Linear TSV documentation

About the Linear TSV standard format

Data files for training and testing risk prediction include at least 3 columns - the original, the machine translation and final human post-edited translation or label. Common parallel data file formats like XML and XLIFF do not support that.

The ModelFront console supports uploading and downloading the Linear TSV file format for training, testing and using risk prediction because it's simple, standardized and scalable.

Format specification

The control characters tab (\t), newline (\n) and carriage return (\r) can be included by escaping them with a preceding \.

The literal character \ should also be escaped with \. Therefore the literals \t, \n and \r should be double escaped.

The segments can be plaintext or XML/HTML.

Examples

For a post-edited segment where the original is Apple machines, the machine translation is Äpfelmaschinen and the final human post-edited translation is Apple-Geräte, the line will be:

Apple machines	Äpfelmaschinen	Apple-Geräte

Note the 2 tabs separating the 3 columns!

For a segment that includes linebreaks when displayed, the newline character is escaped:

Machines:\nApple	Maschinen:\nÄpfel	Geräte:\nApple

If there is metadata, it can be provided in additional columns.

Why TSV

For natural language text data, especially parallel data, TSV has key advantages over CSV, XML and JSON.

Human-readability

CSV, XML and JSON use delimiters and control characters like commas, brackets and quotes, which occur often in natural language text, and therefore require quoting or encoding.

The tab delimiter is whitespace and does not occur frequently.

Standardization

There are many conflicting dialects and specifications of CSV and XML.

The Linear TSV standard is the main TSV standard.

Scalability

ModelFront is built to handle very large files. TSV can be read in line by line - without reading the whole file into memory. TSV is also more compact than CSV, XML or JSON

Convenience

The built-in Unix command-line tools like cut and paste read and write TSV by default and fundamentally operate at the line level.

More reading

Conventions for lossless conversion to TSV

en.wikipedia.org › wiki › Tab-separated values

Linear TSV

google.com