Evaluation documentation

Reference documentation for evaluations in the ModelFront console

Evaluation is an easy way to get risk predictions for a dataset right in the ModelFront console.

A good evaluation helps you assess the quality of translations, for example to estimate post-editing effort, compare multiple translation APIs or custom models. You can also use evaluation to filter or check the quality of human translations, translation memories or parallel corpora.

It's a few just a few clicks, and doesn't require writing any code. Just like the API, an evaluation returns segment-level scores and can include translations if requested.

A single evaluation is for a single language pair, engine and model. To compare languages, engines or models on the same original input segments, just create and run an evaluation for each combination.

Your evaluation is yours and private by default - you can only share it by downloading the file and sharing it.


Creating an evaluation

Name and note

You can always edit Name and Note while an evaluation is running or after it finishes.

Source and target language

The source and target language must not be the same, unless other (und) is selected.

Translations

By default, an evaluation requires a parallel data file.

If you only have source text, you can just select an engine to have segments translated by one of the major machine translation APIs like Google, Microsoft, DeepL or ModernMT.

Data file

The data file can contain parallel data - pairs of sentences or other segments in the source language and target language, similar to machine translation training data.

Or, if you don't have translations, you can just upload a monolingual data and have translations from one of the external engines filled in.

All files should be UTF-8 encoded.

Parallel formats

.tmx

TMX is an open and standard XML format for importing and exporting translation memories. Segments can contain tags and linebreaks and other control characters.

Only the selected language pair will be extracted. If the file includes multiple variants for that language pair, translations for all variants will be extracted.

.tsv

A tab-separated-values (TSV) file is a plain-text file where columns are separated by the tab character. Applications like Excel can export a spreadsheet to a TSV and the Okapi framework can convert other file types to TSV.

The TSV file for evaluation must have exactly 2 columns (or 3 columns for accuracy testing and the name must end in .tsv.

ModelFront supports the Linear TSV standard.

Monolingual formats

.txt, .text, .md, .markdown, .adoc, .se, .html, .xhtml, .align, .src, .trg, .srt

The monolingual file format option is only for evaluations requesting machine translation. It should only include the original segments, and the machine translations will be filled in with the engine you selected.

If your data includes newline characters, consider TSV.

.tsv

A TSV file with exactly 1 column can also be used. This is useful for data that has control characters like newlines within segments.

File size

Evaluation supports very large files - there is no technical limit. You can evaluate files larger than 1GB with the Google Cloud Storage address option.

Depending on the segment length and our current load across all clients, it takes about 1 hour per million segments. Evaluations that include a request for machine translations take significantly longer, due to the latency of the external translation APIs.

Model

By default, evaluations use our latest default generic model. You can also select a custom model from those that are available to your account.


Results

When an evaluation is finished, ModelFront will send you a notification email, and you'll be able to preview, share and download a spreadsheet file with the full results.

Score

The quality score is similar to human evaluation or BLEU score - an aggregate score for the whole set, 0 to 100, where higher is better. It's only meaningfully for evaluations that are large and diverse enough to represent a statistically significant sample.

A ModelFront quality score is just the opposite of the average of the risks, weighted by length of the original source text, including tags. Length-weighting makes the score better reflect actual quality and post-editing effort.

let risk = 0;
let length = 0;
res.rows.forEach(({ risk }, i) => {
  const { original } = req.rows[i];
  risk += risk * original.length;
  length += original.length;
});
const score = risk / length;

So if the average risk is 10%, the score will be roughly 90.

Chart

The chart is a histogram showing the distribution of translations by risk. High quality translations are clustered along the left.

If there is a peak of risky translations on the right, that's a sign that there is a significant cluster of bad translations.

The chart can help you understand the effect of where you set a cutoff. How many translations will you keep? What final quality will you get?

Preview

The preview shows the riskiest translations, with the risk score and labels. You can toggle a label filter on to filter out rows by a label.


Downloading

The full results are available as a TMX file or as TSV file. The TSV file has an additional third column with the predicted risk.

Small and medium datasets can be filtered right in the console before downloading.

For working with very large dataset, we recommend downloading as TSV file and provide guidance on common operations.

Format

The download data file is encoded and escaped the same as an upload data file in the TSV format. You may want to unescape control characters when converting it to another format.

If you open a TSV file in a spreadsheet application Microsoft Excel, Apple Numbers or Google Spreadsheets, make sure to change Text Qualifier from " to None, in case some of your segments contain ".

If you open a TSV file in Microsoft Excel or in a Windows application, make sure to select UTF-8 as the file encoding.

You can also work with it on the command line, which is recommended for larger files.

Sorting

To sort by risk in Bash:

sort -t$'\t' -k3 -n <file.tsv> 

To reverse sort, add -r.

Filtering

To filter while preserving the order in Bash, for example to get only those with risk below 50%:

awk -F "\t" '{ if($3 < 50) { print }}' <file.tsv>

To just peak at the top or bottom, add | head -n 100 or | tail -n 100.

To count the lines, add | wc -l. To write it out to a file, > <newfile.tsv>.

Slicing

To drop the third column with risk scores and keep only the filtered parallel data in Bash, use cut to get the first two columns:

cut -f1 -f2 <file.tsv>

Joining

To join multiple eval files with corresponding rows in Bash, use paste.


Accuracy testing

How accurate is ModelFront risk prediction on your data?

If you have post-editing data, it's easy to test and visualize. Post-editing data consists of a originals, machine translations and human translations - similar to risk prediction training data.

To start an accuracy test, create an evaluation with a 3-column TSV with a name ending in .edt.tsv.

The Accuracy test will include graphs of metrics like throughput and quality as well as the correlation predicted risk against post-editing distance. The preview line will include the human-post-edited translation if it differed from the machine translation.

We recommend about 500 lines per test. If you have more data, you can split test sets by content type, quality tier, project or data to have multiple smaller accuracy tests.


Linear TSV

ModelFront supports the linear TSV standard:

The control characters tab (\t), newline (\n) and carriage return (\r) can be included by escaping them with a preceding \.

The literal character \ should also be escaped with \. Therefore the literals \t, \n and \r should be double escaped.

The segments can include HTML or other XML tags and HTML or XML encodings - by default, those will be automatically removed and decoded.

Conventions for lossless conversion to TSV

en.wikipedia.org › wiki › Tab-separated values

Linear TSV

google.com

For natural language text data, especially parallel data, TSV has key advantages over CSV and XML.

Human-readability

CSV and XML use delimiters and control characters like commas, brackets and quotes, which occur often in natural language text, and therefore require quoting or encoding.

The tab delimiter is whitespace and does not occur frequently.

Standardization

There are many conflicting dialects and specifications of CSV and XML.

Linear TSV standard is the main TSV standard.

Scalability

ModelFront is built to handle very large files. TSV can be read in line by line - without reading the whole file into memory. TSV is also more compact than CSV or XML.

Convenience

The built-in command-line tools like cut and paste read and write TSV by default and fundamentally operate at the line level.