Evaluation documentation

Reference documentation for evaluations in the ModelFront console

Evaluation is an easy way to get risk predictions for a dataset right in the ModelFront console.

A good evaluation helps you assess the quality of translations, for example to estimate post-editing effort, compare multiple translation APIs or custom models. You can also use evaluation to filter or check the quality of human translations, translation memories or parallel corpora.

It's a few just a few clicks, and doesn't require writing any code. Just like the API, an evaluation returns segment-level scores and can include translations if requested.

A single evaluation is for a single language pair, engine and model. To compare languages, engines or models on the same original input segments, just create and run an evaluation for each combination.

Your evaluation is yours and private by default - you can only share it by downloading the file and sharing it.

Creating an evaluation

Name and note

You can always edit Name and Note while an evaluation is running or after it finishes.

Source and target language

The source and target language must not be the same, unless other (und) is selected.


By default, an evaluation requires a parallel data file.

If you only have source text, you can just select an engine to have segments translated by one of the major machine translation APIs like Google, Microsoft, DeepL or ModernMT.

Data file

The data file can contain parallel data - pairs of sentences or other segments in the source language and target language, similar to machine translation training data.

Or, if you don't have translations, you can just upload a monolingual data and have translations from one of the external engines filled in.

All files should be UTF-8 encoded.

Parallel formats


TMX is an open and standard XML format for importing and exporting translation memories. Segments can contain tags and linebreaks and other control characters.

Only the selected language pair will be extracted. If the file includes multiple variants for that language pair, translations for all variants will be extracted.


A tab-separated-values (TSV) file is a plain-text file where columns are separated by the tab character. Applications like Excel can export a spreadsheet to a TSV.

The .tsv must have exactly 2 columns. The control characters tab (\t), newline (\n) and carriage return (\r) can be included by escaping them with a preceding \. Therefore the literals \t, \n and \r should be double escaped.

Monolingual formats

.txt, .text, .md, .markdown, .adoc, .se, .html, .xhtml, .align, .src, .trg, .srt

The monolingual file format option is only for evaluations requesting machine translation. It should only include the original segments, and the machine translations will be filled in with the engine you selected.

The control characters tab (\t), newline (\n) and carriage return (\r) can be included by escaping them with a preceding \. Therefore the literals \t, \n and \r should be double escaped.

File size

Evaluation supports very large files - there is no technical limit. You can evaluate files larger than 1GB with the Google Cloud Storage address option.

Depending on the segment length and our current load across all clients, it takes about 1 hour per million segments. Evaluations that include a request for machine translations take significantly longer, due to the latency of the external translation APIs.


By default, evaluations use our latest default generic model. You can also select a custom model from those that are available to your account.


When an evaluation is finished, ModelFront will send you a notification email, and you'll be able to preview, share and download a spreadsheet file with the full results.


The quality score is similar to BLEU score - an aggregate score for the whole set, 0 to 100, where higher is better.

A ModelFront quality score is just the opposite of the average of the risks, weighted by length of the original source text. Length-weighting makes the score better reflect actual quality and post-editing effort.

So if the average risk is 10%, the score will be roughly 90.


The chart is a histogram showing the distribution of translations by risk. High quality translations are clustered along the left.

If there is a peak of risky translations on the right, that's a sign that there is a significant cluster of bad translations.

The chart can help you understand the effect of where you set a cutoff. How many translations will you keep? What final quality will you get?

Download data file

The data file with the full results is in the .tsv format with an additional third column with the predicted risk.


The download data file is encoded and escaped the same as an upload data file in .tsv format. You may want to unescape control characters when converting it to another format.

If you open a .tsv in a spreadsheet application Microsoft Excel, Apple Numbers or Google Spreadsheets, make sure to change Text Qualifier from " to None, in case some of your segments contain ".

If you open a .tsv in Microsoft Excel or in a Windows application, make sure to select UTF-8 as the file encoding.

You can also work with it on the command line, which is recommended for larger files.


To sort by risk in Bash:

sort -t$'\t' -k3 -n <file.tsv> 

To reverse sort, add -r.


To filter while preserving the order in Bash, for example to get only those with risk below 50%:

awk -F "\t" '{ if($3 < 50) { print }}' <file.tsv>

To just peak at the top or bottom, add | head -n 100 or | tail -n 100.

To count the lines, add | wc -l. To write it out to a file, > <newfile.tsv>.


To drop the third column with risk scores and keep only the filtered parallel data in Bash, use cut to get the first two columns:

cut -f1 -f2 <file.tsv>


To join multiple eval files with corresponding rows in Bash, use paste.