How we evaluate risk prediction

Setting the right risk threshold and measuring accuracy

Before sending to our training pipeline, we remove a random 1% of rows for the eval sets. If the machine translation is equivalent to the final human-post-edited translation, we label it good, otherwise bad.

What's equivalent depends on the use case. For the professional post-editing use case, any edit is an edit. For other use cases, or if the post-edits are not fully trusted, training and evaluation can ignore minor differences - for example, variations in casing, whitespace and punctuation - or consider synonyms like No., Number and equivalent in certain contexts.

ModelFront returns a risk score (between 0.0 (good) and 1.0 (bad) i.e. 0% and 100%), so we can choose where to set the risk threshold for evaluation, and you can adjust it whenever you wish.

Translations with a predicted risk below the threshold are labelled good, translations with a risk above the threshold are labelled bad. As we lower the threshold, recall increases but precision decreases - higher final quality, but lower potential savings for many use cases.

Recall and precision

The most meaningful way to evaluate a risk prediction system is with two metrics, recall and precision. Of the bad translations, how many are caught? Of the translations caught, how many are actually bad?

For our purposes, a bad translation is a positive, just like in detection of a medical condition.

To calculate precision and recall, we count:

true negatives - Good translations, which the model correctly predicted you can safely skip. These are your potential savings.

true positives - Bad translations, which the model correctly predicted you should send for post-editing. This is us doing our job!

false negatives - Bad translations, which the model incorrectly predicted are safe. This hurts your final quality.

false positives - Good translations, which the model incorrectly predicted you should send for post-editing. This eats into your potential savings.

An example

For example, let's consider a scenario where half the machine translations are good.

If we set a very conservative threshold, the final outcome could be something like 38% skipped, and 97% correct - essentially the same final quality as with humans today.

A conservative threshold

The same scenario, in terms of translation quality, but with a more aggressive threshold, 55% skipped and 90% correct.

A conservative threshold

If the machine translation quality improves, then risk prediction allows us to skip more, quality held equal.

For the post-editing use case, the goal is to catch as many bad translations as humans do (typically 95-99%), but minimise false positives. So we lower the threshold until recall is in that range, but no further. Negatives - translations labelled good - are your potential savings - for these you can skip post-editing while safely maintaining today’s average final quality.

By default, we provide a conservative evaluation. A very thorough evaluation should also include:

slicing by time - Usually translation quality has improved over time.

manual inspection - Actually looking at the outliers among the false postives and false negatives, as they almost always reveal human post-editing errors.

length-weighting - When calculating savings, it's important to length-weight the lines, as almost always the very short lines are riskier, but easier to review.