How we evaluate risk prediction

Setting the right risk threshold and measuring accuracy

ModelFront predicts whether a translation is right or wrong. But how do we know if ModelFront is right or wrong?

The details vary depending on your use case, but the principles are the same. In this example, we'll focus on the hybrid translation for post-editing - auto-approving raw machine translation for as many segments as safely possible.

The goal is to preserve today's full post-editing quality, while gaining efficiency. Typically we train a custom model optimized for the content, machine translation, terminology and quality standards of the client, content or project.

It's the most critical use case - inaccurate predictions can cause not just more post-editing but bad final quality - but there are also standard and objective metrics so we can know how it will do before launch.

Before sending to our training pipeline, we remove a random 1% of rows for later evaluation - the "eval set". If the machine translation is equivalent to the final human-post-edited translation, we label it good, otherwise bad.

What's equivalent depends on the project. For the professional human-quality post-editing use case, any edit is an edit. So every translation is either good or bad - even if only one character was changed.

For other use cases, or if the post-edits are not fully trusted, training and evaluation can ignore minor differences - for example, variations in casing, whitespace and punctuation - or consider synonyms like No., Number and equivalent in certain contexts.

ModelFront returns a risk score between 0.0 (good) and 1.0 (bad) - i.e. between 0% and 100%. So we can choose where to set the risk threshold, and and adjust it whenever we wish. For the hybrid translation use case, to preserve the post-editing quality, we'll set the threshold low enough so that 98% of the translations would not have been edited.

Translations with a predicted risk below the threshold are labelled good, translations with a risk above the threshold are labelled bad. As we lower the threshold, recall increases but precision decreases - higher final quality, but lower potential savings for many use cases.

We run the eval set through ModelFront, and see what it predicts on the translations that it has never seen before, for which we have human post-edits.

Recall and precision

The most meaningful way to evaluate a risk prediction system is with two metrics, recall and precision. Of the bad translations, how many are caught? Of the translations caught, how many are actually bad?

For our purposes, a bad translation is a positive, just like in detection of a medical condition.

To calculate precision and recall, we count:

true negatives - Good translations, which the model correctly predicted you can safely skip. These are your potential savings.

true positives - Bad translations, which the model correctly predicted you should send for post-editing. This is us doing our job!

false negatives - Bad translations, which the model incorrectly predicted are safe. This hurts your final quality.

false positives - Good translations, which the model incorrectly predicted you should send for post-editing. This eats into your potential savings.

An example

For example, let's consider a scenario where half the machine translations are good.

If we set a very conservative threshold, the final outcome could be something like 38% skipped, and 97% correct - essentially the same final quality as with humans today.

A conservative threshold

The same scenario, in terms of translation quality, but with a more aggressive threshold, 55% skipped and 90% correct.

A conservative threshold

If the machine translation quality improves, then risk prediction allows us to skip more, quality held equal.

For the post-editing use case, the goal is to catch as many bad translations as humans do (typically 95-99%), but minimise false positives. So we lower the threshold until recall is in that range, but no further. Negatives - translations labelled good - are your potential savings - for these you can skip post-editing while safely maintaining today’s average final quality.

By default, we provide a conservative evaluation. A very thorough evaluation should also include:

slicing by time - Usually translation quality has improved over time.

manual inspection - Actually looking at the outliers among the false postives and false negatives, as they almost always reveal human post-editing errors.

length-weighting - When calculating savings, it's important to length-weight the lines, as almost always the very short lines are riskier, but easier to review.

Other use cases and accuracy metrics

In research, the task is often framed as regression, not classification, so there is no threshold and the metric is F1 or Pearson's correlation coefficient.

For use cases like evaluation or estimating post-editing effort, those correlation metrics make sense.

For parallel data filtering, because it depends on the dataset size and downstream systems, measuring effectiveness is very complex.

For final validation, either correlation or recall and precision can be used.