Before sending to our training pipeline, we remove a random 1% of rows for the eval sets. If the machine translation is equivalent to the final human-post-edited translation, we label it good, otherwise bad.
What's equivalent depends on the use case. For the professional post-editing use case, any edit is an edit. For other use cases, or if the post-edits are not fully trusted, training and evaluation can ignore minor differences - for example, variations in casing, whitespace and punctuation - or consider synonyms like
№ equivalent in certain contexts.
ModelFront returns a risk score (between 0.0 (good) and 1.0 (bad) i.e. 0% and 100%), so we can choose where to set the risk threshold for evaluation, and you can adjust it whenever you wish.
Translations with a predicted risk below the threshold are labelled good, translations with a risk above the threshold are labelled bad. As we lower the threshold, recall increases but precision decreases - higher final quality, but lower potential savings for many use cases.
The most meaningful way to evaluate a risk prediction system is with two metrics, recall and precision. Of the bad translations, how many are caught? Of the translations caught, how many are actually bad?
For our purposes, a bad translation is a positive, just like in detection of a medical condition.
To calculate precision and recall, we count:
true negatives - Good translations, which the model correctly predicted you can safely skip. These are your potential savings.
true positives - Bad translations, which the model correctly predicted you should send for post-editing. This is us doing our job!
false negatives - Bad translations, which the model incorrectly predicted are safe. This hurts your final quality.
false positives - Good translations, which the model incorrectly predicted you should send for post-editing. This eats into your potential savings.
For example, let's consider a scenario where half the machine translations are good.
If we set a very conservative threshold, the final outcome could be something like 38% skipped, and 97% correct - essentially the same final quality as with humans today.
The same scenario, in terms of translation quality, but with a more aggressive threshold, 55% skipped and 90% correct.
If the machine translation quality improves, then risk prediction allows us to skip more, quality held equal.
For the post-editing use case, the goal is to catch as many bad translations as humans do (typically 95-99%), but minimise false positives. So we lower the threshold until recall is in that range, but no further. Negatives - translations labelled good - are your potential savings - for these you can skip post-editing while safely maintaining today’s average final quality.
By default, we provide a conservative evaluation. A very thorough evaluation should also include:
slicing by time - Usually translation quality has improved over time.
manual inspection - Actually looking at the outliers among the false postives and false negatives, as they almost always reveal human post-editing errors.
length-weighting - When calculating savings, it's important to length-weight the lines, as almost always the very short lines are riskier, but easier to review.