Translation metrics

What can predicted risk represent?

Is post-editing distance the same as translation quality? What about post-editing effort?

Most of the segment-level translation metrics are correlated, but each answers a subtly different question.

Translation risk prediction can be customized to predict one of the many different possible metrics.

Post-editing probability

Was the machine translation edited or not?

Post-editing probability predicts whether a machine translation was post-edit, even by one character. It can be configured to ignore casing, whitespace, punctuation or insertions.

This is the default metric for risk prediction for hybrid translation with ModelFront.

Post-editing distance

By how many characters was the machine translation edited?

Post-editing distance has many of the same characteristics as BLEU. Like BLEU, it can be configured to be character- or word-level, to ignore edits in whitespace or punctuation and to be length-normalized. And like BLEU, it can penalize synonyms and reordering while not penalizing small but critical errors.

Post-editing time

How much time did it take to post-edit the machine translation?

Post-editing time more accurately reflects translator effort. However, it's difficult to know why a post-edit took a long time - was it terminology research or a coffee break - and to account for bulk or async actions like find-replace or questions to colleagues.


How good was the translation?

Direct assessment of quality takes into account human judgement of the quality of a human or machine translation. Quality can be binary label or a sliding scale, for a single dimension or many dimensions, like fluency and accuracy.

For example, it could be a commerce platform's feature Rate this translation on a scale of 1 to 5 stars, or Report this translation for offensiveness.

Quality is generally supposed to be independent of source-side risk and inherent difficulty.

Other metrics

There are many other potential metrics, like whether a translator requested clarification, whether a translator deleted the machine translation and started from scratch, whether the translation should be the same as the original, whether the post-edited translation is flagged during final review.

For filtering parallel data, the only metric that matters is whether it will improve the final machine translation system to be trained - but that's very difficult to know at filtering time, even for a human linguist.

Generate and store more data

Platforms, enterprises, language service providers and translation tools that generate and store more data about each translation
are in a position to get ModelFront custom risk prediction models for more metrics and more use cases.

Applying metrics

When integrating a risk prediction model into a production application, it's important to consider application-specific business logic. For example, in a marketplace, certain products, text fields and locales have a higher impact on conversions and total sales, so the risk threshold should be adjusted accordingly.

When aggregating segment-level metrics into project-, corpus- or engine-level metrics, it's important to consider length-weighting the segments. For example, bad translations of longer segments require more post-editing effort. The aggregate quality score in the ModelFront console is weighted by the length of the original source text.