Is post-editing distance the same as translation quality? What about post-editing effort?
Most of the segment-level translation metrics are correlated, but each answers a subtly different question.
Translation risk prediction can be customized to predict one of the many different possible metrics.
Was the machine translation edited or not?
Post-editing probability predicts whether a machine translation was post-edit, even by one character. It can be configured to ignore casing, whitespace, punctuation or insertions.
This is the default metric for risk prediction for hybrid translation with ModelFront.
Post-editing probability is increasingly important in a world where most machine translations are good as-is.
By how many characters was the machine translation edited?
Post-editing distance has many of the same characteristics as BLEU. Like BLEU, it can be configured to be character- or word-level, to ignore edits in whitespace or punctuation and to be length-normalized. And like BLEU, it can penalize synonyms and reordering while not penalizing small but critical errors.
Post-editing distance was important in a world where most machine translations required at least a little bit of post-editing.
How much time did it take to post-edit the machine translation?
Post-editing time more accurately reflects translator effort. However, it's difficult to know why a post-edit took a long time - was it terminology research or a coffee break - and to account for bulk or async actions like find-replace or questions to colleagues.
How good was the translation?
Direct assessment of quality takes into account human judgement of the quality of a human or machine translation. Quality can be binary label or a sliding scale, for a single dimension or many dimensions, like fluency and accuracy.
For example, it could be a commerce platform's feature Rate this translation on a scale of 1 to 5 stars, or Report this translation for offensiveness.
Quality is generally supposed to be independent of source-side risk and inherent difficulty.
There are many other potential metrics, like whether a translator requested clarification, whether a translator deleted the machine translation and started from scratch, whether the translation should be the same as the original, whether the post-edited translation is flagged during final review.
For filtering parallel data, the only metric that matters is whether it will improve the final machine translation system to be trained - but that's very difficult to know at filtering time, even for a human linguist.
Generate and store more data
Platforms, enterprises, language service providers and translation tools that generate and store more data about each translation
are in a position to get ModelFront custom risk prediction models for more metrics and more use cases.
When integrating a risk prediction model into a production application, it's important to consider application-specific business logic. For example, in a marketplace, certain products, text fields and locales have a higher impact on conversions and total sales, so the risk threshold should be adjusted accordingly.
When aggregating segment-level metrics into project-, corpus- or engine-level metrics, it's important to consider length-weighting the segments. For example, bad translations of longer segments require more post-editing effort. In the ModelFront console, the aggregate, evaluation-level quality score is weighted by the length of the original source text.