Is post-editing distance the same as translation quality? What about post-editing effort?
Most of the segment-level translation metrics are correlated, but each answers a subtly different question.
Translation risk prediction can be customized to predict one of the many different possible metrics.
Was the machine translation edited or not?
Post-editing probability predicts whether a machine translation was post-edit, even by one character. It can be configured to ignore casing, whitespace, punctuation or insertions.
This is the default metric for risk prediction for hybrid translation with ModelFront.
By how many characters was the machine translation edited?
Post-editing distance has many of the same characteristics as BLEU. Like BLEU, it can be configured to be character- or word-level, to ignore edits in whitespace or punctuation and to be length-normalized. And like BLEU, it can penalize synonyms and reordering while not penalizing small but critical errors.
How much time did it take to post-edit the machine translation?
Post-editing time more accurately reflects translator effort. However, it's difficult to know why a post-edit took a long time - was it terminology research or a coffee break - and to account for bulk or async actions like find-replace or questions to colleagues.
How good was the translation?
Direct assessment of quality takes into account human judgement of the quality of a human or machine translation. Quality can be binary label or a sliding scale, for a single dimension or many dimensions, like fluency and accuracy.
For example, it could be a commerce platform's feature Rate this translation on a scale of 1 to 5 stars, or Report this translation for offensiveness.
Quality is generally supposed to be independent of source-side risk and inherent difficulty.
There are many other potential metrics, like whether a translator requested clarification, whether a translator deleted the machine translation and started from scratch, whether the translation should be the same as the original, whether the post-edited translation is flagged during final review.
For filtering parallel data, the only metric that matters is whether it will improve the final machine translation system to be trained - but that's very difficult to know at filtering time, even for a human linguist.
Generate and store more data
Platforms, enterprises, language service providers and translation tools that generate and store more data about each translation
are in a position to get ModelFront custom risk prediction models for more metrics and more use cases.
When integrating a risk prediction model into a production application, it's important to consider application-specific business logic. For example, in a marketplace, certain products, text fields and locales have a higher impact on conversions and total sales, so the risk threshold should be adjusted accordingly.
When aggregating segment-level metrics into project-, corpus- or engine-level metrics, it's important to consider length-weighting the segments. For example, bad translations of longer segments require more post-editing effort. The aggregate quality score in the ModelFront console is weighted by the length of the original source text.