A practical guide to machine translation quality prediction
We're sharing a practical guide to machine translation quality prediction with you, based on what we've learned making “quality estimation” research work in the real world.
GenAI started in machine translation in 2017 ↗ (opens in a new tab), but for half a decade it failed to accelerate human-quality translation, because buyers still had to send every single word generated by machine translation to manual human post-editing.
Now high-volume translation buyers are already translating up to 10x more efficiently, at the same human quality, by using private custom LLMs to fix and confirm most automatically generated translations - millions and millions of words - with no human look.
More and more translation teams are wondering how to get started with quality prediction, or how to catch up. Fortunately, it has never been easier.
— Adam Bittlingmayer and Artur Aleksanyan, co-founders, ModelFront, February 2025
Quality prediction (QP) is AI for checking machine translations. ✓ or ✗.
Use cases
What quality prediction is for, and what it isn't
Quality prediction is for accelerating translation safely. More efficiency, same quality.
Concretely, quality prediction is for accelerating manual human post-editing workflows that are too slow or too expensive, by cutting out agencies or manual work for millions of segments that would not be edited anyway.
The goal of accelerating post-editing workflows could be to grow capacity, to speed up turnaround time or to save costs. It could even be part of a plan to improve quality by shifting more content from raw machine translation to accelerated post-editing, or from external vendors to in-house, to expand into more languages, to provide more service-level tiers or to reduce workflow steps.
Quality prediction is not for one-off or low-value use cases that are offline or only measure work, like comparing machine translation engines, cleaning translation memories, estimating post-editing effort at the document level or annotating at the word level. Although those other, subtly different use cases do have some value because they are based on real daily problems, they would actually require fundamentally different systems.
Quality prediction is only for actually accelerating post-editing work safely, directly in the main production workflow, by automatically confirming as many segments as safely possible, so that they can skip manual human post-editing.
The value is in fully skipping a manual step for exactly the segments that can be safely skipped, because a segment is the main unit of human cognitive effort. (We cannot confirm a translation without reading the whole sentence.) And this precedent is already well-established for decades: the translation memory, legacy workflows systems, CAT tools and payment contracts work at the segment level.
Integration and workflows
How quality prediction works
Quality prediction makes workflows more efficient by automatically triggering human intervention at the right points.
Like the translation memory (TM), but for the new content, for the machine translations (MT). A translation memory is basically a cache.
A quality prediction system is integrated into the translation management system (TMS) to confirm millions of translations at the segment level.
Example workflow
To accelerate a traditional fully manual post-editing workflow, an automatic quality prediction step is added before the manual human post-editing step.
-
Machine translation
All new segments get a machine translation, like before. Many of these translations are perfect. But which ones? -
Quality prediction
Now all machine translations get a quality prediction: verified ✓, or rejected ✗. The verified segments skip over manual human post-editing. -
Human post-editing
Now only the rejected segments are still sent to manual human post-editing.
The final output is a blend of fully automatic translations and manually post-edited translations.
Now only the rejected segments are still sent to manual human post-editing.
The final output is a blend of fully automatic translations and manually post-edited translations.
The verified segments skip over manual human post-editing.
Now only the rejected segments are still sent to manual human post-editing.
The final output is a blend of fully automatic translations and manually post-edited translations.
In a translation workflow, the quality prediction (QP) step comes after the machine translation (MT) step, and before the manual human post-editing (PE) step. In the quality prediction step, the status of confirmed segments is changed to Translated or Confirmed, as if a human translator had already verified them manually. The exact setup is configurable, usually based on how the workflow had been setup previously.
As soon as this automated step has run, the translation management system is updated with the progress and remaining word count and pricing for the file and project. CAT tools skip over confirmed segments, while still showing them for context, like exact matches from the translation memory. Only the segments that cannot be safely verified are still sent to manual human post-editing.
The final output is a blend of fully automated translations and manually post-edited translations.
- Segments automatically verified by the translation memory
- Segments automatically verified by quality prediction
- Segments manually verified, or manually edited and verified
Models
The LLMs under the hood
Under the hood, the core technology inside a production system are multilingual large language models (LLMs) for verifying and editing translations.
Quality prediction
The main model is a quality prediction (QP) model, an LLM built and trained for the task of confirming (or rejecting) translations.
Quality prediction (QP)
LLM to confirm (or reject) translations
- Input: Source text, target text
- Output: Verified ✓ or rejected ✗ (boolean)
For example, if the translation is already good, the model returns ✓.
- Input:
“The quick brown fox jumps over the lazy dog.”
,“Der schnelle braune Fuchs springt über den faulen Hund.”
- Output: ✓
If the translation is bad, the model returns ✗.
- Input:
“The quick brown fox jumps over the lazy dog.”
,“Der kecke braune Fuchs gumpt über der lausigen Dogge.”
- Output: ✗
In real world scenarios, a quality prediction model is custom trained to predict if a machine translation will not be edited anyway by the professional human translators at that step in that workflow.
Quality estimation
In the research world, the precursor to quality prediction was machine translation quality estimation (QE). A quality estimation model or system returned raw scores (logistic regression), not a boolean prediction (binary classification).
Quality estimation (QE)
LLM to score translations
- Input: Source text, target text
- Output: Score from 0.0 to 1.0 (floating-point number), often displayed as 0% to 100%
The meaning of raw quality estimation scores also varies between systems, and the distribution varies between models, languages and content types.
To make those raw scores useful in real world workflows, the production system has to provide boolean quality predictions, by calibrating and applying thresholds for each model version, language and content type, to keep the same final human quality.
Automatic post-editing
An automatic post-editing (APE) model is a separate, generative LLM built for the task of **editing translations.
Automatic post-editing (APE)
LLM to edit translations
- Input: Source text, target text
- Output: Edited target text
In real world scenarios, the automatic post-editing model is custom trained to edit translations in that workflow, and tightly coupled with the custom trained quality prediction model for the same workflow.
All these models are based on the Transformer ↗ (opens in a new tab), the same core model architecture as machine translation systems, like Google Translate, where it was invented, or genAI systems, like ChatGPT, that followed. But these models are specifically built and trained for the task of verifying or editing a translation.
Requirements for buyers
How to decide if a scenario is a fit for quality prediction, or not
There are strict prerequisites for making quality prediction work in the real world.
- High volume
- Real value
- Control
Are you translating a high volume of content at human quality? For now at least, this technology is only feasible in scenarios with enough volume to justify the overhead by automating millions of words a year, and enough high-quality data.
Does accelerating translation safely create real value for your company? A good test is whether this solution would still interest you and your team if did not include an LLM. If you just want AI for AI's sake, there are plenty of better options. But if you actually need efficiency and human quality, then it makes sense to use the right technology.
Are you in control of your translation workflows? No, not just TMs. You should be managing the translation management system and workflow steps. Too many companies lost control and even visibility to agencies over the years.
If you're a high-volume, high-quality buyer, who can get real value, and you are in control of your translation workflows, accelerating translation safely has never been easier.
If you don't have control, then that's the first thing to work on. If you start now, you can be in a better position in months.
Requirements for providers
How to decide if a quality prediction provider is a fit, or not
If your scenario meets the prerequisites, next you need to buy (or build) a system that meets your prerequisites.
In our experience, accelerating translation safely in the real world requires human quality, convenience and alignment. These are minimum requirements for a provider to even be worth considering.
Requirements | System A | System B | System C |
---|---|---|---|
Human quality Same final quality | ✓ | ||
Convenience Works easily with any TMS, MT or agency | ✓ | ✓ | |
Alignment Full focus and no conflict of interest | ✓ | ✓ | |
Success stories Working in the real world | ✓ |
There can be additional considerations, like savings, pricing or geography-specific deployment, in a specific scenario to decide between multiple options that meet these requirements.
Human quality is key to real savings — if quality drops, then the value is unclear. Bad quality prediction is worse than no quality prediction. Ultimately, the provider needs to be responsible for quality. Behind the scenes, it requires a full system and lifecycle management: data checks, custom LLMs with strong guardrails, transparent monitoring, A/B testing, engineers on call, human evaluations, and continuous re-training.
Convenience is also key to savings — upfront costs, ongoing costs and hidden costs can destroy net savings. Quality prediction should just work, with your existing setup — TMS, MT engines and translators or agencies. It should also work with your future setup, not lock you in.
Alignment is key to quality and convenience. No conflicts of interest. Not trying to sell you more manual translation, or lock you in to a translation management system or machine translation engine, or grow engineering headcount.
What it changes
What quality prediction changes, and what it doesn't
Quality prediction should not change the final quality, or the setup - the TMS and CAT, the MT engines, the agencies or in-house translators, or how they work day to day. It should just make translation more efficient.
But making a translation team 2x or 10x more efficient does cause changes. More efficiency leads to more demand (opens in a new tab). And more efficiency and more demand quickly become the new normal.
This comes with subtle, counterintuitive changes inside the translation team.
Before, in the era of traditional fully manually human post-editing of the 2000s and 2010s, the motivation had been to just push professional human translators to rush on repetitive work.
After the shift to quality prediction, the motivation is to get higher quality for the segments they do look at, because that data is used to train the LLMs that ultimately drive efficiency.
In fact, now there is more direct concrete value created by all sorts of work, like cleaning up assets like TMs and terminology, retraining engines, separating workflows, monitoring and evaluations.
The future is already here.
It's just not evenly distributed.
You've already seen content that was made by blending fully automated translations with human translations, you just didn't know it! Software and hardware docs, product catalogs, marketing material, patents... And that's the point! Nobody should know.
For example, if you scroll the world's top marketplace for luxury fashion in Spanish, Arabic or Japanese, you'd never know that 80 to 90% of the product titles and descriptions you're seeing were machine translations that were AI-edited and AI-verified, without a human even looking.
Like the airliner's autopilot, or a credit card's fraud detection, or the humble translation memory, this new technology is growing efficiency without sacrificing quality.
The key was not just automatically doing all work, but rather automatically triggering human intervention at the right points. And of course, the systems had to be properly built, customized, deployed, monitored and updated.
Now quality prediction, with private custom LLMs, guardrails and monitoring, is accessible in every major third-party translation management system, just like machine translation is.
Did you find this guide helpful?
Share modelfront.com/quality-prediction (opens in a new tab) with your team and more folks it can help!
— Adam and Artur