Is a translation good or bad? Measuring quality and risk are fundamental to successful translation at scale. Both human and machine translation benefit from sentence-level and corpus-level metrics.
The top production use cases are post-editing effort estimation and hybrid translation - safely auto-approving raw machine translation, typically for medium-scale, high-value content like technical support documentation or product descriptions. And of course, there have always been tools and processes for final validation.
The top offline use cases are filtering parallel corpora - for example, cleaning translation memories - and evaluation - assessing or comparing translation systems or custom models.
Metrics like BLEU are based on string distance to human reference translations and cannot be used for new incoming translations, nor for the human reference translations themselves.
What are the options if you want to build or buy services, tools or technology for measuring the quality and risk of new translations?
Whether just an internal human evaluation in a spreadsheet, user-reported quality ratings, an analysis of translator post-editing productivity and effort, or full post-editing, professional human linguists and translators are the gold standard.
There is significant research on human evaluation methods, and quality frameworks like MQM-DQF and even quality management platforms like TAUS DQF and ContentQuo for standardizing and managing human evaluations, as well as translators and language service providers offering quality reviews or continuous human labelling.
Translation tools like Memsource, Smartling and GlobalLink have features for automatically measuring quality bundled in their platforms. Memsource's feature is based on machine learning.
Xbench, Verifika and LexiQA directly apply exhaustive, hand-crafted linguistic rules, configurations and translation memories to catch common translation errors, especially human translation errors.
They are integrated into existing tools, and their outputs are predictable and interpretable. LexiQA is unique in its partnerships with web-based translation tools and its API.
ModelFront partners like translate5 and GlobalDoc's LangXpert integrate ModelFront technology as smart features in their translation systems and language service offerings.
If you have the data and the machine learning team and want to build your own system based on machine learning, there is a growing set of open-source options.
The most notable quality estimation frameworks are TransQuest from Tharindu Ranasinghe, OpenKiwi from Unbabel and DeepQuest from the research group led by Lucía Specia. TransQuest offers pretrained models. Zipporah from Hainan Xu and Philipp Koehn is the best-known library for parallel data filtering.
The owners of those repositories are also key contributors to and co-organizers of the WMT shared tasks on Quality Estimation and Parallel Corpus Filtering.
Massively multilingual libraries and pretrained models like LASER are not specifically for translation or translation quality, but a solid baseline for an unsupervised approach to parallel data filtering when combined with other techniques like language identification, regexes and round-trip translation.
Unbabel, eBay, Microsoft, Amazon, Facebook and others invest in in-house machine translation quality estimation research and development for their own use, mainly for the content that flows through their platforms at scale.
The main goal is to safely auto-approve raw machine translation for as much as possible, whether in efficient hybrid translation workflows for localization or customer service, or just to limit catastrophes on user- and business-generated content that is machine translated by default.
Their approaches are based on machine learning, and they publish papers and code.
KantanQES can provide a quality score with every KantanMT translation - the first machine translation provider to do so.
ModelFront is the first and only API for translation risk prediction, and it's also based on machine learning. With a few clicks or a few lines of code, you can access a production-strength system.
Our approach is developed fully in-house, extending ideas from the leading researchers in quality estimation and parallel data filtering, and from our own experience inside the leading machine translation provider.
We've productionized it and made it accessible and useful to more players - enterprise localization teams, language service providers, platform and tool developers and machine translation researchers.
We built in security, scalability, and support for 100+ languages and 10K+ language pairs, locales, encodings, formatting, tags and file formats, integrations with the top machine translation API providers, and developed a process for customization.
We continuously invest in curated parallel datasets and manually-labeled datasets and track emerging risks types as translation technology, use cases and languages evolve.
Building means investing upfront in research and development of an internal system. Key factors in the decision are:
Build, deploying and maintaining a deep learning-based risk prediction system requires specialized research and engineering experience, as well as data and hardware.
The effort required to build and maintain a system or systems is a function of the number of languages, content types and use cases. If there is a single narrow consistent use case with high volume, then building in-house starts to make more sense.
Most organizations with those capabilities and scale have many interesting and high-value problems core to their businesses for a team of natural language processing researchers and machine learning engineers to work on.
The typical organization that chooses to build a risk prediction system in-house has deep experience building and training machine translation itself from scratch, and even those organizations also decide to buy external machine translation and external risk prediction for some purposes.