Automated evaluation for machine translation researchers and developers

Still chasing BLEU, waiting for human evals and fighting fires?

Neural machine translation has changed the game. But the problems of life as a machine translation researcher have not changed: BLEU obsession, slow human evals and random catastrophic translation fails.

Those problems are not unrelated. Google, Microsoft and everybody else know that BLEU isn't useful when it comes to very good translations or very bad translations. But what is the alternative?

Find your fails before your users do

ModelFront finds translation fails. Now we're offering to help machine translation teams directly. Leveraging our risk prediction technology, we catch your translation fails, scalably and automatably.

Fail lists

With big lists of fails, you can understand and fix problems in your dataset, pipeline or approach, and also apply patches before there's fire.

Fail rates

With the rates of fails per million, you can compare systems and versions, and protect against nasty regressions in the long tail.

Fail reports

We summarize our findings, and report to you possible major or minor bugs in your pipeline or pre-processing.

Useful metrics

ModelFront lists and rates include a breakdown by error type.


Errors with words dropped or inserted, and errors with agreement


Errors with words left untranslated, or translated to the wrong language


Errors with literally translated or mixed up person names, placenames, brand names, product names, and org names


Errors with numbers, prices, dates, codes, URLs, emails, hashtags and usernames


Errors around localisation of numbers, prices, dates and quotes


Errors with casing and punctuation


Errors with negation, context, function words, content words, ambiguity, and idioms


Robustness to input that is wrong language, very long, mixed language, wrong script, missing diacritics, bad encoding, character variants, spelling variants, typos, missing spaces, non-standard casing, non-standard punctuation, and out of distribution


Translations that are NSFW, offensive or violent, or mix up named entities

Fast and continuous

ModelFront quality evaluations are scalable and automable. You can compare systems, perform grid search, set up continuous integrations or monitoring on fresh incoming content.

How is it different than quality evaluation, like BLEU?

BLEU requires a reference translation, and doesn't distinguish between stylistic variations and scandalous translation fails. Today there are systems that have roughly the same BLEU score as human translators, but translation fails still happen.

ModelFront requires no reference translation, doesn't get triggered by output that is noisy or machiny but good enough, and also covers the long tail.

How is it different than quality estimation (QE)?

The main metric in the QE task is HTER. Like BLEU, it focuses on edit distance or time, so it does not distinguish stylistic variations from scandalous translation fails. Unlike BLEU, it doesn't require reference translations, but it does require labelled training data. QE is most useful in that context of CAT platforms for human translators.

ModelFront is built specifically to catch really bad translations.

How is it different than round-trip translation?

If only it were that easy! Round-trip translations often match the original even when the translation was bad. This could be because of identity translations - words or phrases that failed to translate - or literal translations - e.g. of an idiom or semantically ambigous word or phrase - or just bad training data that causes corresponding errors in both directions.

On the other hand, round-trip translations often don't match the original even when the translation was good. Every language has distinctions that it cannot represent, both structural - for example, between cases, tenses, formality or genders - or semantic - because of homonyms, including named entities, and idioms. So a translation is too often a lossy represention of the original. For the many indirect pairs, there is additional loss.

ModelFront uses round-trip translations, but it's a small signal among many.

How is it different than confidence?

Modern neural models have some clue about their own confidence, but many if not most translation fails are due to bad data or the inherent complexity of human language. Glassbox techniques can fail precisely when translation fails. On the other hand, confidence may be low just because there are multiple synonymous translations.

ModelFront is an independent system, specifically focused on risk prediction. Translation engines are blackboxes to us.

How is it different than hardcoding rules-based approaches?

Rules-based approaches can be very effective for certain error types, e.g. number formats, or in certain scenarios, e.g. when there is an extensive translation memory. Notable examples include LexiQA, WMT test suites, or features offered within CAT tools like Smartling and Memsource.

ModelFront predictions come from deep learning models, including unlabelled data, to scale robustly on unseen data, variations, and noise, across language pairs.

Is it automated, or just crowdsourcing?

ModelFront predictions come from deep learning models, not crowdsourcing or hardcoded rules. Our approach scales to new content types and new languages, and high request volumes.

We're open to other approaches, including hybrid approaches, but we're highly incentivized to automate and generalize as much as we can as soon as we can.

What's not automated? First and foremost, understanding your content types and quality goals. Each client has different definitions of what is a good enough translation.

Intrigued. How to start?

It's simple. We bang on your API, find the broken stuff and ping you with the results.

If you want to focus on specific types of input, you can send us a sample. It can be huge, but it doesn't need to be - we have crawlers. If you want to focus on specific types of errors, you can send us a sample. If you want us to focus on custom or pre-launch systems, you can give us access.

Once you're satisfied, we set it up to be something you can get automatically - fast and continuous quality evaluation.

<style> .screenshot { max-width: 66%; padding-bottom: 3em; }

What are the error types?

We work with clients to define their goals. Error types should be actionable for machine translation researchers and developers.

A useful breakdown includes metrics for errors on named entities, untranslated tokens, symbols like numbers and URLs, structure and semantics.

All metrics can also be sliced by other dimensions like content type, language pair, language pair grouping or length.

What are the severity levels?

The main buckets are critical fails, catastrophic fails and scandalous fails. We don't focus on issues that are just stylistic or subjective.

<style> .black { color: black; }
Read more
Critical fails

Critical fails are not just machiny, they are broken enough that the reader can't even understand the key information. Typically they are issues with named entities and identity translation, numbers, prices and codes or semantic ambiguity, or critical structural issues.

Catastrophic fails

Catastrophic fails are not just due to the limitations of machine translation and inherent ambiguities of human language, but to problems in your data crawling, alignment, tokenization, training or serving.

Scandalous fails

Scandalous fails are potentially viral, because they are offensive or political. They are typically also critical or catastrophic fails, but which happen to involve a word or topic that is sensitive or NSFW.

We don't focus on stylistic preferences like emphasis, but we can catch violations of locale preferences (both lexical and formatting preferences), formality preferences (tu/vous) and casing or punctuation.

Does it only work with neural or seq2seq models?

Your translation system is a blackbox to us. ModelFront works with many systems, even RBT, SMT or hybrid systems.

Is it accurate on human translations?

No. Human and machine errors have very different typology and distribution. Accuracy on human translation or transcreation input is a non-goal for now. (And it would require strong AI.)

Is it robust to noisy input?

Yes. We work mostly with user-generated data.

Is it robust to adversarial input?

Not yet. Accuracy on adversarial input to ModelFront is something we would love to work on, but so far we have focused on output that major machine translation systems actually produce.

We do focus on adversarial input to translation systems - that's basically what ModelFront does.

Can't you just train better translation?

No, because it is not robust to adversarial input. We believe that an adversarial approach could be used to improve machine translation, but not to fully solve it.

Often it's possible to automatically predict that a translation is risky, but not know what the correct translation should be. Sometimes it requires context or common sense that no system has.

We do not recommend blindly maximizing for our eval metrics. There is often a natural tension between two dimensions of quality. For example, a change to reduce errors of untranslated tokens could increase errors of literally translated named entities.

ModelFront is meant to invoke human judgement efficiently - not replace it.

Our company can't send our data. How can ModelFront help us?

If your translation system is available as an API or the output is on your platform, ModelFront can work unilaterally to get your translations on key datasets. You don't need to send us anything or do any integration.

Yes, the goal is a system that gives you useful but highly automated evaluations on demand.

Can we compare to another company's system?

Yes, assuming that other company's system is publicly accessible.

How much does it cost?

ModelFront offers pilots and annual plans. The pilot covers us investing in custom crawling, human labelling and training, and gives you a chance to give your feedback and assess the usefulness and value to you.

Once you're satisfied, we can agree on a plan and set up the eval to be something you can get automatically - fast and continuous quality evaluation.

Our goal is to be 3x more efficient than you building something internally, and much more efficient and convenient long-term than human evals.

How many fails will we get?

The number of fails depends on the accuracy of your translation system, the input types you want us to focus on and the error types you want us to focus on.

That's great… But does it actually work?

Some problems in natural language require human intelligence or even more. We always ask for sample data and sample errors upfront, to set realistic goals before we start. ModelFront annual plans include a moneyback satisfaction guarantee.

Does it work on arbitrary datasets?

Yes, in general. However, for specific quality goals and content types and formats, we need to invest in additional human labels and system tweaks.

There are specific content types and quality goals - e.g. humorous Tweets, or marketing slogans - where machine translation and machine eval are not a good fit.

How can it capture what translation models can't?

There is no silver bullet for translation quality. ModelFront combines multiple approaches and datasets.

There is plenty of data that translation systems do not capture, that other systems already do, for example image search.

How does the accuracy compare to human eval?

We constantly compare our results to human eval. Our baseline goal is 90% accuracy and 90% precision, and then we invest in human linguist labels and improving our technology to constantly improve precision for your content types and quality goals.

Can it support pairs between languages that are not English?

Yes, and we love to support translation research and development in that mission.

Can it support multi-modal inputs or context?

Sadly no major translation API is doing this yet. If you want to be the first, we would love to support you in your mission.

Is there an API?

Yes. ModelFront's main solution is a scalable instant risk prediction API. You can set up a continuous integration or get risk predictions on live traffic. ModelFront can also continously monitor your system.

If you have other ideas about how to automate your quality evaluation workflow or firefighting, or special requirements like on-prem deployment, please talk to us.

Is there a console?

Not yet. For now, ModelFront eval results are delivered as emails with numbers, links and a report. More is coming soon!

ModelFront keeps posting screenshots of our fails. Can you make it stop?

ModelFront post the most fails from the most popular and accurate translation engines. Consider it an honor! We at ModelFront look forward to the day when somebody posts our fails.

Ready to start or have more questions?

Ping us