What is a good evaluation?

Comparing machine translation systems in a meaningful way

Thanks to open-source neural machine translation, there are more and more machine translation options, and the average quality of major machine translation APIs is very similar across the major language pairs.

Numbers without context

An evaluation score of 74 versus 76 out of 100 is not a significant difference. If the evaluation is less than 1K lines or not a well randomized sample of a larger corpus, it could even be just noise driven by a few outliers.

It will only help you conclude that machine translation is great when it works, but it doesn't always work. You knew that already.

It won't help you decide which API or model is best for your end goal - for example, reducing post-editing effort, selling more items or even something as critical as translating customer service chats or making coronavirus information accessible to the world - and how you could improve it.

Asking good questions

Whether you do a human evaluation, an automated evaluation on ModelFront or both, how to decide which machine translation API or custom model is better for you?

Setting up a good evaluation is like asking good questions. What does this API or model do best on? How does it do on our most difficult content? What is its Achilles' heel? Is adding custom glossaries a net win?

Their quality differs mostly on the edge cases. They're actually easy to test, especially with automated evaluations.

Scale it up

We always recommend evaluating with at least 1K lines for statistical significance, and every new ModelFront account includes free credit for roughly 10K sentences.

With human evaluations, you're often limited by resource constraints. But human evaluations, just like automated evaluations, are invalid if they're too small or not random.

You can even send 1M lines, and sort by risk. Translation catastrophes - totally mangled, misleading or offensive translations - are one in a thousand or one in a million.

Slice it up

When you look at specific slices, suddenly you see much bigger differences in quality between machine translation APIs and models.

It can help you choose APIs or custom options, change your process or bring concrete evidence to the API provider that there is a systemic issue compared to your other options - their competitors.

Content type

Just like you evaluate multiple language pairs separately, it's best to evaluate your marketing material and technical docs separately.

Content type is less about domain - gaming, beauty or finance - and more about the text structure - titles, descriptions, emails, chats, documentation and UI strings are very different.

Tags and formatting

The major machine translation APIs have better quality translating text without any tags.

What happens if you evaluate only your translations with HTML tags? Can you also induce the problem by putting <span></span> tags around every word in every segment?

Noise

Noisy input is a fact of life. What happens on noisy text, whether user-generated, just a mistake or because of process issues?

You can always take clean content and noisify it, for example by making segments all uppercase or all lowercase.

Constants

What happens to your constants - URLs, emails, codes, usernames?

Numbers

Numbers are not necessarily constants - decimal separators and units vary between languages and locales.

Are the numbers and formatting correct?

Names

How do the APIs compare on your glossary, your product names, your employee names or your customer names?

Language pairs

What happens to medium- and low-resource languages and pairs, even those like German to French that are typically implemented indirectly?

Locales

Is the quality better when the source text is American English, not British or global? Is the quality better when translating into Brazilian Portuguese than into European Portuguese, or worse for all target languages with multiple variants?

Formatting

Does quality vary depending on formatting, like punctuation and spacing? Can you also induce the problem by replacing all ASCII quotes with fancy quotes and , or fancy quotes with ASCII quotes?

Extreme input

Neural machine translation has better fluency, but is also less predictable on random input.

It could be input in a third language, or the wrong language was selected.

Length

Machine translation is best at well-formed, medium-length sentences. How do the engines compare on your very short segments? And on very long ones?

Sentence segmentation

Machine translation APIs, like CAT tools, need to segment paragraphs into sentences. It's hard - it can require full human intelligence - and when it goes wrong, there are often bad translations.

How do the engines compare when multiple sentences are in the same segment? And when one sentence is chopped up into multiple segments?

Time

Machine translation APIs train on web data, but not constantly. Their quality on new topics and new words is not guaranteed.

Is your machine translation API as good on your recent and new incoming content as it is on a sample of all your content?

Custom translation

Customizing translation - whether with glossaries, custom models like Google or Microsoft, or instant adaptation like ModernMT - is not a solved problem.

The generic models are trained on hundreds of millions or billions of translations. Custom models are fine-tuned on top of that, with a few thousand to a few million, with a higher learning rate - the variable the determines how aggressively to generalize from a few examples.

Too low, and no customization happens. Too high, and the results are unpredictable. The machine learning research phenomenon "catastrophic forgetting" is exactly what it sounds like.

So the machine translation API with the best generic model is not necessarily the machine translation with the best custom model for your data. They also react differently to your dataset size and quality.

Monitor updates

Your custom model has not updated since you trained, but updates to the generic models of most machine translation APIs are launched a few times a year, typically silently.

Is your custom model still better than the latest generic model? Have there been regressions on quality for certain slices?

Stay open to all options

Is instant adaptation from a .tmx, like with ModernMT, better for your workflow than training custom models with Microsoft or Google? What about free offline models like Fairseq or Opus MT?

Machine translation APIs are constantly adding new languages, features and pricing. Some are overrated, some are underrated and it all depends on your specific parameters.

Human and machine

Humans are the gold standard - nothing replaces a deep look at the actual text input and output and full understanding of the context and processes.

But, like automated evaluations, human evaluations can also be invalid or unhelpful. And they are very slow and expensive.

Your best bet is hybrid - use human evaluations to confirm that automated evaluations like ModelFront are directionally correct and in agreement with professional human evaluations.

Then use automated evaluation, like ModelFront, for more coverage and convenience - more languages, more APIs, more slices, more scale and more frequency - and to focus on the most important problems for human analyses and solutions.