Constants

When not to translate

One of the timeless dilemmas of translation is what to translate and what not to translate. It could be a word or phrase that is potentially a named entity, like Apple, Stripe or Prince of Persia, or a symbols like a flight code.

In modern times, very often it's an email address, a URL or even computer code or markup tags. If a word or segments is the same across all languages, we call it a constant.

How much are you translating for nothing?

One of the best practices for translation and machine translation workflows is to mark constants so that they are not sent for translation - a step towards controlled language.

However, because the developers or writers who create the content are often not thinking about localization or translation at creation time, all translation workflows struggle with constants.

In reality, roughly 1 in 10 segments we see are constants that should not be translated at all. For certain domains or content types like commerce or technical content, it can be more.

Hyperneural

SMT (statistical machine translation) had one useful characteristic: if the word or segment was new and it didn't know what to do, it would just copy over the word or segment. That was bad and good.

NMT too often fails to just copy the string. Of the segments that should should be copied, NMT erroneously changes roughly 2/3 in some way, requiring extra post-editing just to change them back.

Most of the weapons in a machine translation research team's arsenal are double-edged swords. For example, dialing up subword handling improves translation of compounding and morphology, but hurts translation of named entities and constants.

In our internal tests of top machine translation APIs on hundreds of thousands of segments of emails, usernames, URLs and source code, 1/10 to 9/10 were translated, depending on the engine and language pair.

Trade-offs… So this problem is not going away soon.

Efficiency

Can ModelFront catch it? Yes! It's one of the easier risks to predict. So at least you caught it.

But then you and your translators have to deal with fixing it - more work, and fixing mangled URLs is precisely not the type of work that translators dream about.

So if mistakenly translated constants are only, for example, 7% of the total segments, they may be 20% to 50% of the total post-edits!

One approach is to manually maintain a very large dictionary of words not to translate, but it will never cover new input, nor deal with the ambiguous cases.

A simple heuristic

So how can you keep constants constant?

You can always first consider a copy of the source segment as a potential translation, and get the risks:

The world needs independent businesses.
The world needs independent businesses.
100% risk

https://www.shopify.com/covid19.
https://www.shopify.com/covid19
0% risk

Learn about the actions we’re taking to address the impact of COVID‑19
Learn about the actions we’re taking to address the impact of COVID‑19
100% risk

So right away you see that some segments definitely need to be translated, and some probably don't.

You only need to do this once for the string across all language pairs.

Now you can machine translate them all, and get the risks:

The world needs independent businesses.
El mundo necesita negocios independientes.
5% risk

https://www.shopify.com/covid19
https://www.shopificar.com/covid19
100% risk

Learn about the actions we’re taking to address the impact of COVID‑19
Conozca las acciones que estamos tomando para abordar el impacto de COVID-19
49% risk

What's the best way to integrate this into a workflow?

For those scenarios where we are interested in safely auto-approving raw machine translation for as many segments as possible, we can just use https://www.shopify.com/covid19 as soon as we know it's 0% risk.

In a more conservative scenario with full human review, we can at least let the translator start post-editing from https://www.shopify.com/covid19 - one click to approve - instead of forcing him or her to correct https://www.shopificar.com/covid19.

A metric to monitor

As you run this simple heuristic to check for constants and your translation stream, or retroactively your translation memory, you can track how many constants are leaking into the translation workflow over time and by source, and be alerted to sudden increases that could indicate a business process problem.

It's remarkable how the work that is automatable is precisely the work that we humans do not enjoy. Keeping constants constant frees human translators to work on more creative language content.