What causes bad translations?

Machine translation error causes at the system, process and language levels

We often hear about the results of bad machine translations - translator post-editing effort, breakdown in communication, lost conversions…

But what about the causes of bad machine translations?

Neural machine translation has improved most aspects of quality, but other quality problems seem to be timeless, and a few quality problems are new.

Translation can be split into three different levels: the translation systems, the translation processes relying on them and natural human language itself.


System

Systemic errors are on the machine translation provider side, like bad data or lack of data, pre-processing or post-processing - anything that isn’t fixable without re-training the machine translation model or re-engineering the machine translation training and serving infrastructure.

Bad training data

Bad noisification

Bad pre-processing

Bad post-processing


Process

Process errors are on the client side or in the integration or agreement between the client and the machine translation provider. They’re the most common, and luckily also the easiest to fix.

Bad sentence segmentation

Wrong language

Wrong script

Unlanguage

Tags

Formatting

Encoding


Language

Natural human language is infinitely expressive. There are rules, and there are exceptions, and exceptions to the exceptions. Different languages and cultures express meaning and ideas differently, they are not 1:1. There are ambiguities which require intelligence and reasoning. Some errors can be solved with more data or more context, most can only be net reduced.

Noise

Lexical ambiguity

Syntactic ambiguity

Long-distance dependencies

Untranslatability

Style preferences

Domain

Locales

Dialects


Language is constantly evolving. In most scenarios, we can’t control language, only build and maintain systems and processes to handle it better or fail gracefully.