Can you guess the ambiguous English words that caused the bad translations? Hints: grave, bat, you...
Can you find other translations that expose this hack?
How many translations are affected?
The English hack affects almost all translation pairs and all major engines today - Google Translate, Microsoft Translator and even DeepL.
| de | en | es | fr | ... | ru | sv | zh |
de | | en → de | es → de | fr → de | ... → de | ru → de | sv → de | zh → de |
en | de → en | | es → en | fr → en | ... → en | ru → en | sv → en | zh → en |
es | de → es | en → es | | fr → es | ... → es | ru → es | sv → es | zh → es |
fr | de → fr | en → fr | es → fr | | ... → fr | ru → fr | sv → fr | zh → fr |
... | de → ... | en → ... | es → ... | fr → ... | ... → ... | ru → ... | sv → ... | zh → ... |
ru | de → ru | en → ru | es → ru | fr → ru | ... → ru | | sv → ru | zh → ru |
sv | de → sv | en → sv | es → sv | fr → sv | ... → sv | ru → sv | | zh → sv |
zh-cn | de → zh-cn | en → zh-cn | es → zh-cn | fr → zh-cn | ... → zh-cn | ru → zh-cn | sv → zh-cn | zh → zh-cn |
zh-tw | de → zh-tw | en → zh-tw | es → zh-tw | fr → zh-tw | ... → zh-tw | ru → zh-tw | sv → zh-tw | zh → zh-tw |
There are a few exceptions. For example, there are also direct systems between very similar languages like Serbian, Croatian and Bosnian, and between variants like simplified Chinese and traditional Chinese. And these days, anyone with data and machines can take seq2seq and train a model for pair.
Why is there no support for more direct pairs?
Supporting all direct pairs between 100 pairs would require training, launching and maintaining 100 x 100 pairs. That's about 10,000 in total!
By using a pivot language or bridge language - almost always English - the number of systems can be reduced to about 200 - one for each direction for each language.
Lack of training data
There is simply not a lot of parallel data for many pairs, compared to the data available for translation to and from English. It makes it hard to train a model, and also hard to evaluate.
Engineering costs
Crawling, alignment, training, evaluation, deployment and maintenance for 10,000 pairs in production would create massive engineering costs. The amount of traffic for many obscure pairs simply does not justify the cost and effort.
Current research
Neural machine translation does not increase quality as much as it simplifies engineering. By lowering the barrier to entry, neural offers the potential to train and launch many more direct pairs. In fact, Google researchers are already experimenting with a universal model that handles all language pairs.
The demand for direct translation between non-English languages is outside English-speaking countries, so it will not be surprising if DeepL, Reverso, Yandex, Tencent or Baidu is the first to market with open-domain direct translation for major pairs like German-French or Spanish-Chinese.
Risk prediction
ModelFront risk prediction is built to support direct and indirect pairs. We see significantly better translation quality for direct pairs.
Not sure whether to invest in training a direct pair? Talk to us about your language pairs, content types and quality goals, and we will gladly advise you on your options.