The progress in machine learning is amazing. With enough data, machines can now beat humans at chess, image labelling, and even driving. Machine translation improves every year.
But human language is complex. Machine translation fails often and unpredictably. The results can be nonsense, misleading, offensive or even dangerous.
The basic idea of risk prediction is simple: we want to guess whether a translation is right or wrong.
But more precisely it's the prediction of the risk that a machine translated sentence is right or wrong. In some cases we know that a translation is wrong, but in other cases we only know that there is a risk that it might be wrong.
For example, there could be a 50% probability that a translation is wrong, but a 100% probability that it is risky. A sentence like
He was hit by a bat could be translated to Spanish as
Fue golpeado por un bate or
Fue golpeado por un murciélago, depending on whether the wooden stick or the flying animal is meant. The translation could be wrong, it could be right. Without context or a reference translation, we don't know. But we do know it's risky.
With real-time risk prediction, client applications can decide how to react to risk for each translation, in a way that is right for their users and their speed, scale and quality goals.
Traditional quality evaluation, like BLEU and METEOR, is for the evaluation of an engine, at the corpus level. It requires existing human-quality golden translation, so it can't be used for new content instantly, or for real-time predictions on a stream of translations.
Because risk prediction can always easily be run on a large dataset, it also effectively provides quality evaluation - even for a dataset with no human refereence translations. ModelFront now also offers quality evaluation for machine translation researchers and developers.
Risk prediction is related to quality estimation, as it's known in the research world. It's usually at the sentence level. State-of-the-art approaches are based on feature engineering, and learn a separate model for each pair, which makes it very hard to support hundreds of languages and tens of thousands of pairs.
The greatest problem with traditional quality estimation is that it is focused on HTER - post-editing effort. It's useful for CAT tools and TMS, but it doesn't separate stylistic preferences from painfully catastrophic translations.
Developers are using our machine translation risk prediction in creative ways. The results are so scalable and cost effective that they open up whole new use cases.
It's no secret that human translation delivers much higher quality. On the other hand, machine translation is about 500 times less costly than human translation, and is also instant.
With translation quality risk prediction, you can instantly use the machine translations that are good, and send only the risky translations to human translators. Alternatively, you can priority sort the translations by risk, so that humans translators work on the riskiest translations first.
Now you can get 99% quality at 10% of the price.
For example, if there are 100 reviews of a product, and 90 of them are translated well, we can just show those 90 to the user, and drop the 10 risky ones, or put them on the last page.
Simply measuring, graphing and monitoring aggregate quality is a big step forward. How many risky translation were served today? Is the quality better for descriptions or for reviews? How does it vary by domain, or by language pair?
By setting up alerts, you can also be sure to know if the translation risk for any slices changes.
Compare translations from multiple APIs to choose the best
If you have trained a custom engine like AutoML for Google Translate, you can compare the translation from your custom models with those of the default API, and choose the best one for each sentence. You can also look at aggregate quality to train better custom models.
If there are multiple versions of the original input text, for example with variations in spelling or casing, you can translate both and use the higher quality translation.
If the user knows multiple languages, you can even compare the translations into different languages, and show the best one.
Add confidence to reading and composing with machine translation in business workflows. For example, when a team member is replying to a user or customer review or email with a machine translated text, the translation can be checked for risks before it is sent.
After working at Google Translate, using translation APIs and watching friends and family work as human translators, we knew that there are many translations that require human intelligence.
But we also knew that there are too many machine translation errors that are detectable and preventable with the right implentation of today's technology.
ModelFront takes a fundamentally different approach to machine translation risk prediction, and using deep learning and very large datasets. ModelFront is opening up many new use cases for balancing machine scale and human quality.
The ModelFront team is led by full-stack engineers with experience building highly-scaled data-centric APIs, marketplaces and SaaS platforms like Google Translate, Google Play, PubNative, Aarki and early-stage startups.
You can catch our team at Empirical Methods in Natural Language Processing (EMNLP), Association for Computation Linguistiscs (ACL), Workshop on Machine Translation (WMT) and Applied Machine Learning Days (AMLD).
For technical support please email [email protected].