Parallel data filtering

Filtering training data for better machine translation

Bad data causes bad translations.

Even back in the days of statistical machine translation, bad data was a top cause of errors in production systems. With neural machine translation models that’s still true - they’re more sensitive in some ways, less interpretable and predictable, and the majority of the data today is synthetically generated. Custom models, which need to learn from relatively small data, can be even less predictable.

The solution is parallel data filtering. The top machine translation research teams, like those at Google and Microsoft, have sophisticated internal systems for parallel data filtering at scale. Inside some companies, there are also projects to filter or label data manually. In open research, there are conference shared tasks and open-source options, like Zipporah and LASER.

And now ModelFront provides production-strength parallel data filtering with a few clicks.


Parallel data

Traditionally there are two main sourcces of parallel data: human translations like translation memories, and data crawled and aligned from the web.

Monolingual data - text - is much easier to find then parallel data. Modern machine translation researchers lean heavily on synthetic data generated from text that’s already in the target language.

The main synethetic data generation technique is back-translation - machine-translating the text back into the source language or languages. A similar technique is back-copying - simply copying the target text over to the source text.

Those techniques are a bit counterintuitive, but they’re surprisingly effective for two reasons:

  1. Noise in the translation is bad, but source-side noise - noise in the original text - is not just harmless, it’s even good.

  2. The machine translation model should implicitly learn a great target-side language model, beyond what’s in the parallel data.

Error types

Parallel data can have all types of errors: misaligned sentences, bad sentence segmentation, bad encodings. It could just be in the wrong language, or mixed language.

Murphy’s Law: “Anything that can go wrong will go wrong.“

Web data is especially noisy. For example, Wikipedia articles are not translations of each other, so the sentences being aligned are not necessarily alignable. Much of the web data is contaminated with new and old machine translations.

Is a good translation good training data?

No. In fact, the most creative translations are some of the most problematic. Human translations of books, subtitles or marketing material may be restructured or reordered to fit the flow or layout. An idiom, telephone number or language or price name could be translated to a different, locale-appropriate idiom, telephone number or language name or currency and format.

EN: I'm feeling lucky. FR: J'ai de la chance.

EN: Download Firefox - English
FR: Télécharger Firefox - Français

EN: $20K
FR: 17,7 millions d’euros

The best translation for a specific project and context may not generalize well.

Is translation symmetric?

No. A valid translation in one direction can be bad training data if flipped in the other direction. For example, the original text may already be noisy or even already in the target language, or use a rare or informal, or gender-specific word. There can be ambiguity or specificity that cannot be captured in both languages.

Can we filter with round-trip translation or vector similarity?

Round-trip translation and vector similarity, for example unsupervised approaches with pretrained models like LASER, are good as a baseline or sanity check.

But they suffer from critical drawbacks. They’re easily fooled by translation that is in the wrong language. They penalise source-side noise as much as target-side noise. They may penalize a synonym, but not penalize overly literal translations, or changes in numbers, named entities, formality, gender, negation or offensiveness - real errors. Even Zipporah is only as good as the data it is trained on, which is often the same as the data to be filtered.


Production-strength filtering with a few clicks

ModelFront is a production-strength translation risk prediction system, built to be accessible and useful to more players. You can select the language pair, upload a file, click Start and go have a drink.

The ModelFront console for filtering and evaluation connects to the scalable translation risk prediction API we’ve built for production scenarios. It handles all the error type between 100+ languages, security, tags and encodings. And of course it works for very large datasets, like ParaCrawl, including support for files on cloud storage and an API.

Drop or fix?

By default, filtering is for dropping translations that are too risky. At scale, it may be the only option.

But filtering is also a chance to identify systemic issues - for example bad sentence segmentation or bad encoding - and decide if it’s worth fixing the most common of them, either manually or automatically.

For example, many parallel corpora are inconsistent - they contain conflicting translations of the same word or phrase.

What’s the best threshold?

If you drop too much training data, your translation quality will suffer.

The best threshold depends on how much data you have, how noisy it is, how diverse the risky translations, the language pair, the translation system being trained and whether it’s being train from scratch or just customized.


You can create a free account and start an evaluation on console.modelfront.com in a few clicks. ModelFront provides filtering to machine translation and machine learning researchers, engineers and specialists at language service providers, enterprises, labs and startups.

Machine scale, human quality.

If you want guidance or require customization or a volume discount, you can talk to us. We continually improve accuracy, features and language coverage, and welcome feedback and feature requests.