Parallel data filtering

Filtering training data for better machine translation

Bad data causes bad translations. Even back in the days of statistical machine translation, bad data was a top cause of errors in production systems. With neural machine translation models that’s still true - they’re more sensitive in some ways, less interpretable and predictable, and the majority of the data today is synthetically generated. Custom models, which need to learn from relatively small datasets, can be even less predictable.

Bad data causes bad translations.

The solution is parallel data filtering.

The top machine translation research teams, like those at Google and Microsoft, have sophisticated internal systems for parallel data filtering at scale. Inside some companies, there are also projects to filter or label data manually, or to get it from the crowd. In open research, there are conference shared tasks and open-source options, like Bicleaner, Zipporah and LASER. In the translation industry, it’s known as cleaning a translation memory - sometimes by hand.

And now ModelFront provides production-strength parallel data filtering with a few clicks.


Parallel data

Traditionally there are two main sources of parallel data: human translations like translation memories, and data crawled and aligned from the web.

Monolingual data - text - is much easier to find then parallel data. Modern machine translation researchers lean heavily on synthetic data generated from text that’s already in the target language.

The main synethetic data generation technique is back-translation - machine-translating the text back into the source language or languages. A similar technique is back-copying - simply copying the target text over to the source text.

Those techniques are a bit counterintuitive, but they’re surprisingly effective for two reasons:

  1. Noise in the translation is bad, but source-side noise - noise in the original text - is not just harmless, it’s even good.

  2. The machine translation model should implicitly learn a great target-side language model, beyond what’s in the parallel data.

Error types

Parallel data can have all types of errors: misaligned sentences, bad sentence segmentation, bad encodings. It could just be in the wrong language, or mixed language.

Murphy’s Law: “Anything that can go wrong will go wrong.“

Web data is especially noisy. For example, Wikipedia articles are not translations of each other, so the sentences being aligned are not necessarily alignable. Much of the web data is contaminated with new and old machine translations.

Is a good translation good training data?

No. In fact, the most creative translations are some of the most problematic. Human translations of books, subtitles or marketing material may be restructured or reordered to fit the flow or layout. An abbreviation could be expanded. An idiom, telephone number or language name or price could be translated to a different, locale-appropriate idiom, telephone number,language name, currency or format.

EN As President Macron said, ...
FR Comme l'a dit le président, ...

EN CA docs
FR Médecins pour carcinome

EN I’m feeling lucky.
FR J’ai de la chance.

EN Download Firefox - English
FR Télécharger Firefox - Français

EN $20K
FR 17,7 millions d’euros

Even if our ideal machine translation would convert currencies, it’s risky business - there are dozens of different dollars, like Canadian and Singaporean dollars, and exchange rates change.

The best translation for a specific project and context may not generalize well.

Is translation symmetric?

No. A valid translation in one direction can be bad training data if flipped in the other direction. For example, the original text may be noisy or even already in the target language, or use a rare or informal, or gender-specific word. There can be ambiguity or specificity that cannot be captured in both languages.

Can we filter with round-trip translation or vector similarity?

Round-trip translation and vector similarity, for example unsupervised approaches with pretrained models like LASER, are good as a baseline or sanity check.

But they suffer from critical drawbacks. They’re easily fooled by translation that is in the wrong language. They penalise source-side noise as much as target-side noise. They may penalize a synonym, but not penalize overly literal translations, or changes in numbers, named entities, formality, gender, negation or offensiveness - real errors.

Even Zipporah and Bicleaner are only as good as the dataset they are trained on, which is often either from another domain or the same as the data to be filtered.


Production-strength filtering with a few clicks

ModelFront is a production-strength translation risk prediction system, built to be accessible and useful to more players. You can select the language pair, upload a file, click Start and go have a drink.

The ModelFront console for filtering and evaluation connects to the scalable translation risk prediction API we’ve built for production scenarios. It handles all the major error types - as well as security, tags and encodings - for translations between 100+ languages. And of course it works for very large datasets, like ParaCrawl, including support for files on cloud storage and an API.

What’s the best threshold?

If you drop too much training data, your translation quality will suffer.

The best threshold depends on how much data you have, how noisy it is, how diverse the risky translations, the language pair, the translation system being trained and whether it’s being train from scratch or just customized.

Drop or fix?

Usually we think of filtering as a way of automatically dropping translations that are too risky. Researchers often just pick a threshold and apply it automatically. At scale, it may be the only option.

But filtering is also a chance to identify systemic issues - for example, bad alignments, bad sentence segmentation, bad encodings or conflicting translations of the same word or phrase - and decide on the best course of action. It can be worth fixing segments, either manually or automatically, or finding or generating other datasets to blend in.

We recommend slicing out different types of segments - for example, segments that are URLs, segments that were an approved TM fuzzy match, segments where the original and translation are the same or segments that are especially trusted - and inspecting and applying different thresholds to shape the final dataset for training custom machine translation.

The freedom to iterate

The task is not just filtering data - it’s shaping your data for training machine translation. That can’t be totally outsourced or isolated from the training process. The ModelFront console is a tool to make that iterative process radically more efficient.


You can create a free account and start an evaluation on console.modelfront.com for a TMX or other common parallel data file format - in just a few clicks.

ModelFront provides filtering to machine translation and machine learning researchers, engineers and specialists at language service providers, enterprises, labs and startups.

Machine scale, human quality.

If you want guidance or require customization or a volume discount, you can talk to us. We continually improve accuracy, features and language coverage, and welcome feedback and feature requests.