Confidence scoring, quality estimation and risk prediction

What is the difference between a quality estimation, a confidence score and risk prediction?

Microsoft, Unbabel and Memsource use the research standard quality estimation, while Smartling calls its feature a quality confidence score, and there are even research papers talking about confidence estimation or error prediction, and tools that use a fuzzy match percentage. Parallel data filtering approaches like Zipporah or LASER use quality score or similarity score.

ModelFront uses risk prediction.

They're overlapping concepts and often used interchangeably - the distinctions are as much about tradition, convention and use case as about inputs and outputs. They are all basically a score from 0.0 to 1.0 or 0% to 100%, at sequence-level precision or greater. Unlike BLEU, HTER, METEOR or WER, they do not require a golden human reference translation.

We're interested in language, so we know the nuances in naming are important.

Confidence scoring

A machine translation confidence score is typically used for a machine translation's own bet about its own quality on the input sequence. A higher score correlates with higher quality.

It is typically based on internal variables of the translation system - a so-called glassbox approach. So it can't be used to compare systems or to assess human translation or translation memory matches.

Quality estimation

A machine translation quality estimate is based on a sequence pair - the source text and the translation text. Like a confidence score, a higher score correlates with higher quality.

It implies a pure supervised blackbox approach, where the system learns from labelled data at training time, but knows nothing about how the translation was produced at run time. It also implies scoring of machine translation only.

This term is used in research literature and conferences, like the WMT shared task, and is also the most common term in the context of the approach pioneered at Unbabel and Microsoft - safely auto-approving raw machine translation for as many segments as possible.

It's often contrasted with quality evaluation - a corpus-level score.

In practice, usage varies - researchers do talk about unsupervised and glassbox approaches to quality estimation, and about word-level quality estimation, and there's no reason that quality estimation could not be used for more tasks, like quality evaluation or parallel data filtering.

Risk prediction

A translation risk prediction is also based on a sequence pair - the source text and the translation text. A higher score correlates with higher risk.

Like quality estimation, it implies a pure blackbox approach. Unlike quality estimation, it can also be used for everything from parallel data filtering to quality assurance of human translation to corpus- or system-level quality evaluation.

Why did we introduce yet another name? Risk prediction is the term used at ModelFront because it's the most correct and it's what clients actually want, across all use cases.

Often it's impossible to say if a translation is of high quality or low quality because the input sequence is ambiguous or noisy. When English Apple is translated to Spanish as Manzana or to Apple, it makes no sense to say that both are low quality or medium quality - one of them is probably perfect. But it does make sense to say that, without more context, both are risky.

We also wanted our approach to explicitly break away from quality estimation's focus on post-editing distance or effort and CAT tools' focus on rules-based translation memory matching, and to be future-proof as use cases and technology evolve.

ModelFront's risk prediction system will grow to include risk types and rich phrase- and word-level information.