Training

Reference documentation for training custom models

Custom risk prediction model training is not yet a self-serve feature accessible to all users via the console and API. The ModelFront team works with clients to train and evaluate custom models, and deploys them so that they're accessible via the console and API.

As part of the guided customization process, the ModelFront team makes a best effort to detect formats, convert and filter client-provided datasets into the shape usable for training and evaluating custom translation risk prediction.

We've put together a few guidelines on exporting translation data for customization.


Dataset types

Translation risk prediction can be trained with monolingual source or target data, parallel data, post-edited data and labeled data.

Post-edited data and labeled data are the most ideal and the most standard for risk prediction, for example for hybrid translation.

Labeled data could from an offline human evaluation, but it could also be the byproduct of a production process like final review.

Contents

Translations from significantly different quality tiers, processes or content types should be separated. They can either be delivered in separate files, or the quality tier or content type can be indicated in an additional field.

For post-editing data, it's important to separate translation memory matches from machine translations.

Translations do not need to be sorted or deduplicated. It's fine to leave segments in chronological order, and to indicate the date in additional field.

Quality

As a rule, the accuracy of translation risk prediction reflects the scale and quality of the training data. Real datasets of significant scale always have quality issues - noise, inconsistencies and human errors.

The standard customization process includes basic data filtering. Clients can notify ModelFront if a certain subset of the dataset - for example the post-edits before a certain date, or segments containing a certain phrase - is less trusted.

In severe cases, the ModelFront team may recommend an additional data filtering or correction process.

Preprocessing

Training data should generally be preprocessed the same as it will be preproccessed when calling machine translation and risk prediction in prediction. It's usually better to avoid preprocessing or normalization, which can remove important signals.

Clients do not need to unescape HTML/XML or remove HTML/XML tags. By default, the ModelFront training pipeline and API unescape HTML/XML and remove tags. This is analogous to the XML/HTML input mode of the major machine translation APIs. If an HTML/XML tag is meant to be interpreted literally, then it should be HTML/XML escaped. Alternatively, clients can request that escaped XML/HTML and XML/HTML tags be interpreted literally by default.

Arbitrary template schemes or placeholders like {} or {1} are supported in custom translation risk prediction and can be included in datasets.

Leading, trailing and medial whitespace is considered in translation risk prediction and can be included in datasets.

If you want to use the HTML5 translate attribute to mark words or phrases should not be translated - translate="no" - please talk to the ModelFront team.

Formats

There is no standard format for importing and exporting post-edited data and labeled data. The most common formats are TMX, XLIFF, XLS/XLSX, JSON, CSV and TSV. In research, it's also common to use multiple aligned files, typically plaintext files, and this approach can also be used with formats like TMX.

With all formats, it's important to escape delimiters and control characters in a standard way that's fully recoverable. With aligned files, it's critical that the number and order of the segments correspond exactly.

For large datasets - beyond 10M segments per file - we recommend the truly line-oriented formats, like TSV. They're more scalable, compact and human-readable. They're preferred by the research and engineering communities, and have built-in support in Unix operating systems like Linux and MacOS.

The most common TSV standard, now known as Linear TSV, was established by Postgres and MySQL and has good library support. This is also the TSV format supported for upload and download in the ModelFront console. Single-column TSV can be used instead of plaintext if some segments contain newline characters.

Note: The TSV format supported by Google AutoML and Amazon Active Custom Translation is in fact CSV with tab delimiters. The CSV standard has many constraints. For example, " must be escaped as """, certain characters are not permitted in certain positions at all and header lines may be required.