Machine readable text

How to format text for machine translation

It’s no secret that machine translation quality is affected by how easy or difficult the text is. But what’s easy for humans is not necessarily easy for machines.

We humans are good at understanding what’s a tag, a style, a constant, a quote or sentence boundary. We humans guess that formatting from context. That’s because we first evolved to understand spoken language, which has no formatting anyway.

While machine translation researchers are working on this perhaps impossible problem, there’s a practical fix: format the text well to make it easier.


Tags

Most machine translation APIs are designed to handle HTML and XML as well as text.

However, we notice that heavy tagging causes degraded translation quality on the major machine translation APIs.

We also see a lot of content with other non-standard schemas for placeholders, like {1} and %S, that are too often mangled or translated as if they are ordinary words.

Styles

Style, like titlecase or all caps, are often hardcoded in the text itself.

Click On The Logout Button  
CONTACT US FOR MORE INFORMATION  
CSS/HTML  

It’s better to use truecasing and set the style outside the text:

<span style="text-transform: capitalize;">Click on the logout button</span>  
<span style="text-transform: uppercase;">Contact us for more information</span>  
CSS/HTML  

With truecasing, letters that are actually uppercase regardless of style should still be uppercase.

There are similar considerations for other types of styles, like taṭwīl or kashida in Arabic and Perso-Arabic script.

Constants

Constants like URLs or codes are common, especially in commerce and technical content.

He’ll arrive on flight NOT819 from LAX.
We noticed suspicious activity on the domain rome.hotels.

The HTML5 standard includes translate="no", which is supported by major machine translation APIs:

He’ll arrive on flight <span translate="no">NOT819</span> from <span translate="no">LAX</span>.
We noticed suspicious activity on the domain <span translate="no">rome.hotels</span>.

Control elements

Machines have a very hard time parsing and translating names or quotes that seem like they are part of the sentence.

Select Add Entry  
Watch When in Rome.

It’s better to mark these somehow:

Select <uicontrol>Add Entry</uicontrol>
Watch <i>When in Rome</i>

A more platform-independent option is to use traditional orthographic hints:

Select "Add Entry"
Watch "When in Rome"

Segmentation

Ideally input is segmented into sentences. But it’s tricky when there is non-prose like titles, menus and lists.

Make sure that sentences are not cut in half, which often happens when there are abbreviations that use the period (full-stop) character . or constructions that include : or -. Many of the tools work better for clean English content than for user-generated content or text that is not in English or not in the Latin alphabet.

And it’s not good to use very long source sentences or multiple sentences. Split too little, and the segments may be segmented or translated poorly. Split too much, and context is lost.

Download The Secret Trio feat.
Ara Dinkjian (oud)!
Colour:
Peach

Shoot for balance:

Download The Secret Trio feat. Ara Dinkjian (oud)!
Colour: Peach

The best implementation of these recommendations depends on your content type, process, tools and machine translation provider.

Enterprises saw the value of clear, simple input text and started using "controlled natural language" half a century ago. They’re also able to control internal and external guidelines and processes.

If you’re a language service provider, it’s harder. External clients send content and you need to work with it however it is formatted. But it can make sense to educate your clients, or just fix the original text once before translating it into many languages.

It’s always good to understand exactly how your tools handle tags, styles, constants, control elements and segmentation.