Search is a key feature of many platforms - search engines, social networks, messenger apps, marketplace and services for banks, law firms, insurers, bureaucrats or spies.
Users need search results to be multilingual. Important search results can be in any language or alphabet. There are multiple possible translations and transliterations of many names and terms.
To provide your users robust multilingual search, you can automatically translate and transliterate content as you index it.
Multilingual search doesn’t just work out of the box. A naïve exact match is broken by endless edge cases. In the ’90s or ’00s, when English and a few other languages dominated the online world, hacks like normalization, hardcoded lists of exceptions and synonyms, basic stemming, morphology and lemmatization were the standard.
“eBay may be a shark in the ocean, but I am a crocodile in the Yangtze River.“
Jack Ma | 马云 | Mǎ Yún, co-founder of Alibaba
Hardcoding rules and building one-off systems doesn’t scale. The US tech giants like Google added transliteration for the top non-Latin alphabet languages, but too late - they’ve never regained their early losses in Russia and China. Search is still broken for most languages, even on the top search platforms.
A bad set of search results for a simple query:
bianki hecaniv (Bianchi bicycle). Could the user find the site or the shop? When the same query is manually corrected and expanded to
bianchi OR բիանկի հեծանիվ, the results improve drastically.
Search is still broken for most languages, even on the top search platforms.
Named entities are an especially interesting and difficult type of natural language. We use names to reference and look up any type of entity - a person, organization, place, brand or product. Names can be transliterated or translated literally or creatively. Even English isn’t easy - many platforms fail to match Dave to David or Bank of America to BoA.
Many search systems are application-specific. They work on new, niche or proprietary data, and so they don’t benefit as much from open datasets or user click data feedback loops. They must work out of the box.
Search systems should match a lot - high recall - and then decide how to show the most relevant matches to the user. Relevance combines ranking, for example prioritizing exact matches, clustering of similar results and even concepts of diversity. There may be performance reasons to limit the matches, but with today’s compute, it’s not necessarily a factor in high-value problems.
The first instinct is often to normalize both the indexed content and the query. Normalization is a lossy process. Every query has only one normalized variant. Instead of normalization, we should do expansion and even noisification.
Query transformations - normalization, expansion or noisification - are inefficient. It creates additional runtime dependencies, latency and uncertainty. Multiple searches mean even more latency. And it’s hard to recover if you can detect and even fix bad transformations, like a bad spelling correction or bad translation.
It’s much better to transform the index. You can upper and lowercase, you can apply spelling correction or even synonymization, and you can translate and transliterate into many languages and scripts. It doesn’t need to be instant. If you detect problems - for example, a risky translation using ModelFront - you can simply drop the variant, log it or even fix it.
To repeat: we do not recommend hitting a machine translation API or translation risk prediction API on every query.
This is analogous to the best approaches for making other natural language processing systems, like classifiers or named-entity recognition, multilingual: translate the training data (and filter the translations), don’t translate at inference time.
The ideal translation or transliteration model for search index expansion should generate multiple variants. The major translation APIs cannot do that, and there are almost no transliteration APIs.
If you’re training your own custom translation or transliteration models, then you need your own training data. Training data can be actual parallel data from the web or from humans. You can and should also leverage monolingual data in the target language or languages, with back-translation and back-copying.
Language-agnostic approaches to search that solve problems for long-tail languages and scenarios also improve accuracy on the top languages and scenarios. They require more upfront effort, but once you have it working, it’s easy to add new content and new languages.
ModelFront works with clients to build and improve their multilingual search in-house. We share the basics openly, and we’re happy to provide more guidance for specific cases.