One of the advantages of being an IT professional as well as a translator is that as part of my IT duties I try to keep ahead of the wave and therefore keep an eye of state-of-the-art research. This, obviously, means having a subscription in professional societies such as the Association of Computing Machinery (ACM) or the IEEE Computer Society, as well as reading the content of their digital libraries as well as their publications. I also keep an eye on the most relevant SEO blogs and forums and of course on the Google webmaster blog.
So I stumbled on several complaints by users that the search engines –and Google in particular– was identifying their non-English (and sometimes even their English) pages as being in an incorrect language. This made me wonder how search engines identify the language of a specific website, as this has the greatest importance for SEO translation and even for multilingual SEO, once the translation has been performed.
I started my investigation around the academic community, as I have evidence that the search engines (and in particular Google) pay attention to the research in this field. A search in the digital libraries of the ACM and the Computer Society was however quite disappointing. A found quite a few theoretical papers on language recognition, but not specific to the Web. There were also a number of interesting academic papers about the recognition of the language in search queries, a quite interesting paper on how culture affects web design (and that I will address in a future post) but only six papers and one book brought me some light onto this particular issue:
- Indexing the Indonesian Web: Language Identification and Miscellaneous Issues, by Vinsensius Berlian Vega and Stéphane Bressan. Through very brief, it provides some interesting insights on the difficulties regarding language recognition.
- Web Page Language Identification Based on URLs, by Eda Baykan, Monica Henzinger and Ingmar Weber. This paper discussed several machine learning algorithms for language identification, namely Naïve Bayes, Relative Entropy, Maximum Entropy and Decision Tree. Curiously, the study was attempting to identify the language of a web page using only its URL.
- Language identification on the World Wide Web, by Katia Hayati, as part of her master’s degree in Computer Science. The basic approach is the classic n-gram based algorithm, supplemented by the Fisher discriminant function. The math and algorithms are relatively simple, but it contains some hidden pearls in the text that I will identify later on, and which are consistent with what Google is saying.
- Language identification of on-line documents using word shapes, by N. Nobile, S. Bergler, C.Y. Suen and S. Khoury. The authors developed character classes using visual characteristics. Contrary to the previous paper, they do not use n-grams, but just bigrams and trigrams combining scores and an expert system.
- Language Identification in Web Pages, by Bruno Martins and Mário J. Silva. These also construct their study on the well-known n-gram algorithm initially proposed by Canvar and J. M. Trenkle (see ).
- N-gram-based text categorization, by W. B. Canvar and J. M. Trenkle . The basic n-gram paper that is referenced by everybody else.
- Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti. Strictly speaking it does not indicate how search engines discover your language, but it provides a profound insight on knowledge discovery – and ultimately identifying a language is exactly that.
Now, is this pure academic stuff useful? Actually it is if you want to understand how search machines detect their languages. If you have a look at the Google Research Blog, you will find that they DO look at academic papers for their own research. And one particular article that caught my eye was the one titled All Our N-gram are Belong To You. It literally states that “Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction and others”. This is a strong hint that Google uses n-grams to recognize language, as it has repeatedly stated that it does not use the “language attribute”. But we’ll leave that for the next part of this post…