In the previous post (Language Identification is difficult) I highlighted how difficult it could be to identify a language. Yet Google does it somehow: it has over 160 domains, and Google allows restricting user results to pages in 117 languages, so “somehow” it must be able to “understand” into which languages these pages must be. Based on the literature and the hints found in places such as Google’s Webmaster Central, search engines seem to use a variety of methods to detect the language of your web pages.
Ok, Google ignores this, but other search engines (such as Yahoo!) will recognize the <meta http-equiv=”Content-Language” content=”en” /> meta-tag as an important input for language indication.
Basically this consists in building a list of commonly found words (usually stop words such as “the”) in languages and scanning unknown texts for those words, and assigning a language to them based on those words. A simple algorithm that identified that a high percentage of the total words in a text were in the dictionary for a specific language would be a strong argument towards assigning that language to the page.
Now, I read recently an interesting article at SEO by the Sea about how search engines know the language of a query. Ok, so strictly speaking it does not speak about the identification of the language in web pages, but the basic problem is the same. Moreover, the article mentioned four Google patents which hinted to the fact that Google might be using some kind of artificial intelligence and/or statistical method to identify the language of the text. And in an article in the Official Google Blog they actually mention the use of artificial intelligence. Google’s language detection tool seems also to point out in the direction of probabilistic methods.
Moreover, in the first post of this series about whether search engines understand your localized pages, I pointed out one particular article from the Google Research Blog titled All Our N-gram are Belong To You that literally stated “ …we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation…”. Thus, it is evident that Google utilizes probabilistic methods, and it is likely that such methods are not just used for translation, but also for language recognition.
Thus, Google could compute the probabilities of all those words appearing on the page for the different languages, and use artificial intelligence and/or machine learning to predict the most likely language. Combining this with character mapping (maps of characters more unique to certain languages than others) this could provide a very high probability of identification.
Personally, I strongly doubt that the whole page is scanned this way (the required computing power for the calculation of billions of pages would be immense), but using a subsegment of the pages or an incremental scanning approach would be reasonable. Some research points out that a sample between 400 and 600 characters should be sufficient to identify a language, so why spend more effort on it?
For those interested in exploring this method, I found an interesting example of probabilistic language determination and training with Lingpipe, a toolkit for processing text using computational linguistics.
At least Google considers Geo-location a factor for language detection. When you think about it, if you find that a page indicates a geo-location then that particular page is probably in the local language. A site offering services in Paris is likely to be in French, a page offering services in New York is likely to be in English. This is obviously not always 100% true – if the page is that of an hotel, chances are that it also has an English version, even if located in Paris.
I noticed an interesting discussion at the Google Webmaster Central help forum about a site that was in English but was treated as Chinese. The ONLY relationship of the guy whose page was mistakenly taken as Chinese was that he had a lot of incoming links from Chinese pages. Thus, we can safely assume that the language of incoming links –which Google knows, after all– might be a factor in language detection. There are a few hints to this in Google’s Webmaster Central, but nothing all too obvious. It seems logical; it is unlikely that people link to pages in a different language.
Katia Hayati (Ref.  in the first post of the series) highlighted that in the tested sample of web pages roughly 95% of all outgoing links were written in the same language. Again, this makes sense, people link to what they understand. Now, I could not find anywhere even a hint that outgoing links are used by search engines for language recognition, but I personally would not discard it.
Unfortunately, the search engines themselves have not published exactly a wealth of information about how they detect the language of web pages. In the Official Google Blog you can find some information about voice recognition and how they detect the language of queries, but hardly anything about how they detect the language of the web pages themselves. Based on my research, my best guess is that less sophisticated engines will go for the dictionary method and more sophisticated engines will go for probabilistic methods (and hence artificial intelligence), with character mapping, geo-location and link language as complementary information for those cases where there are doubts. Less sophisticated search engines –and possibly also some of the major ones, except Google– will also consider code-level language information, either as the main criterion or as complementary information.
What that does mean for those involved in SEO translation or having multilingual websites? Well, for that you will have to wait for the next post…