I highlighted in the previous post (Basic Research in Language Recognition) some of the academic research that is taking place and that provides the groundwork for language recognition and identification by the search engines. But why is the recognition of the language in web pages so difficult?
Machines don’t “understand” a language.
Machines are stupid. Even so-called artificial intelligence is light-years away from the intelligence of a 5-year old. Any human being that knows how to read can immediately identify the languages he speaks, and often also others. But machines –and search engines are just that– need to be programmed to analyze the text. For us, a text implies a meaning. For a search engine, it is just a binary stream of characters, though it “knows” (by means of a programmed rule) that words are a set of binary characters separated by the “space” character. Certain words in its dictionary are associated to a certain language and others aren’t. Since it cannot understand the meaning of the text, it has to trust the fact that somebody has told it that certain words belong to a certain language. But it does not know what to do with words that are not on its list, as it cannot identify those words by context, as a human being would do.
Teaching a machine is far more difficult than to teach a 5-year old, the 5-year old has a brain that is adapted to learning and is far more powerful than any machine. (Perhaps I should add “as of today”). And those that have kids know it is difficult to teach them…
Languages might be mixed in the page text.
I’ve seen more than one page written in several languages, pretending to be “helpful”. Speaking fluently five languages (and not fluent in a few more), I did not find it helpful at all, rather the opposite: Quite confusing. And for search engine spiders it will be even more confusing, as they will not –as would a human reader– be able to identify what is or not in a different language.
Even assuming that a page is written “officially” in a single language, you might encounter other problems: In one of the Google Webmaster Central help forum posts, one of the issues was that the page was in Turkish, but it had footers and some other text in English that prevented correct recognition. Google can recognize multiple languages on a page (as stated in the same forums), but probably the volume of text in each language determines the language that is selected by the search engine for that particular page. It is unlikely (though not impossible) that Google classifies a page as being in several languages. And the other search engines probably won’t recognize more than one language at the time.
The code-level language information may be incorrect.
In the previous referenced post, there was an interesting statement by a Google employee: He indicated that Google ignores all code-level language information for language detection, because these are frequently copy & pasted, regardless of the language of the content on the page. But other search engines may not necessarily ignore such information, and this could mislead them into detecting the wrong language. And how long will it take before Google decides to penalize such sloppy practices that make language identification more difficult?
The “language” attribute may be spurious.
Many HTML editors set the default language to English or to the locale defined during tool installation. Not only is this bad in the sense that the whole page may be classified as being in an incorrect language – to my dismay I found when using a quite famous HTML editor that when I inserted text into an existing page, it automatically added a “language” attribute for that text – which did not correspond at all to the page language. Thus, a Spanish page had “English” text all over the place… even though the whole page was written in Spanish. OK, so Google will ignore that. Curiously enough, a Microsoft bing research in 2008 indicated that the most common ‘standard’ lang tag showed up on just 0.000125% of pages on the web, so they did not consider it very useful. But will other search engines do that also?
The domain name may be meaningless.
One could imagine that one could identify the language based on the URL name, and reference  in the previous post actually tried to identify it in exactly this way. But the great difficulty is that many of the domain names are nonsense (because they correspond to company names or brands), misspelled words, combinations of words that would have to be analyzed in every possible language, or arbitrary combinations of characters with numbers. Though the authors of the study reported a certain success, it is questionable that they could repeat it with domain names of these types. In any case, Google has reported that it does not detect language based on the URL but on content.
The TLD and country code mean nothing.
Obviously, everybody and his grandmother has been using (and abusing) the generic TLD codes such as .com, .net, .org or .biz. You can find domains with these extensions in almost every language there is. But even a country-specific domain does not necessarily indicate a specific language. For example, .tv is the Internet country code top-level domain (ccTLD) for the islands of Tuvalu. And it is used by television stations around the world. Does it mean that pages ending in .tv will be written in Tuvaluan or English, which are the two official languages of that island? Don’t count on it. And even more exotic ccTLDs such as .mn (Mongolia) are abused to such an extent that the Minnesota Senate is mapped to senate.mn. Now, don’t expect that particular page to be in Mongolian! With the exception of certain ccTLDs, there is a great probability that the language will match the ccTLD, but it is a probability, not a certainty. And that does not apply to the generic TLDs such as .com.
The character set does not necessarily help.
Some character sets identify the language, but others do not. Thus, for example, character sets such as Euc-JP (a Japanese character set) determines that if a page uses the portion of Euc-JP that is not pure ASCII then it is certainly in Japanese. You can reasonably expect that such characters will be used on a Japanese page. On the other hand, most applications in Western languages use typically Latin-1 (also known by its standard, ISO-8859-1), which is also the default encoding for legacy HTML documents; web pages typically use UTF-8. But neither of these two character sets allows distinguishing between English, French, German or Spanish texts.
Meta-tags might be in the incorrect language.
Curiously, I have found translated pages where the meta-tags were still in the original language. And I do not talk about exotic tags; I talk about important meta-tags such as the site description and the keywords. In addition, some HTML edition tools add meta-tags of their own, such as the program name. I remember that Microsoft FrontPage did that, but it was not the only one and these meta-tags might be read by some search engines and understood as being in a different language (usually English).
Hosting location or server IP is irrelevant.
One might think that the hosting location or the IP of the server might be relevant for language detection. Perhaps it was, ten or fifteen years ago. It no longer is. A hosting site at the other side of the world is just one click away. I live in Europe, and I have some of my sites hosted in Texas and the rest in India. The sites are in five different languages. Oh, and none of them are Hindi (which I do not speak), nor all sites hosted in the U.S. are in English.
The problem is therefore not as simple as it looks. You may have recognized at a glance that this SEO Translation blog is in English, but a search engine will face much bigger problems to get to the same conclusion that you made in a split second…
In the next post (PART 3: How Search Engines Identify Language) we’ll see how search engines DO recognize web pages, and how to make sure that they’ll recognize your multilingual site.