In previous posts I discussed Basic Research in Language Recognition, the Difficulty of Language Identification and How Search Engines Recognize the Language of Your Pages.This might be very interesting, but the point is that you WANT search engines to properly recognize the language in which your site is written. Why is this important? I also wrote a post about what will happen if your website language is not recognized.

So we finally come to the core of the problem: How do we ensure that the search engines recognize the language of our original and localized pages? Because if your translation is not recognized as such, then your multilingual SEO efforts will be totally wasted. Here are a few tips to make sure that the search engines DO recognize your pages:

Do not mix languages!

Given that many search engines use a combination of n-gram algorithms and dictionary methods, language recognition is made more difficult if several languages coexist on a same page, so you risk that the page in question is identified in one of the languages on it, though not necessarily in the main one. Murphy ’s Law applies also here. If you must include some text in a different language (e.g., a quote), make sure that the remaining text is significantly more substantive that the quote.

Use the page-level “lang” attribute.

Yes, it’s true that Google does not trust it, but that does not mean that other search engines do not use it. And at the very least, if the page-level “lang” attribute corresponds to Google’s “best guess”, then it will provide it with the confirmation that it is indeed in that particular language. Something like <html lang=”es”> at the beginning of the page will certainly help to indicate that the language is Spanish. It can’t hurt, so do it.

Make sure that there are no spurious “lang” attributes in the text.

As pointed out previously, some HTML tools insert by default a “lang” attribute in the text when you write additional text in an existing page. Now, this text-level language attribute may or not correspond to the page language, but if it does correspond then it is redundant, and if it is not then it is misleading. Use this attribute only when you want to insert some text in a different language, and make sure that it corresponds to the correct language.

Use a language meta-tag.

For example, <meta http-equiv=”Content-Language” content=”es-es” /> for Spanish. Like the “lang” attribute, Google will ignore it, but other search engines might not. For example, Yahoo still gives it a lot of importance.  Check out the page on HTTP and meta for language information.

Use Google Webmaster Tools.

Now, other search engines might not have that possibility, but you better make sure that Google acknowledges at least the main language of your site. If your localized versions are in subdomains, then identify those too. And of course, if you use geo-targeting, use the Set Geographic Target tool in Webmaster Tools.

Clean those HTML pages!

It is always a hygienic measure to remove the crap from your pages. This includes removal of useless information such as the tool that generated the page, or lots of meta-tags that you might have been copying mindlessly from one page to another, and then to another site. These meta-tags might be in a different language, further muddling the identification of your page language. Make also sure that you have no text in a different language in non-conspicuous places, such as footers (e.g., because you used a template).

Make sure the meta-tags are in the same language.

It is incredible how often people leave the page title or keywords or description in the original language when they localize their web pages. Now, that’s a good way to confuse the search engines about the used language. And even if you don’t confuse them, you will confuse the users searching for your localized pages when a description pops up in a language they don’t understand.

Link to your pages from pages in the same language.

I pointed out in my recent post Why Your Existing Back Links Are Worthless why it was not a good idea to use to use external links written in a different language. Well, there’s another catch: Google uses also external links to determine the language of your site. So, if you have many links in English to your German pages, Google may decide that those German pages are in English.

Don’t link to pages in different languages from a same page.

It is not clear whether search engines consider also the language of linked pages to evaluate the language of your page, but I’d say better be safe than sorry. And why would you want to link to a page in a different language anyhow? Even if you speak it, your visitors might not!

Identify links to different languages using hreflang.

If you DO need to link to documents in other languages, at least use the hreflang attribute to indicate the language of the target document. It is unclear whether search engines use this particular attribute, but it’s worth a try:

<a href=”” hreflang=”en”></a> to link to this page from a non-English page.

Do not store pages from different languages in a same location.

One of my posts discussed how to save the pages corresponding to the different languages in either separate subdirectories or subdomains. Apart that maintenance becomes a mess, if you mix pages of different languages in a same location, the search engine spider will find all those pages together and might deduce (incorrectly) that they are in the same language. I have no evidence that this might occur, but I personally would not take this risk.

Provide sufficient text for language recognition.

Interestingly, a lot of the academic studies stated that they had problems in language recognition because they did not have a sufficiently big sample of words to determine the page language. This will be also true for search engines, so make sure that your page holds sufficient text. Avoid pages stuffed with images and no text, and in particular avoid images containing text – the search engine robots do not include OCR software to detect what you have written in your graphics!

Use language-characteristic words and letters.

For those search engines that use dictionary-based approaches (and yes, perhaps even for Google) it is very helpful if they can detect words (or characters) that are specific to one language. “España” will be written like that only in Spanish. Actually, the “ñ” is a letter that is only used in Spanish. Я identifies a letter of the Cyrillic alphabet, but Russian is not the only language using that alphabet. If you find however ن, you might be quite safe in stating that the language is Arabic, or that the language is Hebrew if you find the letter ש. Greek letters, however, are likely to be encountered in mathematical formulas, so don’t abuse them, or at least include them as a graphic. The German β (such as in Gruβ) is however also used in Greek and Coptic. The French ç is also found in many East European languages, so do not think that is the key of success!

A final word of warning.

If you combine all these little pieces of advice in your web pages, the probability that a search engine identifies the page language correctly is close to certainty.  Ignore them at your own risk: incorrect language recognition by a search engine will imply that you will NEVER top in the search results. Your SERP might actually be excellent, but perhaps in a language that you yourself do not understand, and your visitors even less. This is not theory –even a superficial search in the Google Webmaster Central help forum shows many people complaining that their pages dropped like rocks in the ranking because Google identified their language incorrectly. And if you were #1 in English, why would you want to exchange that for a #1 in Swahili or Afrikaans? Of course there is a market for those languages. But is it your market?

This closes the series of posts on language identification by the search engines. It is an interesting topic, and there is not much information around, so I’ve tried to compile all the information in one single place, including my personal impressions on how you could improve the recognition of your multilingual websites. Now, what do YOU think?

Be Sociable, Share!