SEO Translator

How to optimize your web site translation for the search engines!

Keyword Basics

No comments

I just got a call from a potential client who’s made the step to website localization – which was a total disaster. Yes, the translation was more or less fine, but the translator managed to mess up his target keywords big time.

Talking with this customer, I suddenly realized that this guy didn’t have a clue about not only about the problems surrounding multinational keywords, much less about the keywords in his own language. So I thought it was about time to start discussing in this blog the keyword localization. But first, for this type of people, I should start with the really basic stuff before moving on to more complex aspects.

What are Keywords?

People search all the time for something on the Internet. And the search volume is huge. Comscore has reported 131 billion searches per month worldwide (December 2009), with a growth of 46% over the previous year. Twitter (noton the previous list) claims 19 billion searches per month, or 800 million searches per day.

So how people search? They go to a search engine and type in the words that describe what they are looking for. These terms are what we call keywords. What we want now is that when the user searches for those words, our page pops us as the first one in the search engines.

Long Tail Keywords

The fact is, that many keywords are very sought-after, and the competition is fierce. The most searched keywords are also the most coveted, and the keyword competition is fierce. Hence that it is important to perform keyword research, so as to find out a set of keywords for which it is relatively easy to rank, or at least where there is relatively little competition. And that is where long tail keywords come in.

Curiously enough, only about 20% of the searches consist of a single word. Statistically, the frequency distribution of the individual keywords is some very few keywords that are used very often, and then other keyword using that same word that appear very infrequently. This frequency distribution -called long tail because of its form- includes less frequently used terms.

Keyword distribution, including long tail keywords.

Thus,  “widgets” might be a very frequent keyword, but “blue widgets” is less frequent. “Square blue widgets in Iowa” is likely to be quite infrequent – it’s part of the long tail. Scoring for “widgets” will be difficult, and ads will be expensive, even if there is a lot of traffic. But the long tail keyword, even if it has not so much traffic, will also have lower advertising costs and a much lower competition. It is for this reason that smart SEO target first for the long tail keywords that include their master keyword.

Keyword Targeting

Now, our page is not going to appear by miracle on top of the search engines. We need to target those keywords, which means that we have to make that the search engines consider that our page is the very best match when a user search for this particular keyword. For this purpose, we need to sprinkle our keyword though the text (but not too much, otherwise you will be penalized by the search engines for keyword stuffing).

For the On-Page SEO, keywords can be targeted by including them in the following locations:

  1. Keyword in the domain name (e.g. www.mykeyword.com)
  2. Keyword in the Page name (e.g., mykeyword.html)
  3. Keyword in URL (e.g. www.mydomain.com/mykeyword/related-keyword.html)
  4. Keyword in the “Title” tag, preferably at the beginning
  5. Keyword in the “Description” Meta-tag
  6. Keyword in tje “Keyword” meta-tag. Google does not use it, but others do.
  7. Keywords in the H1, H2 and H3 titles
  8. Keywords in the text body
  9. Keywords in the “alt” tags
  10. Keywords highlighted with strong or italic
  11. Keywords in bigger font
  12. Keywords in the anchor of internal links

Note that in all cases, the closer the keywords are to the beginning, the better.

In the next post, I’ll discuss off-page SEO for specific keywords, how to search for keywords and how to create a keyword strategy. Then I’ll discuss why all this is important for multinational SEO.

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

In my previous post (Your #1 in Google Is Worthless in The Rest of The World) I highlighted that ranking #1 in Google.com is worth much less than what you think because people outside the US get “diverted” to their local Google. Thus, the fact that one scores #1 in google.com does NOT mean that you will rank everywhere as #1 – in my research I found companies that topped the ranking in google.com completely disappeared from the radar chart when searching (in English!) in the local Google sites.

Interestingly, after examining the results in some of the local googles, I started seeing a trend in the local results, which you might  use to your advantage. Now, this is not a scientific study, as the sampling (only 9 local Google sites and 4 keywords) is not sufficient to perform a detailed analysis. But you can call it an educated guess based on available information.

Search for cars in Google India

Search for cars in Google India (Click on image for greater size)

Apparent Google geo-targeting ranking

Ok, so it’s not scientific. But these are the conclusions I found when looking at the different first pages over 9 different sites (all in English):

  1. Local specifically geo-targeted results on the first page amount to 40-60% of the total.
  2. Around 30% are from sites that have no localization or geo-targeting characteristic (e.g., phone numbers)
  3. The rest seem to be sites which, though ranking relatively good (from second page onwards) in Google.com have a significant number of local links

How are local sites detected?

Here we are on somewhat more firm ground, as some of the information has been published in the Google Webmaster Forum:

“Yes, we do try to find context from these two factors (TLD & server IP) … however, if your site has a geographic TLD/ccTLD (like .co.nz) then we will not use the location of the server as well. Doing that would be a bit confusing, we can’t really “average” between New Zealand and the USA… At any rate, if you are using a ccTLD like .co.nz you really don’t have to worry about where you’re hosting your website, the ccTLD is generally a much stronger signal than the server’s location could ever be. ”

I have also detected that at least Google detects other signals within the text, such as the country name, city names, addresses and international dialing codes. Now, that might not be the most important factor, but I am convinced that such information is also used for geo-targeting, and might be a reason for your site appearing or not appearing on a certain regional Google site.

The interesting point is that as Google performs strong geo-targeting of its results, you might want to use it to your advantage. There are two possibilities here, one is that you actually want to specifically target a certain market, and the other that you do not want to target a specific market but rather want to retain your ranking also in the regional Google search engines.

Targeting local searches

Thus, when you actually WANT To perform some geo-targeting, then you can use the following tricks (in order of importance):

  • Publish your location in Webmaster Tools
  • Use a geographic ccTLD
  • Use a local server in the region you are targeting
  • Get local incoming links, and in particular list your site in regional directories
  • Make sure the content sends signals about your regional information, such as country names, city names, local addresses and international dialing codes

How NOT to disappear from local searches

Again, the sampling base is not sufficient to consider this  as scientific evidence (and it might change with the next algorithm change anyhow), but based on the information I collected in my experiments, I’ve found that the sites that DO stay in the regional Googles present the following characteristics:

  • Probably they have not been targeted with the Google Webmaster tools
  • They are TLDs, and not ccTLDs
  • They are general, truly global, and do not display regional information and if they do, it is about many countries/locations
  • They are multilingual, thus sending strong signals about multi-regional presence
  • They have local incoming links, denoting local presence

Curiously, even the big players stumble on these points, so you have a real chance to get to #1. For example, amazon.de scores quite well for cars in Google Germany (google.de). However, in Google Austria (google.at) Amazon does not appear on the first page for this keyword! Yes, I’m talking about Amazon. And that despite the fact that Germany and Austria share a border and even the same language. But the advantage of having a “.de” ccTLD allows Amazon.de to score well in Germany, but penalizes it in Austria.

Ranking #1 in local search engines

Often it is not possible to rank #1 in Google or other global search engines. However, apart from the fact that competition is not so strong in regional search engines, local targeting can have a dramatic boost on your local ranking, and not performing it might on the other hand make your site completely disappear from the regional SERPs, even if ranking well in google.com. Even some regional targeting by big players (see the Amazon example above) is so sloppy that you can beat them without too much effort.

And I left out the best for the end: Lately Google is showing several results for a same site if it finds those results valuable. During my research I found out that this is also true when searching on the “other” Googles, so your single score could multiply into two or three different results locally!

Call the regional results niche results, if you want. But you want traffic, don’t you? Then ignore the regional search engines at your own risk. You thought the world turned around Google.com? Wrong! Most of the searches around the world are not performed there! Targeting the regional engines at the same time as the global ones is the smart move to do. You may perhaps not rank #1 on Google.com, but perhaps you are No. 1 in many, many other regional sites if your competitors are not as smart as you!

Skeptic? Well, let me put it this way: Do you prefer a #5 in google.com (where you will get barely 2-3% of the traffic) or a #1 in Google India and Google China (where you’ll get 50-60% of the traffic)? And where do you think there are MORE potential customers?

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

I participated yesterday in a thread in Digitalpoint, where somebody complained that for a certain keyword the Google adwords keyword checker reported 6 million words, but he just received 58,000 visits for this keyword though bouncing in positions #2 to#5.

Apart from the fact that if somebody finds you through a search does not mean that somebody will click on your link (e.g., because the displayed text shows something different from what he is looking for), the number of clicks drops dramatically with each position. But what this person was forgetting was a high ranking on Google.com does not automatically guarantee you a good ranking in the “other” Googles I mentioned in my previous post (How to Access the “Other” Google).

Google has a local bias, whether you like it or not, because Google considers -right or wrong- that a user will be more interested in local content. Yet this person replied:

Well, if you rank in Canada and US up top, then usually you stand close in the local Google searches as well.

Now, that is NOT correct! It is a mistake that too many US-centered marketers commit. I have a couple of sites that rank #1 in Google.com and fail to appear in Google.ca and other locales for the same keyword on the first page. I need explicit targeting for that. Google performs geo-targeting based on the user locale and location, and the results vary from country to country, EVEN in the same Google! I travel a lot & have experimented this. The results you get (in English) when accessing Google France when in France, Germany or in Spain are not necessarily the same. BTW, you do not have to believe me, try it out yourself!

Remember my last post? In my last post  (How to to access the “other” Google) I highlighted how you could search in the “other” Google search engines. Forget for a moment the language factor, that I will discuss in a later post. Let’s stick to English and carry out an experiment with the results of three Google search engines:

For example, in Google.com for the word “cars” I get the following results:
1. New & Used Cars for Sale, Auto Dealers, Car Reviews and Car … in www.cars.com/
2. Cars (2006)…. in www.imdb.com/title/tt0317219
3. Disney/Pixar Cars – The Official Site… in disney.go.com/cars/
4. CARS.gov – Car Allowance Rebate System – Home – Formerly Referred … in www.cars.gov
5. Used Cars – New Cars – Search New & Used Cars For Sale – carsales … in www.carsales.com.au

Search for cars in Google.com

Searching for cars in Google.com

In Google.de (Germany) the same keyword provides the following results:

1. Cars – Offizielle Film Website Ein Disney / Pixar Film… in www.disney.de/DisneyKinofilme/cars
2. Cars (Film) – Wikipedia… in de.wikipedia.org/wiki/Cars_(Film)
3. Disney/Pixar Cars – The Official Site… in disney.go.com/cars/
4. Cars (2006).. in www.imdb.com/title/tt0317219/
5. Cars (Einzel-DVD): Amazon.de: Jorgen Klubien, Joe Ranft, Randy … in www.amazon.de

Search for cars in Google.de

Searching for cars in Google.de

In Google.ca (Canada)  -with English locale- the same keyword provides the following results:
1. Cars (2006)… in www.imdb.com/title/tt0317219/
2. Canadian Aviation Regulations (CARs) – Policy and Regulatory … in www.tc.gc.ca/eng/civilaviation/regserv/cars/menu.htm
3. Cars (film) – Wikipedia, the free encyclopedia… in en.wikipedia.org/wiki/Cars_(film)
4. Automobile – Wikipedia, the free encyclopedia… in en.wikipedia.org/wiki/Automobile
5. Used Cars @ CarCasher.Com… in www.carcasher.com/

Searching for cars in Google Canada

Searching for cars in Google Canada

Note that for google.com I get the “standard” feed, but in Google Germany and Canada I get a localized/geotargeted version. The results are NOT the same! car.com, which in Google-com ranks #1,, does not even appear on the first page on Google Germany and Google Canada! Neither do #4 (www.cars.gov) and #5 (www.carsales.com.au) appear on the first page in Germany or Canada!

Thus, the #1 ranking for car.com on Google.com is meaningless in Canada and Germany, as it does not even appear on the first page when searching in those countries.

However, it is interesting to see which results are repeated. Why is this? Any ideas?

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

OK, so you’ve translated your website, performed your multinational SEO, building localized links, and are ready to test the results. But why is your French page not popping up in Google? You did everything there is in the book, yet the results fail to come. Disaster!

Well, perhaps not. The issue might be that you are looking into the “wrong” Google! Surprised? You should not be. Google identifies your location, and the language of your browser, perhaps also your language preferences in Google Webmaster, and tries to localize the result for you. So, if you are based in the US or in Germany, the results would be different than if you searched in France. Actually, if you are not the US, when you write http://www.google.com, Google will redirect you to your country-specific Google, say, http://www.google.de for Germany or http://www.google.es for Spain…. and http://www.google.fr for France.

So if you want to test your Google SERP for your French pages, type the address of the “French” or “Canadian” Google (www.google.fr and www.google.ca respectively). The funny thing is that you will get a page in YOUR language, with an option to switch to the local language (French and English for Canada). Skeptic? Have a look at the next screenshot, when accessing Google with a Spanish locale:

Accessing Google Canada with a Spanish locale

In this way, you can actually test your French pages in the same way as French (in the above example: Canadian) users would do.

But wait: Let’s assume that you ARE in France, and want to test Google Canada. Does it work? Yes, it does. And Germany. And Spain. And every single Google there is…. except Google.com, which will redirect you back to Google France. But do not despair if you want to access Google.com for global testing, there is a solution, simply add “/ncr” after the Google address, and it will NOT redirect you anywhere. Thus, if you type http://www.google.com/ncr, you will NOT be redirected.

It is always important when you perform SEO translation to check the results in the local markets you are targeting, because Google (and many other search engines) try to provide you with geo-targeted results because they think those are most relevant to you. That is obviously not always true, but it might be true for your customers. And even if they are not, those are the results that they will see, like it or not. So do not think that your SEO campaign is a total failure because your French keywords don’t pop up in Google when you’re based in Germany… check out Google.fr and google.com/ncr, perhaps you are there already on the first page!

Oh, and by the way, a curious side-effect when you use the /ncr attribute is that Google stores a cookie in your machine, and when you type again the .com extension you will NOT be redirected!

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

Trados Tageditor is a tool that has been widely used for website translation. I own it myself since 2003, and have upgraded through the different versions (if there was such, I did not notice almost any difference between the two or three last upgrades) so I by now I am quite familiar with it. TagEditor disappeared with the new SDL Trados 2009, but it is still widely used by translators. There is a lot of information about it on the web, but little is commented about how and whether it is really adequate for website translation.

Yes, TagEditor can handle part of the job

The first question is whether it can handle HTML properly, and the question is a big “YES, BUT”… Let’s be clear: It does a great job in identifying the HTML tags, and making sure that you do not touch them and concentrate on the translation of the text. Provide you do not touch the tags, and leave them in the correct places, you will certainly maintain the correct page format.

Tageditor stores the bilingual file in a proprietary format, and you have also the possibility to view the source and translated texts side by side, properly formatted. Mind you, this is a *very* important feature, which explains why TagEditor has been widely used for the translation of web pages. It is really impressive to see the original format and the translated result and identify immediately where you have committed a translation error or skipped a tag. It even allows you to verify (manually, by clicking on them) whether a link is or not the same in both the source page and the translated one. In this context, kudos for this software, old as it is.

TagEditor website translation side-by side with original

TagEditor website translation side-by side with original

BUT:

Yes, there is a “but”: The “Save target as…” function, which restores the translated text into an HTML file, does all kinds of crazy things with it.

Problems with the character sets

1. For starters, it often changes the character set. In one translation (though not always, I don’t know whether it simply doesn’t like some character sets), I had the following meta tag:

<meta http-equiv=”content-type” content=”text/html; charset=ISO-8859-1″>

But TagEditor changed it arbitrarily to:

<meta http-equiv=”content-type” content=”text/html; charset=windows-1252″>

Now, I don’t like that a piece of software tries to be smarter than I am, specially if it makes the stupid assumption that because I ran TagEditor on Windows the target HTML file would also run on Windows, which obviously is not necessarily true on the Internet.

Interference with CMS codes

It started changing “>” by “&gt;” and “<” by “&lt;”. Yes, that is the correct representation of these characters in HTML, but it is plainly WRONG when there is code embedded in the HTML. One of the texts I translated had tags such as “<#=customername#>” that would let a CMS-preprocessor insert the customer name at the location of such tag. But, as these tags were converted happily into “&lt#=customername#&gt”, the web page showed ultimately some very curious things.

PHP scripts cannot be translated in TagEditor

A complete PHP script (meaning everything between “<PHP” and “PHP>”) was identified as a “PHP” tag, and was not translatable. Unfortunately, the script contained text that was printed to the page on the server side, and, given that you could not edit it in the TagEditor, this text would be printed on the page in the original language. Now, in theory you can edit tags in TagEditor, bot only if the tag is in a translation segment. But a PHP tag is NEVER part of translation segment, so it is not possible to edit it. This stupid detail would be usually go undetected by most translators, unless they happened to have the tags expanded, which usually is not so because of the visual clutter they cause.

META Tags cannot be edited

Another thing I really hated was that though it did recognize META tags, the process is exactly the same as for PHP: TagEditor marks it as a tag, and there is no way to edit it. This is specially annoying with important meta tags such as the description or the keywords, where the only way to translate it was AFTER the HTML in the target language was generated.

Special characters will clash with CMS software

Finally, special characters such as “á” (which appears in many languages) were converted into their HTML equivalent (in the case of “ó”, “&oacute;”). In theory this should be OK for plain HTML pages, except that in those cases where you actually had to paste or import the translated text back into a CMS system, the translation of a word like “adiós” (goodbye in Spanish) eventually became “adi&&oacutes” in the HTML and showed up in the browser as “adi&oacutes;”.

Conclusion

To be fair, I must point out that TagEditor DID recognizes “alt” and “title” attributes in pictures, and simply hides as tags the remaining information such as height, width, etc, so it is reasonably smart and allows to edit as plain text the “alt” attributes and titles of pictures. It would be also unfair to blame it for other post-processing that a CMS might do, but it should at least have the option to disable the automatic changing of the special characters into their HTML codes.

Is there a way around?. Well, there is, in the sense that you can create a special DTD (Document Type Definition) which does not consider these elements as tags, but just simple letters, but writing such a DTD is not for the faint-hearted, takes quite a while and does not solve all issues.

So ultimately yes, TagEditor is “somewhat” adequate for website translation, provided you are aware of the traps that you might fall into. And you should note that, though it does have some good points, it will not suffice: You will need to perform some post-editing (say, with a text editor or HTML-editor) to clean up things like the character set of the META tags that TagEditor refused to allow you to translate.

Though Trados TagEditor is still widely used, it has disappeared in the last version of the Trados suite. Is the latest Trados version better suited for the translation of websites? Well, we will look into that in another post…

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

In previous posts I discussed Basic Research in Language Recognition, the Difficulty of Language Identification and How Search Engines Recognize the Language of Your Pages.This might be very interesting, but the point is that you WANT search engines to properly recognize the language in which your site is written.

So we finally come to the core of the problem: How do we ensure that the search engines recognize the language of our original and localized pages? Because if your translation is not recognized as such, then your multilingual SEO efforts will be totally wasted. Here are a few tips to make sure that the search engines DO recognize your pages:

Do not mix languages!

Given that many search engines use a combination of n-gram algorithms and dictionary methods, language recognition is made more difficult if several languages coexist on a same page, so you risk that the page in question is identified in one of the languages on it, though not necessarily in the main one. Murphy ’s Law applies also here. If you must include some text in a different language (e.g., a quote), make sure that the remaining text is significantly more substantive that the quote.

Use the “lang” attribute.

Yes, it’s true that Google does not trust it, but that does not mean that other search engines do not use it. And at the very least, if the “lang” attribute corresponds to Google’s “best guess”, then it will provide it with the confirmation that it is indeed in that particular language. Something like <html lang=”es”> will certainly help to indicate that the language is Spanish. It can’t hurt, so do it.

Make sure that there are no spurious “lang” attributes.

As pointed out previously, some HTML tools insert by default a “lang” attribute when you write additional text in an existing page. Now, this attribute may or not correspond to the page language, but if it does correspond then it is redundant, and if it is not then it is misleading. Use this attribute only when you want to insert some text in a different language, and make sure that it corresponds to the correct language.

Use a language meta-tag.

For example, <meta http-equiv=”Content-Language” content=”es-es” /> for Spanish. Like the “lang” attribute, Google will ignore it, but other search engines might not. For example, Yahoo still gives it a lot of importance.  Check out the W3.org page on HTTP and meta for language information.

Use Google Webmaster Tools.

Now, other search engines might not have that possibility, but you better make sure that Google acknowledges at least the main language of your site. If your localized versions are in subdomains, then identify those too. And of course, if you use geo-targeting, use the Set Geographic Target tool in Webmaster Tools.

Clean those HTML pages!

It is always a hygienic measure to remove the crap from your pages. This includes removal of useless information such as the tool that generated the page, or lots of meta-tags that you might have been copying mindlessly from one page to another, and then to another site. These meta-tags might be in a different language, further muddling the identification of your page language. Make also sure that you have no text in a different language in non-conspicuous places, such as footers (e.g., because you used a template).

Make sure the meta-tags are in the same language.

It is incredible how often people leave the page title or keywords or description in the original language when they localize their web pages. Now, that’s a good way to confuse the search engines about the used language. And even if you don’t confuse them, you will confuse the users searching for your localized pages when a description pops up in a language they don’t understand.

Link to your pages from pages in the same language.

I pointed out in my recent post Why Your Existing Back Links Are Worthless why it was not a good idea to use to use external links written in a different language. Well, there’s another catch: Google uses also external links to determine the language of your site. So, if you have many links in English to your German pages, Google may decide that those German pages are in English.

Don’t link to pages in different languages from a same page.

It is not clear whether search engines consider also the language of linked pages to evaluate the language of your page, but I’d say better be safe than sorry. And why would you want to link to a page in a different language anyhow? Even if you speak it, your visitors might not!

Identify links to different languages using hreflang.

If you DO need to link to documents in other languages, at least use the hreflang attribute to indicate the language of the target document. It is unclear whether search engines use this particular attribute, but it’s worth a try:

<a href=”http://www.seo-translator.com/” hreflang=”en”>seo-translator.com</a> to link to this page from a non-English page.

Do not store pages from different languages in a same location.

One of my posts discussed how to save the pages corresponding to the different languages in either separate subdirectories or subdomains. Apart that maintenance becomes a mess, if you mix pages of different languages in a same location, the search engine spider will find all those pages together and might deduce (incorrectly) that they are in the same language. I have no evidence that this might occur, but I personally would not take this risk.

Provide sufficient text for language recognition.

Interestingly, a lot of the academic studies stated that they had problems in language recognition because they did not have a sufficiently big sample of words to determine the page language. This will be also true for search engines, so make sure that your page holds sufficient text. Avoid pages stuffed with images and no text, and in particular avoid images containing text – the search engine robots do not include OCR software to detect what you have written in your graphics!

Use language-characteristic words and letters.

For those search engines that use dictionary-based approaches (and yes, perhaps even for Google) it is very helpful if they can detect words (or characters) that are specific to one language. “España” will be written like that only in Spanish. Actually, the “ñ” is a letter that is only used in Spanish. Я identifies a letter of the Cyrillic alphabet, but Russian is not the only language using that alphabet. If you find however ن, you might be quite safe in stating that the language is Arabic, or that the language is Hebrew if you find the letter ש. Greek letters, however, are likely to be encountered in mathematical formulas, so don’t abuse them, or at least include them as a graphic. The German β (such as in Gruβ) is however also used in Greek and Coptic. The French ç is also found in many East European languages, so do not think that is the key of success!

A final word of warning.

If you combine all these little pieces of advice in your web pages, the probability that a search engine identifies the page language correctly is close to certainty.  Ignore them at your own risk: incorrect language recognition by a search engine will imply that you will NEVER top in the search results. Your SERP might actually be excellent, but perhaps in a language that you yourself do not understand, and your visitors even less. This is not theory –even a superficial search in the Google Webmaster Central help forum shows many people complaining that their pages dropped like rocks in the ranking because Google identified their language incorrectly. And if you were #1 in English, why would you want to exchange that for a #1 in Swahili or Afrikaans? Of course there is a market for those languages. But is it your market?

This closes the series of posts on language identification by the search engines. It is an interesting topic, and there is not much information around, so I’ve tried to compile all the information in one single place, including my personal impressions on how you could improve the recognition of your multilingual websites. Now, what do YOU think?

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

In the previous post (Language Identification is difficult) I highlighted how difficult it could be to identify a language. Yet Google does it somehow: it has over 160 domains, and Google allows restricting user results to pages in 117 languages, so “somehow” it must be able to “understand” into which languages these pages must be. Based on the literature and the hints found in places such as Google’s Webmaster Central, search engines seem to use a variety of methods to detect the language of your web pages.

Meta-tag.

Ok, Google ignores this, but other search engines (such as Yahoo!) will recognize  the <meta http-equiv=”Content-Language” content=”en” /> meta-tag as an important input for language indication.

Dictionary methods.

Basically this consists in building a list of commonly found words (usually stop words such as “the”) in languages and scanning unknown texts for those words, and assigning a language to them based on those words.  A simple algorithm that identified that a high percentage of the total words in a text were in the dictionary for a specific language would be a strong argument towards assigning that language to the page.

Probabilistic methods.

Now, I read recently an interesting article at SEO by the Sea about how search engines know the language of a query. Ok, so strictly speaking it does not speak about the identification of the language in web pages, but the basic problem is the same. Moreover, the article mentioned four Google patents which hinted to the fact that Google might be using some kind of artificial intelligence and/or statistical method to identify the language of the text. And in an article in the Official Google Blog they actually mention the use of artificial intelligence. Google’s language detection tool seems also to point out in the direction of probabilistic methods.

Thus, Google could compute the probabilities of all those words appearing on the page for the different languages, and use artificial intelligence and/or machine learning to predict the most likely language. Combining this with character mapping (maps of characters more unique to certain languages than others) this could provide a very high probability of identification.

Personally, I strongly doubt that the whole page is scanned this way (the required computing power for the calculation of billions of pages would be immense), but using a subsegment of the pages or an incremental scanning approach would be reasonable. Some research points out that a sample between 400 and 600 characters should be sufficient to identify a language, so why spend more effort on it?

For those interested in exploring this method, I found an interesting example of probabilistic language determination and training with Lingpipe, a toolkit for processing text using computational linguistics.

Geo-location.

At least Google considers Geo-location a factor for language detection. When you think about it, if you find that a page indicates a geo-location then that particular page is probably in the local language. A site offering services in Paris is likely to be in French, a page offering services in New York is likely to be in English. This is obviously not always 100% true – if the page is that of an hotel, chances are that it also has an English version, even if located in Paris.

Incoming links.

I noticed an interesting discussion at the Google Webmaster Central help forum about a site that was in English but was treated as Chinese. The ONLY relationship of the guy whose page was mistakenly taken as Chinese was that he had a lot of incoming links from Chinese pages. Thus, we can safely assume that the language of incoming links –which Google knows, after all– might be a factor in language detection. There are a few hints to this in Google’s Webmaster Central, but nothing all too obvious. It seems logical; it is unlikely that people link to pages in a different language.

Outgoing links.

Katia Hayati (Ref. [3] in the first post of the series) highlighted that in the tested sample of web pages roughly 95% of all outgoing links were written in the same language. Again, this makes sense, people link to what they understand. Now, I could not find anywhere even a hint that outgoing links are used by search engines for language recognition, but I personally would not discard it.

Conclusion.

Unfortunately, the search engines themselves have not published exactly a wealth of information about how they detect the language of web pages. In the Official Google Blog you can find some information about voice recognition and how they detect the language of queries, but hardly anything about how they detect the language of the web pages themselves. Based on my research, my best guess is that less sophisticated engines will go for the dictionary method and more sophisticated engines will go for probabilistic methods (and hence artificial intelligence), with character mapping, geo-location and link language as complementary information for those cases where there are doubts. Less sophisticated search engines –and possibly also some of the major ones, except Google– will also consider code-level language information, either as the main criterion or as complementary information.

What that does mean for those involved in SEO translation or having multilingual websites? Well, for that you will have to wait for the next post… ;)

PART 4: How to Ensure Language Recognition by Search Engines

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

I highlighted in the previous post (Basic Research in Language Recognition) some of the academic research that is taking place and that provides the groundwork for language recognition and identification by the search engines. But why is the recognition of the language in web pages so difficult?

Machines don’t “understand” a language.

Machines are stupid. Even so-called artificial intelligence is light-years away from the intelligence of a 5-year old. Any human being that knows how to read can immediately identify the languages he speaks, and often also others. But machines –and search engines are just that– need to be programmed to analyze the text. For us, a text implies  a meaning. For a search engine, it is just a binary stream of characters, though it “knows” (by means of a programmed rule) that words are a set of binary characters separated by the “space” character. Certain words in its dictionary are associated to a certain language and others aren’t. Since it cannot understand the meaning of the text, it has to trust the fact that somebody has told it that certain words belong to a certain language. But it does not know what to do with words that are not on its list, as it cannot identify those words by context, as a human being would do.

Teaching a machine is far more difficult than to teach a 5-year old, the 5-year old has a brain that is adapted to learning and is far more powerful than any machine. (Perhaps I should add “as of today”). And those that have kids know it is difficult to teach them…

Languages might be mixed in the page text.

I’ve seen more than one page written in several languages, pretending to be “helpful”. Speaking fluently five languages (and not fluent in a few more), I did not find it helpful at all, rather the opposite: Quite confusing. And for search engine spiders it will be even more confusing, as they will not –as would a human reader– be able to identify what is or not in a different language.

Even assuming that a page is written “officially” in a single language, you might encounter other problems: In one of the Google Webmaster Central help forum posts, one of the issues was that the page was in Turkish, but it had footers and some other text in English that prevented correct recognition. Google can recognize multiple languages on a page (as stated in the same forums), but probably the volume of text in each language determines the language that is selected by the search engine for that particular page. It is unlikely (though not impossible) that Google classifies a page as being in several languages. And the other search engines probably won’t recognize more than one language at the time.

The code-level language information may be incorrect.

In the previous referenced post, there was an interesting statement by a Google employee: He indicated that Google ignores all code-level language information for language detection, because these are frequently copy & pasted, regardless of the language of the content on the page. But other search engines may not necessarily ignore such information, and this could mislead them into detecting the wrong language. And how long will it take before Google decides to penalize such sloppy practices that make language identification more difficult?

The “language” attribute may be spurious.

Many HTML editors set the default language to English or to the locale defined during tool installation. Not only is this bad in the sense that the whole page may be classified as being in an incorrect language – to my dismay I found when using a quite famous HTML editor that when I inserted text into an existing page, it automatically added a “language” attribute for that text – which did not correspond at all to the page language. Thus, a Spanish page had “English” text all over the place… even though the whole page was written in Spanish. OK, so Google will ignore that. Curiously enough, a Microsoft bing research in 2008 indicated that the most common ’standard’ lang tag showed up on just 0.000125% of pages on the web, so they did not consider it very useful. But will other search engines do that also?

The domain name may be meaningless.

One could imagine that one could identify the language based on the URL name, and reference [2] in the previous post actually tried to identify it in exactly this way. But the great difficulty is that many of the domain names are nonsense (because they correspond to company names or brands), misspelled words, combinations of words that would have to be analyzed in every possible language, or arbitrary combinations of characters with numbers. Though the authors of the study reported a certain success, it is questionable that they could repeat it with domain names of these types. In any case, Google has reported that it does not detect language based on the URL but on content.

The TLD and country code mean nothing.

Obviously, everybody and his grandmother has been using (and abusing) the generic TLD codes such as .com, .net, .org or .biz. You can find domains with these extensions in almost every language there is. But even a country-specific domain does not necessarily indicate a specific language.  For example, .tv is the Internet country code top-level domain (ccTLD) for the islands of Tuvalu. And it is used by television stations around the world. Does it mean that pages ending in .tv will be written in Tuvaluan or English, which are the two official languages of that island? Don’t count on it. And even more exotic ccTLDs such as .mn (Mongolia) are abused to such an extent that the Minnesota Senate is mapped to senate.mn. Now, don’t expect that particular page to be in Mongolian! With the exception of certain ccTLDs, there is a great probability that the language will match the ccTLD, but it is a probability, not a certainty. And that does not apply to the generic TLDs such as .com.

The character set does not necessarily help.

Some character sets identify the language, but others do not. Thus, for example, character sets such as Euc-JP (a Japanese character set) determines that if a page uses the portion of Euc-JP that is not pure ASCII then it is certainly in Japanese. You can reasonably expect that such characters will be used on a Japanese page. On the other hand, most applications in Western languages use typically Latin-1 (also known by its standard, ISO-8859-1), which is also the default encoding for legacy HTML documents; web pages typically use UTF-8. But neither of these two character sets allows distinguishing between English, French, German or Spanish texts.

Meta-tags might be in the incorrect language.

Curiously, I have found translated pages where the meta-tags were still in the original language. And I do not talk about exotic tags; I talk about important meta-tags such as the site description and the keywords. In addition, some HTML edition tools add meta-tags of their own, such as the program name. I remember that Microsoft FrontPage did that, but it was not the only one and these meta-tags might be read by some search engines and understood as being in a different language (usually English).

Hosting location or server IP is irrelevant.

One might think that the hosting location or the IP of the server might be relevant for language detection. Perhaps it was, ten or fifteen years ago. It no longer is. A hosting site at the other side of the world is just one click away. I live in Europe, and I have some of my sites hosted in Texas and the rest in India. The sites are in five different languages. Oh, and none of them are Hindi (which I do not speak), nor all sites hosted in the U.S. are in English.

The problem is therefore not as simple as it looks. You may have recognized at a glance that this SEO Translation blog is in English, but a search engine will face much bigger problems to get to the same conclusion that you made in a split second…

In the next post (PART 3: How Search Engines Identify Language) we’ll see how search engines DO recognize web pages, and how to make sure that they’ll recognize your multilingual site.

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

One of the advantages of being an IT professional as well as a translator is that as part of my IT duties I try to keep ahead of the wave and therefore keep an eye of state-of-the-art research. This, obviously, means having a subscription in professional societies such as the Association of Computing Machinery (ACM) or the IEEE Computer Society, as well as reading the content of their digital libraries as well as their publications. I also keep an eye on the most relevant SEO blogs and forums and of course on the Google webmaster blog.

So I stumbled on several complaints by users that the search engines –and Google in particular– was identifying their non-English (and sometimes even their English) pages as being in an incorrect language. This made me wonder how search engines identify the language of a specific website, as this has the greatest importance for SEO translation and even for multilingual SEO, once the translation has been performed.

I started my investigation around the academic community, as I have evidence that the search engines (and in particular Google) pay attention to the research in this field. A search in the digital libraries of the ACM and the Computer Society was however quite disappointing. A found quite a few theoretical papers on language recognition, but not specific to the Web. There were also a number of interesting academic papers about the recognition of the language in search queries, a quite interesting paper on how culture affects web design (and that I will address in a future post) but only six papers and one book brought me some light onto this particular issue:

  1. Indexing the Indonesian Web: Language Identification and Miscellaneous Issues, by Vinsensius Berlian Vega and Stéphane Bressan. Through very brief, it provides some interesting insights on the difficulties regarding language recognition.
  2. Web Page Language Identification Based on URLs, by Eda Baykan, Monica Henzinger and Ingmar Weber. This paper discussed several machine learning algorithms for language identification, namely Naïve Bayes, Relative Entropy, Maximum Entropy and Decision Tree. Curiously, the study was attempting to identify the language of a web page using only its URL.
  3. Language identification on the World Wide Web, by Katia Hayati, as part of her master’s degree in Computer Science.  The basic approach is the classic n-gram based algorithm, supplemented by the Fisher discriminant function. The math and algorithms are relatively simple, but it contains some hidden pearls in the text that I will identify later on, and which are consistent with what Google is saying.
  4. Language identification of on-line documents using word shapes, by N. Nobile, S. Bergler, C.Y. Suen and S. Khoury. The authors developed character classes using visual characteristics. Contrary to the previous paper, they do not use n-grams, but just bigrams and trigrams combining scores and an expert system.
  5. Language Identification in Web Pages, by Bruno Martins and Mário J. Silva. These also construct their study on the well-known n-gram algorithm initially proposed by Canvar and J. M. Trenkle (see [6]).
  6. N-gram-based text categorization, by W. B. Canvar and J. M. Trenkle . The basic n-gram paper that is referenced by everybody else.
  7. Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti.  Strictly speaking it does not indicate how search engines discover your language, but it provides a profound insight on knowledge discovery – and ultimately identifying a language is exactly that.

Now, is this pure academic stuff useful? Actually it is if you want to understand how search machines detect their languages. If you have a look at the Google Research Blog, you will find that they DO look at academic papers for their own research. And one particular article that caught my eye was the one titled All Our N-gram are Belong To You. It literally states that “Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction and others”. This is a strong hint that Google uses n-grams to recognize language, as it has repeatedly stated that it does not use the “language attribute”. But we’ll leave that for the next part of this post…

PART 2: Language Identification is difficult

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot

One interesting thing, which is almost never considered, is the fact that error pages very often will pop up in the original site language. Why? Because the webmaster never thought about it in the first place.

Now, from the user experience point of view, this is not exactly thrilling. Imagine that you are navigating a localized French page, and you get a 404 error (Page not found) in English. The user might not understand English in the first place, and he might not even know what a 404 error is. Will he click the page back to where he was? Will he leave altogether?  And what will he think/do with more esoteric error codes, such as 503 (Service not available) error?  (A list of HTTP error codes can be found for example in the help for Google webmaster tools.)

The worst possibility is that he came to your site from an external link, and therefore (if he goes back previous page) will never have to opportunity to know your products or services. A potential customer is lost simply because there was not an error handling page that could have sent him in the correct direction.

The fact that a user encounters an error is something that all webmasters should consider, as it impacts the user experience and therefore the possibility to lose clients. Thus, well thought-out sites have special error pages offering assistance such as suggestions on possible pages, a link to the home page or sometimes just a funny message that makes the user forget about the fact that he wanted something that is simply not there.

Sample 404 error

Sample 404 error

But just as bad as not having an error page is that when you localize your site you forget to localize your error pages. I remember that I was once browsing the English pages of a very interesting Russian site when I encountered a 404 error page – in Russian.  I studied some Russian 25 years ago, but I’ve forgotten it all by now, so I did not understand anything.

If you have made the effort to create error pages, make also an effort to localize these, and make sure that when a localized page is not found, then the localized error page pops up. It is not very difficult. For example, if you create your localized site in a subdirectory, then a simple modification to the .htaccess file will ensure that the localized page pops us in the correct language.

A different way would be to detect the browser locale or user language preferences, and present the localized version, it depends on how you handle localized content. But, as I will explain in  different post, this might not be a good idea.

But again, when you localize your error pages, remember for which culture you are writing, do not simply copy the original pages. An error page like the one show above might be funny in many cultures, but would be offensive in others.  Error pages merit the same localization effort as other pages in your site, as these will mark the difference when a user ends up seeing them

The user experience depends not only on translating the text, but rather on the fact that the user “sees” the site as if it had been made in his language and culture. A small detail such as that ever the error page is localized might not seem important, but customers appreciate small details, and will trust you more than if you just took care of the superficial varnish.

Keep also the search engines in mind, and make sure that your localized error pages also comply with the acceptable rules for these kinds of pages. For example, Google discourages the use of so-called “soft 404s” because they can be a confusing experience for users and search engines.  A recent blog on the Official Google Webmaster Central Blog however indicated a means to correct soft 404s using Google Webmaster Tools.

But look at the localized error pages also as an opportunity: Google acknowledges that it uses the links on error pages, even if though it considers these of less importance, so make sure that those links reinforce your localized site!

Share and Enjoy:
  • Add to favorites
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Technorati
  • Mixx
  • Google Bookmarks
  • Blogplay
  • MySpace
  • Tumblr
  • Twitter
  • Suggest to Techmeme via Twitter
  • LinkedIn
  • Yahoo! Bookmarks
  • BlinkList
  • eKudos
  • Meneame
  • blogmarks
  • Netvibes
  • Sphinn
  • Live
  • Blogosphere News
  • MyShare
  • MSN Reporter
  • Slashdot