International SEO: When Google Delivers Unexpected Results Despite Correctly Implemented HREFLANG

Multilingual websites and international SEO often pose a challenge for the korrekte Implementierung des hreflang-Attributs. This attribute is intended to help search engines select the appropriate language and country version of a page for users. However, even with correct implementation, unexpected results can occur if different country versions are created for the same language area, as can happen with Austria, Germany, and Switzerland or England, USA, Australia, etc.

In the Search Console, there is a note that the canonical URL specified by the user was not used for indexing, but another URL was selected by Google as the canonical URL. Example – here the URL was anonymized using Dev Tools:

Even an assignment of the content through ccTLDs (.de and .at) is not helpful. The problem can also arise with top-level domains. So to continue the fictional example:

Canonical assignment for top-level domains

The mechanism behind the problem

This unexpected behavior from Google can occur when the search engine crawler indexes pages from different countries that are very similar in content. For example, Google first crawls the German (DE) version of a page, calculates the so-called hash value, and indexes the page with the DE URL as the canonical version. When Google subsequently crawls the Austrian (AT) version of the page, it calculates the hash value again and recognizes that it is identical to the already indexed DE version. Instead of indexing the AT URL separately, Google adds it to the already existing document.

Hashing is a technique that Google uses to index the content of web pages more efficiently and to detect duplicates. A hash value is a unique, cryptographic value (a type of fingerprint) that is generated by a hash function based on the content of a web page. If two pages have the same hash value, it indicates that their content is identical or very similar. It can be considered secure that Google Simhash einsetzt bei Crawling.

SimHash for comparing content and detecting duplicates

Simhash is a special technique used by Google to efficiently analyze and compare content, particularly with regard to detecting duplicate content. Simhash allows for quickly checking large amounts of data for similarities without having to compare each individual piece of content in its entirety.

Simhash works by extracting specific features from a text, such as frequent words, phrases, or structures. These features are transformed into a vector that represents the main characteristics of the text. After vectorization, hashing takes place. A shorter bit sequence (the Simhash) is generated from this vector using a hash function. The trick is that similar texts lead to similar Simhash values, allowing for quick identification of similar or nearly identical content.

Google uses Simhash for duplicate content detection, which allows it to quickly and efficiently identify duplicate or very similar content on different websites. If two pages have a similar Simhash value, Google can use this as an indicator that the content is very similar and may choose to include only one version of the page in the search index.

Instead of performing a complete text analysis during each crawling operation, Google can quickly check whether the content of a page significantly differs from other content using Simhash. This saves resources and speeds up the crawling and indexing process.

An important aspect of Simhash is its ability to recognize similar content based on the Hamming distance. The Hamming distance measures the number of bit positions in which two Simhash values differ. A small Hamming distance between two Simhash values indicates that the corresponding content is very similar. Unlike traditional hash methods, where even a small change in content can lead to a completely different hash value (collision resistance), Simhash generates similar hash values for similar content, making it particularly useful for detecting nearly identical content.

Pages with similar Simhash values may perform worse in ranking because they are considered duplicate content or less original, as Google aims to display unique and high-quality content in search results.

By the way: SimHash can also be used by operators of very large websites that work a lot with text spinning and programmatic SEO in content creation to automatically compare content and measure the degree of duplicate content.

Why hreflang sometimes doesn’t work

Google uses – presumably alongside other methods – SimHash to analyze the content of pages and determine if pages are duplicates. If two pages, like the DE and AT versions, have the same SimHash value, Google considers them identical and selects a canonical URL, regardless of the hreflang tags. Hreflang tags, just like html-lang attributes or canonical tags, are non-binding for Google, unlike noindex tags or directives in the robots.txt. Therefore, Google does what it sees fit in the interest of the best user experience. The determination of which URL is set as canonical is based on various factors, such as the number of incoming links, positive user signals, and other criteria. This canonical URL is displayed as such in Google Search Console (GSC) and is assigned all traffic.

However, this does not necessarily mean that users will always see this canonical URL in the search results. Rather, Google displays the URL that is most relevant to the user – which does not always have to be the canonical URL. Yes, I know, it sounds complex to the point of being contradictory.

If there are slight differences between the pages – such as a small module on one page that is missing from the other – this can lead to a distinction sufficient to be considered individual content. However, even a minimal difference in search engine guidelines or a change in page content or design can cause Google to treat the pages differently or to consolidate them.

Sometimes Google’s behavior is truly remarkable. I personally know of an online furniture store that has both the .at website for Austrian customers and the .de website for German customers. According to Search Console, Google has chosen the .at URL instead of the .de URL as the canonical URL, even though the .de URL has been defined as the canonical URL for the German site by the website itself.

The case became particularly strange when the furniture shop took the product offline on the .at page and displayed a 404 error.

The URL is still in the index and leads to a 404 page, while the German page still features the product but is not indexed.

The .de page still shows the page content, but the article is sold out. Still, it is surprising that Google has the 404 page indexed and does not have the German counterpart with available page content.

How the problem affects the GSC data

The behavior described by Google, in which different country versions of a page are grouped together and a canonical URL is selected, has direct implications for the significance of the data in the Google Search Console (GSC). This issue causes the performance data displayed in the GSC to not always reflect the actual presentation of the search results.

When Google recognizes different country versions of a page as duplicates and selects one of them as the canonical URL, the entire traffic received by the various versions is attributed to the canonical URL in Google Search Console (GSC). This means that the GSC may only display traffic data for the DE version of a page, even though users in Austria have actually seen the AT version. This skews the data and makes it difficult to accurately measure the success of individual country versions.

The ranking data can also be influenced by this behavior. If Google determines the AT URL as canonical, but still shows users in Germany the DE version, the ranking data in the GSC can be misleading as well. This discrepancy occurs because the GSC data are based on the canonical URL, while the actual search results may display a different URL.

In the GSC, it appears that only one URL is performing, while in reality, several URLs are displayed in different countries. This complicates the analysis and understanding of performance at the country level, especially when companies purposefully apply different content or optimizations for various markets.

Due to this behavior from Google, SEO teams need to adjust their analysis approaches. Instead of solely relying on the data displayed in the GSC, external tools could be used to verify which URLs are actually shown in the search results across different countries. Additionally, a closer look at the filter options in the GSC may be necessary, for example, by analyzing the data by country and country directory to get a clearer picture of the actual performance in the respective markets. Also, Regex queries in the Search Console can be helpful in this case. A separate article from me on this topic will follow shortly.

Strategies for Solutions

It would be best if Google considered each language-country variant as a canonical URL. As we can see, this is not certain for identical languages. One possible solution would be to question the necessity of country variants and examine whether it makes more sense to create only one version for linguistically identical markets. This can be particularly helpful in cases where the differences between countries are minimal. If there are not even any differences in currency, such as between Germany (Euro) and Switzerland (Franc), it makes little sense to create an additional /de-AT/ directory for Austria with Euro, where all content is duplicated.

How much Google is now limiting the indexing of pages can be easily measured through an InURL query in Google Search. An example in the following screenshot: The sitemap of the website lists 644 pages for /de-CH/ and another 1312 for /de-DE/. In the search index, there are actually considerably fewer pages. Furthermore, this website has another directory for /de-AT/. It can be assumed that many pages are not in the index and many others have been consolidated into a canonical URL for the German language.

Some SEO tools also signal issues after crawling regarding Duplicate Content. For this reason, where there are no differences in content, offerings, delivery, or other specific characteristics varying by country, it is better to rely on language directories and consolidate the three directories for Germany, Austria, and Switzerland under /de/.

Another – and probably the best – strategy for correctly canonizing desired URLs is to strongly differentiate the content. Even if the language is the same, differences in metadata, main content, or user guidance can help convince the Google algorithm that these are standalone contents assigned to a specific region. For example, one could use variables to greet website visitors from the respective country. A warm welcome from <variable>, nice to have you here. For <variable> it would then be used accordingly: Austria for de-at, Germany for de-de, Switzerland for de-ch, and so on. The same can be applied to… our delivery terms for <variable>. The more differentiated the content, the more individual the hash value will be, and the more likely Google will determine each page as its own canonical URL.

A third approach could also be to use hreflang tags in sitemaps, as this can lead to clearer assignments by Google in some cases. We ourselves also include our hreflang tags in the sitemap and limit ourselves to the pure language variant without the country code.

In my opinion, the most effective approach remains to differentiate the content as much as possible in order to send clear signals to Google that these are different pages.