X robots header
One of the problems with Google being very greedy and trying to crawl and index almost every file type and URL they can find is that they also will end up trying to index PDFs or even binary formats like Word documents, Excel spreadsheets, etc.
The problem with these files is that you can’t implement meta tags and you can’t just go to HTML because there is no HTML. So, you can’t apply a noindex as you normally would.
The problem gets even bigger if you think not only about files but also about fragments like JSON objects or parts of AJAX calls or anything that is processable by Googlebot.
So, we need a way to tell Google and other search engines how to deal with non-HTML files; and that is why Google introduced the X-robots directives.
There are three types:
- The first one is an X-robot tag. The idea is that you can apply a noindex not only through a meta tag in HTML but also through a server header. If you have a file type that is non-HTML, but you don’t want to show it in the search results, you can do the same annotation with the same values and then apply it through your web server. The syntax depends on what type of web server you are using. If it is Apache you can, for example, do it in htaccess or if you are on NGINX you can do it in the server configuration.
- Another header that was introduced a bit later is called X-robots Rel-Canonical. What happens if, say, you have a PDF which is a white paper and you have the same version in HTML on your website? A PDF can be linked stronger externally. This makes it very likely that Google will show the PDF in the search results. From the user perspective, it is not the best user experience as they have to download the full PDF and wait for it. It would be better for that person to end up on a proper website with full navigation. You can use an X-robot rel-canonical header; you can set a canonical tag for this specific PDF on the server level using the header and not an HTML tag and then point from the PDF towards the HTML version. Even if the PDF is more relevant or stronger linked, in the search results Google will rank the HTML version.
- It should also be mentioned, that hreflang – which we will cover in the international chapter in more detail, can be applied using server headers as well.
Generally, X-robots directives are meant for everything that is not HTML. However, if you don’t want to, you don’t have to use HTML annotations at all. You can control everything on the server level. Keep in mind, that some of the crawling tools are still not really capable of showing directives in server header headers – so there might be issues with those. But from a Google perspective, it doesn’t matter if you use HTML or server-side.