Log File Auditing
What is Log File Auditing?
Log files are generally stored on a web server and – simply speaking – contain a history of each and every access by a person or crawler to your website. Log file auditing can give you some idea of how search engine crawlers are handling your site.
Why should you care about log file auditing?
- They can help you understand crawl priorities. You can see which pages are prioritized by search engines and should, therefore, be considered the most important.
- Secondly, log file auditing can help to prevent reduced crawling. Google may reduce its crawling behavior/frequency & eventually rank you lower if you constantly serve huge numbers of errors.
- You want to understand global issues. You want to identify any crawl shortcomings (such as hierarchy or internal link structure) with potential site-wide implications.
- You also want to ensure proper crawling. You want to make sure Google is crawling everything important: primarily ranking relevant contents, but also older & fresh items.
- The last goal of log file auditing is to ensure proper linking. You want to ensure that any gained link equity will always be passed using proper links and/or redirects.
Only access log files can show how a search engine’s crawler is behaving on your site; all crawling tools are simply trying to simulate their behavior.
Characteristics of a log file
The characteristics of a log file are relatively simple; it is just a text file. The content and structure in log files can vary and they depend on your web server (Apache, NGINX, IIS, etc.), caching and its configuration. Make sure to identify which setup you are running and how things look from an infrastructure perspective.
Usually, a log file contains
- the server IP/hostname
- the timestamp of the request
- the method of the request (usually GET/POST)
- the request URL
- the HTTP status code (everything is fine for 200, or 301 for redirect)
- the size in bytes for the response – of course always depending on if it has been set up to store this information or not
- Also, log files store the user-agent. The user-agent helps you to understand if the request actually came from the crawler or not.
When you work with log file data, you need to ask the right questions. Log file data can be quite overwhelming because you can do so many different things; make sure you’ve got your questions prepared.
Log file data can be very different to Google Analytics data, for example. While log files are direct, server-side pieces of information, Google Analytics uses client-side code. As the datasets are coming from two different sources, they can be different.
When requesting access to a log file, keep in mind that you do not need any personal information, so when you talk to your IT or to a client, you don’t need to worry. It’s essentially only about the crawler requests from Google or Bing. No need for any user data (operating system, browser, phone number, usernames, etc.). It is not relevant.
Also, you need to be aware of how the server infrastructure is set up. If you are running a cache server or a proxy and/or a CDN which creates logs elsewhere, we will need those logs as well to get the whole picture.
Log file auditing tools
Once you have your log files ready it is time to start dealing with the data. There are different ways to approach this. It is not a good idea to rely on a simple text editor to open this file. Often, log files are hundreds of megabytes in size so if you try to open that type of file, you will have problems.
For a small site, you can start with DIY-solutions based on Excel or even Google Docs. You’d have to build filtering, cross-references, etc manually. So it is not scalable. There are no nice dashboards and graphs – you’d need to build that first; which is clearly not the simplest way to approach this.
One of the better ways – especially from an SEO perspective – is using Screamingfrog Log File Analyzer. It is a beginner level, desktop-based log file auditing tool with some very useful, pre-defined reports. It has a simple interface where you can drill down into different reports, understand crawl events and behavior, see response codes, etc.
But it does not have any sharing capability. Also, you need to download log files manually from the server and import them into the tool. If the files are large it can take forever. It is a beginner level tool for small- and medium-size sites.
Another solution is the Elastic Stack – formerly known as ELK. It consists of 3 different tools:
- Elasticsearch: search & analytics engine,
- Logstash: server-side data processing pipeline,
- Kibana: data visualization (charts, graphs, etc.)
The great thing with this one is, that it’s open source and therefore free to use. On the other hand, you have to set it up and run it on your own server infrastructure. So, it needs IT resources.
Other SaaS solutions are logrunner.io, logz.io & Loggly. They are all based on ELK but focused on SEO based auditing, so they have dashboards where you can, for example, see the crawl behavior over time or response codes per crawler etc.
The beauty of SaaS solutions is, that they are almost real time. You pipe your log files into the system and very quickly you can see what’s happening on your website.
It is important that working with your log files is integrated as a part of your regular SEO workflow, rather than having one-off audits. One-off audits are fine for a start, but logfile audits really become invaluable if you combine them with web crawl data and do it on an ongoing basis. Also, messing around with exports, up-/downloads is frustrating.
I’d generally recommend finding something that fits your requirements. All the tools have limitations; the price is based on the amount of volume that they process per month. The advantage of SaaS solution is the possibility of sharing reports with a team or with a client. If you’re doing migrations, it makes things easier because you see all the events that can happen in real time.
Log file auditing and essential reports
One of the most obvious things you can do based on log file data is to try and spot anomalies within a time frame. For this, you need log file data that goes back a couple of days or even months. You can see spikes in the crawl behavior, e.g., Googlebot was crawling very aggressively for 1 or 2 specific days. Or for instance, if you want to be found in China, but it does not seem to be happening and you then see in the log files that Baidu does not crawl your site at all, that would indicate that you have a problem.
You could obviously also break it down into just the bots that are actually accessing the website. You can get an idea what other types of crawlers are coming and processing data from your site. One of the up-to-date use cases would be the Google MFI switch, where you can see if the Google smartphone bot overtook the Googlebot desktop in terms of crawl volume.
You can understand what the top crawled pages by Googlebot are, and then go and verify if they coincide with your domains’ most important URLs. You can also break down the crawl requests & status codes by the directory to understand how well or not the pages are crawled and if it happens regularly or with huge delays, or not at all.
Going deeper into the report, let’s have a look at redirects. Log files can help you figure out incorrect status codes, so in terms of redirects, you’d be particularly looking for 302, 304, 307, 308 and then changing them to 301s except for geo-redirects. Watch out for redirect chains as well and try to figure out if there is something that you need to tackle on that end.
Also, it’s super important to understand crawl errors. There are two different categories that you should especially look out for. First one is the 4xx status code range, mainly 404s and 410s. The general approach for these is to see if those 404s happen specifically for one crawler rather than another. Depending on what is happening, you can decide what to do. It might only be happening for Bingbot say – or for all of the crawlers.
If you want to recover those URLs, bring them back and use 200 or if it doesn’t exist anymore but we want to keep inbound link equity, we need to implement a 301 to make sure this actually happens. If these 404s URLs are never coming back, you might consider changing them to 410s, because 410 says the URL is gone and never will be back. You’re doing it on purpose, not by accident. Google will reduce re-crawling and those pages will be removed from their index way faster.
Other important issues can be found in the 5xx status code range, especially 500 and 503. These happen from time to time so it is natural enough to see them in the log file. It is more about their volume and consistency. If there is a specific crawler from one specific IP that causes the same error over and over again, this is an issue to investigate.
Generally, from an SEO perspective that’s something you can’t do much about yourself and – particularly for the 5xx – it is usually an issue with the server or the infrastructure in general. In such a case it is probably necessary to pass the problem to the IT team.
Another important thing is to identify the top/worst crawled URLs and folders. The highly crawled pages and folders can be used for additional internal linking (add link hubs). Low crawled areas need to be linked more prominently. It reflects what Google bots spend their time on; it is wise to use URLs that are frequently crawled, for example, to establish internal linking to new content items to get them indexed sooner rather than later. Also, when you understand what the worst crawled pages are, you can prioritize them and give them more attention.
Log file auditing can help you see if (new) URLs have been crawled at all. If relevant URLs haven’t been discovered/crawled, your internal linking probably is too weak, and those pages need more additional internal links. You should also consider XML sitemaps, better/more prominent linking, etc.
The final big thing you can use log file auditing for is to understand what type of crawl waste is happening. It is usually caused by URL parameters gone wild. If you do not have proper URL parameter management, parameters for tracking (for example) can often randomly append to each and every URL. So Google will crawl these at some point – you need to watch for this very closely It can also cause duplicate content and loads of other issues, not just crawling. Most of this waste is caused by bad parameters.
It is highly recommended to set up parameter behavior tracking all the time and to watch constantly for new parameters as well as significantly increased crawling for already known ones.
Log files combined with crawl data
One of the really exciting things is to gain insights that you would not have had before, right? A really cool thing I do a lot and I think this really can generate very valuable insights – is to try and combine various data sources with each other.
Certainly, you can use log file auditing as shown previously – but this is a limited view. It gets way more exciting when you combine your log file data with input from other data sources. The most obvious way to do this is to use data from a web crawl and then combine this with log files to compare simulated behavior, for example, Google bots’ actual behavior.
You can also take data from Google Analytics or GSC – or all of those sources at once. Another easy way would be to utilize your XML sitemap and overlay it with the data from your log files.
So let me walk you through a couple of things that you could actually do:
- So one of the easiest things you could do would be to overlay your sitemap with your log files. Eventually, you would see that the data may indicate a lack of internal links within the site architecture because if your site architecture is working properly, all URLs included in the sitemap should actually also have been crawled. If not, there is something wrong.
- If you have data from a web crawl, it could discover a URL that has been set to noindex – for whatever reason. If you then were to overlay that URL with data from your log file you would eventually see that this noindex URL is crawled very frequently. In this case, setting up to noindex maybe was not the best idea, right? This happened for one of our clients where the team made a change in their CMS and actually set some very strong products pages to noindex – which should not have happened, of course. So overlaying log file data with other sources can also help to reveal mistakes and can act as a maintenance routine.
- Another report can be when you take a look at your indexable pages and see if they are really being crawled or not, and if so how often. This can be a great starting point to understand if they just need improvement or if you should reconsider indexation altogether – maybe they are just not good enough? Or you might want to consolidate them with other contents on your site and get rid of this URL from the index entirely. Generally speaking, if Google does not crawl them at all, there will be a reason for it; and you need to figure it out and act accordingly.
- You could also take all your non-indexable pages, not necessarily only meta robots noindex. But also those that contain a canonical tag referencing an alternate URL or pages that are blocked in robots.txt, yet are still being crawled. This is a great way to understand if Google considers your hints and directives properly. If that’s not the case, you need to improve the relevant URLs.
Many things can be done by overlaying different data sources, but the general approach is to build this gap analysis and try to understand the major differences. Is it something that you did correctly? do all crawlers behave the same way?- or does Googlebot behave differently from what you were expecting altogether? Comparing crawl simulations with log file auditing data is super powerful – and once you’ve identified the differences you can immediately take action.