In web development and SEO parlance, a crawl error happens when a search spider fails to access a webpage or an entire website. Technical issues, content management mishaps or improperly configured bot restrictions could lead to the occurrence of these errors. While most sites have crawl errors in small quantities and are virtually unaffected by it, chronic crawl errors in significant volumes could lead to poor user experiences and fluctuations in search engine visibility.
As such, managing your site’s crawl errors is an integral part of maintaining a high level of technical health. If you want to make sure that your pages rank well on Google for keywords that are strongly related to your site’s content, this is an area that you can’t overlook.
In this post, we’ll discuss how crawl errors can be detected and classified. Depending on the nature of your site’s crawl errors, different fixes will need to be applied. We’ll also have a look at the case of a website plagued by crawl errors and how fixing them helped it gain significantly more traffic from search engines.
How to Detect Crawl Errors
Detecting crawl errors is easier than ever thanks to Google Search Console. If you’re not familiar with it, this is a free diagnostic service that Google offers to webmasters. Using it is a snap: you just need to verify your ownership of a website and wait for Search Console’s report to start showing data within a few days. You can check out this guide for full details on how to get it up and running.
Once the data starts flowing in, you should be able to see plenty of information in its reports. Data on where your backlink profile, content duplication, microdata implementation, etc. will be there. You can find your Crawl Error report under Crawl>Crawl Errors in the left sidebar menu. It’s bound to look something like this:
If for some reason the report is not showing any data, don’t panic. There are cases where it takes 2-3 weeks to get things rolling. If it takes longer than that, you may have to do some troubleshooting with your setup. This article from Google may provide helpful insights.
Types of Crawl Errors
Once yo start seeing crawl error data, it’s time to identify the types of errors that your site has. There are two main types and several subtypes depending on what’s generating the errors. Specifically:
- Site Errors – These are errors that affect the entire website, preventing Google and other bots from accessing all the pages in a domain. Site errors are very rare and are usually indicative of serious technical issues or large-scale misconfigurations. In cases when Google Search Console reports them, a lot of these errors are usually just glitches in your site or spikes that don’t represent larger problems.
However, there are cases when Site Errors are legitimate and will require fixes. There are three subtypes of site errors that Search Console can detect, namely:
- DNS Errors – A DNS error happens when your domain name system server’s operation is interrupted or if there’s a problem with its routing to your domain. It may also indicate a high level of latency, which means slower operations and more troublesome processing overall.
- Server Connectivity Errors – Server errors on a site level indicate that your server returned a busy, a timed out or no response at all when search engines tried to access it. After several unsuccessful attempts to access your site, search bots will abandon the request and report the situation via Search Console.
- Robots.txt Fetch Errors – This type of error happens when search engines can’t find or access your website’s robots.txt file. If you’re unfamiliar with it, this file lists parameters on which pages can be indexed and which pages are off limits to search engines and other bots. Most search engines have policies that prohibit their bots from crawling any of a website’s pages if they’re unsuccessful in reading its robots.txt file.
- URL Errors – As the name suggests, these are crawl errors that happen on a per-page basis as opposed to a sitewide effect. Search Console reports three types of URL errors, specifically:
- Not Found – URLs that return the 404 Not Found server response. These are usually pages that have been deleted or are being hampered by technical troubles. If your internal pages of pages in other sites are linking to these URLs, crawlers can encounter problems and human users can hit dead ends.
- Soft 404s – These errors are reported when pages display 404 Not Found messages but give a response code that is not 404 Not Found. Usually, this happens when webmasters want to get a little creative by displaying custom 404 Not Found messages but leave the response code at 200 OK.
- Server Errors – Like its Site Error counterpart, this URL Error happens when a bot tries to access a webpage in your site but the server is either busy or times out. The difference is that this server error affects only some pages – not the entire website.
- Access Denied – There are cases where your site’s pages become inaccessible due to misconfigured robots.txt files or login requirements to access pages. Make sure that all pages that you want to be open to the public remain unrestricted. Conversely, don’t list restricted URLs on HTML and XML sitemaps to avoid confusing bots.
- Not Followed – Some on-page elements that are linked to other pages might not be fully crawlable by search engines, preventing them from following the links and indexing the pages they point to. Flash, iFrames and JavaScript may have this effect. In some cases, daisy-chained redirects may also trigger this kind of error.
How Crawl Errors May Harm Your SEO
Crawl errors are viewed by Google and other search engines as natural occurrences in most websites. These things inevitably happen even to the biggest and most popular brands online. Therefore, it only makes sense that there’s no such thing as an outright penalty or ranking demerit just because a site has some crawl errors. However, there’s also a reason why Search Console reports crawl errors the way it does: because having too many of these errors or having them on key pages can hurt your search visibility if left unresolved.
For one thing, having a lot of crawl errors and not addressing them in a timely manner gives signals to search engines that your site is in poor technical health or content management on your site isn’t efficient. If your site displays a chronic inability to keep its pages live, other sites that exhibit a better degree of uptime are likely to be favored more by search engine algorithms.
If your site’s hierarchy of pages is viewed in a diagram form and search engines find dead ends peppered across it as seen in the image below, it’s easy to see why search engines might view it less favorably.
In some cases, crawl errors may not be huge in number but if they happen to key pages in your site, they could still have a significantly negative impact on your organic traffic. If crawl errors are affecting a page that is high on your site’s hierarchy (such as broad category pages) or if it has a significant amount of inbound links pointed to it (such as an article that’s frequently referenced by other sites), the flow of internal link equity may be disrupted and search rankings could suffer as a result.
Fortunately, Search Console lists URL errors according to priority. By default, the top 1,000 errors from the past 90 days are displayed to make prioritization of fixes easier for you.
How to Fix Crawl Errors
There are a number of ways to address crawl errors depending on what type they fall under. The following are recommendations on how to approach crawl both site and URL-level crawl errors:
- DNS Errors – DNS errors usually happen either because your DNS provider is encountering technical difficulties or if you misconfigured your DNS and hosting settings. You can start fixing this issue by checking for provider-side issues. If there aren’t any, review your DNS-hosting setup and make sure that the configuration is correct.
As an added measure, configure your server to yield 500 or 404 responses to non-existent host names.
- Server Errors – The occurrence of server errors usually depends on server uptime. Fortunately, most hosting providers these days guarantee a high degree of availability. If you encounter these errors, check with your hosting provider on the status of their service.
In some cases, high user volume and activity can overwhelm your server’s resources and return server timeouts and busy status responses. Consider adding more servers or moving to a higher hosting plan to cope with your audience’s demands on your site. You’ll also want to manage downloads better and cut down on file sizes from page elements to alleviate server burdens.
Bot crawls can also expend a significant amount of your server resources when you have a particularly large site with tens or hundreds of thousands of pages. If this is an issue for you, consider only allowing user agents from search engines you want to be listed in.
- Robots.txt Fetch Failure – Although extremely rare, a robots.txt file may not be accessed by search engines if a firewall is blocking bot access or if it requires a password to access. Make sure that this document does not return a 5xx response code (unreachable) so search engines can move forward with their normal indexing process.
- 404 Not Found Errors – This type of error is the most common and has the most diverse set of possible fixes. Depending on the nature of the page yielding the 404 Not Found error, you may:
- Leave them alone – Pages that have been intentionally deleted and have no replacement can be left alone. In time, Google will recognize this as a deliberate content management action and flush the URL out of its system.
The same is true for dynamic pages that were inadvertently indexed and have subsequently expired. These need no further action unless you’re sure that they’re receiving valuable inbound links from authority sources.
- 301 Redirect them – If the 404 error pages were deleted but new pages have been created to replace them, use 301 redirects to lead users and bots to the replacements. Once search engine bots digest the new order of content in your site, the 404 Not Found listings on the crawl error report should start going away.
In some cases, a page may not have a direct replacement but should be 301 redirected due to off-page optimization considerations. For instance, if you run an ecommerce site and you had a holiday sale, there’s a good chance that your promo pages might have accumulated backlinks from blogs, forums or even news sites. Deleting that page nullifies the rank-boosting power of the inbound links you received.
In cases like these, it might make sense to 301 redirect the URL to the most contextually related page to preserve the link equity that the old page acquired. To see which pages are garnering links for your site, you could use Search Console’s Links to Your Site report under the Search Traffic menu item. Alternatively, you can use a tool called Ahrefs for richer data on your backlink situation.
- Fix broken links – Some 404 errors come from misspelled URLs or coding errors in internal links. Screaming Frog is probably the best tool for quickly and easily finding these errors. You can check here how to do just that in a guide I write recently. If you’re using WordPress as your CMS, things can get even easier with the Broken Link Checker
- Soft 404 Errors – Download the Soft 404s report in Search Console. Download the data as a CSV file. This will contain a list of all URLs deemed by Google as soft 404s. Download Screaming Frog and use its List mode to crawl the URLs in the data sheet.
This allows you to check whether the URLs are giving out response codes other than 404 or 5xx. If they are, make sure to reconfigure them in such a way that they yield the appropriate server responses. In most cases, hard 404s are best.
- Not Followed – Make sure that special webpage features such as Flash, JavaScript or BHTML are not hampering proper bot crawls. Google recommends the use of text-based browsers such as Lynx to view your pages at a code level. If you see that the aforementioned elements are hampering normal bot functions, consider using alternatives such as HTML5 to achieve the same effects while still facilitating smooth bot movement through your pages.
- Access Denied – Make sure that the URLs reported don’t require login credentials. Also, make sure that your sitemap is not listing URLs that are blocked with robots.txt.
Crawl Error Cleanup in Action
There are many factors that affect search engine visibility. In fact, Google says they track about 200 signals that determine a site’s relevance and authority in relation to a query. In our experience, dealing with crawl errors is an essential step in establishing a solid SEO framework. This year, we ran a six-month campaign for an ecommerce website that had plenty of technical SEO issues. As you can see in the screenshots, this 1.4-million page site had plenty of crawl errors that come in different flavors. We listed descriptions of each issue below and the actions we took to rectify each problem type.
Issue: The site receives a few million visits per month, which puts a lot of stress on its servers. During one particularly busy stretch, the site had to accommodate such a high volume of user activity that its pages started timing out and returning busy signals. This would explain why Search Console reported 2,656 server errors in the image.
Action Taken: We took a look at data trends in the past three years and recognized traffic spike patterns. Referencing our business calendar, we were able to deduce which weeks of the year were busiest. These dates correlated with server strain patterns, allowing us to determine which times of the year we should add more resources to help our servers deal with the demands of the business.
Issue: The site ran on a CMS that created a lot of dynamic pages and lets them get indexed by search engines by default. When these pages expire, become inaccessible and the server returns a 404 error message. Due to the sheer size of the site, Search Console listed more than 600,000 of these.
Action Taken: We added a disallow parameter on robots.txt which restricted access to URLs with “?” in them. We also combed Search Console’s list for static pages that needed 301 redirection to their new versions. Lastly, we identified URLs that were not replaced by new pages and had no significant inbound links pointed to them. We chose to let these URLs remain in their 404 Not Found states to let search engines know that they’re gone for good.
Issue: The site was testing new custom 404 pages that would appear whenever a user would access a non-existent page. However, these were not implemented properly and they were giving out 200 OK response codes. Search Console ended up reporting 290 such cases.
Action Taken: We identified all the affected pages and the site’s web development team configured the CMS to turn the 200 OK response codes to hard 404 Not Found ones.
Results
These actions, combined with other on-page SEO and minor link acquisition activities were the primary catalysts for a sustained organic traffic growth for the site. Due to NDAs, we can’t show you a screenshot from Google Analytics but the visualized organic search growth would look something like this:
Just to be clear: fixing crawl errors alone was not the reason for this organic traffic improvement. However, it was one of the major optimizations we implemented and its application to the site correlates well with search engine visibility. Overall, think of crawl errors fixes as a factor for better user experience and bot crawls – not a magic bullet that magically pours traffic to your site in itself. It shows search engines that you have the ability to practice good housekeeping within your own site – a hallmark of good webmaster practices that lead to more rewarding user experiences.