XML Sitemaps: Everything You Need to Know

 

An XML sitemap is a file that helps search engines crawl and index a website’s pages more intelligently. Through this protocol, web crawlers can locate more URLs, determine their place in the site’s information architecture and understand their level of importance in the site’s hierarchy of pages.

In search engine optimization (SEO), XML sitemaps are crucial to the technical and on-page facets of the process. Publishing and submitting a sitemap to search engines increases the chances of having organic search landing pages indexed and ranked for a website’s target keywords.

History

In mid-2005, Google introduced the XML sitemap protocol as a means for webmasters to list all the pages in their websites on a single reference document. The protocol is based on the “crawler-friendly web servers” concept that helps search engines crawl websites more effectively. In addition, it provides additional information that gives algorithms more clues about a page’s importance relative to its place within the website’s information architecture.

By late 2006, Yahoo and MSN (now known as Bing), joined Google in the support of the sitemap protocol. In 2007, it was announced collectively that sitemap discovery would be supported by listing it on a site’s robots.txt file.

Initially, webmasters had to write, code, upload and update their sitemaps manually. However, many CMS platforms  and plugins have since been developed with support for the automated generation of sitemaps as well as the dynamic publishing of the document.

Uses

XML Sitemaps have several key areas of applications including:

  • Large websites – Large websites are particularly prone to having significant numbers of pages left uncrawled and unindexed by search engines. Most of the time, this is due to the depth of category pages involved in setting up a proper topic siloes. XML sitemaps give the pages in them a greater chance of being crawled by listing them and indicating what their place is in the site’s structure.
  • Crawl budget prioritization – “Crawl budget” refers to the allocation that a search engine gives a website in terms of how many pages it will crawl on average whenever it visits. Crawl budget is “earned” as a site grows in terms of authority and number of pages. The greater the crawl budget, the greater the chance that all your pages will be regularly visited by bots.

However, this isn’t the case for a lot newer websites with relatively large numbers of pages. A new ecommerce site with 5,000 pages, for instance, will probably have less than half of its pages crawled due to crawl budget limitations initially. Until it grows in authority, the number of pages crawled will not increase.

For as long as this is the case, the website can attain a viable level of search visibility by using its XML sitemap to list the most important among its pages. Having priority pages in the sitemap and leaving out the ones that can wait for indexing will increase the chances that search engines will prioritize the indexing and ranking of pages that matter more to you and your target audience.

  • Site Hierarchy Emphasis – A properly written sitemap will allow search engines to gain a better understanding of the hierarchy of pages within a website. In practice, an ecommerce site with a good sitemap should be able to make it obvious to search engines that the home page is the most important page in a site, followed by categories, subcategories, then product pages. The same would apply to other site formats such as blogs, forums, etc.
  • Index Ratio Check – When submitted to search diagnostic services such as Google Search Console, a sitemap will yield all sorts of useful information. One of these is a statistic that shows how many of the URLs listed in the sitemap are actually being indexed. Ideally, this should be 1:1 as a listing on a sitemap tells Google that a webpage is important and must be considered for indexing. However, Google may decline to index a URL due to several reasons including:

XML Sitemaps

  • URL blockage from meta robots tags or the robots.txt protocol.
  • URL redirection to other pages
  • Page availability issues
  • Poor page quality
  • Page content duplication

If your sitemap has more than 10% of its URLs not being indexed, this could be a cause for moderate concern. Search Console will not tell you which pages are not being indexed but at least you’ll have an idea of how many there are and where you’ll want to start looking.

  • Easier Site Audits – This has nothing to do with search engines but everything to do with SEO. If you’re using an application like Screaming Frog to crawl pages that you want to review for proper on-page SEO, you’ll notice that it misses some pages here and there. It also tends to pick up dynamic URLs which you don’t really want to optimize. To make sure you scan just the pages that you want to optimize, you can pull the URLs directly from your XML sitemap and crawl just those.

XML sitemaps can be opened in Microsoft Excel so you can view and sort them in a very neat fashion. To do that, simply open your XML sitemap in a browser and it CTRL+S to save it. Open Excel, hit CTRL+O and find the XML file where you saved it. You should be able to open the file rather quickly.

Generators

These days, XML sitemaps are rarely ever written manually. Most websites rely on automatic generators to build and update their sitemaps. Below is a list of popular solutions for this task:

  • WordPress SEO – Yoast’s ultra-popular SEO plugin for WordPress comes with its very own sitemap generator. With a wealth of settings and configuration features, you can easily control what goes in and out of your sitemap.
  • Magento Sitemap Suite Extension – If you’re operating an ecommerce site and Magento is your platform of choice, this plugin will help you generate a complete sitemap that evolves along with your website. You can be sure that all the category pages and products you need to have indexed will be crawled and considered by search engines.
  • Google Sitemap – This WordPress plugin generates a simple yet complete XML sitemap. Unlike WordPress SEO, it doesn’t come with a full suite of SEO functionalities. It simply does what its name suggests: create an XML sitemap. This one’s ideal if you prefer to use another SEO plugin or if your WordPress theme has built-in SEO features.
  • XML Sitemap (Drupal) – For those of you who remain faithful to Drupal, this is one of the easiest and most effective XML sitemap add-ons.

Common Errors

There are several common errors that can limit the effectiveness of an XML sitemap in contributing to a site’s overall search visibility. These include:

  • txt blockage – Some webmasters block certain pages from crawlers that they don’t want indexed. Unfortunately, they sometimes also list these pages in their sitemaps that fall within the robots.txt file’s blocking parameters. This creates conflict because Google won’t be able to tell how you want the page to be handled. When this happens, Google will just report the error and hold off on indexing the affected pages.
  • Meta robots blockage – Similar to the robots.txt blockage issue, some pages on your sitemap might be telling search engines not to index them with the “noindex” meta tag while they’re listed in your XML sitemap. Again, this creates conflict and search engines will not index the page. In both instances, the ratio between submitted and indexed pages falls, negatively impacting overall search visibility.
  • Dead Pages – In some cases, pages that have already been deleted continue to be listed in the sitemap. This happens when your sitemap does not dynamically update itself when you make changes to your content library, resulting in crawl errors. To avoid this, make sure that you’re using a reliable sitemap generator that consistently edits itself whenever changes are made.
  • Redirected Pages – Over time, pages in your site will be deleted and replaced with new ones. When this happens, 301 redirects from the old URLs to the new ones need to be applied to lead both human users and crawlers to the corresponding pages. However, you need to make sure that the old URLs are taken off the XML sitemap. Once they’ve been redirected, you only need to list the new pages that replaced them. The old URL will no longer be of interest to both bots and human users, so feel free to leave it out of the sitemap.
  • Thin Page Inclusion – Some pages in a website are more for utility than information. This blog tag pages, checkout pages, internal search result pages and other session-generated pages. These are of no use to most users and they don’t target specific keywords either. They’re also low on juicy content, making them poor landing pages for users driven by search engines to your site. When configuring which pages to have in your sitemap, make sure that these “thin” content pages aren’t included.
  • Non-submission – Some webmasters who are new to SEO publish XML sitemaps but neglect to submit them to search diagnostic platforms such as Google Search Console. While that step is optional, it’s necessary for reaping the full benefits that your sitemap brings. Make sure you “tell” Google and Bing where to find your XML sitemap y submitting it to their respective webmaster tool platforms.

Best Practices

There are several best practices that will ensure your sitemap works properly and serves its purpose. Whenever possible, apply these simple tips:

  • Include only the webpages in your site that will make good landing pages for searchers. They have to satisfy at least one of the three primary intents of search: navigational, informational or transactional. Pages that do not fall into these buckets need not be included.
  • Pages with duplicate content should be excluded whether they are duplicating internal or external pages.
  • Pages that have the noindex meta tag or the rel=”canonical” link that points to another page should be excluded.
  • Pages that are being blocked by robots.txt must be excluded or the robots.txt parameters should be edited to lift the restriction.
  • Pages that have been 301 redirected to a new page should not be included.
  • Pages that have been permanently deleted should not be included.
  • Listed URLs must have a corresponding modification date at all times.
  • When possible, stick to simple addresses for your sitemap such as www.example.com/sitemap.xml or www.example.com/sitemap_xml.
  • Whenever possible, create a sitemap index page that branches off into several other sitemaps for specific types of pages in your site. For instance, an ecommerce site can have a sitemap for its general pages, a sitemap for its categories, a sitemap for its products and a sitemap for its blog posts.

Keep in mind that inclusion in a sitemap is not a guarantee that a page will be indexed. It simply means that you are asking Google with more emphasis when you place a URL in the sitemap. The page’s quality and your own indexing restrictions will ultimately determine if the page will be indexed and will be ranked for its keywords.

Glen Dimaandal
Glen Dimaandal
Glen Dimaandal is the founder and CEO of SearchWorks.Ph. He has been doing SEO since 2008 and is consistently featured in mainstream media and industry conferences. His core skills include SEO, SEM, data analytics and business development.
Glen Dimaandal
Glen Dimaandal
Glen Dimaandal is the founder and CEO of SearchWorks.Ph. He has been doing SEO since 2008 and is consistently featured in mainstream media and industry conferences. His core skills include SEO, SEM, data analytics and business development.