An XML sitemap is a set of website instructions implemented by webmasters that tell search engines which URLs could be crawled. As the name suggests, sitemaps provides search engine crawlers with a “map” of a given website.
Before the Sitemaps protocol became widely adopted by all major search engines in the 2000s, the only way search engine crawlers could explore a site was by using internal links. While this system was workable, relying on internal links often led to pages being orphaned and unable to be indexed on search results pages.
Adding and maintaining a properly configured XML sitemap to a website is a critical part of search engine optimization for several reasons:
- Sitemaps allow critical pages to be indexable, thus increasing a site’s search engine visibility.
- They also help search engines prioritize the URLs they crawl, which can be important for conserving crawl budgets on larger websites.
- Sitemaps also make it easier for search engines to understand the relative value of different pages, allowing them to be more competitive for their target keywords on search engine results pages (SERPs).
- They help webmasters organize and fine-tune their site structure and navigation to improve organic search visibility and the overall user experience while preventing irrelevant or abnormal pages from persisting and degrading the site.
- Using sitemaps makes it easier for webmasters and SEO professionals to quickly spot technical issues on websites.
- XML sitemaps ensure that important pages are never orphaned, ensuring constant search engine visibility.
What Tools Do You Need to Audit XML Sitemaps?
The latest versions of Google Search Console (GSC) have functions that let SEOs audit a target website’s sitemap. GSC is also useful because it offers suggestions on how different sitemap issues could be fixed. The main issue with using Google Search Console, however, is that it isn’t real-time and may not necessarily give the most recent data, which is not ideal for sites that are frequently updated.
At SearchWorks, we also use an SEO diagnostic tool called Screaming Frog to give us an updated picture of a site’s XML sitemap. However, compared to GSC, Screaming Frog does require more SEO knowledge and experience to use effectively. Screaming Frog’s free trial version also only lets you crawl a limited number of websites, so you may need to consider purchasing the full version if you handle large websites or run an SEO-related business.
What to Include in XML Sitemaps
While they could be manually created, sitemaps these days are most often autogenerated by some kind of application or plugin. On WordPress sites, the plugin we tend to use for this purpose is Yoast, as it has a multitude of other SEO functions, allowing us to effectively reduce the number of plugins we use. With Yoast and most other popular WordPress SEO plugins, making sitemaps is just a matter of ticking off boxes about what pages and page types to include. The experience is broadly similar on other popular platforms like Shopify.
Webmasters can choose to include or exclude whatever pages they want on their sitemaps. However, business websites and other sites that intend to reach a wider audience through organic search may want to include certain pages on their sitemap. These include:
- All public-facing pages
- The homepage
- Category pages (Some SEOs may choose to exclude these)
- Blog articles
- Product pages
- Brand pages
- Visitor account login pages
- Other landing pages
What Not to Include
In most cases, we recommend excluding these page types. Note that you may occasionally want to include some of these depending on the content and intent of your pages:
- Restricted pages (example: admin login pages)
- Deleted Pages (404 pages)
- Redirected Pages (301 and 302 redirects)
- Pages leading to a Canonicalized URL
- Tag Archives
- Author Archives
- Date Archives
- Paginations (pages that belong to a series; usually long articles broken up into short pages. However, you should still include the first page on your sitemap)
Our Process for Auditing XML Sitemaps
Below are the basic steps our team takes when auditing XML sitemaps as part of ongoing website maintenance and to assess clients’ web properties in preparation for other SEO activities. We recommend downloading Screaming Frog if you want to use the same process we do. For reference, you can follow our process on the video, starting at 24:56.
Step 1: Find the Sitemap
The site’s web developer should be able to tell you where the XML sitemap is located. If you’re not sure where it is, in most cases the sitemap could be found in the root domain.
In our example on the video, the sample site’s root domain is “poundit.com”. Knowing this, we can try finding the sitemap on “poundit.com/sitemap.xml” or “poundit.com/sitemap_xml”, as the “sitemap.xml” and “sitemap_xml” extensions are the most common ones used for sitemaps. Less often, sitemaps may have different naming conventions or be intentionally hidden by the web developer.
In the example video, “poundit.com/sitemap.xml” worked and gave us the parent sitemap. A parent sitemap is simply a sitemap that contains a list of other child sitemaps. In our example, the website’s development team segregated the parent sitemap into four smaller sitemaps, each related to the site’s various contents. This is a useful way of organizing sitemaps so that they are easier to understand and so that search engine bots are better able to understand the functionality of different parts of a website.
If you find a parent sitemap, you can look into the contents of the smaller child sitemaps by simply copy-pasting their URL to your web browser’s address bar (see 28:42).
Step 2: Extract The URLs from the Sitemap
You will need Excel or a similar spreadsheet application to extract the URLs. Save the XML document by hitting “CTRL + S”. Make sure to save the document where you could immediately find it.
Next, open the XML file you need in Excel. You should get a dialog box asking you how you would like to open the file. Select “As an XML Table” and click OK. Click “OK” on each of the succeeding dialog boxes. You should be given a spreadsheet with a list of the URLs on the sitemap with columns indicating information related to the URLs.
If there are other child sitemaps, simply repeat the process for each of them. Once you’ve done that, compile all the URLs on one Excel sheet. Copy and paste all the URLs in one column so that they are all in one long list.
Step 3: Prepare Screaming Frog
Once you’ve opened Screaming Frog, go to the top menu and set “Mode” to “List”. Screaming Frog’s default mode “Spider” simulates web crawlers that could also be used to extract sitemap data. However, we prefer to extract URL lists straight from sitemaps to be especially sure that no URLs are overlooked.
Step 4: Upload Your List
Once the correct mode is set, go to your Excel spreadsheet and copy all the URLs you collected from the sitemap to your clipboard. Then go back to Screaming Frog, click “Upload”, then select “Paste”. You should get a confirmation screen containing your URL list. If there are any duplicate entries, Screaming Frog will automatically omit them, which is one of the nice features of this app. Click “OK” to begin uploading. Screaming Frog will then crawl the URLs on your list and give you data on their current status.
Step 5: Export Your Data to Excel
While almost all the data given is useful for all kinds of technical and on-page optimizations, you only need data from the first few columns for your XML sitemap audit. To clean up the data, we’ll need to export from Screaming Frog to a different spreadsheet. To do this, click “Export” and save your spreadsheet as a .CSV file. Again, be sure to save the file where you can easily find it. Once you have the file, open it in Excel.
Step 6: Delete or Hide All Unneeded Data
As is, your spreadsheet will be full of data that you won’t need for an XML sitemap audit. To make things easier and reduce the odds of making avoidable mistakes, delete or hide all the data columns except for “Address”, “Content”, “Status Code”, “Status”, “Indexability”, and “Indexability Status”.
Step 7: Sort and Mark Your Data
Sort the “Status Code” data column from largest to smallest. This should bring any pages with 301, 302, or 404 statuses to the top of the list. If you find any anomalous URLs, flag them for later.
Next, sort the “Indexability” data column in reverse alphabetical order (from Z to A). This should bring up any non-indexable URLs to the top of the list. You can then mark these or remove all the other non-problematic URLs from the list to leave you with a smaller, easier-to-manage list of URLs.
Lastly, sort the data in the “Indexability Status” column in normal alphabetical order to bring all the canonicalized URLs to the top. These should also be flagged for inspection and removal.
Step 8: Inspect Functioning Non-Indexable URLs
If the pages are on the sitemap, it implies that the site owner wants those pages to be indexed by search engines. However, if they are functioning but non-indexable because of a meta robots tag, then this cannot happen. Therefore, non-indexable pages that are on a sitemap either have to be removed from the sitemap to optimize for bot crawl budgets or made indexable so that they can show up on search engines.
To resolve this contradiction, open each non-indexable URL on your browser to confirm if they should be indexed. To reiterate, all public-facing pages, blog articles, product pages, brand pages, public login pages, and other landing pages that you want to be searchable should be indexable. Sometimes, you will need to contact the website owner or webmaster to confirm whether or not the pages you found are meant to be visible to visitors.
Step 9: Remove or Edit Functioning Non-Indexable URLs from the Sitemap
Once you have a clear idea of which pages need to be made indexable, it’s time to remove their noindex tags through the site’s backend. All other non-indexable pages on the list should then be removed from the sitemap.
If directly working on the URLs is outside of your responsibilities, you’ll have to send the website owner or web developer lists containing the URLs to remove from the sitemap and the URLs that need to be made indexable.
If you have to do these things yourself, removing noindex tags and updating sitemaps is usually easy, especially if the website is hosted on a popular platform like Shopify or WordPress. These platforms have a large support community and you should be able to find SEO plugins and other tools that will allow you to set the noindex tags on single URLs or entire page categories. To learn more about setting noindex tags, read our Meta Robots Tag Audit Guide.
Step 10: Remove URLs With Crawl Errors and Canonical Tags from the Sitemap
Following the principle that only pages meant for the public should be on XML sitemaps, all the error and canonicalized pages that you flagged earlier should be removed. Go back to your spreadsheet and gather these URLs in one list.
After these problem URLs are collated, whoever is in charge of the site should be notified so that they can inspect and remove these pages from the sitemap. Removing these pages should help conserve the site’s crawl budget and improve the visibility of higher-value pages on the site.
XML sitemap audits are a basic part of technical SEO that should be done prior to making any modifications on clients’ websites. They should also be done periodically to conserve the site’s crawl budget, particularly if the site is very large or is updated frequently. Doing these audits can be key for helping a site maintain its position on the SERPs and should help make its content more visible to its intended audience.
Some webmasters may fail to update their XML sitemaps frequently enough because of the several steps involved. However, if you understand what’s going on, it’s not all complicated, especially if you have SEO diagnostic tools like Screaming Frog.
If you want the Philippines’ most respected SEO agency to run an XML sitemap audit on your site, don’t hesitate to reach out and set up a meeting.