Duplicate Content: Common Causes and Practical Solutions

As the name suggests, duplicate content refers to content that is completely identical or very similar to that which appears in another location or multiple other locations within a website or across different websites. The operative word here is “location,” which refers to a unique web address, also known as a URL. If you have content in one URL that is significantly similar or totally matches the content in another URL, you’ve got yourself a duplicate content.

For example, if you run a pet blog and you’ve written an article about the betta fish, you may publish it on a page that ends up having the web address https://example.com/betta-fish. However, if you publish it on another page with the web address https://example.com/pet-fish/betta-fish, then this becomes a duplicate of the first one.

Why Does Content Duplication Happen?

Content duplication is something that can happen intentionally because the webmaster or author might purposely create the same content on other webpages with different URLs. The person might simply want the content to be available on other pages, unaware of how search engines handle such practice (more on this later).

However, content duplication can also occur because of the way some content management systems are configured. This tends to happen a lot to ecommerce websites because they often have products that can appear in multiple categories, and the websites end up allowing multiple paths and URLs to the same content pieces. Many websites use query string parameters in URLS, for example, in order to track various items such as searches, faceted navigation inputs, session IDs, languages, currencies, and so on.

Content duplication can also happen when a website creates different versions of a web page for specific purposes. For example, a website may create a stripped-down version of a webpage so that it loads faster and looks better on a smartphone. A website may also generate a print-friendly version of a webpage to make it easier for users to get the content printed.

Similarly, different versions of a webpage can exist when it is made accessible both as a www and as a non-www page, as well when it’s made available through both the “http” and “https” (secure) protocols.

Conversely, there are some cases when content duplication across domains is performed by unscrupulous webmasters as an attempt to spam web users and to try to manipulate the search engine rankings of their website. Such practices can result in serious consequences for the websites in question.

The good thing is that most of the cases mentioned above happen not because of deliberate efforts at being spammy but because the duplicated content pieces are generated automatically by the websites themselves. This means content duplication is often a technical issue that can be addressed with corresponding technical solutions.

Can Your Website Be Penalized by Google for Having Duplicate Content?

Much has been made about the evil of duplicate content and of the hard luck that befalls websites that have them. One of the most pervasive myths that has emerged over the years is the one about Google penalizing websites simply because they contain duplicate content. This belief became even more prevalent following the Google Panda algorithm update in February 2011, which sought to lower the rank of low-quality websites, particularly content farms and scraper sites that copy content from other websites, often without the webmasters’ permission.

We now know for a fact that Google does not necessarily penalize website just because they have duplicate content. As early as September 2008, Webmaster Trends analyst Susan Moskwa has made it clear through a post on the Google Webmaster Central Blog that this is not the case:

“Let’s put this to bed once and for all, folks: There’s no such thing as a “duplicate content penalty.” At least, not in the way most people mean when they say that.”

In July 2013, Matt Cutts, former head of the web spam team at Google, also answered the question in a video posted on Google Webmasters’ official YouTube channel:

“The answer is I wouldn’t stress about it—unless the content that you have that’s duplicated is spammy or keyword stuffing, or something like that. Then an algorithm or person might take action on it.”

For the most part, your website might be perceived by the Panda algorithm as being low-quality only if you have a significant amount of content that was deliberately lifted and plagiarized from other sources on the web. In other cases, such as when you have a large number of duplicate product pages in your ecommerce site, you most likely won’t be penalized by Google. However, you may still run into problems because of the way Google filters duplicate content in their search results.

Issues You Might Encounter from Having Duplicate Content

From the perspective of web users, duplicate content is usually not a cause for concern because they’re able to see the same content no matter which version of it they end up in. The situation is different for search engines like Google, however, because they have to deal with the predicament of having to choose among these different versions of the same content.

First, search engines will not know which versions of the content they must include or exclude in their search indices. Often a search engine like Google will be forced to choose a version that it believes is the most relevant for any given search term entered by web users. As such, there is a possibility that the least desirable or profitable versions of the content will be shown on top of the search results instead of the main version that you want to appear instead.

Secondly, search engines will have a difficult time figuring out how to rank the different versions of the content in their search results. This is because by having multiple versions of the same content on your website, you’re basically dividing the link equity among pages with the same content. When web users link back to different versions of the same of content on your page, this particular content piece will not gain optimal search visibility and link equity as compared to when only the main version of it is shown in search results.

Simple Solutions to Duplicate Content

There are a number of practical solutions to duplicate content issues. Here are some of them:

Create 301 Redirects

One of the most commonly employed solutions to duplicate content is creating 301 redirects. They are permanent redirects usually used to help people and search engines find content items that have been moved to new URLs.

Setting up 301 redirects from duplicate pages also allows you to send web users and search engine crawlers to the main or original copies of your content. They are useful because they pass most of the link equity from the duplicate pages to the redirected page, effectively eliminating the possibility of these duplicate content pieces competing with the main one.

Use Rel=“Canonical” Tags

A rel=“canonical” is an HTML attribute that tells search engine crawlers that a specific URL is the main or original copy of that content. This attribute informs search engine crawlers that the page is only a copy of the main (i.e. the “canon”) content page.

Like 301 redirects, they also pass on link equity from the duplicate pages to the main content. Unlike a 301 redirect, however, a rel=“canonical” tag is much easier to implement because it is simply added to the HTML head of a webpage.

Set Your Preferred Domain

You can use the Google Search Console to tell Google which version of your domain you want to use for search results. For instance, if your website is accessible through both the www and the non-www versions of the URL, you can tell Google to favor one over the other so that the search engine will use that version in all subsequent crawls of the site.

Use Content=“NoIndex,Follow” Tags

Another meta tag that you can use against duplicate content is the content=“noindex,follow” meta robots tag. This can be added to the HTML head of a webpage so that it can tell search engines to crawl the links on that page while preventing them from including these links to their search indices. This HTML tag is very useful in duplicate content issues arising from paginated content.

Syndicate and Republish Content with Caution

You may be thinking of syndicating your content on other websites, or conversely, republishing (in your own website) content pieces which you’ve contributed earlier in other websites. The best practice rule to follow is to make your content assets unique across these different websites. Ideally, no two content pieces should be the same.

If you really can’t prevent syndicating your content, ask the webmasters for backlinks to your own website and tell them to use the noindex meta tag to prevent search engines from indexing their duplicate copies of your content. If you’re republishing in your site a guest post you did for another site in the past, use the rel=“canonical” tag to let search engines know where the content piece first originally appeared.

For webmasters that strive to operate within the specified guidelines of search engines, duplicate content issues rarely result in a penalty. Nevertheless, the presence of significant duplicate content in your site could still affect its search visibility because of the way Google and other search engines filter such content.  By implementing the right technical solutions, however, you’ll be able to fix these existing duplicate content issues, allowing you to concentrate on webpages that actually matter to you.

Glen Dimaandal
Glen Dimaandal
Glen Dimaandal is the founder and CEO of SearchWorks.Ph. He has been doing SEO since 2008 and is consistently featured in mainstream media and industry conferences. His core skills include SEO, SEM, data analytics and business development.
Glen Dimaandal
Glen Dimaandal
Glen Dimaandal is the founder and CEO of SearchWorks.Ph. He has been doing SEO since 2008 and is consistently featured in mainstream media and industry conferences. His core skills include SEO, SEM, data analytics and business development.