Over on the Official Google Webmaster Central Blog Google has explained in detail how they deal with duplicate content. Should you care? Yes, especially if you have a [tag]WordPress[/tag] blog.
As [tag]Google[/tag] points out, “Most of the time when we see this (duplicate content), it’s unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and — worse yet — linked) via multiple distinct URLs, and so on“.
What do they do about it? “During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list. However, we prefer to focus on filtering rather than ranking adjustments … so in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index”.
Yikes! Less desired version! Now I know why archive, category, and feed pages show up in my indexing.
To the rescue is the DupPrevent WordPress Plugin that helps you avoid being penalized by Google for duplicate content by inserting noindex meta tag in pages that might trigger Google’s duplicate content filters. The plugin also contains a robot.txt file to disallow spider access to files that need not be included in engine’s index.
If you already have a robot.txt file, just append the contents with the lines from the plugin file. I just installed the plugin on this WordPress blog (v2.04).
Also, I have also installed this plugin on a WordPress v1.52 blog. It works perfectly! Now, it will take some time to see if the indexing for my blogs improves.
Update (June 07): The plugin described above is no longer available. If you would like to block duplicate content from getting indexed try the: WordPress Duplicate Content Cure Plugin.
Ah, but what I wonder is how well does it actually decide what not to index? A quick glance at the code shows it’s an incredibly simple routine. Only page 1 of the homepage, single posts, category posts, and pages are indexed. Everything else gets a noindex.
It’s a very cut and dry decision, and it also noindexes all your tags. Does this really solve the problem?