WordPress, RSS, and Robots.txt

I have been playing with my robots.txt file because I notice that I am finding my RSS feeds showing up in the Google search results. RSS feeds are not human-readable and therefore better left out of the search engine and also they duplicate existing content, so provide no benefit.

I did some searching around and found what looks to be a good example WordPress robots.txt file at askApache. But it does not address the RSS feed problem at all so I just added:

Disallow: /feed/
Disallow: /comments/

to my robots.txt file. That would take care of the primary RSS feeds for posts and comments but in WordPress each post also has its own feed in the form of “/post-name/feed/” and “/post-name/comments/feed/“. The original 1994 specs for robots.txt do not allow a ‘wild-card’ character but in searching around I found that some search engines will allow them. Google spells out its pattern matching rules pretty well.

So,

Disallow: */feed/

should work, and so I added it too. I left in the two lines mentioned above just in case some other search-bot does not recognize the * as a wild-card. Testing out my new robots.txt file at

Here is my current robots.txt:

# Allow all
User-agent: *
Disallow:

# disallow all files in /wp- directories
Disallow: /wp-*/
Disallow: /images/
Disallow: /stats/
Disallow: /feed/
Disallow: /comments/
Disallow: */feed/
Disallow: /Test/

# disallow the wp-* files in the root directory
Disallow: /wp-

# disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.txt$

# disallow all files with? in url
Disallow: /*?*

Navigation:

2 Comments »

  1. 1

    Only the Googlebot has express support for these wildcards in robots.txt So your robots.txt should start out more conservative.. then specify rules for the Google bots.

    User-agent:  *
    Disallow: /wp-
    Disallow: /images/
    Disallow: /stats/
    Disallow: /feed/
    Disallow: /comments/
    Disallow: /Test/
    
    
    User-agent: Googlebot
    # disallow all files ending with these extensions
    Disallow: /*.php$
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: /*.txt$
    
    # disallow all files with? in url
    Disallow: /*?*
    Disallow */feed/
    Disallow */trackback/
    

    And as for your RSS issues, I moved to FeedBurner and its awesome! Google actually lets you submit your RSS feed as a sitemap in the webmaster tool!

    Comment by askApache — February 23, 2007 @ 7:11 am


  2. 2

    Thanks for the comment!

    Re Feedburner… I have thought about moving, and looked at it, but have not made the jump yet. It looks like it would be a great way to keep stats on your feed readers.

    Re Robots.txt… I began the evening thinking that none of the search bots could handle wild cards since that appears to be what robotstxt.org states.

    Then I happen across a post that states Google can handle wildcards, and I think, hmm maybe I should create some rules just for the Google bot.

    Then as I read more, I find this article at the Yahoo blog from November 2006 that the Yahoo bot can now handle wilcards!

    Rather than go looking to find out which bots have kept up with times, I figure that the worst that can happen is that the bot will ignore the rule. The robots.txt file seems a lot more forgiving than .htaccess

    Comment by YeOleImposter — February 23, 2007 @ 9:34 am


RSS feed for comments on this post. TrackBack URI

Leave a comment



Powered by WordPress
Copyright by Gary Paulson

Bad Behavior has blocked 583 access attempts in the last 7 days.