Bloggers Page Rank Tips #2: Tell Search Engines What To Do With Robots.txt

by Colin Lim

Ever wanted to tell the big search engines what to go??! Well you can.. not quite in the way some of you may be thinking of though (I think there are some forums around that will let you let off a little steam about the search engines). You can basically tell a search engine which parts of our blog you want it to crawl and include in their index.

Why would you want to stop a search engine crawling your entire blog?

There are several reasons that you want this, here are a few:

  1. I’d say that most blogs include a bunch of folders that shouldn’t really be visible to the public and/or are irrelevant to the blog’s subject such as your cgi-bin folder or other folders that Wordpress notoriously creates to make your blog work but are irrelevant to the blog itself, like wp-content, wp-admin etc.
  2. You may be still building your blog and you do not want the search engine spiders to crawl unfinished work
  3. If you allow a search engine bot/crawler/spider to index you’re entire blog, the chances are you are going to "bleed" page rank because it indexes parts of your blog that aren’t optimised for search engines, do not contain good or any content, and do not relate to your blog’s main topic.
  4. You may have parts of your blog that you are using to sell certain affiliate programs - they are landing pages and they are very similar but just slightly different depending who you are selling to. If you let a search engine crawl all of them, you could get penalized for duplicate content.
  5. There are some blogs and spiders you simply don’t want anything to do with like the ones that just collect email addresses.
  6. You have some private material in your folders that you haven’t password protected.

So to stop any of this from happening you use what is called a robots.txt file. This file is the universal way of telling a search engine what it’s allowed to access and index and what it isn’t. All search engines will look for the existence of this file first before it indexes your blog. If one doesn’t exist the search engine crawler will simply index your entire blog.It is important to note that the robots.txt file is a text file and not HTML nor PHP. The best way to create one is using a program like Windows Notepad or anything else that creates just text and not Word documents etc. Avoid using word processors and other programs to create this file as they can sometimes add characters to their documents that aren’t recognized by search engine spiders and may invalidate your robots.txt file.

What do you put into a robots.txt file?

Well, the format of this file is universal and it should like this:

User-agent: (this is the name of the spider or bot)

Disallow: (name of the file or folder you do not want the spider or bot to index)

The first line "User-agent" refers to which crawler you want to address. You may address all crawlers by adding a "*" here so your top line would be User-agent: * (this means apply the following rules to all agents). If you want a list of all the user-agents available, visit this site www.user-agents.org

The second line "Disallow" tells the agent(s) you are addressing, which folders or pages they are not allowed to access/index. So if you do not want them to index your cgi-bin, you’d put Disallow: /cgi-bin. Be sure to choose the correct path to your folders here. Or you could tell the agent not to index any part of your blog, you would use the form Disallow: *. But this would mean your blog would not get ranked at all, so unless your blog is top secret, then I would not use this!

Here is an example of a robots.txt file that will prevent all search engines indexing some of the folders that Wordpress creates:

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/

Once you have created your file, save it as robots.txt and upload it via FTP to your blog’s host. You need to put this in your root directory - that is, the directory that contains your index.php or index.html file. This is normally /www or /public_html.

And there you have it, you are now telling the search engines where to go!

About the Author

Colin Lim has been had several years experience making money online starting with his own web hosting business that he set up in the early 1990’s. Colin has made money from several different methods including eBay, Affiliate Marketing, Pay Per Click, property, shares trading etc.

Did you like it? Was it useful? Bookmark or share this post:

Leave a Reply

You must be logged in to post a comment.


 

Site Stuff

Sponsors

Find High Paying
Adsense Keywords!