Go to DotNetFunda.com
 Welcome, Guest!  
LoginLogin  
Expose your profile for FREE in Xpose
Submit: Article | Interview Question | Code | Question | Xpose | Joke | Link || Search  
 Skip Navigation Links Home > Articles > Robots.txt and Search Engine Spiders - SEO

All Articles | Post Articles |  Subscribe to RSS

Robots.txt and Search Engine Spiders - SEO

 Posted on: 11/26/2007 10:53:03 AM by Amitgupta007_99 | Views: 906 | Category: SEO | Level: Intermediate | Print Article
ASP.NET 3.5 Hosting and MS SQL 2008!
A spider can be defined as which crawls over the web and fetches the webpages for search engines. It can virtually start from anywhere and go everywhere following the links.

Advertisement

Showcase
Are you an employee, employer or a service provider? Showcase your profile in Xpose section to get better opportunity.

When a search engine visits a web site through a submission or when following a link from site one site to another, the search engine robot (also known as a Crawlers, Agents, Bots and Spiders) will look for a text file called robots.txt. The file normally resides in the root directory of the site such as "www.abc.com/robots.txt". Robots.txt will instruct spiders to visit or not to visit particular webpages from the website.

A robots.txt file is just a simple text file; it doesn’t require any special type of formatting like font face, size etc. Make sure that the file is saved as (all lowercase) robots.txt.

The three most common items you will find in a robots.txt file are:
• allow
• disallow
• and the wildcard or asterisk: "*"

Normally you would use the "disallow" command so that an engine not index certain areas of your site, while the "allow" command is actually redundant since they will usually follow any other link that you have not prohibited. Finally the wildcard indicates all engines thus if you had a file folder called "images" under the main directory such as: "www.abc.com/images/" you might use the following coding if you wished to disallow all spiders from that folder:

User-agent: *
Disallow: /images/

When you really meant to block a folder not individual files as in:
Disallow: /images/

We also have to keep an eye on different spiders (Log files or Web Analytics) from a single search engine, like “GoogleBot” and “GoogleBot-Images” etc. We have to be very clear about which files/images needs to be indexed by which spiders like images needs to indexed by ImageBot.

Meta Tags and Robots.txt

<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
Indicates nor to index the webpage nor to follow the links.

<META name="ROBOTS" content="NOINDEX">
Indicates not to index the webpage.

<META name="ROBOTS" content="NOFOLLOW">
Indicates not to follow any links on the webpage

<META name="ROBOTS" content="NOINDEX, FOLLOW">
Indicates to follow links on webpage but not index the web page.

<META name="ROBOTS" content="INDEX, NOFOLLOW">
Indicates not to follow links on webpage but index the web page.

Outbound links & robots.txt

Outbound links are links which contributes to PR of your webpage. Outbound Links involves mainly websites having facility to post by outsiders / visitors, where people post useless contents and links of their respective websites to promote their websites or products. You can block these types of attempts by taking following action:

<a href="http://www.abc.com/cars.htm" rel="nofollow">the truth about cars.</a>

Conclusion

Robots.txt is a vital part of any website; it can be compared with a traffic controller system in a city so in a way it’s necessary to have an updated traffic controller system with all possible directions. Robots.txt also prevents spam and penalties associated with duplicate content.
We humans risk health to earn money and then we give away money to earn the health back. When we try to get indexed by all available spiders, some BAD Agents are generated by software, using which mirror of your website can be downloaded for plagiarism, stealing your clients by posting a similar website. We loose bandwidth, documents, images, Adsense money and prospective business.
So, we need to take control of Robots.txt to save on resources, minimize the risk of loosing content, money and prospective business and ENJOY the growth.

You need any help, always write on amit@r2ainformatics.com or buzz me on +91 9821376830

All the best!!!!


Amit P Gupta
Web Strategist

Interesting?  Bookmark and Share kick it on DotNetKicks.com


About Amit Gupta

Experience:4 year(s)
Home page:http://www.r2ainformatics.com
Member since:Monday, July 23, 2007
Biography:His experience covers a wide range of spectrum: SEO, Analytics, consultant, technical editor and college instructor . Amit holds more than 3 technical certifications and has completed MCA. Amit may be reached at amit@r2ainformatics.com
 Latest post(s) from Amitgupta007_99

   ◘ Robots.txt and Search Engine Spiders - SEO posted on 11/26/2007 10:53:03 AM
   ◘ Search Engine Optimization posted on 11/23/2007 6:09:14 AM
   ◘ Web Analytics posted on 10/19/2007 11:54:48 AM
   ◘ ASP.net redirection posted on 7/23/2007 11:10:40 PM


Response(s) to this Article
Posted by: Animesh | Posted on: 04 Dec 2007 11:52:47 PM
Very informative article .

But i have some doubts,

So is this robots.txt file is automatically created when we host our application or it will not be created untill we dont create it.

And if it is created automatically then who is responsible of creating it.

Thanks

Animesh Misra
Posted by: ProgTalk | Posted on: 20 Apr 2008 08:50:46 PM
It is not created automatically. You usually have to manually create it and put it on your root directory.
Posted by: Raja | Posted on: 23 May 2008 07:53:02 AM
No Animesh. Robots.txt file will not be created automatically. You will have to create yourself.

There are certain protocol you need to follow. You can get some information from here http://www.robotstxt.org/ or http://www.google.com/webmasters/


About Us | Contact Us | Privacy Policy | Terms of Use | Link Exchange | Members | Go Top
All rights reserved to DotNetFunda.Com. Logos, company names used here if any are only for reference purposes and they may be respective owner's right or trademarks.
(Best viewed in IE 6.0+ or Firefox 2.0+ at 1024 * 768 or higher)