internet.com
You are in the: Small Business Computing Channelarrow
Small Business Technology
» ECommerce-Guide | Small Business Computing | Webopedia | WinPlanet |Refer-It

WinPlanet Software Downloads and Reviews for Small Businesses
Search
Power Search | Tips
-
Navigate WinPlanet
WinPlanet Home Page

Software
Download Index
In-Depth Reviews
Tips & Tutorials
Updates
News

Software Categories
Browsers
Chat / Conferencing
Desktop Utilities
Development
Internet Apps
Multimedia
OS Service Packs
Productivity Tools

Software Glossary

WinPlanet Newsletter

internet.commerce
Partners & Affiliates













Small Business Computing
Small Business Computing
Ecommerce Guide
Webopedia
WinPlanet

WinPlanet / Tips & Tutorials

Download of the day
Internet Explorer 8

Most Popular Software Downloads
Opera
Internet Explorer 7
QuickTime for Windows
Winamp
Mozilla Firefox 3
Ad-Aware 2008 Free
Adobe Flash Player
Paint Shop Pro
Adobe Shockwave Player
AVG Anti-Virus Free
7-Zip

Most Popular Software Articles
Windows Vista Tips & Tricks, Part 1
Windows Vista: Worthy of the Hype?
Windows Wireless Zero Configuration: Five Steps to Sanity


Software Reviews

The Inner Workings of Robots, Spiders, and Web Crawlers
Gaining Control Over Robots
Lee Underwood

Gaining Control Over Robots

As robots visit your Web site, they follow every link and visit every directory — unless, that is, you tell them otherwise. There are a few methods you can use to gain some control over what and where robots search. As I said earlier, most of the robots will obey the robots.txt file and the robots meta tag. Let's take a look and see what these are and how they work.

Robot Meta Tags

Meta tags are used for different things: listing the date the page was created, the author, keywords, a description of the page, and instructions for robots. The robot meta tag tells the robot whether to index the current page and follow the links on it. It's useful when you don't have access to the robots.txt file. (Remember though, every Web page is potentially accessible.)

The meta tag is placed between the tags. There are several parameters that can be used with the tag: all, none, index, noindex, follow, nofollow. (Each parameter must be separated by a comma.) The default, without the meta tag, is all, meaning the robot can index the current page and follow all the links on it. The none tag means the robot is not to index the current page or follow any links on it. The two tags, index and noindex, tell the robot whether it can index the current page. The two tags, follow and nofollow, tell the robot whether it can follow the links on the current page. To keep the current page from being indexed but still allow the links to be followed, you would use:

To stop the robot from indexing the current page and following the links, you would use:

It's important to remember that not all robots support this tag. Most search engines do, but it would be better to use the robots.txt file as it's more effective.

The Robots.txt File

The Robots Exclusion Protocol was created to limit robot access to Web sites. However, it's not a mandatory protocol. When a robot visits a Web site, it first looks for a file called "robots.txt" in the root directory, i.e. http://www.yoursite.com/robots.txt. The file will not work in any other directory. It must also be in text, or ASCII, format.

The format of the file is not too difficult to understand. Each entry or "record" in the file is separated by one or more blank lines. The first line of a record contains the command User-agent: followed by the name of the robot to be excluded or an asterisk ("*"), meaning all robots, i.e.,

User-agent: EmailSiphon

User-agent: *

Following that, on the next line, is a list of the directories that you don't want the robot to visit, i.e.,

Disallow: /cgi-bin/
Disallow: /javascript/

To block a robot from your entire site, list the root directory by itself, i.e.,

Disallow: /

A robots.txt file would look something like the following:

User-agent: EmailSiphon
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: *
Disallow: cgi-bin
Disallow: javascript
Disallow: img
Disallow: /style/css

In the above example, the robots EmailSiphon and CherryPicker are banned from the whole site (if they obey the rules). All robots are banned from the "cgi-bin", "javascript", "img", and "/style/css" directories.

That about covers it for the robots. For more information, check out the links below (for the sake of interest, here is Microsoft's robots.txt file.

Major Search Engine Robots
Miscellaneous Links
Spambots
Software

Tutorial adapted from WebReference

« Previous Page

Contents:
1. Getting Up to Speed with Webbots
2. Gaining Control Over Robots






JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers