HTML Extensions
In addition to broken and poorly formed HTML, some browsers have taken it upon themselves to extend the basic W3C HTML specifications by adding their own tags. The search engine spiders may not support such non-standard tags and so pages that display well in Internet Explorer may not be readable by the spiders.
Robots.txt
The robots.txt file is a text file placed in the root folder of your web site that tells the search engines which pages on your site you would prefer didn't get indexed. In general, the major spiders will obey your request and not index your pages. However, this is a voluntary task undertaken by the spiders, and blocking pages through the use of a robots.txt file will not necessarily stop email harvesters from reading your pages.
A major problem can arise when you write your robots.txt file in such a way that you accidentally block all or part of your site from the spiders.
The basic robots.txt file consists of one or more lines of text as follows:
User-agent: * (the spider)
Disallow: /tmp (what is to be disallowed)
Disallow: /logs (what is to be disallowed)
In the previous example, all spiders (the user-agent: *) are requested not to index any pages starting with /tmp and /logs. The problem becomes that the disallow strings assume a wildcard at the end, so blocking /logs will stop the spiders from indexing /logs, /logs/log1.txt, and /logsee.php, but you may not have meant to block /logsee.php.
In addition some search engines have extended the robots.txt specification so that it allows pattern matching. Pattern matching is where, instead of looking for an exact match for the URL, wild card characters are introduced to allow partial URL matching. This includes "*" for any sequence of characters, and "$" to mean end of line. So, for instance, the line "Disallow: /abc*?$" for Google means disallow all URLs that start with "/abc" and end with a "?." For the other search engines, it means to ignore all URLs that start with "/abc*?$."
In addition, some, but not all, spiders support the "Allow:" command. The Allow command has the same syntax as Disallow but explicitly tells the search engine spiders they can index the page referenced.
Google gives an example robots.txt file that will block all robots except Googlebot from indexing your site:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Because this is non-standard, most people are often confused by these additions.
Comments