Saturday, September 20, 2008

Spider spotting

Spider spotting

The effectiveness of your efforts in submitting your pages for listing on search engines

can be monitored and evaluated by two methods: spider spotting and URL check.

Spiders from search engines that visit your site and crawl pages leave some unique trace

marks in your access log. This can tell you whether a spider has visited or not, what

pages they have visited and also the frequency or duration of their visit.

The best way to identify spider visits is by finding out which visitors asked for the file

robots.txt from your site. Only spiders make such a request, as this file is an indication to

them to avoid covering the page in question. So the first thing a crawler would do is to

check for this file. If you see the access log and analyze it using some convenient

software, you would be able to spot all the visits that were initiated with this request.

Then one can spot the host name and relate that to major search engines. Host names are

related to the search engine company’s name (it is the name of the site that hosts the

spider). Another name that is used to identify such visits is the agent or browser names

used by respective search engines. Get a list of host names and agent names from

available resources (these names tend to change often) and also develop your own

intuitive list by searching your access logs for all occurrences of known engine, host or

agent names. Concentrate only on the top engines; though you may find several other

smaller and less known search engines visiting your site.

Pay attention to not only the total number of visits but to the activity pattern for each of

the recent visits to actually judge how many pages they covered. This is a very good way

of ensuring if submissions have worked or if other inducements such as links from other

sites have worked or not. This also helps you to distinctly evaluate the effectiveness of

submission, indexing and page ranking characteristics of your site.

Some examples of hostnames and agent names are as below:

• AltaVista: hostname may have within its name; agent is often called


• Excite host name may have atex or and agent name is Architextspider.

• Inktomi agent and host names have and Slurp is often used as the

agent name.

• Lycos uses within its host name and Lycos Spider is often part of the

agent name.