Thursday, September 18, 2008

How Inktomi Works

How Inktomi Works





Inktomi is one of the most popular crawler based search engines. Inktomi is a crawler-

based search engine. However, it does not make its index available to the public through

its own site like other crawler-based search engines, such as Lycos or Alltheweb. Inktomi

licenses other companies to use its search index. These companies are then able to

provide search services to their visitors without having to build their own index.





It uses a robot named Slurp to crawl and index web pages.






Slurp – The Inktomi Robot


Slurp collects documents from the web to build a searchable index for search services

using the Inktomi search engine, including Microsoft and HotBot. Some of the

characteristics of Slurp are given below:


Frequency of accesses



Slurp accesses a website once every five seconds on average. Since network delays are

involved it is possible over short periods the rate will appear to be slightly higher, but the

average frequency generally remains below once per minute.



robots.txt



Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994

Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the

1994 standard, the proposed standard is followed.



Slurp will obey the first record in the robots.txt file with a User-Agent containing

"Slurp". If there is no such record, it will obey the first entry with a User-Agent of "*".



This is discussed in detail later in this book.



NOINDEX meta-tag



Slurp obeys the NOINDEX meta-tag. If you place







in the head of your web document, Slurp will retrieve the document, but it will not index

the document or place it in the search engine's database.



Repeat downloads



In general, Slurp would only download one copy of each file from your site during a

given crawl. Occasionally the crawler is stopped and restarted, and it re-crawls pages it

has recently retrieved. These re-crawls happen infrequently, and should not be any cause

for alarm.



Searching the results



Slurp crawls from websites to the Inktomi search engines immediately. The documents

are indexed and entered into the search database in quick time.



Following links



Slurp follows HREF links. It does not follow SRC links. This means that Slurp does not

retrieve or index individual frames referred to by SRC links.



Dynamic links



Slurp has the ability to crawl dynamic links or dynamically generated documents. It will

not, however, crawl them by default. There are a number of good reasons for this. A

couple of reasons are that dynamically generated documents can make up infinite URL

spaces, and that dynamically generated links and documents can be different for every

retrieval so there is no use in indexing them.



Content guidelines for Inktomi



Given here are the content guidelines and policies for Inktomi. In other words, listed

below is the content Inktomi indexes and the content it avoids.



Inktomi indexes:



Original and unique content of genuine value

Pages designed primarily for humans, with search engine considerations

secondary

Hyperlinks intended to help people find interesting, related content, when

applicable



Metadata (including title and description) that accurately describes the contents of

a Web page

Good Web design in general


0 comments: