How Inktomi Works
Inktomi is one of the most popular crawler based search engines. Inktomi is a crawler-
based search engine. However, it does not make its index available to the public through
its own site like other crawler-based search engines, such as Lycos or Alltheweb. Inktomi
licenses other companies to use its search index. These companies are then able to
provide search services to their visitors without having to build their own index.
It uses a robot named Slurp to crawl and index web pages.
Slurp – The Inktomi Robot
Slurp collects documents from the web to build a searchable index for search services
using the Inktomi search engine, including Microsoft and HotBot. Some of the
characteristics of Slurp are given below:
Frequency of accesses
Slurp accesses a website once every five seconds on average. Since network delays are
involved it is possible over short periods the rate will appear to be slightly higher, but the
average frequency generally remains below once per minute.
robots.txt
Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994
Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the
1994 standard, the proposed standard is followed.
Slurp will obey the first record in the robots.txt file with a User-Agent containing
"Slurp". If there is no such record, it will obey the first entry with a User-Agent of "*".
This is discussed in detail later in this book.
NOINDEX meta-tag
Slurp obeys the NOINDEX meta-tag. If you place
in the head of your web document, Slurp will retrieve the document, but it will not index
the document or place it in the search engine's database.
Repeat downloads
In general, Slurp would only download one copy of each file from your site during a
given crawl. Occasionally the crawler is stopped and restarted, and it re-crawls pages it
has recently retrieved. These re-crawls happen infrequently, and should not be any cause
for alarm.
Searching the results
Slurp crawls from websites to the Inktomi search engines immediately. The documents
are indexed and entered into the search database in quick time.
Following links
Slurp follows HREF links. It does not follow SRC links. This means that Slurp does not
retrieve or index individual frames referred to by SRC links.
Dynamic links
Slurp has the ability to crawl dynamic links or dynamically generated documents. It will
not, however, crawl them by default. There are a number of good reasons for this. A
couple of reasons are that dynamically generated documents can make up infinite URL
spaces, and that dynamically generated links and documents can be different for every
retrieval so there is no use in indexing them.
Content guidelines for Inktomi
Given here are the content guidelines and policies for Inktomi. In other words, listed
below is the content Inktomi indexes and the content it avoids.
Inktomi indexes:
Original and unique content of genuine value
Pages designed primarily for humans, with search engine considerations
secondary
Hyperlinks intended to help people find interesting, related content, when
applicable
Metadata (including title and description) that accurately describes the contents of
a Web page
Good Web design in general
Inktomi is one of the most popular crawler based search engines. Inktomi is a crawler-
based search engine. However, it does not make its index available to the public through
its own site like other crawler-based search engines, such as Lycos or Alltheweb. Inktomi
licenses other companies to use its search index. These companies are then able to
provide search services to their visitors without having to build their own index.
It uses a robot named Slurp to crawl and index web pages.
Slurp – The Inktomi Robot
Slurp collects documents from the web to build a searchable index for search services
using the Inktomi search engine, including Microsoft and HotBot. Some of the
characteristics of Slurp are given below:
Frequency of accesses
Slurp accesses a website once every five seconds on average. Since network delays are
involved it is possible over short periods the rate will appear to be slightly higher, but the
average frequency generally remains below once per minute.
robots.txt
Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994
Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the
1994 standard, the proposed standard is followed.
Slurp will obey the first record in the robots.txt file with a User-Agent containing
"Slurp". If there is no such record, it will obey the first entry with a User-Agent of "*".
This is discussed in detail later in this book.
NOINDEX meta-tag
Slurp obeys the NOINDEX meta-tag. If you place
in the head of your web document, Slurp will retrieve the document, but it will not index
the document or place it in the search engine's database.
Repeat downloads
In general, Slurp would only download one copy of each file from your site during a
given crawl. Occasionally the crawler is stopped and restarted, and it re-crawls pages it
has recently retrieved. These re-crawls happen infrequently, and should not be any cause
for alarm.
Searching the results
Slurp crawls from websites to the Inktomi search engines immediately. The documents
are indexed and entered into the search database in quick time.
Following links
Slurp follows HREF links. It does not follow SRC links. This means that Slurp does not
retrieve or index individual frames referred to by SRC links.
Dynamic links
Slurp has the ability to crawl dynamic links or dynamically generated documents. It will
not, however, crawl them by default. There are a number of good reasons for this. A
couple of reasons are that dynamically generated documents can make up infinite URL
spaces, and that dynamically generated links and documents can be different for every
retrieval so there is no use in indexing them.
Content guidelines for Inktomi
Given here are the content guidelines and policies for Inktomi. In other words, listed
below is the content Inktomi indexes and the content it avoids.
Inktomi indexes:
Original and unique content of genuine value
Pages designed primarily for humans, with search engine considerations
secondary
Hyperlinks intended to help people find interesting, related content, when
applicable
Metadata (including title and description) that accurately describes the contents of
a Web page
Good Web design in general
0 comments:
Post a Comment