What AltaVista doesn’t Index
AltaVista doesn't index everything. In fact, features that Web designers may add to sites
at great expense may block crawlers, meaning that those pages will never be indexed and
never be found through search engines. As a result, those sites may end up spending far
more on promotion than they would have had to otherwise.
Here are some pages AltaVista doesn’t index. This only highlights the importance of
using plain text for your web pages.
First, sites that require any kind of registration or password lock out AltaVista. Keep in
mind that a web crawler cannot fill out a form of any kind. If you need to fill out a form
to get to the next page, the crawler halts right there. If you would like to gather
information about your users/members but would also like your pages to be indexed,
make the registration optional.
Similarly, the AltaVista crawler cannot get content from a database, because it cannot fill
out a form. If the content of your database is largely text, you might consider creating
plain text static HTML pages with that same content, so it can be indexed and found.
Dynamic pages also block AltaVista spiders. While it's great to give visitors to your site
unique experiences, tailored to their needs, the techniques you use to do that could stop
most search engines including AltaVista from indexing your content and hence could
greatly reduce your potential traffic. Dynamically generated pages are created on the fly
from a variety of elements held in databases. When the AltaVista crawler arrives at such
a page, it captures the content but halts immediately, and will not follow the links,
because it sees ahead of it an infinite number of pages ahead -- a black hole that would
bring it to a crash.
Active Server Pages (.asp) with question marks in their URLs (indicating that the page is
a script for the construction of a page, rather than just static content) fall into this
category.
If you have information inside frames, that will probably prove to be a hindrance, but is
not an absolute barrier. AltaVista indexes the outside of the frame as a distinct page. It
will also index each pane of the frame window as a separate page. That means that if the
content matching a query is in a pane, when visitors clicking on those links will see the
pane and only the pane -- not the full page as it was designed. So if you want visitors
from search engines to experience your pages the way they were intended to be seen, you
should have non-frames as well as frames versions of those pages; and submit the non-
frames versions with Add URL.
AltaVista also can't index text that is embedded in graphics. Search engines simply
cannot "see" the text unless the Webmaster put ALT text behind the picture, describing it
and listing those important words. But pictures, as pictures, can be indexed for Image
search at AltaVista.
Text that appears in multi-media files (audio and video) cannot be indexed. But those
same files can be indexed at AltaVista for Multimedia search.
Information that is generated by Java applets or in XML coding cannot be indexed.
Acrobat files cannot be indexed either. But technology exists that will enable AltaVista to
convert those files to indexable form.
Exceptionally large pages also present a problem at AltaVista. As a pragmatic
compromise, intended to help optimize the performance of AltaVista, they fully index the
first 64 Kbytes of text on any single page. They will harvest the hyperlinks from the
whole document for following up later, but they will only index the first 64 Kbytes. So if
you want to post an entire book, it's best to break it up into chapters, and then all the text
can be indexed.
Comments, such as , aren't indexed at all. Those are
intended as private communications, not viewable by Web site visitors, except by using
View/Page Source.
Also, consider technical factors. If a site has a slow connection, it might time-out for the
crawler. Very complex pages, too, may time out before the crawler can harvest the text.
If you have a hierarchy of directories at your site, put the most important information
high, not deep. AltaVista will presume that the higher you placed the information, the
more important it is. And crawlers may not venture deeper than three or four or five
directory levels.
Above all remember the obvious - full-text search engines such as AltaVista index text.
You may well be tempted to use fancy and expensive design techniques that either block
search engine crawlers or leave your pages with very little plain text that can be indexed.
Ranking Rules
The simple rule of thumb is that content counts, and that content near the top of a page
counts for more than content at the end. In particular, the HTML title and the first couple
lines of text are the most important part of your pages. If the words and phrases that
match a query happen to appear in the HTML title or first couple lines of text of one of
your pages, chances are very good that that page will appear high in the list of search
results.
AltaVista bases its ranking on both static factors (a computation of the value of page
independent of any particular query) and query-dependent factors.
It values:
Long pages, which are rich in meaningful text (not randomly generated letters and
words).
Pages that serve as good hubs, with lots of links to pages that that have related
content (topic similarity, rather than random meaningless links, such as those
generated by link exchange programs or intended to generate a false impression of
"popularity").
The connectivity of pages, including not just how many links there are to a page
but where the links come from: the number of distinct domains and the "quality"
ranking of those particular sites. This is calculated for the site and also for
individual pages. A site or a page is "good" if many pages at many different sites
point to it, and especially if many "good" sites point to it.
The level of the directory in which the page is found. Higher is considered more
important. If a page is buried too deep, and the crawler simply won't go that far
and will never find it.
These static factors are recomputed about once a week, and new good pages slowly
percolate upward in the rankings. Note that there are advantages to having a simple
address and sticking to it, so others can build links to it, and so you know that it's in the
index
Query-dependent factors include:
The HTML title.
The first lines of text.
Query words and phrases appearing early in a page rather than late.
Meta tags, which are treated as ordinary words in the text, but like words that
appear early in the text (unless the meta tags are patently unrelated to the content
on the page itself, in which case the page will be penalized)
Words mentioned in the "anchor" text associated with hyperlinks to your pages.
(E.g., if lots of good sites link to your site with anchor text "breast cancer" and the
query is "breast cancer," chances are good that you will appear high in the list of
matches.)
AltaVista doesn't index everything. In fact, features that Web designers may add to sites
at great expense may block crawlers, meaning that those pages will never be indexed and
never be found through search engines. As a result, those sites may end up spending far
more on promotion than they would have had to otherwise.
Here are some pages AltaVista doesn’t index. This only highlights the importance of
using plain text for your web pages.
First, sites that require any kind of registration or password lock out AltaVista. Keep in
mind that a web crawler cannot fill out a form of any kind. If you need to fill out a form
to get to the next page, the crawler halts right there. If you would like to gather
information about your users/members but would also like your pages to be indexed,
make the registration optional.
Similarly, the AltaVista crawler cannot get content from a database, because it cannot fill
out a form. If the content of your database is largely text, you might consider creating
plain text static HTML pages with that same content, so it can be indexed and found.
Dynamic pages also block AltaVista spiders. While it's great to give visitors to your site
unique experiences, tailored to their needs, the techniques you use to do that could stop
most search engines including AltaVista from indexing your content and hence could
greatly reduce your potential traffic. Dynamically generated pages are created on the fly
from a variety of elements held in databases. When the AltaVista crawler arrives at such
a page, it captures the content but halts immediately, and will not follow the links,
because it sees ahead of it an infinite number of pages ahead -- a black hole that would
bring it to a crash.
Active Server Pages (.asp) with question marks in their URLs (indicating that the page is
a script for the construction of a page, rather than just static content) fall into this
category.
If you have information inside frames, that will probably prove to be a hindrance, but is
not an absolute barrier. AltaVista indexes the outside of the frame as a distinct page. It
will also index each pane of the frame window as a separate page. That means that if the
content matching a query is in a pane, when visitors clicking on those links will see the
pane and only the pane -- not the full page as it was designed. So if you want visitors
from search engines to experience your pages the way they were intended to be seen, you
should have non-frames as well as frames versions of those pages; and submit the non-
frames versions with Add URL.
AltaVista also can't index text that is embedded in graphics. Search engines simply
cannot "see" the text unless the Webmaster put ALT text behind the picture, describing it
and listing those important words. But pictures, as pictures, can be indexed for Image
search at AltaVista.
Text that appears in multi-media files (audio and video) cannot be indexed. But those
same files can be indexed at AltaVista for Multimedia search.
Information that is generated by Java applets or in XML coding cannot be indexed.
Acrobat files cannot be indexed either. But technology exists that will enable AltaVista to
convert those files to indexable form.
Exceptionally large pages also present a problem at AltaVista. As a pragmatic
compromise, intended to help optimize the performance of AltaVista, they fully index the
first 64 Kbytes of text on any single page. They will harvest the hyperlinks from the
whole document for following up later, but they will only index the first 64 Kbytes. So if
you want to post an entire book, it's best to break it up into chapters, and then all the text
can be indexed.
Comments, such as , aren't indexed at all. Those are
intended as private communications, not viewable by Web site visitors, except by using
View/Page Source.
Also, consider technical factors. If a site has a slow connection, it might time-out for the
crawler. Very complex pages, too, may time out before the crawler can harvest the text.
If you have a hierarchy of directories at your site, put the most important information
high, not deep. AltaVista will presume that the higher you placed the information, the
more important it is. And crawlers may not venture deeper than three or four or five
directory levels.
Above all remember the obvious - full-text search engines such as AltaVista index text.
You may well be tempted to use fancy and expensive design techniques that either block
search engine crawlers or leave your pages with very little plain text that can be indexed.
Ranking Rules
The simple rule of thumb is that content counts, and that content near the top of a page
counts for more than content at the end. In particular, the HTML title and the first couple
lines of text are the most important part of your pages. If the words and phrases that
match a query happen to appear in the HTML title or first couple lines of text of one of
your pages, chances are very good that that page will appear high in the list of search
results.
AltaVista bases its ranking on both static factors (a computation of the value of page
independent of any particular query) and query-dependent factors.
It values:
Long pages, which are rich in meaningful text (not randomly generated letters and
words).
Pages that serve as good hubs, with lots of links to pages that that have related
content (topic similarity, rather than random meaningless links, such as those
generated by link exchange programs or intended to generate a false impression of
"popularity").
The connectivity of pages, including not just how many links there are to a page
but where the links come from: the number of distinct domains and the "quality"
ranking of those particular sites. This is calculated for the site and also for
individual pages. A site or a page is "good" if many pages at many different sites
point to it, and especially if many "good" sites point to it.
The level of the directory in which the page is found. Higher is considered more
important. If a page is buried too deep, and the crawler simply won't go that far
and will never find it.
These static factors are recomputed about once a week, and new good pages slowly
percolate upward in the rankings. Note that there are advantages to having a simple
address and sticking to it, so others can build links to it, and so you know that it's in the
index
Query-dependent factors include:
The HTML title.
The first lines of text.
Query words and phrases appearing early in a page rather than late.
Meta tags, which are treated as ordinary words in the text, but like words that
appear early in the text (unless the meta tags are patently unrelated to the content
on the page itself, in which case the page will be penalized)
Words mentioned in the "anchor" text associated with hyperlinks to your pages.
(E.g., if lots of good sites link to your site with anchor text "breast cancer" and the
query is "breast cancer," chances are good that you will appear high in the list of
matches.)
0 comments:
Post a Comment