Thursday, September 18, 2008

What AltaVista doesn’t Index

What AltaVista doesn’t Index



AltaVista doesn't index everything. In fact, features that Web designers may add to sites

at great expense may block crawlers, meaning that those pages will never be indexed and

never be found through search engines. As a result, those sites may end up spending far

more on promotion than they would have had to otherwise.



Here are some pages AltaVista doesn’t index. This only highlights the importance of

using plain text for your web pages.






First, sites that require any kind of registration or password lock out AltaVista. Keep in

mind that a web crawler cannot fill out a form of any kind. If you need to fill out a form

to get to the next page, the crawler halts right there. If you would like to gather

information about your users/members but would also like your pages to be indexed,

make the registration optional.



Similarly, the AltaVista crawler cannot get content from a database, because it cannot fill

out a form. If the content of your database is largely text, you might consider creating

plain text static HTML pages with that same content, so it can be indexed and found.



Dynamic pages also block AltaVista spiders. While it's great to give visitors to your site

unique experiences, tailored to their needs, the techniques you use to do that could stop

most search engines including AltaVista from indexing your content and hence could

greatly reduce your potential traffic. Dynamically generated pages are created on the fly

from a variety of elements held in databases. When the AltaVista crawler arrives at such

a page, it captures the content but halts immediately, and will not follow the links,

because it sees ahead of it an infinite number of pages ahead -- a black hole that would

bring it to a crash.



Active Server Pages (.asp) with question marks in their URLs (indicating that the page is

a script for the construction of a page, rather than just static content) fall into this

category.








If you have information inside frames, that will probably prove to be a hindrance, but is

not an absolute barrier. AltaVista indexes the outside of the frame as a distinct page. It

will also index each pane of the frame window as a separate page. That means that if the

content matching a query is in a pane, when visitors clicking on those links will see the

pane and only the pane -- not the full page as it was designed. So if you want visitors

from search engines to experience your pages the way they were intended to be seen, you

should have non-frames as well as frames versions of those pages; and submit the non-

frames versions with Add URL.



AltaVista also can't index text that is embedded in graphics. Search engines simply

cannot "see" the text unless the Webmaster put ALT text behind the picture, describing it

and listing those important words. But pictures, as pictures, can be indexed for Image

search at AltaVista.



Text that appears in multi-media files (audio and video) cannot be indexed. But those

same files can be indexed at AltaVista for Multimedia search.



Information that is generated by Java applets or in XML coding cannot be indexed.

Acrobat files cannot be indexed either. But technology exists that will enable AltaVista to

convert those files to indexable form.



Exceptionally large pages also present a problem at AltaVista. As a pragmatic

compromise, intended to help optimize the performance of AltaVista, they fully index the

first 64 Kbytes of text on any single page. They will harvest the hyperlinks from the

whole document for following up later, but they will only index the first 64 Kbytes. So if

you want to post an entire book, it's best to break it up into chapters, and then all the text

can be indexed.



Comments, such as , aren't indexed at all. Those are

intended as private communications, not viewable by Web site visitors, except by using

View/Page Source.



Also, consider technical factors. If a site has a slow connection, it might time-out for the

crawler. Very complex pages, too, may time out before the crawler can harvest the text.

If you have a hierarchy of directories at your site, put the most important information

high, not deep. AltaVista will presume that the higher you placed the information, the

more important it is. And crawlers may not venture deeper than three or four or five

directory levels.



Above all remember the obvious - full-text search engines such as AltaVista index text.

You may well be tempted to use fancy and expensive design techniques that either block

search engine crawlers or leave your pages with very little plain text that can be indexed.



Ranking Rules






The simple rule of thumb is that content counts, and that content near the top of a page

counts for more than content at the end. In particular, the HTML title and the first couple

lines of text are the most important part of your pages. If the words and phrases that

match a query happen to appear in the HTML title or first couple lines of text of one of

your pages, chances are very good that that page will appear high in the list of search

results.



AltaVista bases its ranking on both static factors (a computation of the value of page

independent of any particular query) and query-dependent factors.



It values:



Long pages, which are rich in meaningful text (not randomly generated letters and

words).



Pages that serve as good hubs, with lots of links to pages that that have related

content (topic similarity, rather than random meaningless links, such as those

generated by link exchange programs or intended to generate a false impression of

"popularity").



The connectivity of pages, including not just how many links there are to a page

but where the links come from: the number of distinct domains and the "quality"



ranking of those particular sites. This is calculated for the site and also for

individual pages. A site or a page is "good" if many pages at many different sites

point to it, and especially if many "good" sites point to it.



The level of the directory in which the page is found. Higher is considered more

important. If a page is buried too deep, and the crawler simply won't go that far

and will never find it.



These static factors are recomputed about once a week, and new good pages slowly

percolate upward in the rankings. Note that there are advantages to having a simple

address and sticking to it, so others can build links to it, and so you know that it's in the

index



Query-dependent factors include:



The HTML title.

The first lines of text.

Query words and phrases appearing early in a page rather than late.

Meta tags, which are treated as ordinary words in the text, but like words that

appear early in the text (unless the meta tags are patently unrelated to the content

on the page itself, in which case the page will be penalized)

Words mentioned in the "anchor" text associated with hyperlinks to your pages.

(E.g., if lots of good sites link to your site with anchor text "breast cancer" and the



query is "breast cancer," chances are good that you will appear high in the list of

matches.)

0 comments: