2) Make full use of early paragraphs to include relevant keywords. Most
search engines place emphasis on early text, and less on the words further
down the page. The numbers vary from engine to engine, but you can assume
the first 50 words are crucial, the next 50 are important, the 50 following
are likely to be read. After that, it's anybody's guess, though some engines
do manage to fully index pages with more than a thousand words. Try to get
your important keywords - the expressions you expect your visitors to use in
their searches - included in your first 150.
3) Don't overdo any repetition. If you repeat your keywords too often, you
could be penalized. There's no magic number to aim for, but if you repeat
keywords three times or less, you should be safe.
4) Concentrate on the main text. You might have a separate top table
(perhaps containing an advert and logo) plus a left hand column with links.
These will appear in the HTML file before your main, central text block.
There's a temptation to think these areas are more important than the main
text area because spiders read them first. If these outlying areas contain a
lot of text (unlinked) then this may well be true. But many engines try to
ignore peripheral HTML blocks, especially if they're heavy on links, and
head straight for the center. It's not too difficult for them to do. They
simply look for the largest title (within <h> tags) on the page and assume
that whatever follows that is the most important text area.
5) It's not much use getting your keywords in the right place if you've
chosen the wrong ones. It doesn't help the spiders either. They'd prefer you
to choose the right keywords so their indexing works as intended. It's worth
spending a few hours on deciding your keywords, maybe trying out a few
expressions in the search engines and seeing if they deliver the sites you
want to compete with.
6) Spiders have lists of stop words - mainly related to adult content and
profanity. When they find one of these words they may abandon your site
altogether. If you have a page that includes a possible stop word, hide it
from spiders by making it an exclusion in your robots.txt file (see later).
Also watch out for words that have two meanings, one of which is sexual.
Spiders don't understand context.
7) If you have pages full of links, make sure there's plenty of text to
accompany them. Pure link listings are often ignored by spiders, but if you
add a couple of sentences describing each link, the problem disappears.
Popular Sites are Exceptions
Often you can learn a few tricks by looking at the most popular sites on the
Web and seeing how they do things. But not in this case. The most popular
sites are given a special status by search engines and indexed under
slightly different rules than regular sites. They are more likely to be
indexed thoroughly and frequently, which means they don't have to try as
hard. Also, because it's assumed they won't try to spam the engines, they're
forgiven the occasional mistake, such as overusing a keyword.
Titles and Filenames Count
Spiders like to see useful page titles, and some also appreciate relevant
filenames. It helps them, but unfortunately the mechanism has been abused,
so they're wary. Try to use filenames and page titles that match your text
content and keywords, rather than using them to cover keywords that don't
otherwise get a mention. Words in filenames can be separated by an
underscore - this is a convention that IT professionals used before the
Internet arrived, so it's perfectly acceptable. But if your filenames turn
into a long sequence of keywords, spiders will assume you're trying to spam
them.
Meta Tags
These go in the file header, in two sections - keywords and description. The
meta tag system has been so heavily abused that some engines simply ignore
them. But it's still worth spending a few minutes on creating them for the
engines that remain interested. Keep them short and don't use words that are
missing from the main text. If you spend a long time working on meta tags,
you're probably trying to manipulate the system and you may well be found
out. Create them quickly, using the simplest, most obvious content, and it's
more likely they'll work as intended.
Image Alt Descriptions
These create the text that shows in an image space before a graphic loads,
and subsequently when the mouse rolls over it. They've been sorely abused,
often crammed with long lists of keywords, and again the spiders have wised
up and tend to ignore them, or penalize obvious abuse.
Their proper use is to show visitors with text only browsers (and
impaired-vision visitors with talking browsers) what they're missing. Using
them as a method of presenting keywords is spamming and you can hardly
complain if it gets you a ranking penalty.
Frames
Frames confuse most spiders. If you insist on using frames, then make the
most of your <noframes> tag and include a link within it to a sitemap or
contents page that lists your pages and links to them directly, rather than
linking to framesets. You can always force the framesets to appear when the
links are followed in a regular browser by using JavaScript, which the
spiders will ignore. It's a lot of work but at least it should get you
listed in the search engines.
Robots.txt
This text file goes in your root directory and gives instructions to spiders
about which files and directories to ignore when they're trawling your site.
It can have other uses too, but many of these are close to spamming
techniques so won't be covered here.
Here's a sample robots.txt file:
User-Agent: *
Disallow: /images/
Disallow: /bookmark*.html
Disallow: /cgi_bin/
Disallow: /status/
This tells all spiders (first line) not to look inside the directories
called images, cgi_bin and status, and to ignore files called
bookmark1.html, bookmark2.html and so on. Incidentally, the linebreaks are
important.
It's a good idea to include a robots.txt file on your site, even if you
don't have much to exclude. It helps prevent spiders wasting their time
poking around in your image directories. And since spiders often tire and
give up with sites without fully indexing them (especially new sites) it can
help you get the more important areas of your site indexed.
Directory Structure
Spiders find their way around your site by following your internal links.
They prioritize pages that are in the root directory, then first level
directories, and if you're lucky (or a very popular site) they may look at
subdirectories beyond that, but often they won't bother. That's why you find
most professional sites have a flat structure, with many pages in the root
directory and first-level subdirectories, rather than a deep structure with
many levels of subdirectories.
Dynamic Pages
Spiders generally have trouble with these. Also they're a little frightened
of them because they can get trapped inside a dynamic page server, and may
even bring the server down. For this reason spiders identify dynamic pages
by the question mark contained in their URLs, and usually avoid them. Some
will allow you to submit specific dynamic pages, but they still won't follow
the internal links within them.
One solution is to create static gateway pages that include static links to
other pages on your site. Make sure the link URLs are inherently complete,
not generated on the fly, that they don't contain question marks, and that
your server can translate these static links to reach dynamic pages if it
has to. Also make sure there's plenty of text on the gateway page, that it
isn't purely made up of links, otherwise it may be ignored.
An alternative is to make technical alterations to your system so the server
can cope with a visit from a spider, and then replace the question mark with
a less obvious symbol such as a % sign. There's no point in making this
replacement if the server won't be able to cope. The usual problem is that
links to dynamic pages are often created dynamically themselves, and spiders
can't manage this. They request pages with incomplete URLs missing query
string elements, the server sends back a request for more information to
complete the URL, which the spider can't understand, and the request turns
into a dangerous loop. To get over this you have to create a work-around for
the incomplete URL problem, and technically that's a demanding task.
For more details on getting dynamic sites indexed, try NetMechanic
(http://www.netmechanic.com/news/vol4/promo_no3.htm) and Spider Food
(http://spider-food.net/dynamic-page-optimization.html).
Additionals Links
* More on the robots.txt file
(http://www.wdvl.com/Location/Search/Robots.html)
* How To Use HTML Meta Tags
(http://searchenginewatch.com/webmasters/meta.html)
* Everything you need to know about search engines, at SearchEngineWatch
(http://searchenginewatch.com/).
* Theme-based spidering is a relatively new concept. Read about it here:
http://www.netmechanic.com/news/vol4/promo_no13.htm)
* Learn how to attract hungry arachnids at Spider Food
(http://spider-food.net/).
About the Author:
Andrew Starling is a Web developer, consultant and author based in the UK.
He was previously the Managing Editor of the Web Developer's Journal for
internet.com and Technology Editor of the UK's Internet Magazine, for which
he still writes. His own Web sites are Foxglove.co.uk and Tinhat.com.
Foxglove is a satirical site and was chosen as the Mirror newspaper site of
the day back in August 2000. Tinhat covers Internet security and privacy.
Email: astarling@foxglove.co.uk