Fravia's Nofrill
Web design

November 1998
Search engines' vagaries

First elements of searchenginology - 1

Well, ever wondered which among the search engines get more hits, which one does index more pages, which SEs are more spammed (where the relation between "noise" and relevant information is worse)?
And did you even wonder why there's so much opportunistic spam among the first links that most SE return (especially when too broad queries have been performed)?
And did you ever think that each search engine has its own algos in order to refresh/accept/select/classify/list the results of your queries... and that these same algos can be (pretty easily) cracked... and that many commercial 'slave-catchers' are doing "professionally" exactly that, in order to scrap some money from your useless clicks (destroying at the same time the value of the search engine services for million users)?
Good old fravia+ will now explain all this (at least in part), so that you may choose and use better the SEs... and survive on the web!

[Some data] [Spam for clicking] [Reversing search spiders] [Write your own search bots] [Fravia's tips for your own site]

Some data
Main SEs: Indexed pages, web-coverage and number of monthly visitors (font: fravia's scripts, results for mid July)

Search engine		Indexed pages web coverage monthly visitors
Altavista (AV)		150 Mil		42,9%  	 	 7 Mil
Hotbot (HB)		120 Mil		34,3%	 	 5 Mil
Northernlight (NL)	 85 Mil		24,3%	 	 3 Mil
Excite	(EX)		 50 Mil		14,3%		15 Mil
Infoseek (IS)		 33 Mil	 	 9,4%		13 Mil
Lycos (LY)		 31 Mil		 8,9%		10 Mil
Webcrawler (WC)		  3 Mil		 0,9%		 6 Mil
Yahoo (YA)		not		not		32 Mil

What can we conclude from these data?

Spam for clicking
Spam Spam is, in the case of the search engines, something different from the usual email spam that you already hate, yet it is nevertheless an opportunistic crap. There are people all over the world, that scrap money luring lusers and zombies into clicking banners (after having sold to their unhappy clients the lie that clicking means 'commercial opportunity').
Those commercial slave-catchers have quickly understood a couple of simple truths:

The consequence is that you'll find often enough as answer to your queries only 'commercial' pages, that HAVE BEEN DESIGNED in order to figure among the first positions, instead of finding the true knowledge sites you're looking for, that don't care for this crap 'positioning'. It's as simple as that: the search engines that you are using are not giving you what you expect, nor what they are supposed to, they are just giving you tons of commercial spam. And millions of users are wasting time, so that the commercial slave-catchers can scrap some money, as ususal...
That's a reason good enough to cross the commercial SE's spammers' planes, annoy them and eventually destroy their soo artefully designed sites, and I'll teach you how to do it...

Reversing search spiders
Any real reverser can quickly fool the SE's algos (commercial spammers, that are pretty stupid individuals, actually do it all the time - for money).
Algos vary from search engine to search engine (of course, that's the reason the same query gives DIFFERENT results on each search engine, btw) and may be very simple or complex.

The simplest way to reverse the SE's algos is to perform searches on very commercial subjects (those where the spammers from the 'insert site consulencies' battle a lot) and have a look at the first 15-30 sites that you'll get as result of your query. Your reversing blick will quickly pick up the relevant patterns...
Let's take as an example Excite, in order to let you grasp the complexity (and at the same time the banality) of the issues at stake.
The following is taken from my own essay redreversing search engines' bots:

The above snippets are taken from the 'Excite' chapter of my own (unpublished) 'reversing search engines bots' small booklet, yet every single Search engine has its own idiosynchrasies.
Altavista, for instance, seems to be rotating at least two ranking methods several times during the course of a day, in order to keep spammers at by. On Altavista, quite correctly, "root domains" get a relevancy boost, which frustrates spammers. But many other small things seem also to be in play. I believe, for instance, that font size increases keyword weight on some AV algos (because they assume that parts of text written in bigger fonts are more relevant for the page). Also If you try to submit more than 60 pages for a given domain, AV kills it... how do the spammers then spam in this case? They use a unix shell and start a lynx -dump. So they can submit 20.000 pages in 4 hours from only one server.
Altavista searches use moreover different algorithms depending of the PART of the database they are falling in: The main AV page uses one algorithm, yet the small AV search panel in Micro$oft's Explorer 4 and up (yes, you'll have to lower yourself to use this puke browser too, if you want to fish algos on the search engines), uses a different algorithm at AV. You can tell that the M$IE searches are different, because they will have a "n200" in the referer string in your logs. AltaVista algos are based on the oldest search engine, the one in use on the web of the 'older ones', and AV is THEREFORE still the best place to find detailed information, even if the submitters (or spalmmers) never indended to list it that way. AV bots still follow all links they find, and sometime they go 'crazy' and chart NEW UNDISCOVERED TERRAIN, something that happens very seldom on most other Search engines. I personally hope that AV will retain is roots, and crawl and list as it does now, even though the net is becoming more and more commercial, notwithstanding all our efforts.

If you are interested in this kind of stuff (hopefully not in order to spam on your own :-) you may want to reverse some algos by yourself and have a general look around at the various spammers newsgroups (yes, they exchange their findings on the web: 'I got three clients into the top AV 10, but cannot get them in WC" and so on).
Anyway, one of the results of this awful "commercial oriented" activity (these assholes would sell their sisters for a couple of clicks) is that MOST OF THE TIME you can and must FORGET the first 10% positioned links of any query result you'll get. Yes, you understood me right: all search engines anti-spam tricks notwistanding (at the moment the more refined antispammer is probably Infoseek, since it was for ages the most spammed SE) the first results of any query will NOT be relevant, because of the spam (unless your query is very specific).
Well, if you don't believe me, just try it: the more 'broad' your search category, the more useless will be all links that have been reported in the first positions. You'll have more luck, probably, when you start after the first relevant 10%.

There are a couple of tricks you can use:
Jump the first results
Say you have searched for instance three terms, and you get term1 100000, term2 2000, term3 40000.
Now you know that 'first broad relevant' palette is 2000 (term2 findings). 10% is 200, you may begin your search at 'page' 20 of your results, don't worry, you won't loose much.
Negate spammers
Have a look at the first page, see if there are several hits from the same spammer among the first 10, say you see three hits from http://www.spammon.com, just add to the search string (that should still be inside the search window) -spammon and have a look at the first 10 hits you get NOW. Repeat until necessary.

Punish Spammers
An interesting idea is to 'punish' spammers sites on your own, simply resubmitting them with a lot of hidden spamming text that YOU have added... The search engines algos will exclude them; mostly without warning. :-)
Alternatively you can email the search engines with your favourite spammer target, just remember to keep it very factual and to the point. Also, maintain a professional level. No all caps screams, no judgements about it. Just the facts: "On such-and-such search, such and such commercial domain is dominating the results, disallowing the searcher access to a varity of real results from which to choose."

Write your own search bots
Uff! It's a long way to come to term with the search engines, isn't it? Well, there's ANOTHER way: write yourself your dedicated search bots, it's easier than you may think, and works MUCH better than being delivered to the whims of the (at times comical) algos of the main SEs.
When you'll have learned how to write your own search bots (study perl, my son) you'll be able to incorporate these tricks inside your own probes... yes... as you now probably understand: public search engines may be used if you have nothing better, but your own search bots will be much better!


Fravia's tips for your own site

OK, there have been so many that have written me praying for some tips in order to ameliorate the ranking of their own pages, that -even if I don't really agree- I am going to publish the following. I sincerely hope that most of my readers are just 'correct' people that need a boost for their clever reversing site, not commercial spammers that will misuse this info for money. You may also notice that I -purposely- DO NOT use these same tips in order to boost the position of my own site (nor I advertise on usenet for it) since I DO NOT WANT TO BOOST MY SITE TOO MUCH. I believe in fact that the quantity of people reading my site, but the QUALITY of that people makes the real difference (as usual on all sectors of life). Anyway here we go and let's just hope you won't use this to sell some pathetical crap on the web... btw, the following is strongly Altavista geared...

This said, "tip" sheets are not much worth, since the SEs algos are continuously improving, and obvious spamming techniques will get you nowhere. Most "tip" sheets on SE positioning, that you will find around the web can be paralleled to buying a "tip" sheet at a race track. They are usually wrong. Trust your instincts, study and reverse as much as you can, and experiment with different strategies...
Back to the search Lab

redSearch fravia's site ~ redHow to search ~ redSearch engines light form

redhomepage redlinks red+ORC redstudents' essays redacademy database
redtools redjavascripts wars redcocktails redanonimity academy redantismut CGI-scripts
redcounter measures redmail_fravia+
redIs reverse engineering legal?

red(c) Fravia 1995, 1996, 1997, 1998. All rights reserved