spiderSpy™ — the full control

Spider Spy

Frequently Asked Questions (FAQ)

What are spiders and how do they work?

Warning: this is a long one! :-)

A: A “spider” or “crawler” is a program (usually automated) employed by — commonly — a search engine to grab data on websites for later indexing. There are different types of spiders and, hence, various forms of spider behavior.

E.g. there are submission checkup spiders activated when you submit a site to a search engine: they will do a simple check if the submitted URL is valid, if the server is available/accessible, if a redirect command is effected, etc.

If the URL passes this test, it will typically be stored in a task queue for later crawling.

Comes the time (and this may literally take weeks) when your site is scheduled for full spidering or crawling (terms used synonymously here): this will be done by another spider program which may either suck only a single page's content for later indexing (“flat crawl”) or may follow internal and/or even external links (i.e. hyper links leading to other sites) up to a predefined level (“deep crawl”), typically 3-5 levels deep.

The data thus collected will then be fed into the search engine's database, where it will be indexed (again, this process may take several weeks to actually happen). After indexing, your site will be available for searches.

How do you actually set up and maintain your spider list?

A: Establishing such a list is no mean task: while many, many spider lists are freely available on the net, our (commercial) fantomas spiderSpy™ service goes a step further in systematizing any spider whose data we can get hold of.

To give you an impression of what we are up against: we have verified and stored no less than 900+ unique Inktomi spiders alone; AltaVista is featured with 1,100+ spiders, etc. We are also covering international search engine spiders e.g. from Germany, Japan, etc.

Nor does this ever stop: major search engines tend to implement new spiders almost on a weekly basis; spider names may be changed, IPs reassigned, etc.

Then there's new search engines pressing on the market all the time. This is what our fantomas spiderScouts Department is all about: it does just what it says — scouting for search engine spiders across the whole net. To effect this, we have set up a string of “spider traps”: these may range from proprietary software to dedicated domains whose only function is to invite spidering by regular (daily) page submissions. Then there's our own log files evaluation, third party sources, etc., etc.

You can read more about it at: http://fantomaster.com/fasvsspy01.html

There, you will also find a comprehensive list of search engines currently covered.

How do you reference search engine spiders — by UserAgent or by IP?

A: Strictly by IP as this is the only really safe approach. The UserAgent variable can easily be forged, some browsers (such as Opera) actually offer you a choice of UserAgent variables to submit to any web site you visit. Thus, a snooping competitor might spoof a search engine spider's UserAgent variable to detect whether you are using cloaked pages — a scenario you should avoid at all costs!

Why are some spiders uncommented, e.g. Googlebots?

A: These are typically classified as "Decloaking Hazards". E.g. the Googlebot spiders will cache your pages unless excluded from doing so by implementing a proprietary meta tag.

This is hazardous to cloaked setups because the spider, if not uncommented and, hence, treated as a human visitor, will store the cloaked content which will then be displayed during search routines. Meaning that any competitor could catch you out cloaking, which is not what you would typically want.

Instead of uncommenting spiders, why not simply leave them out altogether?

A: Some clients prefer to edit the spider list manually. Indeed, our fantomas shadowSniper™ program specifically offers this option. However, the spider list keeps expanding all the time and it's well nigh impossible for a single user to keep up with the plethora of search engine spiders haunting the Web.

We are including “Decloaking Hazards” in the list in uncommented form in order to point these users to which spiders they should not include by mistake or for sheer lack of information.

In a later version, we may split the list in two, placing these uncommented spiders in a separate exclusion list file.

Why are some spiders labelled SE, while some have no label?

A: This is a trimming measure to reduce server load when in action as a cloaking engine list. In the ASCII text version of the botBase, the search engine is generally only listed once (akin to a header or category title), followed by its spiders.

Use the CSV version if you want full reference for each spider.

How can I prevent Google from caching my pages?

A:

  1. Ask them to stop caching your pages: they will comply if only because they would run the risk of a copyright violation suit. It will probably take a few weeks till the cached pages disappear, but disappear they will. (Been there, done that.)
  2. They are also offering a do-it-yourself solution: simply include the following in your meta tags section: <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> [ Source: http://www.google.com/faq.html ]

In any case, should you have requested Google not to cache your pages, you could safely uncomment their spiders in our botBase's engine list.

If I leave the comments in your spider list,will the robots still read only what is not commented?

A: The engine list file in its ASCII (text) version is primarily targeted toward the fantomas shadowSniper™ keyword switch script only — it determines the manner in which each individual spider will be treated when accessing your site.

Other than that, the commented text serves no further technical purpose. So you can leave the comments in the file or simply delete them, it makes no difference, EXCEPT (we really can't stress this point too much!): if you uncomment a spider IP, this spider will, of course, be fed the cloaked page when accessing your site. This isn't always desirable, as for instance in the case of AltaVista's Babelfish (automatic translation) spider, but of course it is entirely up to you to modify the list in accord with your own specific requirements.

Are you using unresolved or resolved IPs to determine search engine spiders? And if both, which version should I use?

A: Depending on your server's specific configuration, spiders (as all site visitors, for that matter) will either be determined

  1. by unresolved IP (e.g. "111.222.333.255")
  2. by resolved IP (e.g. "spiderdomain.com")
  3. either of both.

Obviously, the first and the last option are preferable because some spiders' IP aren't resolvable.

Depending on this configuration, you might even do without the resolved entries altogether, though we advise against it because the overall drain on CPU resources is really minimal.

Plus, the more you modify your spider list file, the more administrative overhead you will incur when updating it from our site, which, as you know, is being updated no less than every six hours.

Does your fantomas spiderSpy™ software program work on NT as well?

A: It doesn't have to: you won't be buying the the software but, rather, access to the database, which is updated 4 times a day.

What are “environmental variables” or “footprints”, and how do they relate to search engine spiders?

A: Every program — be it spider or web browser — accessing a web site will leave a “footprint”, i.e. it comes with a set of various environmental variables (i.e. data) which can be read and recorded by the visited system.

These variables include (but are not limited to):

  1. The originating unique IP (“Internet Protocol”) address, e.g. “216.35.116.41”. Many servers (but by no means all of them) will attempt to “resolve” this IP, i.e. translate it into a common domain name like, in this case, “slurp@inktomi.com”. Sometimes however, IPs will not resolve gracefully for reasons beyond the scope of this short summary.
  2. The UserAgent, which is basically a more or less freely assignable name tag such as “Slurp/si” or, in our example, “Slurp/si (slurp@inktomi.com) http://www.inktomi.com/slurp.html)”. Your web browser will usually have its own UserAgent, e.g. “Mozilla/4.72 [en] (Win98; I)” for a Netscape brower or “Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; AtHome0107)” for the MS Internet Explorer.
  3. The referrer (variable “http_referer” — yes, no typo: it really lacks an “r”) which shows the last site visited (more typical for web browsers than search engine spider), e.g. “ http://www.google.com/search?q=ip+blocker ”. (In this example, the visitor obviously searched for the keyword phrase “ip blocker” at Google.com.) The latter is important for several reasons, one of them being evaluation of your site's search engine positioning. (By visiting the URL http://www.google.com/search?q=ip+blocker you can establish if/where your site is ranked under this search phrase.)
What is IP Delivery and where does your service come in?

A: For various reasons many webmasters opt for a technique variously termed “IP Delivery”, “Cloaking”, “Stealthing”, “Food”, “Ghosting” or “Phantomizing”.

This will typically work by feeding a search engine spider with a different page or page content than a human visitor will receive.

This may be done to protect a page's code from thieving competitors, for offering spiders — which are normally text oriented — with indexable content for pages which consist of mere graphics (“splash pages”) or to generally improve a site's search engine ranking by feeding the spiders with content optimized for search engine indices (but not very well readable for the human eye), etc.

Obviously, if a webmaster is cloaking pages, his or her system must be able to recognize a spider for what it is and distinguish it reliably from human visitors to serve pertinent content.

Enter our spider list as quoted: this gives relevant data required for cloaking setups: the UserAgent (#UA), the IP, the resolved domain name, and the search engine these belong to.

I am interested in buying a one-year access to your spider IP list. However, I see there are several engines you cover, many I've never heard of.

A: You normally wouldn't: we cater to an international clientele, hence we're also covering German, French, Spanish search engines, and more.

Can you direct me on how I can update the botBase for the fantomas shadowSniper™ software in my fantomas Webmaster Suite™?

We have recently implemented our proprietary fantomas spyFetcher™ script to automate the process. This one is available as a Perl/CGI program for the Unix platform and as an ASP script for Windows systems. You can download it from our subscribers section.

Tip: While the fantomas shadowSniper™ script references the engine list as “register.txt” by default, you may change this variable to “spiderspy.txt” — this will save you having to rename it.

How do I update my IP-file without having a lot of duplicate IP numbers when I add my own?

A: Sorry, we don't have software to support that. You could write your own script to delete duplicates, or export the file to a database first, sort it and trash the dupes, whatever.

Alternatively, you can send us your findings (many of our clients are doing this now), we'll check them out and if we can confirm them, they'll be included in our spiderSpy database immediately.

fantomInfo About Us Mission Statement Privacy Policy Contact Office Hours

At fantomaster.com we are committed to aiding internet and Web professionals in achieving their goals in today's and tomorrow's increasingly competitive technological environment.

fantomNews Weblog siteFlash: What's New Here? Archive

Read the latest info on our products and services in our fantomNews™ online newsletter focusing on IP delivery (cloaking), search engine optimization, webmaster tricks, etc

fantomProducts Overview Downloads TechSpecs Manuals Price List

Check out our fine product line of webmaster software, Perl and CGI scripts, many of them world time firsts in their class. See our documentation and test our demo versions in real time.

fantomTips FAQs Tutorials Cloaking and IP Delivery Resources Free Content

Our information gold mine: search engine positioning, IP delivery, cloaking technology, search engine spider IPs, FAQs, link popularity, resources and links to boost your web presence.

fantomServices Overview spiderSpy™ Anti-Spam Anti Code Napping Anti-Fraud

Profit from our research and development efforts! Get the world's most comprehensive database of search engine spiders for top notch search engine optimization and traffic analysis.

fantomFreestuff Overview Services Downloads FAQs Tutorials

Giving back to the community: our free cutting edge applications for webmasters and IT professionals. With thousands of downloads per year, we're helping to make the Web a better place.

fantomOrders Overview Ordering Online PayPal Ordering Offline Price List Special Deals

Need we say more?

We offer the industry's widest variety of secure options for payment, download and registration of our products and services. Order online via our state-of-the-art SSL-secured enhanced Apache server or via PayPal

Alternatively, you may order by fax, by email, by phone or by snail mail.

fantomCrew™ Affiliates Overview FAQ Links & Banners Terms Join Up! Member Login

Teaming up with success: excellent established products, lifetime commissions, zero setup fee, enlightened support — if you can make web professionals listen, speak with us and join up!

fantomTech™ OEM Program Overview Contact

The fantomTech™ Mighty Engines OEM Licensing Program offers cutting edge power engines and value added services for software developers and service providers. Full support available.

fantomMedia™ Center Press Releases

Media workers: stay informed and up-to-date by reading our fantomNews™ online newsletter, special press releases and digests. Consult with our world renowned experts.
Interview inquiries welcome.