June 2, 2009

Invisible Web

Many untrained users have the naive expectation that they can locate anything on the world wide web by using Google or Yahoo or Ask.com. No, as powerful as these search engines are, they do not index everything on the world wide web. In fact, search engines index less than 10% of the entire web! That remaining 90% is called the "Invisible Web", or in other words, "The Cloaked Web" or "The Deep Web". This is the massive content that is publicly available, but hidden from regular search engines.

Indeed, this is a tough concept to grasp - that billions of web pages cannot be found by Google. But it's true, billions of pages are beyond the abilities of search engine cataloging. The robot "spiders" which scan and catalog the world wide web are limited... they cannot see nor index everything.

To better visualize this concept, let's start with some size estimates from Google.com, Yahoo.com, Cyberatlas, and MIT. These stats are current to Fall 2007:
Google.com indexes 12.5 billion public web pages.
71 billion static web pages are publicly-available. These pages can easily be found by Google and other search engines. (e.g. www.honda.com, www.australia.gov.au)
6.5 billion static pages are hidden from the public. As private intranet content, these are the corporate pages that are only open to employees of specific companies. (e.g. employees.honda.com, secure.australia.gov.au)
220+ billion database-driven pages are completely invisible to Google. These invisible pages are not the regular web pages you and I can make. Rather, these are dynamic database reports that exist only when called from large databases.
(e.g. custom online car quote for Shelly, Australian government discussion on aboriginal taxation)


Google, considered the best search database today, can only catalog a fraction of this monstrous content. Even with electronic spiders to catalog millions of web pages each week, Google current indexes only 12.5 billion out of the 220+ billion pages out there...less than 6% of all available internet content.

So if Google only catalogs 6% of the World Wide Web, and other search engines catalog even less, then where is the remaining 90%of web content hidden?

2 comments:

Darcy said...

A lot of the deep web information can be found in domain specific search portals that search vetted and peer-reviewed databases. These portals tend to be narrowly focused repositories of rich information, a real plus for the focused researcher.

Try these sites:

www.mednar.com - medical
www.biznar.com - business
www.worldwidescience.org - science
www.science.gov - science
www.scitopia.org - science
www.nutrition.gov - health

Many of the deep web portals support topic clustering as well.

Matthew S. Theobald said...

The aforementioned deep web portals are a significant development but do reveal the deep web to Google.

Internous.com's Internet Search Environment number has a good video illustrating the problem and solution. ID databases and tag them creating a database of databases.

Youtube "internous" for really cool video game like animation on how it works or visit http://www.isen.org/isenuploads2009 for a high resolution version.

Thanks for your attention on the subject.

-m@

Popular Posts