The Hidden Web
What is the Hidden Web?
- Difficult to define
- Also know as the "deep web" and the "hidden web"
- Content rich databases from universities, libraries, organizations, businesses and government
What is hidden in the Hidden Web?
- Stuff that web crawlers can't reach or won't add
- Information that resides on an Intranet
- Commercial resources with domain or IP limitations
- Sites using a robot.txt file to keep search engines out
- Real-time data (stock quotes, weather, sports, election results, etc)
- Graphics (except through the ALT tag)
- Archives (newspapers)
- Contents of PDF and other file types (IE: .doc, .xls, .ppt, Flash, streaming media, etc)
- The content of sites requiring registration or a login
- Dynamically generated pages (example: CGI, ASP, CFM) where data is requested by a form - also some pages with "?"
What is hidden in the Hidden Web?
- Two web sites now index Adobe PDF files
Its How Big?!?
- No real agreement among the experts
- Approximately 500 times larger than the surface web and faster growing (BrightPlanet study, 2000)
- 2-50 times larger than the visible net (Chris Sherman and Gary Proce, THe Invisible Web, 2001)
The Hidden Web in Parts!
- In their book "The Invisible Web: Finding Hidden Internet Resources Search Engines Can't See" Chris Sherman and Gary Price divide the Hidden Web into 4 parts
- Opaque web
- Private web
- Proprietary web
- Truly Hidden
Opaque Web
- crawl depth
- expensive to index every web page
- crawl frequency
- the most powerful crawlers can hit only about 10 million pages a day
- maximum number of viewable results
- disconnect URLs
Private Web
- password protected
- use "robot.txt" file to prevent crawling
- each sub-directory on a web site may have such a "robot.txt" file
- example: <META NAME="ROBOTS" CONTENTS="NOINDEX,NOFOLLOW">
- "noindex" meta tag prevents spider from reading past the head section of the web page
Proprietary Web
- registration required
- some are free
- some are fee
Truly Hidden Web
- technical reasons why crawlers can't find or enter web page
- search engines may have chosen to omit the web page
- dynamically generated web pages
- relational databases that require a query
Why Use the Hidden Web?
- Quality of content/higher level of authority
- Comprehensiveness
- Focused
- Timeliness
- The material isn't available elsewhere on the web
When to Use the Hidden Web?
- Standard search engines aren't working
- A precise answer is needed
- Data or statistics are needed
- High quality or authoritive results are needed
- When timeliness is important
- You know the subject area well
- looking for collections (images, sounds, manuscripts, etc)
- Reference books online (handbooks, guides, dictionaries, encyclopedias, etc)
Strategies for Searching the Hidden Web
- Have the mindset of a hunter, detective or passionate collector
- Use search engines to get to the front door
- "searchable database"
- "interactive database"
- Use the site map to see if databases or statistics are mentioned
- Use the site's internal search tool and search for "database"
- Check subject specific discussion groups
A Few Words of Caution
- Search engines are only as good as their
- You must have some idea of what you are looking for
- Learn how to use a search engine before you actually use it
- Specific is better than general
Tools to Help You Use the Hidden Web
- About.com
- CiteSeer: Scientific Literature Digital Library
- CompletePlanet
- FirstGov
- IncyWincy: The Invisible Web Search Engine
- Infomine
Tools to Help You Use the Hidden Web
- Librarians's Index to the Internet
- Scirus
Staying Informed
- Read publications that review web sites
- Subscribe to discussion groups (also called forums) in your specific areas of interest
- Subscribe to newsletters that relate to what you are interested in
Thats All Folks!