Communication: Searching: Deep Web: Definitions

Surface Web, Deep Web, Hidden Web

Surface Web

The surface web, also called the visible web, is the portion of the Web that is freely available to the general public. Almost any page with a simple web address [http://www.servername.domain/filename] is a surface web page. These pages are indexed [crawled] by search engines such as Google and Bing.

Deep Web:

The Deep Web, also known as the Invisible Web, is a portion of the web not reached by standard search engines such as Google and Bing. Less than 10% of the web is indexed by search engines with the remaining 90% of web content called the Deep Web. It is estimated to be 2-500x bigger than the surface web.

Content on the Deep Web is not found by most search engines because it is stored in a database which is not coded in HTML. Google and Bing might lead us to a front door [a search interface], but it generally can't search the content of a databse. It is up to you to search the database where the results of your search are loaded into a dynamically generated HTML page for viewing.

Some database providers have found it valuable to program their database content to show up on the surface web. For example, if you are searching for a product for sale such as a halloween costume, a Google search will send you directly to the page in the database of Amazon, Party City, and Spirit of Halloween. A search for a movie title will lead you to the IMDb site for that movie.

Hidden Web:

The Hidden Web, also known as the Private Web, is the portion of the web viewable by a restricted set of people. Web resources can be restricted with a firewall, by IP address restriction [UW Libraries databases], by password [UW course reserves]. Search engines and tools designed to seach the deep web will not fine this content.

World Wide Web

Only SOME of the Invisible Web [grey box] material is available via UW Libraries databases and sites listed on this guide.

Data from P. Gil, "What is the invisible web?" Dec. 2010. Schematic by N. Tann, Goshen College, March 2011.

Image: The Harold & Wilma Good Library

Search Engines can't find this content...

Content found in databases – Database content that is dynamically generated as the result of a query cannot be found by general-purpose search engines. Example: ERIC database, Library catalogs.
Subscription database content – Fee-based database content is only accessible to those who have subscribed. [Many libraries offer their members free access to subscription databases.] Examples: EBSCOhost databases, LexisNexis Academic.
Information offered on very content rich websites – General-purpose search engines only partially index very large [deep] websites. The parts of the website that they do not index become part of the Invisible Web. Examples: Library of Congress, U. S. Census Bureau.
Real time content – Information about events currently taking place may not yet be indexed by general-purpose search engines.
Media conten - media content that is not found [video and graphics] because search engines deal almost exclusively with text.
Formats – Information occurs in various formats, some of which are not indexed by general-purpose search engines. It also takes time for new formats to appear in search engines. Example: Any new format.
Computer code - code that cannot be easily found by a search engine, such as Javascript, Flahs, or other dynamically generated code.
Sites requiring login authorization – These sites require users to login or identify themselves as having the right to access and use content. Examples: Canvas, membership sites.
Sites with interactive content – These sites require information from the user before they can generate an answer. Examples: Travel direction sites, job hunting sites.
New content – It may take time for a search engine to find and include new websites and newly added website content.
Sites that are not linked to by other sites - Search engines index websites by following links from one website to another, if there aren't any links to a site it might not be found or included.
Sites blocked by Robot Exclusion Protocols – These sites are not intended for open access use.
Gated social media communities – Social networking sites such as Facebook, LinkedIn, etc.

Characteristics of Invisible Web Content	Examples
Database content [dynamically generated for a particular inquiry]	ERIC Library catalogs
Subscription databases	EBSCOhost, LexisNexis academic
Deep websites	Library of Congress, U.S. Census Bureau
Real time content
Formats	Any new format
Sites that require login	Canvas MyUW UW course reserves Membership sites
Sites that require that forms be filled out	Sites offering travel directions, job hunting sites
New content	Any new websites or content newly added to an existing website
Sites with a no-index protocol	Private websites
Social networking sites	Facebook, LinkedIn, etc.

Content adpated from: http://library.laguardia.edu/invisibleweb/characteristics