Journalism: The Web: Definitions

This guide supports students in COM 360, COM 361, and anyone trying to locate information available through the Deep/Invisible Web.

Surface Web, Deep Web, Hidden Web

Surface Web:

The surface web, also called the visible web, is the portion of the Web that is freely available to the general public. Almost any page with a simple web address [http://www.servername.domain/filename] is a surface web page. These pages are indexed [crawled] by web crawlers that are used to build the databases of search engines such as Google and Bing.

Deep Web:

The Deep Web, also known as the Invisible Web, is a portion of the web not reached by standard search engines such as Google and Bing. "It’s almost impossible to measure the size of the Deep Web. While some early estimates put the size of the Deep Web at 4,000-5,000 times larger than surface web, the changing dynamic of how information is accessed and presented means that the Deep Web is growing exponentially and at a rate that defies quantification" (Bright Planet: "How large is the deep web?")

Content on the Deep Web is not found by most search engines because it is stored in a database which is not coded in HTML. Google and Bing might lead us to a front door [a search interface], but it generally can't search the content of a database. It is up to you to search the database where the results of your search are loaded into a dynamically generated HTML page for viewing.

Some database providers have found it valuable to program their database content to show up on the surface web. For example, if you are searching for a product for sale such as a Halloween costume, a Google search will send you directly to the page in the database of Amazon, Party City, and Spirit of Halloween. A search for a movie title will lead you to the IMDb site for that movie.

Hidden Web:

The Hidden Web, also known as the Private Web, is the portion of the web viewable by a restricted set of people. Web resources can be restricted with a firewall, by IP address restriction [UW Libraries databases], by password [UW Canvas]. Search engines will not find this content.

For more info about web crawlers, see A brief history of web crawlers from CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research.

See also What is the Deep Web? story from CNNtech.

World Wide Web

Only SOME of the Invisible Web [grey box] material is available via UW Libraries databases and sites listed on this guide.

Data from P. Gil, "What is the invisible web?" Dec. 2010. Schematic by N. Tann, Goshen College, March 2011.

Image: The Harold & Wilma Good Library

What is the Deep Web?

By: Jose Pagliery & Tal Yellin / CNNMoney

Search Engines can't find this content...

Content found in databases – Database content that is dynamically generated as the result of a query cannot be found by general-purpose search engines. Example: ERIC database, Library catalogs.
Subscription database content – Fee-based database content is only accessible to those who have subscribed. [Many libraries offer their members free access to subscription databases.] Examples: EBSCOhost databases, LexisNexis Academic.
Information offered on very content rich websites – General-purpose search engines only partially index very large [deep] websites. The parts of the website that they do not index become part of the Invisible Web. Examples: Library of Congress, U. S. Census Bureau.
Real time content – Information about events currently taking place may not yet be indexed by general-purpose search engines.
Media content - media content that is not found [video and graphics] because search engines deal almost exclusively with text.
Formats – Information occurs in various formats, some of which are not indexed by general-purpose search engines. It also takes time for new formats to appear in search engines. Example: Any new format.
Computer code - code that cannot be easily found by a search engine, such as Javascript, Flahs, or other dynamically generated code.
Sites requiring login authorization – These sites require users to login or identify themselves as having the right to access and use content. Examples: Canvas, membership sites.
Sites with interactive content – These sites require information from the user before they can generate an answer. Examples: Travel direction sites, job hunting sites.
New content – It may take time for a search engine to find and include new websites and newly added website content.
Sites that are not linked to by other sites - Search engines index websites by following links from one website to another, if there aren't any links to a site it might not be found or included.
Sites blocked by Robot Exclusion Protocols – These sites are not intended for open access use.
Gated social media communities – Social networking sites such as Facebook, LinkedIn, etc.

Characteristics of Invisible Web Content	Examples
Database content [dynamically generated for a particular inquiry]	Communication & Mass Media Complete Library catalogs (UW Libraries Search)
Subscription databases	EBSCOhost, Nexis Uni
Deep websites	Library of Congress, U.S. Census Bureau
Real time content
Formats	Any new format
Sites that require login	Canvas MyUW UW course reserves Membership sites
Sites that require that forms be filled out	Sites offering travel directions, job hunting sites
New content	Any new websites or content newly added to an existing website
Sites with a no-index protocol	Private websites
Social networking sites	Facebook, LinkedIn, etc.

Content adpated from: http://guides.laguardia.edu/c.php?g=762553&p=5467880