Topic: searching the Web

topics > computer science > information > Group: information retrieval

archiving Information in Hypertext
browsing with a user interface
full-text indexing
hypertext as external memory
hypertext as a global database
hypertext links
information retrieval by relevance
information retrieval by searching
information services
problems with information retrieval
searching hypertext
text trails through hypertext
using keywords to search hypertext
World-Wide Web


The Web is brought challenges and opportunities to information retrieval. With care, the vast bulk of the Web can be indexed and prioritized. Links help identify authoritative sites, but relevance remains elusive. Links enable neighborhood search. (cbb 11/07)
Subtopic: authority vs. relevance up

Quote: web search concerns authoritativeness instead of relevance; trusted source of correct information that has a strong web presence [»boroA2_2005]
Quote: hyperlinks are endorsements for web pages; a network of recommendations that identifies authoritative documents [»boroA2_2005]
Quote: want authoritative results from broad-topic queries; the set of relevant results is too large [»kleiJM9_1999]

Subtopic: search up

Quote: even though the World Wide Web was built on the idea of links, global navigation has been replaced by search engines [»furnGW3_1997]
Quote: Google records proximity, font size, and raw HTML for all pages [»brinS4_1998]
Quote: vector space model does not work well on the web; returns short documents [»brinS4_1998]
Quote: a search for "Bill Clinton" should return reasonable results
Quote: in a hypertext, explore the original context for items; e.g., articles, books, or other hypertext units
Quote: with hypertext, continually uncover new items of interest, often with growing relevance

Subtopic: presentation order up

Quote: three times more clicks for the top search result than the second position; by swapping results [»joacT8_2007]
Quote: strong presentation bias for search results even if you present the top 10 results in reverse order [»joacT8_2007]
Quote: skipped search results are clearly less relevant than selected results; pairwise relative preference had 80% agreement with manually ranked results [»joacT8_2007]

Subtopic: reachability up

Quote: the best link analysis algorithm ranks nodes by their reachability; BFS combines InDegree with the Hits algorithm; 44% highly relevant [»boroA2_2005]

Subtopic: topics, communities up

Quote: identify communities as bipartite graph of three web pages that all point to the same three other web pages; 95% had a unifying topic [»chakS8_1998]
Quote: tightly-knit communities, cycles, and isolated components are generally irrelevant; reachability and high in-degree are better measures of relevance; BFS/InDegree better than Hits/PageRank [»boroA2_2005]

Subtopic: Web context up

Quote: link analysis algorithms find the Web context of a query; locate highly relevant pages that do not contain the actual query words [»boroA2_2005]

Subtopic: metadata up

Quote: online metadata consists of links (anchor text), tags of pictures or URLs, page views (access logs), and free-form reviews or comments [»ramaR8_2007]

Subtopic: neighborhood search up

Quote: WebGlimpse allows neighborhood search computed at indexing time; allows jumps to related, close-by pages [»manbU1_1997]
Quote: ScentTrails faster than ShortScent; both faster than searching or browsing; company site with 3-4 links from home page to destination [»olstC9_2003]
Quote: ScentTrails allowed simultaneous use of browsing cues and search cues; e.g., find a copier with recyclable toner [»olstC9_2003]

Subtopic: statistics up

Quote: study of user searches at [»jansBJ1_1998]
Quote: 6% of Excite users used AND or "..."; 1% used parentheses [»jansBJ1_1998]
Quote: two-thirds of Excite users used only one query; a fifth modified it once [»jansBJ1_1998]
Quote: more than half of Excite users only viewed first results page; a fifth viewed two pages of ten results per page [»jansBJ1_1998]
Quote: a quarter of the 63 top subject terms were sexual; 10% of the 63 top subject terms were places (e.g., state, american); 8% were economic (e.g., employment, jobs) [»jansBJ1_1998]
Quote: 16% of the 63 top subject terms were modifiers (e.g., free, new, big)
Quote: average query on Excite contained 2.3 terms; a third were one term only; much shorter than normal IR queries [»jansBJ1_1998]
Quote: analysis of AltaVista search query log; 25 most common queries covered 1.5% of total; most sessions consisted of one query, viewed one screen of results [»silvC9_1999]

Subtopic: search feedback up

Quote: clicks on search results can provide implicit feedback to the search engine [»joacT8_2007]
Quote: click streams and attentional metadata outweighs all other metadata; about 46 billion clicks per day worldwide [»ramaR8_2007]
Quote: 5% of Excite queries used 'More Like This'; traditional IR searching uses relevance feedback more [»jansBJ1_1998]
Quote: iterative search by identifying relevant and irrelevant documents; augment query with keywords from relevant documents [»saltG4_1970]

Subtopic: compare search engines up

Quote: compare search engines by interleaving results from each engine [»joacT8_2007]

Subtopic: pagerank up

Quote: PageRank poor for common queries, good for sparse queries such as 'jaguar'; mixes different communities; promotes different pages than other search algorithms [»boroA2_2005]
Quote: peer-review is an artifact of journal publishing; will be replaced by autonomous references from respected sources [»pempS7_2000]
Quote: Clever analyzes hyperlinks to identify authorities on a topic and hubs with links to authorities [»chakS8_1998]
Quote: definition of PageRank algorithm; iterative calculation using citations, out links, and damping factor [»brinS4_1998]
Quote: rank web pages by counting backlinks to page; normalize by number of links; PageRank is probability of a visit [»brinS4_1998]
Quote: high PageRank if many pages point to the page, or if highly ranked pages point to the page [»brinS4_1998]
Quote: improved algorithm for PageRank; sum of values is one; handles isolated cycles and dangling pages [»kimSJ3_2002]
Quote: the iterative algorithm for hubs and authorities converges to the principal eigenvectors of the weighting matrices [»kleiJM9_1999]

Subtopic: indexing up

Quote: index anchor text as well as target; 259 million anchors for 24 million pages [»brinS4_1998]
Quote: index the anchor text as a title of the target page; helps identify a document; good for Web searching [»zobeJ7_2006]

Subtopic: crawler up

Quote: running the web crawler generated a fair amount of e-mail and phone calls; need to solve problems as they occur [»brinS4_1998]
Quote: Google processes 4 million pages a day; indexer keeps up with the crawler [»brinS4_1998]

Subtopic: archive up

Quote: Google stores all web documents it finds; allows independent, efficient research of the web [»brinS4_1998]

Subtopic: implementation up

Quote: Google uses BigFiles with 64-bit offsets, multiple file systems, compression, and allocation/deallocation [»brinS4_1998]
Quote: compact encoding of indexed hit list; plain hits with capitalization, font size, offset; fancy hits for URL, anchor, etc [»brinS4_1998]
Quote: a Google query touches 100s Mbytes of data and execute billions of CPU cycles [»barrLA3_2003]
Quote: Google runs more than 15,000 commodity-class PCs
Quote: energy efficiency and price-performance are the primary factors for Google clusters
Quote: search is parallelizable by randomly dividing index into pieces called index shards; a pool of machines for each shard [»barrLA3_2003]

Subtopic: commercial problems up

Quote: problem of manipulating search engines for profit; e.g., metadata is easily abused since it is invisible

Related Topics up

Topic: archiving Information in Hypertext (6 items)
Topic: archives (19 items)
Topic: browsing with a user interface (14 items)
Topic: full-text indexing (37 items)
Topic: hypertext as external memory (24 items)
Topic: hypertext as a global database (30 items)
Topic: hypertext links (45 items)
Topic: information retrieval by relevance (33 items)
Topic: information retrieval by searching (35 items)
Topic: information services (17 items)
Topic: problems with information retrieval (51 items)
Topic: searching hypertext (17 items)
Topic: text trails through hypertext (17 items)
Topic: using keywords to search hypertext (26 items)
Topic: World-Wide Web
(42 items)

Updated barberCB 7/05
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.