Th Topic: searching the Web

Topic: searching the Web

topics > computer science > information > Group: information retrieval

Topic:
archiving Information in Hypertext
Topic:
archives
Topic:
browsing with a user interface
Topic:
full-text indexing
Topic:
hypertext as external memory
Topic:
hypertext as a global database
Topic:
hypertext links
Topic:
information retrieval by relevance
Topic:
information retrieval by searching
Topic:
information services
Topic:
problems with information retrieval
Topic:
searching hypertext
Topic:
text trails through hypertext
Topic:
using keywords to search hypertext
Topic:
World-Wide Web

Summary

The Web is brought challenges and opportunities to information retrieval. With care, the vast bulk of the Web can be indexed and prioritized. Links help identify authoritative sites, but relevance remains elusive. Links enable neighborhood search. (cbb 11/07)

Subtopic: authority vs. relevance

Quote: web search concerns authoritativeness instead of relevance; trusted source of correct information that has a strong web presence [»boroA2_2005]
Quote: hyperlinks are endorsements for web pages; a network of recommendations that identifies authoritative documents [»boroA2_2005]
Quote: want authoritative results from broad-topic queries; the set of relevant results is too large [»kleiJM9_1999]

Subtopic: search

Quote: even though the World Wide Web was built on the idea of links, global navigation has been replaced by search engines [»furnGW3_1997]
Quote: Google records proximity, font size, and raw HTML for all pages [»brinS4_1998]
Quote: vector space model does not work well on the web; returns short documents [»brinS4_1998]
Quote: a search for "Bill Clinton" should return reasonable results
Quote: in a hypertext, explore the original context for items; e.g., articles, books, or other hypertext units
Quote: with hypertext, continually uncover new items of interest, often with growing relevance

Subtopic: presentation order

Quote: three times more clicks for the top search result than the second position; by swapping results [»joacT8_2007]
Quote: strong presentation bias for search results even if you present the top 10 results in reverse order [»joacT8_2007]
Quote: skipped search results are clearly less relevant than selected results; pairwise relative preference had 80% agreement with manually ranked results [»joacT8_2007]

Subtopic: reachability

Quote: the best link analysis algorithm ranks nodes by their reachability; BFS combines InDegree with the Hits algorithm; 44% highly relevant [»boroA2_2005]

Subtopic: topics, communities

Quote: identify communities as bipartite graph of three web pages that all point to the same three other web pages; 95% had a unifying topic [»chakS8_1998]
Quote: tightly-knit communities, cycles, and isolated components are generally irrelevant; reachability and high in-degree are better measures of relevance; BFS/InDegree better than Hits/PageRank [»boroA2_2005]

Subtopic: Web context

Quote: link analysis algorithms find the Web context of a query; locate highly relevant pages that do not contain the actual query words [»boroA2_2005]

Subtopic: metadata

Quote: online metadata consists of links (anchor text), tags of pictures or URLs, page views (access logs), and free-form reviews or comments [»ramaR8_2007]

Subtopic: neighborhood search

Quote: WebGlimpse allows neighborhood search computed at indexing time; allows jumps to related, close-by pages [»manbU1_1997]
Quote: ScentTrails faster than ShortScent; both faster than searching or browsing; company site with 3-4 links from home page to destination [»olstC9_2003]
Quote: ScentTrails allowed simultaneous use of browsing cues and search cues; e.g., find a copier with recyclable toner [»olstC9_2003]

Subtopic: statistics

Quote: study of user searches at www.excite.com [»jansBJ1_1998]
Quote: 6% of Excite users used AND or "..."; 1% used parentheses [»jansBJ1_1998]
Quote: two-thirds of Excite users used only one query; a fifth modified it once [»jansBJ1_1998]
Quote: more than half of Excite users only viewed first results page; a fifth viewed two pages of ten results per page [»jansBJ1_1998]
Quote: a quarter of the 63 top subject terms were sexual; 10% of the 63 top subject terms were places (e.g., state, american); 8% were economic (e.g., employment, jobs) [»jansBJ1_1998]
Quote: 16% of the 63 top subject terms were modifiers (e.g., free, new, big)
Quote: average query on Excite contained 2.3 terms; a third were one term only; much shorter than normal IR queries [»jansBJ1_1998]
Quote: analysis of AltaVista search query log; 25 most common queries covered 1.5% of total; most sessions consisted of one query, viewed one screen of results [»silvC9_1999]

Subtopic: search feedback

Quote: clicks on search results can provide implicit feedback to the search engine [»joacT8_2007]
Quote: click streams and attentional metadata outweighs all other metadata; about 46 billion clicks per day worldwide [»ramaR8_2007]
Quote: 5% of Excite queries used 'More Like This'; traditional IR searching uses relevance feedback more [»jansBJ1_1998]
Quote: iterative search by identifying relevant and irrelevant documents; augment query with keywords from relevant documents [»saltG4_1970]

Subtopic: compare search engines

Quote: compare search engines by interleaving results from each engine [»joacT8_2007]

Subtopic: pagerank

Quote: PageRank poor for common queries, good for sparse queries such as 'jaguar'; mixes different communities; promotes different pages than other search algorithms [»boroA2_2005]
Quote: peer-review is an artifact of journal publishing; will be replaced by autonomous references from respected sources [»pempS7_2000]
Quote: Clever analyzes hyperlinks to identify authorities on a topic and hubs with links to authorities [»chakS8_1998]
Quote: definition of PageRank algorithm; iterative calculation using citations, out links, and damping factor [»brinS4_1998]
Quote: rank web pages by counting backlinks to page; normalize by number of links; PageRank is probability of a visit [»brinS4_1998]
Quote: high PageRank if many pages point to the page, or if highly ranked pages point to the page [»brinS4_1998]
Quote: improved algorithm for PageRank; sum of values is one; handles isolated cycles and dangling pages [»kimSJ3_2002]
Quote: the iterative algorithm for hubs and authorities converges to the principal eigenvectors of the weighting matrices [»kleiJM9_1999]

Subtopic: indexing

Quote: index anchor text as well as target; 259 million anchors for 24 million pages [»brinS4_1998]
Quote: index the anchor text as a title of the target page; helps identify a document; good for Web searching [»zobeJ7_2006]

Subtopic: crawler

Quote: running the web crawler generated a fair amount of e-mail and phone calls; need to solve problems as they occur [»brinS4_1998]
Quote: Google processes 4 million pages a day; indexer keeps up with the crawler [»brinS4_1998]

Subtopic: archive

Quote: Google stores all web documents it finds; allows independent, efficient research of the web [»brinS4_1998]

Subtopic: implementation

Quote: Google uses BigFiles with 64-bit offsets, multiple file systems, compression, and allocation/deallocation [»brinS4_1998]
Quote: compact encoding of indexed hit list; plain hits with capitalization, font size, offset; fancy hits for URL, anchor, etc [»brinS4_1998]
Quote: a Google query touches 100s Mbytes of data and execute billions of CPU cycles [»barrLA3_2003]
Quote: Google runs more than 15,000 commodity-class PCs
Quote: energy efficiency and price-performance are the primary factors for Google clusters
Quote: search is parallelizable by randomly dividing index into pieces called index shards; a pool of machines for each shard [»barrLA3_2003]

Subtopic: commercial problems

Quote: problem of manipulating search engines for profit; e.g., metadata is easily abused since it is invisible
[»brinS4_1998]

Related Topics

Topic: archiving Information in Hypertext (6 items)
Topic: archives (19 items)
Topic: browsing with a user interface (14 items)
Topic: full-text indexing (37 items)
Topic: hypertext as external memory (24 items)
Topic: hypertext as a global database (30 items)
Topic: hypertext links (45 items)
Topic: information retrieval by relevance (33 items)
Topic: information retrieval by searching (35 items)
Topic: information services (17 items)
Topic: problems with information retrieval (51 items)
Topic: searching hypertext (17 items)
Topic: text trails through hypertext (17 items)
Topic: using keywords to search hypertext (26 items)
Topic: World-Wide Web
(42 items)