ThesaHelp: references a-b
Topic: searching the Web
Topic: hypertext as a global database
Topic: probability
Topic: hypertext links
Topic: problems with information retrieval
Topic: information retrieval with queries
Topic: data caching
Group: data structures
Topic: unique numeric names as surrogates
Topic: examples of file systems
Topic: data compression algorithms
Topic: error safe systems
Topic: full-text indexing
Topic: bugs
| |
Brin, S., Page, L.,
"The anatomy of a large-scale hypertextual web search engine",
Seventh International World-Wide Web Conference, April 1998, Brisbane, Australia,
Computer Networks and ISDN Systems, 30, 1-7, pp. 107-117.
Other Reference
notes from full version at or
section numbers from
1.3.2 ;;Quote: Google stores all web documents it finds; allows independent, efficient research of the web
| 2.1.1 ;;Quote: rank web pages by counting backlinks to page; normalize by number of links; PageRank is probability of a visit
| 2.1.1 ;;Quote: definition of PageRank algorithm; iterative calculation using citations, out links, and damping factor
| 2.1.2 ;;Quote: high PageRank if many pages point to the page, or if highly ranked pages point to the page
| 2.2 ;;Quote: index anchor text as well as target; 259 million anchors for 24 million pages
| 2.3 ;;Quote: Google records proximity, font size, and raw HTML for all pages
| 3.1 ;;Quote: vector space model does not work well on the web; returns short documents
| 3.1+;;Quote: a search for "Bill Clinton" should return reasonable results
| 3.2 ;;Quote: problem of manipulating search engines for profit; e.g., metadata is easily abused since it is invisible
| 4.2 ;;Quote: designed Google data structures to avoid disk seeks; a seek still takes 10 milliseconds
| 4.2.1 ;;Quote: Google uses BigFiles with 64-bit offsets, multiple file systems, compression, and allocation/deallocation
| 4.2.2 ;;Quote: Google compresses repository uses zlib; faster than bzip
| 4.2.2+;;Quote: Google stores docID, length, URL and document; with error log, can rebuild everything
| 4.2.5 ;;Quote: compact encoding of indexed hit list; plain hits with capitalization, font size, offset; fancy hits for URL, anchor, etc
| 4.3 ;;Quote: running the web crawler generated a fair amount of e-mail and phone calls; need to solve problems as they occur
| 5.2 ;;Quote: Google processes 4 million pages a day; indexer keeps up with the crawler
Related Topics
ThesaHelp: references a-b (396 items)
Topic: searching the Web (45 items)
Topic: hypertext as a global database (28 items)
Topic: probability (21 items)
Topic: hypertext links (45 items)
Topic: problems with information retrieval (51 items)
Topic: information retrieval with queries (18 items)
Topic: data caching (28 items)
Group: data structures (12 topics, 275 quotes)
Topic: unique numeric names as surrogates (67 items)
Topic: examples of file systems (44 items)
Topic: data compression algorithms (53 items)
Topic: error safe systems (75 items)
Topic: full-text indexing (35 items)
Topic: bugs (65 items)