Topic: b-trees
Topic: database keys and indexing
Topic: external search and sort
Topic: information retrieval by relevance
Topic: information retrieval by searching
Topic: information retrieval by topic
Topic: information retrieval with an index
Topic: manual indexing
Topic: searching the Web
Topic: using a description as a name
Topic: World-Wide Web
| |
Summary
A full-text index of everything is indispensable. It automatically provides access to information, perhaps retrieving lost information. All terms should be indexed. Phrase searching is particularly effective.
With the Web, full-text indexing has become ubiquitous. The key is a web crawler that searches the Web for new information.
Use an inverted index for indexing. A lexicon is efficiently computed with move-to-front hash chains. Compress the indices. Partitioning text into fixed sized blocks may help. (cbb 10/07)
Subtopic: full-text index
Quote: a full-text index of everything is indispensable; fast access to information, uncovers lost information [»manbU10_1993]
| Subtopic: compressed, self-index
Quote: a compressed self-index takes space close to that of compressed text, replaces it, and provides fast substring search; research survey [»navaG4_2007]
| Quote: self-indexing by suffix arrays derived from the Burrows-Wheeler transform of a suffix tree; can also use Lempel-Ziv compression [»navaG4_2007]
| Quote: compress indices to 15% of uncompressed collection size; may be faster as well [»scholF8_2002]
| Subtopic: web crawler
Quote: Google processes 4 million pages a day; indexer keeps up with the crawler [»brinS4_1998]
| Quote: running the web crawler generated a fair amount of e-mail and phone calls; need to solve problems as they occur [»brinS4_1998]
| Subtopic: indexing everything
Quote: automatically save and index every copy, fax, and print to the infinite memory multifunction machine [»hullJJ3_2001]
| Quote: the Elephant file system saves all important versions; easy undo of recent changes [»santDS12_1999]
| Quote: all terms should be indexed; including numbers, URL tokens, and stopwords [»zobeJ7_2006]
| Subtopic: vocabulary accumulation
Quote: vocabulary accumulation by move-to-front hash chains; 99% of searches at first node in chain [»heinS4_2002]
| Quote: lexicon of all words in database, linked to concordance, disk address, lexical hierarchy, and permuted index [»wittIH_1991]
| Quote: use a two-level lexicon to reduce memory requirements; half of the terms only appear once in the 2-gigabyte TREC collection [»moffA8_1995]
| Quote: retrieve passages with document-ordered processing for rare terms and term-ordered processing for common terms; efficient [»kaszM10_1999]
| Quote: chained hash tables are best for building vocabularies; considerably faster than splay trees and binary trees [»zobeJ12_2001]
| Subtopic: passage and partitions
Quote: index passages instead of full text; proximate terms, similar length, point into document, easy to present [»kaszM10_1999]
| Quote: best retrieval with 150-300 word passages starting every 25 words [»kaszM10_1999]
| Quote: Glimpse features a small index and approximate matching; divide data into blocks, index the blocks, full text search of possible matches [»manbU10_1993]
| Quote: use equivalence relations, pivots, and compact partitions to index metric spaces for proximity queries; pivot best if memory, compact partitioning best in high dimension [»chavE9_2001]
| Quote: an index partition in high-dimensional space spans most of the space in most dimensions, from border to border; coarse partitioning [»bohmC9_2001]
| Subtopic: inverted file
Quote: use inverted files for text query evaluation; better than suffix arrays and signature files [»zobeJ7_2006]
| Subtopic: stop list
Quote: don't use a stop list; compress common words by predicting the inter-word gap; 100 words are 76% of references and 44% of compressed size [»wittIH_1991]
| Quote: a nextword index allows efficient processing of stopwords in phrase queries
| Quote: removing commonest stopwords from queries reduces search time by 60% on average, 3x on longer queries [»bahlD8_2002]
| Subtopic: phrase index
Quote: build two indices, a word-level index for phrase and Boolean searches and a document-level index for ranked queries [»zobeJ7_2006]
| Quote: for phrase queries, use an index for word pairs that begin with a common word; the three commonest words halves phrase querying time [»zobeJ7_2006]
| Quote: use nextword index for phrase queries; 3x faster at 3% extra space [»bahlD8_2002]
| Quote: phrases are not substantially superior to single terms for indexing; nor are sophisticated analysis tools [»saltG4_1970]
| Quote: n-grams of text streams are efficient, robust, and capture statistically significant phrases [»coheJD12_1998]
| Quote: genomic nucleotide data has a fairly uniform distribution of 9-grams; poor locality [»heinS4_2002]
| Subtopic: disk index
Quote: with interactive information retrieval, a query may be modified many times; an inverted file/index is more efficient [»eastCM1_1988]
| Quote: B+-trees are nearly optimal for disk-based indices; good storage utilization [»bayeR_1972]
| Quote: business data uses indexed sequential files for sequential processing and indexed retrieval; also relational and object-oriented databases [»glasRL9_1997]
| Subtopic: index implementation
Quote: indexing core for constructing a document-level index with ranked query evaluation; refinements for reorganization, phrases, and compression; includes bibliography [»zobeJ7_2006]
| Quote: merge-based index construction scales well; 100MB of memory, can minimize disk space overhead, only one parsing pass, compression effective [»zobeJ7_2006]
| Quote: compact encoding of indexed hit list; plain hits with capitalization, font size, offset; fancy hits for URL, anchor, etc [»brinS4_1998]
| Quote: WebGlimpse gets offsets into a pointer file, collects file names, titles, and URLs, and then searches files directly
| Quote: search is parallelizable by randomly dividing index into pieces called index shards; a pool of machines for each shard [»barrLA3_2003]
|
Related Topics
Topic: b-trees (16 items)
Topic: database keys and indexing (18 items)
Topic: external search and sort (23 items)
Topic: information retrieval by relevance (33 items)
Topic: information retrieval by searching (35 items)
Topic: information retrieval by topic (16 items)
Topic: information retrieval with an index (32 items)
Topic: manual indexing (19 items)
Topic: searching the Web (53 items)
Topic: using a description as a name (21 items)
Topic: World-Wide Web (42 items)
|