Th Topic: full-text indexing

Topic: full-text indexing

topics > computer science > information > Group: information retrieval

Topic:
b-trees
Topic:
database keys and indexing
Topic:
external search and sort
Topic:
information retrieval by relevance
Topic:
information retrieval by searching
Topic:
information retrieval by topic
Topic:
information retrieval with an index
Topic:
manual indexing
Topic:
searching the Web
Topic:
using a description as a name
Topic:
World-Wide Web

Summary

A full-text index of everything is indispensable. It automatically provides access to information, perhaps retrieving lost information. All terms should be indexed. Phrase searching is particularly effective.
With the Web, full-text indexing has become ubiquitous. The key is a web crawler that searches the Web for new information.
Use an inverted index for indexing. A lexicon is efficiently computed with move-to-front hash chains. Compress the indices. Partitioning text into fixed sized blocks may help. (cbb 10/07)

Subtopic: full-text index

Quote: a full-text index of everything is indispensable; fast access to information, uncovers lost information [»manbU10_1993]

Subtopic: compressed, self-index

Quote: a compressed self-index takes space close to that of compressed text, replaces it, and provides fast substring search; research survey [»navaG4_2007]
Quote: self-indexing by suffix arrays derived from the Burrows-Wheeler transform of a suffix tree; can also use Lempel-Ziv compression [»navaG4_2007]
Quote: compress indices to 15% of uncompressed collection size; may be faster as well [»scholF8_2002]

Subtopic: web crawler

Quote: Google processes 4 million pages a day; indexer keeps up with the crawler [»brinS4_1998]
Quote: running the web crawler generated a fair amount of e-mail and phone calls; need to solve problems as they occur [»brinS4_1998]

Subtopic: indexing everything

Quote: automatically save and index every copy, fax, and print to the infinite memory multifunction machine [»hullJJ3_2001]
Quote: the Elephant file system saves all important versions; easy undo of recent changes [»santDS12_1999]
Quote: all terms should be indexed; including numbers, URL tokens, and stopwords [»zobeJ7_2006]

Subtopic: vocabulary accumulation

Quote: vocabulary accumulation by move-to-front hash chains; 99% of searches at first node in chain [»heinS4_2002]
Quote: lexicon of all words in database, linked to concordance, disk address, lexical hierarchy, and permuted index [»wittIH_1991]
Quote: use a two-level lexicon to reduce memory requirements; half of the terms only appear once in the 2-gigabyte TREC collection [»moffA8_1995]
Quote: retrieve passages with document-ordered processing for rare terms and term-ordered processing for common terms; efficient [»kaszM10_1999]
Quote: chained hash tables are best for building vocabularies; considerably faster than splay trees and binary trees [»zobeJ12_2001]

Subtopic: passage and partitions

Quote: index passages instead of full text; proximate terms, similar length, point into document, easy to present [»kaszM10_1999]
Quote: best retrieval with 150-300 word passages starting every 25 words [»kaszM10_1999]
Quote: Glimpse features a small index and approximate matching; divide data into blocks, index the blocks, full text search of possible matches [»manbU10_1993]
Quote: use equivalence relations, pivots, and compact partitions to index metric spaces for proximity queries; pivot best if memory, compact partitioning best in high dimension [»chavE9_2001]
Quote: an index partition in high-dimensional space spans most of the space in most dimensions, from border to border; coarse partitioning [»bohmC9_2001]

Subtopic: inverted file

Quote: use inverted files for text query evaluation; better than suffix arrays and signature files [»zobeJ7_2006]

Subtopic: stop list

Quote: don't use a stop list; compress common words by predicting the inter-word gap; 100 words are 76% of references and 44% of compressed size [»wittIH_1991]
Quote: a nextword index allows efficient processing of stopwords in phrase queries
Quote: removing commonest stopwords from queries reduces search time by 60% on average, 3x on longer queries [»bahlD8_2002]

Subtopic: phrase index

Quote: build two indices, a word-level index for phrase and Boolean searches and a document-level index for ranked queries [»zobeJ7_2006]
Quote: for phrase queries, use an index for word pairs that begin with a common word; the three commonest words halves phrase querying time [»zobeJ7_2006]
Quote: use nextword index for phrase queries; 3x faster at 3% extra space [»bahlD8_2002]
Quote: phrases are not substantially superior to single terms for indexing; nor are sophisticated analysis tools [»saltG4_1970]
Quote: n-grams of text streams are efficient, robust, and capture statistically significant phrases [»coheJD12_1998]
Quote: genomic nucleotide data has a fairly uniform distribution of 9-grams; poor locality [»heinS4_2002]

Subtopic: disk index

Quote: with interactive information retrieval, a query may be modified many times; an inverted file/index is more efficient [»eastCM1_1988]
Quote: B+-trees are nearly optimal for disk-based indices; good storage utilization [»bayeR_1972]
Quote: business data uses indexed sequential files for sequential processing and indexed retrieval; also relational and object-oriented databases [»glasRL9_1997]

Subtopic: index implementation

Quote: indexing core for constructing a document-level index with ranked query evaluation; refinements for reorganization, phrases, and compression; includes bibliography [»zobeJ7_2006]
Quote: merge-based index construction scales well; 100MB of memory, can minimize disk space overhead, only one parsing pass, compression effective [»zobeJ7_2006]
Quote: compact encoding of indexed hit list; plain hits with capitalization, font size, offset; fancy hits for URL, anchor, etc [»brinS4_1998]
Quote: WebGlimpse gets offsets into a pointer file, collects file names, titles, and URLs, and then searches files directly
Quote: search is parallelizable by randomly dividing index into pieces called index shards; a pool of machines for each shard
[»barrLA3_2003]

Related Topics

Topic: b-trees (16 items)
Topic: database keys and indexing (18 items)
Topic: external search and sort (23 items)
Topic: information retrieval by relevance (33 items)
Topic: information retrieval by searching (35 items)
Topic: information retrieval by topic (16 items)
Topic: information retrieval with an index (32 items)
Topic: manual indexing (19 items)
Topic: searching the Web (53 items)
Topic: using a description as a name (21 items)
Topic: World-Wide Web
(42 items)