Topic: full-text indexing

topics > computer science > information > Group: information retrieval

database keys and indexing
external search and sort
information retrieval by relevance
information retrieval by searching
information retrieval by topic
information retrieval with an index
manual indexing
searching the Web
using a description as a name
World-Wide Web


A full-text index of everything is indispensable. It automatically provides access to information, perhaps retrieving lost information. All terms should be indexed. Phrase searching is particularly effective.

With the Web, full-text indexing has become ubiquitous. The key is a web crawler that searches the Web for new information.

Use an inverted index for indexing. A lexicon is efficiently computed with move-to-front hash chains. Compress the indices. Partitioning text into fixed sized blocks may help. (cbb 10/07)

Subtopic: full-text index up

Quote: a full-text index of everything is indispensable; fast access to information, uncovers lost information [»manbU10_1993]

Subtopic: compressed, self-index up

Quote: a compressed self-index takes space close to that of compressed text, replaces it, and provides fast substring search; research survey [»navaG4_2007]
Quote: self-indexing by suffix arrays derived from the Burrows-Wheeler transform of a suffix tree; can also use Lempel-Ziv compression [»navaG4_2007]
Quote: compress indices to 15% of uncompressed collection size; may be faster as well [»scholF8_2002]

Subtopic: web crawler up

Quote: Google processes 4 million pages a day; indexer keeps up with the crawler [»brinS4_1998]
Quote: running the web crawler generated a fair amount of e-mail and phone calls; need to solve problems as they occur [»brinS4_1998]

Subtopic: indexing everything up

Quote: automatically save and index every copy, fax, and print to the infinite memory multifunction machine [»hullJJ3_2001]
Quote: the Elephant file system saves all important versions; easy undo of recent changes [»santDS12_1999]
Quote: all terms should be indexed; including numbers, URL tokens, and stopwords [»zobeJ7_2006]

Subtopic: vocabulary accumulation up

Quote: vocabulary accumulation by move-to-front hash chains; 99% of searches at first node in chain [»heinS4_2002]
Quote: lexicon of all words in database, linked to concordance, disk address, lexical hierarchy, and permuted index [»wittIH_1991]
Quote: use a two-level lexicon to reduce memory requirements; half of the terms only appear once in the 2-gigabyte TREC collection [»moffA8_1995]
Quote: retrieve passages with document-ordered processing for rare terms and term-ordered processing for common terms; efficient [»kaszM10_1999]
Quote: chained hash tables are best for building vocabularies; considerably faster than splay trees and binary trees [»zobeJ12_2001]

Subtopic: passage and partitions up

Quote: index passages instead of full text; proximate terms, similar length, point into document, easy to present [»kaszM10_1999]
Quote: best retrieval with 150-300 word passages starting every 25 words [»kaszM10_1999]
Quote: Glimpse features a small index and approximate matching; divide data into blocks, index the blocks, full text search of possible matches [»manbU10_1993]
Quote: use equivalence relations, pivots, and compact partitions to index metric spaces for proximity queries; pivot best if memory, compact partitioning best in high dimension [»chavE9_2001]
Quote: an index partition in high-dimensional space spans most of the space in most dimensions, from border to border; coarse partitioning [»bohmC9_2001]

Subtopic: inverted file up

Quote: use inverted files for text query evaluation; better than suffix arrays and signature files [»zobeJ7_2006]

Subtopic: stop list up

Quote: don't use a stop list; compress common words by predicting the inter-word gap; 100 words are 76% of references and 44% of compressed size [»wittIH_1991]
Quote: a nextword index allows efficient processing of stopwords in phrase queries
Quote: removing commonest stopwords from queries reduces search time by 60% on average, 3x on longer queries [»bahlD8_2002]

Subtopic: phrase index up

Quote: build two indices, a word-level index for phrase and Boolean searches and a document-level index for ranked queries [»zobeJ7_2006]
Quote: for phrase queries, use an index for word pairs that begin with a common word; the three commonest words halves phrase querying time [»zobeJ7_2006]
Quote: use nextword index for phrase queries; 3x faster at 3% extra space [»bahlD8_2002]
Quote: phrases are not substantially superior to single terms for indexing; nor are sophisticated analysis tools [»saltG4_1970]
Quote: n-grams of text streams are efficient, robust, and capture statistically significant phrases [»coheJD12_1998]
Quote: genomic nucleotide data has a fairly uniform distribution of 9-grams; poor locality [»heinS4_2002]

Subtopic: disk index up

Quote: with interactive information retrieval, a query may be modified many times; an inverted file/index is more efficient [»eastCM1_1988]
Quote: B+-trees are nearly optimal for disk-based indices; good storage utilization [»bayeR_1972]
Quote: business data uses indexed sequential files for sequential processing and indexed retrieval; also relational and object-oriented databases [»glasRL9_1997]

Subtopic: index implementation up

Quote: indexing core for constructing a document-level index with ranked query evaluation; refinements for reorganization, phrases, and compression; includes bibliography [»zobeJ7_2006]
Quote: merge-based index construction scales well; 100MB of memory, can minimize disk space overhead, only one parsing pass, compression effective [»zobeJ7_2006]
Quote: compact encoding of indexed hit list; plain hits with capitalization, font size, offset; fancy hits for URL, anchor, etc [»brinS4_1998]
Quote: WebGlimpse gets offsets into a pointer file, collects file names, titles, and URLs, and then searches files directly
Quote: search is parallelizable by randomly dividing index into pieces called index shards; a pool of machines for each shard

Related Topics up

Topic: b-trees (16 items)
Topic: database keys and indexing (18 items)
Topic: external search and sort (23 items)
Topic: information retrieval by relevance (33 items)
Topic: information retrieval by searching (35 items)
Topic: information retrieval by topic (16 items)
Topic: information retrieval with an index (32 items)
Topic: manual indexing (19 items)
Topic: searching the Web (53 items)
Topic: using a description as a name (21 items)
Topic: World-Wide Web
(42 items)

Updated barberCB 6/05
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.