Topic: text compression

topics > computer science > Group: document preparation

compressed data
data compression algorithms
revision delta
searching compressed data
suffix trie and suffix array
version control


Text compression reduces the space occupied by text. Most compression algorithms work well with text. In addition, word-based compression works well. Lexicons compress efficiently. (cbb 2/07)
Subtopic: compressed document collection up

Quote: public domain code for compressing and indexing large document collections; entire retrieval system is 40% of the original [»wittIH_1994]
Quote: XRAY compression for large text/nontext files, random access, and new data; efficient; training phase, testing phase to adjust the phrase model, and coding phase [»cannA7_2002]

Subtopic: redundancy of natural language up

Quote: about 75% redundancy in English, even in reverse order; estimated the entropy as 1 bit per letter [»shanCE1_1951]

Subtopic: word-based text compression up

Quote: use word-based text compression with Huffman codes and swap-to-near-front for large, dynamic text collections [»moffA3_1994]
Quote: the number of distinct words in a text is nearly linear in the size of the text; e.g., misspellings [»moffA3_1994]
Quote: for large document collections, better compression with Huffman encoded words than with gzip; simple synchronization, about half as fast, stores lexicon in memory
Quote: store document id for every word in text using 6% of the size of text; store entire database in 1/3 size of original text [»zobeJ8_1995]
Quote: use swap-to-near-front for auxiliary terms in word-based text compression [»moffA3_1994]
Quote: spaceless words -- encode word if followed by space, else encode word and separator [»demoES8_1998]
Quote: can improve data compression by replacing each word with a signature formed by replacing non-essential characters with '*' [»franR5_1996]
Quote: use antidictionaries for efficient, linear time compression of fixed data sources; i.e., words that are not in the text [»crocM7_1999]
Quote: compression by antidictionaries erases characters on compression and reconstructs on decompression; e.g., humans can identified erased characters of english text [»crocM7_1999]
Quote: vdelta combines compressing and differencing via Tichy's block-move plus hashing; better than diff [»huntJJ4_1998]

Subtopic: compressed index or lexicon up

Quote: don't use a stop list; compress common words by predicting the inter-word gap; 100 words are 76% of references and 44% of compressed size [»wittIH_1991]
Quote: compress lexicon entry into 3 bytes; 8 characters on average, shared prefix, compressed suffix, encoded count, predicted entry size [»wittIH_1991]
Quote: index gigabyte text collections with compression and an external, multi-way mergesort; average of one byte per pointer; less than 4 hours [»moffA8_1995]
Quote: compress indices to 15% of uncompressed collection size; may be faster as well

Related Topics up

Topic: archives (19 items)
Topic: compressed data (16 items)
Topic: data compression algorithms (53 items)
Topic: revision delta (18 items)
Topic: searching compressed data (9 items)
Topic: strings (13 items)
Topic: suffix trie and suffix array (20 items)
Topic: version control
(34 items)

Updated barberCB 12/05
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.