Th Topic: text compression

Topic: text compression

topics > computer science > Group: document preparation

Topic:
archives
Topic:
compressed data
Topic:
data compression algorithms
Topic:
revision delta
Topic:
searching compressed data
Topic:
strings
Topic:
suffix trie and suffix array
Topic:
version control

Summary

Text compression reduces the space occupied by text. Most compression algorithms work well with text. In addition, word-based compression works well. Lexicons compress efficiently. (cbb 2/07)

Subtopic: compressed document collection

Quote: public domain code for compressing and indexing large document collections; entire retrieval system is 40% of the original [»wittIH_1994]
Quote: XRAY compression for large text/nontext files, random access, and new data; efficient; training phase, testing phase to adjust the phrase model, and coding phase [»cannA7_2002]

Subtopic: redundancy of natural language

Quote: about 75% redundancy in English, even in reverse order; estimated the entropy as 1 bit per letter [»shanCE1_1951]

Subtopic: word-based text compression

Quote: use word-based text compression with Huffman codes and swap-to-near-front for large, dynamic text collections [»moffA3_1994]
Quote: the number of distinct words in a text is nearly linear in the size of the text; e.g., misspellings [»moffA3_1994]
Quote: for large document collections, better compression with Huffman encoded words than with gzip; simple synchronization, about half as fast, stores lexicon in memory
Quote: store document id for every word in text using 6% of the size of text; store entire database in 1/3 size of original text [»zobeJ8_1995]
Quote: use swap-to-near-front for auxiliary terms in word-based text compression [»moffA3_1994]
Quote: spaceless words -- encode word if followed by space, else encode word and separator [»demoES8_1998]
Quote: can improve data compression by replacing each word with a signature formed by replacing non-essential characters with '*' [»franR5_1996]
Quote: use antidictionaries for efficient, linear time compression of fixed data sources; i.e., words that are not in the text [»crocM7_1999]
Quote: compression by antidictionaries erases characters on compression and reconstructs on decompression; e.g., humans can identified erased characters of english text [»crocM7_1999]
Quote: vdelta combines compressing and differencing via Tichy's block-move plus hashing; better than diff [»huntJJ4_1998]

Subtopic: compressed index or lexicon

Quote: don't use a stop list; compress common words by predicting the inter-word gap; 100 words are 76% of references and 44% of compressed size [»wittIH_1991]
Quote: compress lexicon entry into 3 bytes; 8 characters on average, shared prefix, compressed suffix, encoded count, predicted entry size [»wittIH_1991]
Quote: index gigabyte text collections with compression and an external, multi-way mergesort; average of one byte per pointer; less than 4 hours [»moffA8_1995]
Quote: compress indices to 15% of uncompressed collection size; may be faster as well
[»scholF8_2002]

Related Topics

Topic: archives (19 items)
Topic: compressed data (16 items)
Topic: data compression algorithms (53 items)
Topic: revision delta (18 items)
Topic: searching compressed data (9 items)
Topic: strings (13 items)
Topic: suffix trie and suffix array (20 items)
Topic: version control
(34 items)