Topic: archives
Topic: compressed data
Topic: data compression algorithms
Topic: revision delta
Topic: searching compressed data
Topic: strings
Topic: suffix trie and suffix array
Topic: version control
| |
Summary
Text compression reduces the space occupied by text. Most compression algorithms work well with text. In addition, word-based compression works well. Lexicons compress efficiently. (cbb 2/07)
Subtopic: compressed document collection
Quote: public domain code for compressing and indexing large document collections; entire retrieval system is 40% of the original [»wittIH_1994]
| Quote: XRAY compression for large text/nontext files, random access, and new data; efficient; training phase, testing phase to adjust the phrase model, and coding phase [»cannA7_2002]
| Subtopic: redundancy of natural language
Quote: about 75% redundancy in English, even in reverse order; estimated the entropy as 1 bit per letter [»shanCE1_1951]
| Subtopic: word-based text compression
Quote: use word-based text compression with Huffman codes and swap-to-near-front for large, dynamic text collections [»moffA3_1994]
| Quote: the number of distinct words in a text is nearly linear in the size of the text; e.g., misspellings [»moffA3_1994]
| Quote: for large document collections, better compression with Huffman encoded words than with gzip; simple synchronization, about half as fast, stores lexicon in memory
| Quote: store document id for every word in text using 6% of the size of text; store entire database in 1/3 size of original text [»zobeJ8_1995]
| Quote: use swap-to-near-front for auxiliary terms in word-based text compression [»moffA3_1994]
| Quote: spaceless words -- encode word if followed by space, else encode word and separator [»demoES8_1998]
| Quote: can improve data compression by replacing each word with a signature formed by replacing non-essential characters with '*' [»franR5_1996]
| Quote: use antidictionaries for efficient, linear time compression of fixed data sources; i.e., words that are not in the text [»crocM7_1999]
| Quote: compression by antidictionaries erases characters on compression and reconstructs on decompression; e.g., humans can identified erased characters of english text [»crocM7_1999]
| Quote: vdelta combines compressing and differencing via Tichy's block-move plus hashing; better than diff [»huntJJ4_1998]
| Subtopic: compressed index or lexicon
Quote: don't use a stop list; compress common words by predicting the inter-word gap; 100 words are 76% of references and 44% of compressed size [»wittIH_1991]
| Quote: compress lexicon entry into 3 bytes; 8 characters on average, shared prefix, compressed suffix, encoded count, predicted entry size [»wittIH_1991]
| Quote: index gigabyte text collections with compression and an external, multi-way mergesort; average of one byte per pointer; less than 4 hours [»moffA8_1995]
| Quote: compress indices to 15% of uncompressed collection size; may be faster as well [»scholF8_2002]
|
Related Topics
Topic: archives (19 items)
Topic: compressed data (16 items)
Topic: data compression algorithms (53 items)
Topic: revision delta (18 items)
Topic: searching compressed data (9 items)
Topic: strings (13 items)
Topic: suffix trie and suffix array (20 items)
Topic: version control (34 items)
|