Th Topic: data compression algorithms

Topic: data compression algorithms

Topic:
compressed code
Topic:
compressed data
Topic:
revision delta
Topic:
searching compressed data
Topic:
text compression

Summary

Many algorithms have been developed for data compression. Data modeling identifies compressible data. The model is then used to transform the data into a more compact representation.
The data model may be predefined, trained over sample input, or dynamically adapt to the data. The algorithm needs to balance compression speed, decompression speed, and space requirements. Search and dynamic execution further constrain the algorithm.
The result may be bit-encoded (e.g., arithmetic coding), byte-encoded (e.g., Huffman coding), or word-encoded.
Examples include LZ algorithms, byte-pair compression, Burrows Wheeler transforms, anti-dictionary. (cbb 2/07)

Subtopic: measuring compression

Quote: measure data compression by determining the maximum compressibility of an infinite sequence by a finite-state encoder [»zivJ9_1978]

Subtopic: compression models

Quote: for data compression, separate statistical model (domain experts) from encoding (compression experts) [»frasCW5_1999]
Quote: develop statistical model for compression by proposing a large number of predictors and inferring a decision tree; e.g., opcode vs. literal data [»frasCW5_1999]
Quote: compression of protein sequences by their biochemical properties; better than ppm or gzip [»neviCG3_1999]

Subtopic: compression alphabet

Quote: for data compression alphabet encode sequences of characters instead of full alphabet [»balkB10_2000]

Subtopic: trained model

Quote: first use of training for lossless compression of tables; 100:1 compression; determine dependency tree of columns; group columns that compress well [»buchAL1_2000]
Quote: use a preliminary scan to improve text compression for small files, less than 10Kb; keep model in memory [»zobeJ8_1995]
Quote: decode at any offset with a compressed-phrase dictionary; e.g., for databases; use bitwise Elias codes and Huffman coding [»cannA3_2001]
Quote: XRAY compression for large text/nontext files, random access, and new data; efficient; training phase, testing phase to adjust the phrase model, and coding phase [»cannA7_2002]
Quote: a compressed-phrase dictionary can average 1.47 bpc; better than gzip and compress; compression 10x slower [»cannA3_2001]

Subtopic: adaptive models

Quote: practical text compression forms adaptive models to help predict which characters will come next; survey [»bellT12_1989]
Quote: described adaptive data compression algorithms that handle context changes; one pass algorithm, constant space [»fialER4_1989]
Quote: for data compression, use partial string matching to create a high-order Markov model; can code English text in 2.2 bits/char [»cleaJG4_1984]

Subtopic: Huffman code

Quote: Huffman's algorithm for data compression; a minimum-redundancy encoding of a message [»huffDA9_1952]
Quote: use Huffman code to compress memory swap pages with many zeros; ten times faster than gzip or LZRW1 [»rizzL4_1997]
Quote: use Huffman codes to compress text for CDROM and limit access; encoding appears random [»kleiST7_1989]
Quote: use decoding tables to process Huffman codes; use CDROM's robust, physical encoding to ignore their sensitivity to errors [»kleiST7_1989]
Quote: use LZH-Light for low memory usage and small packets; combines LZ77 with Huffman encoding [»ignaS10_1998]
Quote: improvements to Lempel-Ziv data compression; takes advantage of multiple matches, short matches, and blocked I/O; smaller and faster [»jenaSK3_1999]
Quote: a variable-byte compression scheme is twice as fast as bitwise compression for query evaluation of web documents
Quote: instead of Huffman codes, use pre-defined, variable length codes or universal codes for Burrows-Wheeler compression [»fenwP11_2002]

Subtopic: arithmetic coding

Quote: arithmetic coding better than Huffman codes for adaptive compression over large alphabets [»moffA3_1995]
Quote: use arithmetic coding when frequency statistics don't match log-2 relationships of Huffman coding [»cleaJG4_1984]
Quote: arithmetic coding represents a message by an interval of real numbers; allows fractional bits for a symbol [»wittIH6_1987]
Quote: dynamic markov compression is very effective on text but requires lots of computation and memory [»cormGV12_1987]
Quote: algorithm for arithmetic coding; uses words instead of characters; statistics module for symbol frequencies [»moffA3_1995]
Quote: compress text by arithmetic coding of sentences using lexicon; 20-word sentence into 220 bits with 3% overhead [»wittIH_1991]
Quote: comparison of compression techniques with the Calgary corpus: arithmetic coders best, then gzip's variation of LZ77 [»wittIH_1994]
Quote: improvement to Fenwick's algorithm for cumulative symbol frequency counts; use for adaptive, arithmetic coding [»moffA7_1999]

Subtopic: LZ compression

Quote: LZW hardware compression operates at under 3 cycles/symbol and performs well for 10,000 character messages [»welcTA6_1984]
Quote: LZW compression uses a translation table, 12-bit codes, and greedy parsing to keep finding the longest recognized string [»welcTA6_1984]
Quote: the LZ algorithm approaches the performance of offline, fixed code book, data compression algorithms [»zivJ5_1977]
Quote: the LZ data compression algorithm encodes future segments by maximum length copying [»zivJ5_1977]
Quote: the LZ data compression algorithm is susceptible to error propagation
Quote: the LZ algorithm creates a new phrase as soon as a prefix of the unparsed phrase differs from all preceding phrases [»zivJ9_1978]
Quote: compared longest match search for Ziv-Lempel compression; lists for each character, best overall [»bellT7_1993]
Quote: for Ziv-Lempel compression of text, found window size of 8192 gave 4-bits/char with knee at 1000 [»bellT7_1993]
Quote: Google compresses repository uses zlib; faster than bzip [»brinS4_1998]
Quote: use LZH-Light for low memory usage and small packets; combines LZ77 with Huffman encoding [»ignaS10_1998]

Subtopic: byte pair

Quote: compress blocks of data by repeatedly replacing common byte pairs with an unused byte; comparable to LZW compression with much lower space requirements [»gageP2_1994]

Subtopic: shortest edit sequence

Quote: compress multiple versions of programs and data with deltas, i.e., an edit sequence from one to the other [»tichWF11_1984]
Quote: algorithms for the shortest edit sequence for converting between two strings [»tichWF11_1984]
Quote: shortest edit sequence by removing longest prefixes [»tichWF11_1984]

Subtopic: anti-dictionary

Quote: use antidictionaries for efficient, linear time compression of fixed data sources; i.e., words that are not in the text [»crocM7_1999]
Quote: compression by antidictionaries erases characters on compression and reconstructs on decompression; e.g., humans can identified erased characters of english text [»crocM7_1999]
Quote: DCA compression and antidictionaries allow search and parallel algorithms because their encodings do not depend on context except for the block's prefix [»crocM7_1999]

Subtopic: Burrows Wheeler Transform, block-sorting

Quote: best file compressors were block-sorting (bzip) and Burrows-Wheeler Transform (BWT) [»fenwP4_1998]
Quote: Burrows Wheeler Transform reversibly reorders data; models data by the string following a character [»nelsMR9_1996]
Quote: use Burrows Wheeler Transform for compression; leaves clumps of repeating characters; as good as PKZIP
Quote: data compression in O(kn) time and O(n) space by Burrows-Wheeler Transformation; groups same right context together; fast and efficient [»balkB10_2000]
Quote: instead of Huffman codes, use pre-defined, variable length codes or universal codes for Burrows-Wheeler compression [»fenwP11_2002]
Quote: use sticky, move-to-front for Burrows-Wheeler compression
Quote: modifications to Burrows-Wheeler compression; better compression ratio than gzip but much worse decompression due to arithmetic decoding [»deorS11_2000]

Related Topics

Topic: compressed code (17 items)
Topic: compressed data (16 items)
Topic: revision delta (18 items)
Topic: searching compressed data (9 items)
Topic: text compression
(17 items)