Th Quote: comparison of compression techniques with the Calgary ...

Quote: comparison of compression techniques with the Calgary corpus: arithmetic coders best, then gzip's variation of LZ77

topics > all references > references t-z > QuoteRef: wittIH_1994 , p. 64

Topic:
data compression algorithms
Topic:
text compression

Quotation Skeleton

[The Calgary corpus] contains several files of English text, as … [executable] code. [File sizes range from 20K to 750K] … [p. 68] "Table 2.6 Speed (Mbyte/second) for encoding and … [ppm, adaptive arithmetic coder by Cleary, Moffat, Witten with 16 Mbytes memory, 6 encode, 5 decode, 2.34 bpc; huffword, static Huffman encoded words by Moffat, Eddy, Zobel with >100K memory, 10 encode, 55 decode, 3.56 bpc; gzip-fast LZ77 encoder by Adler, Wales, Gailly with >10K memory, 25 encode, 120 decode, 2.81 bpc; file copy at 500 Mbyte/second] … [p. 353] For … large files the cost of storing [huffword's] lexicon is only a small fraction of the total output … [2 Gbyte] TREC collection, huffword [compresses better than gzip. Size of compressed file relative to uncompressed file: gzip-fast 35.5%, gzip-best 33.7%, huffword 29.4%, dmc 27.5%, ppm 23.4%.] … [p. 59] [Furthermore, huffword's compression ratio does not degrade substantial if the text is broken into documents or pages.] … [p. 357] [If singleton words are stored on disk, the TREC lexicon requires 5Mbytes of memory.] Google-1 Google-2
Copyright clearance needed for quotation.

Additional Titles

Quote: for large document collections, better compression with Huffman encoded words than with gzip; simple synchronization, about half as fast, stores lexicon in memory

Related Topics

Topic: data compression algorithms (53 items)
Topic: text compression (16 items)