[The Calgary corpus] contains several files of English text, as … [executable] code. [File sizes range from 20K to 750K] … [p. 68] "Table 2.6 Speed (Mbyte/second) for encoding and … [ppm, adaptive arithmetic coder by Cleary, Moffat, Witten with 16 Mbytes memory, 6 encode, 5 decode, 2.34 bpc; huffword, static Huffman encoded words by Moffat, Eddy, Zobel with >100K memory, 10 encode, 55 decode, 3.56 bpc; gzip-fast LZ77 encoder by Adler, Wales, Gailly with >10K memory, 25 encode, 120 decode, 2.81 bpc; file copy at 500 Mbyte/second] … [p. 353] For … large files the cost of storing [huffword's] lexicon is only a small fraction of the total output … [2 Gbyte] TREC collection, huffword [compresses better than gzip. Size of compressed file relative to uncompressed file: gzip-fast 35.5%, gzip-best 33.7%, huffword 29.4%, dmc 27.5%, ppm 23.4%.] … [p. 59] [Furthermore, huffword's compression ratio does not degrade substantial if the text is broken into documents or pages.] … [p. 357] [If singleton words are stored on disk, the TREC lexicon requires 5Mbytes of memory.]
Google-1
Google-2
Copyright clearance needed for quotation.