[The Calgary corpus] contains several files of English text, as … [executable] code.  [File sizes range from 20K to 750K]  … [p. 68] "Table 2.6 Speed (Mbyte/second) for encoding and … [ppm, adaptive arithmetic coder by Cleary, Moffat, Witten with 16 Mbytes memory, 6 encode, 5 decode, 2.34 bpc; huffword,  static Huffman encoded words by Moffat, Eddy, Zobel with >100K memory, 10 encode, 55 decode, 3.56 bpc; gzip-fast LZ77 encoder by Adler, Wales, Gailly with >10K memory, 25 encode, 120 decode, 2.81 bpc; file copy at 500 Mbyte/second]  … [p. 353] For  … large files the cost of storing [huffword's] lexicon is only a small fraction of the total output … [2 Gbyte] TREC collection, huffword [compresses better than gzip.  Size of compressed file relative to uncompressed file: gzip-fast 35.5%, gzip-best 33.7%, huffword  29.4%, dmc 27.5%, ppm 23.4%.]  … [p. 59] [Furthermore, huffword's compression ratio does not degrade substantial if the text is broken into documents or pages.]  … [p. 357] [If singleton words are stored on disk, the TREC lexicon requires 5Mbytes of memory.] 
     Google-1
     Google-2
   
   Copyright clearance needed for quotation.