Topic: hash table and hash functions

topics > computer science > information > Group: information retrieval

data structures

adaptive hash table
associative memory
data caching
database implementation
database keys and indexing
hash filter
one-way hash function
perfect hash table
search algorithms
signature files
symbol table


A hash function generates a random set of n-bits from a larger number. In a good hash function, all bits of the original number can effect all bits of the result. A perfect hash function generates unique values for a given set of input values. A one-way hash function generates a value which is cryptographically secure, i.e., an adversary can not locate an input value that hashes to a given value. Jenkins has a theory of good hash functions.

A hash function may be used for many purposes. The primary use is locating a list of possible values in a hash table. With a good hash function, each list of possible values is short. The list may be stored implicitly or explicitly. Since most entries are accessed infrequently, a move-to-front heuristic allows long hash chains.

Other uses include information retrieval signatures, authentication, trie hash, and duplicate detection. (cbb 4/98)

Subtopic: selection of hash functions up

Quote: universal classes of hash functions with rapid evaluation; allows random choice of a hash function [»cartJL_1979]
Quote: tested several hash functions: division by primes, shifting, folding, and mid-square [»gremLL9_1982]
Quote: for hashing, use division by any number without small prime factors; avoids periodicities in the source, e.g., even numbers [»lumVY4_1971]
Quote: a hash function should avoid funneling, i.e., the map of a set of input bits to a smaller set of output bits [»jenkB9_1997]
Quote: the fastest code for a hash function to mix three 32-bit values reversibly
Quote: a hash function should be reversible, i.e., a permutation; otherwise it loses information about earlier blocks [»jenkB9_1997]
Quote: a good mixing function for natural language: a transposition followed by a sequence of alternating substitutions and linear operations
Quote: Jenkins' LOOKUP2 is a high quality, fast hash function; e.g., used in Spin [»dillPC4_2004]
Quote: hash function by relatively prime, radix transform; separates data, e.g., high order bits mapped to many digits [»linAD5_1963]

Subtopic: double and triple hashing up

Quote: for open-addressing, triple hashing modifies the increment as well as the hash value; much higher accuracy for Bloom filters [»dillPC4_2004]
Quote: triple hashing with LOOKUP2 generating 96 bits of data; as fast as double hashing with 1/13th as many omissions [»dillPC4_2004]
Quote: if hash into the shorter of two buckets, the fraction of buckets with k keys is about (n/km)^2 [»mitzM5_2002]
Quote: if hash into two groups of buckets, always break ties into the same group; greatly reduces long chains [»mitzM5_2002]

Subtopic: string hash functions up

Quote: hash text strings by xor and a table of randomish bytes for each character [»pearPK6_1990]
Quote: an initial hash value of 0 is better than using the length of the string [»mckeBJ2_1990]
Quote: for hashing, use double shifted sum modulo an odd number; or select number by testing
Quote: a shifted sum works well for hashing but the modulus needs careful selection; primes aren't best [»mckeBJ2_1990]
Quote: combine string characters for hashing by rotating add, prime multiplication, or summing groups of 4 [»ahoAV_1986]
Quote: use a recursive hash function to compute hash values for consecutive n-grams; up to 7 times faster than nonrecursive integer division; e.g. for indexing all character sequences in a document [»coheJD7_1997]

Subtopic: binary hash functions up

Quote: Pearson's hash function uses an intermediate table; very uniform [»reicC6_1991]
Quote: Rabin and Karp's hash function can be computed incrementally; more uniform than Pearson, but slower [»reicC6_1991]
Quote: use incremental hash function for efficient deltas of binary files; faster than SCCS [»reicC6_1991]

Subtopic: hash buckets up

Quote: map records into hash buckets with overflow into trailer buckets; independent of key length and source language [»linAD5_1963]

Subtopic: hash chaining up

Quote: move-to-front works well for hash tables; better than extendible hashing; effective for vocabulary accumulation at 80 strings per slot [»zobeJ12_2001]
Quote: chained hash tables are best for building vocabularies; considerably faster than splay trees and binary trees [»zobeJ12_2001]
Quote: compact chaining stores strings in the hash node; faster and less space [»zobeJ11_2005]
Quote: use a contiguous array for all strings in a hash chain; good spatial locality; reduces overhead for skew distributions to 2 bits per string [»zobeJ11_2005]
Quote: use 7- or 15-bit lengths for hashed strings; string match fails on first character [»zobeJ11_2005]
Quote: for words, exact-fit array hashing most efficient with load average of 20; for URLs, best load average was less than 1 [»zobeJ11_2005]

Subtopic: hash table sizing up

Quote: with twice as many hash buckets as keys, about 1.6 out of every 10,000 buckets will have five or more keys [»mitzM5_2002]

Subtopic: hashing execution up

Quote: oblivious hashing is a hash function over assignments and control flows; verify software protection by hashing security-sensitive code paths [»chenY10_2002]

Subtopic: hashing not fast up

Quote: computing hash values is the most expensive operation in a model checker [»dillPC4_2004]
Quote: ternary search trees better than binary trees, tries, and hashing; dynamic, fast operations including unsuccessful search and sort [»bentJ4_1998]

Subtopic: associative array up

Quote: script programmers used associative arrays while non-script programmers used arrays and 10-ary trees
Quote: dictionary by triples ; the functional form 'fetch' can retrieve value given a name [»backJ8_1978a]
QuoteRef: feldJA8_1969 ;;440 associative memory: accessing by specifying part of the contents

Subtopic: other uses of hashing up

Quote: a compressed signature file is like a hash index; 10-30% space overhead, fast retrieval and insertion, suitable for write-only [»faloC9_1988]
Quote: criss-cross hash joins interleaves zoned partitions configured through page maps; low memory and I/O costs; even better if ordered [»gopaRD7_2001]
Quote: use n-gram-hash-and-skip for matching keywords in a large dictionary of 10,000-100,000 entries [»coheJD12_1998]
Quote: trie hash by sort on leading digits with split between buckets by a prefix; better than B-tree, can 100% load for ordered insertion [»litwWA7_1991]
Quote: delete duplicates by multiple passes using hashing; linear average time; constant extra space

Related Topics up

Group: data structures   (12 topics, 278 quotes)

Topic: adaptive hash table (19 items)
Topic: arrays (58 items)
Topic: associative memory (5 items)
Topic: data caching (35 items)
Topic: database implementation (18 items)
Topic: database keys and indexing (18 items)
Topic: hash filter (18 items)
Topic: one-way hash function (24 items)
Topic: perfect hash table (9 items)
Topic: search algorithms (40 items)
Topic: signature files (21 items)
Topic: symbol table
(4 items)

Updated barberCB 2/06
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.