fuchsia / third_party / github.com / google / farmhash / refs/heads/main / . / Understanding_Hash_Functions

UNDERSTANDING HASH FUNCTIONS | |

by Geoff Pike | |

Version 0.2 --- early draft --- comments and questions welcome! | |

References appear in square brackets. | |

1 INTRODUCTION | |

Hashing has proven tremendously useful in constructing various fast | |

data structures and algorithms. It is typically possible to simplify | |

the analysis of hash-based algorithms if one assumes that the relevant | |

hash functions are high quality. At the other extreme, if the | |

relevant hash functions were always to return the same value, many | |

hash-based algorithms become algorithms that are slower, simpler, but still well-known. | |

For example, a chaining hash table devolves into a linked list. | |

There are many possible definitions of hash function quality. For | |

example, one might want a list of keys and their hashes to provide no | |

pattern that would allow an opponent to predict anything about the | |

hashes of other keys. Although I cannot prove it, I think I can meet | |

this and many other definitions of quality with | |

f(s) = SHA-3(concatenation of z and s), | |

where z is some secret string known only to me. This well-known trick | |

provides, I think, more high-quality hash functions than anyone will | |

need, though greater computational power in the future may push us to | |

replace SHA-3 from time to time. | |

In short, discussions about choosing a hash function are almost always | |

discussions about speed, energy consumption, or similar. Concerns | |

about hash quality are easy to fix, for a price. | |

2 ANATOMY OF A HASH FUNCTION | |

Hash functions that input strings of arbitrary length are written in | |

terms of an internal state, S. In many cases the internal state is a | |

fixed number of bits and will fit in machine registers. One generic | |

sketch of a string hash is: | |

let S = some initial value | |

let c = the length of S in bits | |

while (input is not exhausted) { | |

let t = the next c bits of input (padded with zeroes if less than c remain) | |

S = M(S xor t) | |

} | |

let n = the number of bytes hashed | |

return F(S, n) | |

where M is a hash function that inputs and outputs c bits, and F is a | |

hash function that inputs c bits (plus, say, 64 for its second argument) | |

and outputs however many bits one needs to return. In some sense we have | |

reduced the string-hashing problem to two integer hashing problems. | |

2.1 INTEGER HASHING TECHNIQUES | |

A hash function that inputs and outputs the same number of bits, say, | |

32, can use reversible bit-twiddling operations, each of which is | |

"onto" in the mathematical sense. For example, multiplication by an | |

odd constant is reversible, as all odd numbers are relatively prime to | |

2^32. Other commonly used reversible operations include: | |

o Adding or xoring a constant | |

o Bitwise rotation or other bitwise permutations | |

o bit j = (bit j) xor (bit k) for unequal constants j and k | |

o "Shift mix": S = S xor (S >> k), where k is, say, 17 | |

o Replacing a fixed-length bit string with its cyclic redundancy | |

checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike] | |

Each of the above is a "bad" hash function that inputs and outputs | |

the same number of bits. One can simply compose two or more of those | |

bad hash functions to construct a higher-quality hash function. | |

One common quality goal for integer hashing (and string hashing) is | |

that flipping the 19th bit, or any other small change, applied to | |

multiple input keys, causes a seemingly unpredictable difference each | |

time. Similarly, any change to an input should lead to a seemingly | |

unpredictable selection of the output bits to flip. | |

Therefore, if we want a high-quality hash function that inputs c bits | |

and outputs fewer than c bits, we can simply truncate the output of a | |

high-quality hash function that inputs and outputs c bits. | |

To give a concrete example, here is Bob Jenkins' mix(), published in | |

1996 [Jenkins]. Its input is 96 bits in three 32-bit variables, and its output | |

is 96 bits. However, one may use a subset of the output bits, as every | |

output bit is affected by every non-empty subset of the input bits. | |

Input: a, b, and c | |

Algorithm: | |

a -= b; a -= c; a ^= (c>>13); | |

b -= c; b -= a; b ^= (a<<8); | |

c -= a; c -= b; c ^= (b>>13); | |

a -= b; a -= c; a ^= (c>>12); | |

b -= c; b -= a; b ^= (a<<16); | |

c -= a; c -= b; c ^= (b>>5); | |

a -= b; a -= c; a ^= (c>>3); | |

b -= c; b -= a; b ^= (a<<10); | |

c -= a; c -= b; c ^= (b>>15); | |

Output: a, b, and c | |

2.2 VARIATIONS ON STRING HASHING | |

There are three variations on our initial sketch worth noting. | |

First, for speed, one can special-case short inputs, as the CityHash | |

and FarmHash algorithms do. The number of special cases can be | |

reduced by using loads that may overlap: for example, a hash of a 9- | |

to 16-byte string can be implemented by a hash that inputs two 8-byte | |

values (the first 8 and last 8 bytes of the input string) and the string | |

length [CityHash, FarmHash]. | |

Second, one may choose different means of incorporating input bits | |

into the internal state. One example: the mixing of S and input bits | |

may be interleaved with the mixing of parts of S and other parts of S. | |

Another example: the input bits processed in a loop iteration might be | |

xor'ed into multiple places in S, rather than just one, or might be | |

hashed with each other before touching S [Murmur]. The advantages and | |

disadvantages of these are unclear. | |

Third, one may repeatedly "squeeze information" from S, by remixing it with | |

itself and then revealing a subset of S. This is convenient when one would | |

like a family of hash functions with different output lengths. A special | |

case of the idea, called the "sponge construction," has been well studied and | |

adopted by the authors of Keccak and others [SHA-3]. | |

3 HASH FUNCTIONS FOR HASH TABLES | |

It isn't hard to find real-life examples where hash tables or the hash | |

functions for them take more than 5% of a program's CPU time. | |

Improvements to hash tables and their hash functions are therefore a | |

classic example of software performance tuning. Unfortunately, the | |

best choice may be platform-dependent, so to avoid writing your own | |

collection of #ifdefs, please consider selecting something like the | |

FarmHash family of hash functions, that supply decent | |

platform-dependent logic for you. | |

To tune a program, often one will replace an existing hash function with a | |

faster, lower-quality hash function, despite the increased chance of unlucky | |

or pathological performance problems. Clever algorithms can mitigate this | |

risk. For example, hash tables can start with one hash function and then | |

switch to another if things seem to be going poorly. Therefore, one should | |

rarely plan to spend much CPU time on a secure hash function (such as SHA-3) | |

or a near-universal hash function (such as VHASH) when seeking the best | |

possible performance from a hash table. Against that, those types of hash | |

functions can limit the risk of pathological performance problems when one is | |

designing around typical hash-based algorithms that stick with a single hash | |

function no matter how it behaves on the data at hand. | |

4 | |

REFERENCES | |

[Murmur] Appleby, Austin. https://code.google.com/p/smhasher, | |

https://sites.google.com/site/murmurhash/ | |

[SMHasher] Appleby, Austin. https://code.google.com/p/smhasher | |

[SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/ | |

[Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html | |

[VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In | |

Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006. | |

[CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash | |

[FarmHash] Pike, Geoff. https://code.google.com/p/farmhash | |

[Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf |