A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
Now I'm no expert in that matter, but the fs deduplicators I've seen were block, not file, based. Those can clearly not use the file length as they are blissfully unaware of files (or any structure for that matter). Those use a rather expensive hash function (you really want to avoid hash collisions), but (at least some ten years ago) memory, not processing speed, was the limiting factor.
Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.
> I think you are thinking of block level online deduplicators that are integrated into the file system?
> Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.
Most modern CoW filesystems also allow you to mark two files as duplicates, but without sharing subsequent mutations between them. Rmlint supports that, too.
Btw, I'm working on adding deduplication to Bcachefs, and because it's extent-based and not blockbased, the logic will look a lot more like rmlint than what you described.
While I generally like to reinvent the wheel, for hash functions I strongly recommend to use a proved good one. Djb2 by the venerable Daniel Bernstein satisfies all the requirements of TFA.
h = 5381
while still has data:
h = h * 33 + next_byte()
return h
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.
Djb2 is hardly a proven good hash :) It's really easy to find collisions for it, and it's not seeded, so you're kind of screwed regardless. It's the odd middle ground between "safely usable in practice" and "fast in practice", which turns out to be "neither safe nor fast" in this case.
Hash functions and PRNGs are closely related, they share many properties and they can be built from the same algorithmic components, so for many kinds of PRNGs there are corresponding kinds of hash functions and vice-versa.
Nevertheless, the purposes of hash functions and PRNGs are different and complementary.
A PRNG receives a short fixed-length value (the seed) and it expands it into a long pseudo-random sequence of arbitrary length.
A hash function receives a long input sequence of arbitrary length and it generates a short fixed-length pseudo-random value.
Good PRNGs are injective functions and good hash functions are surjective functions.
Normally the design methods for PRNGs and for hash functions should be presented together, because it is easy to interconvert algorithms for one of them with algorithms for the other. For instance, given a good hash function one could make a PRNG by computing the hashes of a sequence of numbers or the hashes of a sequence of strings of increasing length, and given a good PRNG one could make a hash function by accumulating somehow the input into a PRNG seed and taking the first generated number, or better by using input chunks as seeds and then accumulating the first generated numbers into a single value.
However for a successful conversion between PRNG and hash function algorithms, the source algorithm may have have to be overdesigned, to guarantee good enough properties even after the conversion.
When an algorithm is designed directly as a hash function or as a PRNG, with clearly specified requirements, it can be designed only as good as strictly necessary, enabling thus a better performance.
Well, that's technically also a deterministic random number generator! (I want to say it's not a great one, but... that's apparently context-dependent!)
If your input is i.i.d. random, then truncating works great. Eg if your keys are UUIDs then truncating can work well.
Another use:
Suppose you write a tool like rmlint that is looking for duplicate files. Generally, you compute some hash for each file, see if you got any duplicates, and then compare the relevant files directly.
A traditional hash like crc or sha256 takes O(n) to compute. But for files you can start with some cheaper hashes, like file length. After all, files of different length can't have the same content. Taking the first few bytes of your file is another cheap 'hash' you can compute.
Only when these cheap 'hashes' show that you have a potential duplicate, do you go and pay for a more expensive hash.
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
I work on the machine code level, so the only characteristic I'm interested in is how many ticks it takes to compute the result, not how many transistors it requires or anything like that. All modern CPUs take 1 tick to compute both XOR, addition, and many other simple arithmetic operations, so even though addition is technically more complicated in CPU designs, it never surfaces in software. In the context of this post, I preferred addition instead of XOR to reduce cancel-out and propagate entropy between bits.
The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.
At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...
That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.
The point about hash tables using top bits instead of bottom bits is the kind of thing that feels obvious once someone says it and yet here we are. Genuine question: have you seen any real-world hash table implementations that actually do this, or is it purely "this is what we should have done 40 years ago"?
Which bits are the best is completely dependent on the particular hash function.
For instance, if the last operation during computing a hash function was the kind of integer multiplication where the result has the same length as the operands, then the top bits are the best (because they depend on more of the input bits than the bottom result bits).
On the other hand, if the last operation during computing a hash function was the kind of integer multiplication where the result has a length that is the sum of the lengths of the operands (i.e. where the high bits of the result are not discarded), then the best bits of the result are neither the top bits nor the bottom bits, but the middle bits, for the same reason as before, they are the bits that depend on more of the input bits than either the bottom bits or the top bits of the result. (This fact becomes obvious if you do a long multiplication with pen on paper. After you compute the partial products and you start to add them, you can see that when computing either the top digits or the bottom digits you need to add much less digits from the partial products than when you compute the middle digits, so the middle digits are more thoroughly mixed as functions of the input digits.)
When the last operation is an addition, because the carry bits propagate from the bottom bits towards the top bits, the top bits are the best, because the higher a bit is in the result there are more input bits on which it depends.
I think the reason real-world implementations don't do this is to speed up access when the key is a small integer. Say, if your IDs are spread uniformly between 1 and 1000, taking the bottom 7 bits is a great hash, while the top 7 bits would just be zeros. So it's optimizing for a trivial hash rather than a general-purpose fast hash.
And since most languages require each data type to provide its own hash function, you kind of have to assume that the hash is half-assed and bottom bits are better. I think only Rust could make decisions differently here, since it's parametric over hashers, but I haven't seen that done.
I think older processors used to have a slower implementation for shifts, which made this slower.
Nowadays swisstable and other similar hashtables use the top bits and simd/swar techniques to quickly filter out collisions after determining the starting bucket.
Sorry, I've read and reread TFA but the concept still evades me. Is it that, since it's easier for a hash function to have higher entropy in the higher bits than in the lower ones, it would be more logical for hash tables to discard the lower bits and keep the higher ones?
Higher entropy in the higher bits is a property of addition and multiplication when the result has the same size as the operands (i.e. the result is taken modulo 2^N, which folds back the values exceeding the size of the result).
When other operations are used, there may be other bits with higher entropy. For example, when full-length multiplication is used (i.e. where the length of the result is the sum of the lengths of the operands), the middle bits have the highest entropy, not the top bits. On CPUs like the Intel/AMD x86-64 CPUs, where fast long multiplication instructions are available, this can be exploited in more performant hash functions and PRNGs.
In hash functions, additions and multiplications are frequently used together with rotations, in order to redistribute the entropy from the top bits to the bottom bits, during the following operations.
Honorary mention: byte swapping instructions (originally added to CPUs for endianness conversion) can also be used to redistribute entropy, but they're slightly slower than rotations on Intel, which is why I think they aren't utilized much.
Yes, that's my point. It's not true that all hash functions have this characteristic, but most fast ones do. (And if you're using a slow-and-high-quality hash function, the distinction doesn't matter, so might as well use top bits.)
Yes. But that's a known misfeature of C and no other language does it like that. Plus I kind of meant arbitrary byte strings where you can have embedded zeroes and thus have to know the length.
You'd probably want 'return len(str)', if this is Python?
In any case, for some applications this is indeed a great hash function. Programs like rmlint use it as part of their checks for duplicate files: if files have different lengths, they can't have the same content after all.
Any `return c` for some constant is a valid and correct hash function. It just has a lot of collisions and degenerates hash-maps to terrible performance. That was in fact my first thought when I read "simplest hash functions".
Or pad all entries with 0s to an arbitrarily long size. The 0s can be assumed, and not actually stored. Therefore arbitrarily long entries need not be shortened.
It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle
I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.
It's very very unlikely to get collisions there, but still not impossible. Whenever you map data of arbitrary length (infinite possibilities) to a limited length collisions are possible.
with a password you may be mapping into a smaller space or a bigger space, because what you want is to get them all same length, but yeah you may in some cases be mapping into a smaller space, hadn't thought of that, although I sort of also think it is unlikely.
But there it doesn't matter anyway because the password is put together with the email to identify the user so in practicality passwords will never collide even if they could in theory.
For passwords: the input _space_ is bigger. That doesn't say anything about the length of any particular password.
> But there it doesn't matter anyway because the password is put together with the email to identify the user so in practicality passwords will never collide even if they could in theory.
For passowrds, you are not worried so much about two users accidentally getting the same hash, you are worried about people finding a pre-image that hashes to the same output as your user's password.
Let's consider a hash table with an allocation of 1MB, which is about 2^20 bytes. Assume also that each entry occupies a byte. Assuming that the hash function's values are distributed randomly, the probability of there being a collision with only 1000 entries is approximately 38% = 1-(2^20)!/(2^20 - 1000)!/(2^20)^1000. See the "Birthday Problem".
padding with zeroes to a fixed length and prepending the original length would suffice, but you’d have to have a fixed length of double infinity to account for both the length information and the hash information, and the hash is less efficient than the original information.
A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
You are right!
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
Now I'm no expert in that matter, but the fs deduplicators I've seen were block, not file, based. Those can clearly not use the file length as they are blissfully unaware of files (or any structure for that matter). Those use a rather expensive hash function (you really want to avoid hash collisions), but (at least some ten years ago) memory, not processing speed, was the limiting factor.
https://github.com/sahib/rmlint is the one I had in mind.
> Those use a rather expensive hash function (you really want to avoid hash collisions), [...]
Then we are clearly not thinking of the same kind of software.
> but (at least some ten years ago) memory, not processing speed, was the limiting factor.
In what I described, IO is the limiting factor. You want to avoid having to read the whole file, if you can.
I think you are thinking of block level online deduplicators that are integrated into the file system?
> https://github.com/sahib/rmlint is the one I had in mind.
Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.
> I think you are thinking of block level online deduplicators that are integrated into the file system?
Indeed, I was.
> Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.
Most modern CoW filesystems also allow you to mark two files as duplicates, but without sharing subsequent mutations between them. Rmlint supports that, too.
Btw, I'm working on adding deduplication to Bcachefs, and because it's extent-based and not blockbased, the logic will look a lot more like rmlint than what you described.
This comment feels about half as long as it ought to be. Can you say more?
While I generally like to reinvent the wheel, for hash functions I strongly recommend to use a proved good one. Djb2 by the venerable Daniel Bernstein satisfies all the requirements of TFA.
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.Djb2 is hardly a proven good hash :) It's really easy to find collisions for it, and it's not seeded, so you're kind of screwed regardless. It's the odd middle ground between "safely usable in practice" and "fast in practice", which turns out to be "neither safe nor fast" in this case.
I just realized that a hash function is nothing less than the output of a deterministic random number generator xored with some data
Hash functions and PRNGs are closely related, they share many properties and they can be built from the same algorithmic components, so for many kinds of PRNGs there are corresponding kinds of hash functions and vice-versa.
Nevertheless, the purposes of hash functions and PRNGs are different and complementary.
A PRNG receives a short fixed-length value (the seed) and it expands it into a long pseudo-random sequence of arbitrary length.
A hash function receives a long input sequence of arbitrary length and it generates a short fixed-length pseudo-random value.
Good PRNGs are injective functions and good hash functions are surjective functions.
Normally the design methods for PRNGs and for hash functions should be presented together, because it is easy to interconvert algorithms for one of them with algorithms for the other. For instance, given a good hash function one could make a PRNG by computing the hashes of a sequence of numbers or the hashes of a sequence of strings of increasing length, and given a good PRNG one could make a hash function by accumulating somehow the input into a PRNG seed and taking the first generated number, or better by using input chunks as seeds and then accumulating the first generated numbers into a single value.
However for a successful conversion between PRNG and hash function algorithms, the source algorithm may have have to be overdesigned, to guarantee good enough properties even after the conversion.
When an algorithm is designed directly as a hash function or as a PRNG, with clearly specified requirements, it can be designed only as good as strictly necessary, enabling thus a better performance.
Could you explain what you mean here?
Hashes are _functions_ so provide the same output given the same input.
If you don't reseed the RNG after every hash computation, then you break this vital property of hashes.
And if you do reseed, then your claim boils down to "every hash function is just an XOR against a contstant" which certainly is not true either.
Sorry, what?
That might we one very particular way to write a hash function, but it's far from the only one.
Believe it or not, for some purposes taking the first few bytes of a string or even just the length of the string are good hash functions.
Well, that's technically also a deterministic random number generator! (I want to say it's not a great one, but... that's apparently context-dependent!)
What are those purposes?
If your input is i.i.d. random, then truncating works great. Eg if your keys are UUIDs then truncating can work well.
Another use:
Suppose you write a tool like rmlint that is looking for duplicate files. Generally, you compute some hash for each file, see if you got any duplicates, and then compare the relevant files directly.
A traditional hash like crc or sha256 takes O(n) to compute. But for files you can start with some cheaper hashes, like file length. After all, files of different length can't have the same content. Taking the first few bytes of your file is another cheap 'hash' you can compute.
Only when these cheap 'hashes' show that you have a potential duplicate, do you go and pay for a more expensive hash.
> Like addition
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
I work on the machine code level, so the only characteristic I'm interested in is how many ticks it takes to compute the result, not how many transistors it requires or anything like that. All modern CPUs take 1 tick to compute both XOR, addition, and many other simple arithmetic operations, so even though addition is technically more complicated in CPU designs, it never surfaces in software. In the context of this post, I preferred addition instead of XOR to reduce cancel-out and propagate entropy between bits.
The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.
At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...
That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.
The point about hash tables using top bits instead of bottom bits is the kind of thing that feels obvious once someone says it and yet here we are. Genuine question: have you seen any real-world hash table implementations that actually do this, or is it purely "this is what we should have done 40 years ago"?
Which bits are the best is completely dependent on the particular hash function.
For instance, if the last operation during computing a hash function was the kind of integer multiplication where the result has the same length as the operands, then the top bits are the best (because they depend on more of the input bits than the bottom result bits).
On the other hand, if the last operation during computing a hash function was the kind of integer multiplication where the result has a length that is the sum of the lengths of the operands (i.e. where the high bits of the result are not discarded), then the best bits of the result are neither the top bits nor the bottom bits, but the middle bits, for the same reason as before, they are the bits that depend on more of the input bits than either the bottom bits or the top bits of the result. (This fact becomes obvious if you do a long multiplication with pen on paper. After you compute the partial products and you start to add them, you can see that when computing either the top digits or the bottom digits you need to add much less digits from the partial products than when you compute the middle digits, so the middle digits are more thoroughly mixed as functions of the input digits.)
When the last operation is an addition, because the carry bits propagate from the bottom bits towards the top bits, the top bits are the best, because the higher a bit is in the result there are more input bits on which it depends.
I think the reason real-world implementations don't do this is to speed up access when the key is a small integer. Say, if your IDs are spread uniformly between 1 and 1000, taking the bottom 7 bits is a great hash, while the top 7 bits would just be zeros. So it's optimizing for a trivial hash rather than a general-purpose fast hash.
And since most languages require each data type to provide its own hash function, you kind of have to assume that the hash is half-assed and bottom bits are better. I think only Rust could make decisions differently here, since it's parametric over hashers, but I haven't seen that done.
I think older processors used to have a slower implementation for shifts, which made this slower.
Nowadays swisstable and other similar hashtables use the top bits and simd/swar techniques to quickly filter out collisions after determining the starting bucket.
Sorry, I've read and reread TFA but the concept still evades me. Is it that, since it's easier for a hash function to have higher entropy in the higher bits than in the lower ones, it would be more logical for hash tables to discard the lower bits and keep the higher ones?
Higher entropy in the higher bits is a property of addition and multiplication when the result has the same size as the operands (i.e. the result is taken modulo 2^N, which folds back the values exceeding the size of the result).
When other operations are used, there may be other bits with higher entropy. For example, when full-length multiplication is used (i.e. where the length of the result is the sum of the lengths of the operands), the middle bits have the highest entropy, not the top bits. On CPUs like the Intel/AMD x86-64 CPUs, where fast long multiplication instructions are available, this can be exploited in more performant hash functions and PRNGs.
In hash functions, additions and multiplications are frequently used together with rotations, in order to redistribute the entropy from the top bits to the bottom bits, during the following operations.
Honorary mention: byte swapping instructions (originally added to CPUs for endianness conversion) can also be used to redistribute entropy, but they're slightly slower than rotations on Intel, which is why I think they aren't utilized much.
Yes, that's my point. It's not true that all hash functions have this characteristic, but most fast ones do. (And if you're using a slow-and-high-quality hash function, the distinction doesn't matter, so might as well use top bits.)
This is a excellent article for anyone looking for some more in-depth analysis of tabulation based hashing methods: https://arxiv.org/abs/1505.01523
O(1) only if you're working in a language with length-stored strings (like Pascal[0]), right?
In something like C with its generic strings[1], it would surely have to be O(n) since you have to scan the entire string to calculate its length?
(I have always been terrible at big-O, mind.)
[0] There's probably more of them by now.
[1] ie. not a specific length-stored string type.
Yes. But that's a known misfeature of C and no other language does it like that. Plus I kind of meant arbitrary byte strings where you can have embedded zeroes and thus have to know the length.
You'd probably want 'return len(str)', if this is Python?
In any case, for some applications this is indeed a great hash function. Programs like rmlint use it as part of their checks for duplicate files: if files have different lengths, they can't have the same content after all.
Yes, too much Rust :D Eliding the `return` becomes so intuitive after a while that you sort of forget that most other languages require it.
Haskell and the Lisps also work like this.
Yes, I know, it's the natural way to do it in functional programming. Honestly I doubt there are any FP languages that don't do it like that.
Re: missing return keyword
Actually, in Python, None is a valid key...
(I'm so sorry. JavaScript has ruined me.)
Any `return c` for some constant is a valid and correct hash function. It just has a lot of collisions and degenerates hash-maps to terrible performance. That was in fact my first thought when I read "simplest hash functions".
That's why I said "probably".
If you use the identity function as your hashing function then is it O(0) because you are done before you start?
For an arbitrarily long input, you still have to compress it to constant size somehow.
Or pad all entries with 0s to an arbitrarily long size. The 0s can be assumed, and not actually stored. Therefore arbitrarily long entries need not be shortened.
I don't think there are any reasonable use cases for a non-constant-length hash.
a hash function that produce hashes that are already in the hash table should, IMO, not be called a hash function.
I get why technically it is a hash function, but still, no.
It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle
ok damn, I did not know this, obviously. Thanks.
I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.
It's very very unlikely to get collisions there, but still not impossible. Whenever you map data of arbitrary length (infinite possibilities) to a limited length collisions are possible.
Doesn't even have to be arbitrary length.
Whenever you map into a smaller space, you get collisions. The bigger space doesn't have to be infinite.
with a password you may be mapping into a smaller space or a bigger space, because what you want is to get them all same length, but yeah you may in some cases be mapping into a smaller space, hadn't thought of that, although I sort of also think it is unlikely.
But there it doesn't matter anyway because the password is put together with the email to identify the user so in practicality passwords will never collide even if they could in theory.
For passwords: the input _space_ is bigger. That doesn't say anything about the length of any particular password.
> But there it doesn't matter anyway because the password is put together with the email to identify the user so in practicality passwords will never collide even if they could in theory.
For passowrds, you are not worried so much about two users accidentally getting the same hash, you are worried about people finding a pre-image that hashes to the same output as your user's password.
Let's consider a hash table with an allocation of 1MB, which is about 2^20 bytes. Assume also that each entry occupies a byte. Assuming that the hash function's values are distributed randomly, the probability of there being a collision with only 1000 entries is approximately 38% = 1-(2^20)!/(2^20 - 1000)!/(2^20)^1000. See the "Birthday Problem".
A perfect hash function https://en.wikipedia.org/wiki/Perfect_hash_function has to be specially constructed for each desired set of inputs. Generic hash functions cannot be 'perfect'.
Here is a hash function that does not have hash collisions:
That is a function, but not a hash function!
Well it no longer constrains the data in a fixed output length.
Sure, but if you constrain to fixed output length, you will definitely have collisions (Pigeon Hole Principle). There's no way around that.
padding with zeroes to a fixed length and prepending the original length would suffice, but you’d have to have a fixed length of double infinity to account for both the length information and the hash information, and the hash is less efficient than the original information.
what programming language is this?