Cryptographic hash functions are amazing things, with almost magical properties. They have the ability to compress an arbitrarily large block of data into a fixed size, called its hash. For example, the SHA256 hash function will compress any amount of data into a unique string of 256 bits (32 bytes).

Mathematically, this is impossible. Consider how many strings of 256 bits exist. This is a straightforward question, and the answer is 2^{256}. Then, how many strings of arbitrary length exist? Clearly, there are infinitely many. We cannot map an infinite set into a finite one while ensuring that distinct elements always have distinct images. More generally, we cannot map a large set into a smaller one in a one-to-one fashion. So, even restricting to inputs of length 257, there must exist pairs of distinct inputs that map to the same output hash.

What gives? Well, we can only ever evaluate the hash function a finite number of times. Adding up the entire number of times that anyone has evaluated an SHA256 hash ever in the world, we get a very large number, but it is still a lot less than 2^{256}. All that we need is for the hash function to *effectively* be one-to-one, so that with all the time and computing power in the world, the probability of ever finding two distinct inputs with the same hash is vanishingly small.

It is for this reason that cryptographically secure hash functions have strong pseudo-random properties. Although the hash of any input string can be computed by a deterministic algorithm, the result obtained depends seemingly randomly on the input. It is tricky to make this idea mathematically precise, but one consequence is that the probability of finding two distinct strings with the same hash is about the same as randomly generating strings of 256 bits, and finding two that are equal. The space of strings of 256 bits is of size 2^{256} which, although finite, is *huge*. The probability of two totally randomly generated 256 bit numbers ever being the same is so small that it is effectively impossible. The situation is similar to that of digital signatures for a public key cryptosystem, where it is theoretically possible to find a valid digital signature without knowing the private key but, in practice, it is impossible.

For strong hash functions, there is no known way to go in the reverse direction, so that we cannot reconstruct a data string from its hash. In some ways this is a drawback since we cannot simply compress a file into 256 bits by a hash function and then decompress it later. In other ways, though, it is a big benefit of hash functions, as they allow private data to be represented in a secure way.

Cryptographic hash functions have the following two properties.

- Preimage resistance. Given a hash, it is computationally infeasible to find an input string which hashes to that value.
- Collision resistance. It is computationally infeasible to find any two distinct strings with the same hash value.

The first of these two properties is actually a consequence of the second. If we could find preimages of hashes, then we could also find two strings with the same hash. To just do this, randomly generate input strings of sufficient length and compute their hash. Then, find a preimage which, with high probability, will be distinct from the original string, giving a hash collision.

A few applications of hash functions are:

- Error free transmission of large files. If we want to send a large file over the internet, then we can first send its hash. Then transmit the large file and, by computing the hash of the received file, we can detect if errors have occurred.
- File storage. If we upload files to a file storage system, then we would like to know if any of our locally stored copies have any changes. Rather than having to download the entire filesystem to compare, we can just download the file hashes.
- Storage of secret data. Secret data such as passwords can be stored in a secure way by just saving their hashes. It can then be verified that a password is correct by re-computing the hash, but the password itself never needs to be stored.
- Proof of work methods used to secure blockchains rely on hash functions, which I will explain in another post.
- Hash functions can be used to create an ID for referencing a block of data. For example, in Bitcoin, transaction IDs are created as the hash of a transaction, and every block is linked to the previous block by containing its hash.
- Merkle trees allow arbitrarily large collections of data to be stored as a single 256 bit hash, and it is possible to verify that individual blocks of data are indeed contained in the collection without ever having to handle the whole database.

#### Probability of a Hash Collision

Consider a hash function whose image space has size *N*. For the SHA256 hash, for example, we have *N* = 2^{256}. Next, suppose that we compute a hash function a large number *m* times, each time on a different input. What is the probability of two hashes being the same? We can compute this by making the assumption that each hash is randomly and uniformly distributed over the image space. In that case, the probability of any pair of distinct hash calculations giving the same result is 1/*N*. As there are precisely *m*(*m* – 1)/2 ways of choosing a pair from the collection of *m* hashes, the expected number of hash collisions is,

m(m – 1)2
N |
. |

This works as an upper bound on the probability of there being at least one collision. So, the probability of a collision satisfies the bound,

p ≤ |
m^{2}2
N |
. |

Consider computing a quadrillion quadrillion hashes, or *m* = 10^{30}. This is many orders of magnitude more than is feasible with all the computing power and all the time in the world. Even in this case, the probability of there existing two hashes which collide is bounded by

p ≤ |
10
^{60}2
^{257} |
≈ 4 × 10^{-18}. |

The probability of a humanity-destroying asteroid hitting the earth in any given second, based on this occurring approximately once every 30 million years, is about 10^{-15}. The probability of a hash collision is orders of magnitude smaller than this.

This calculation makes the assumption that hashes behave as uniformly distributed random 256 bit strings. If it was possible to generate input strings whose hashes have a significantly biased distribution, then that would increase the probability of a collision.