Merkle Trees

driving

As I described previously, cryptographic hash functions have the ability to compress an arbitrarily large block of data into a fixed size. The result, or hash, then acts as a unique identifier for the original data. Merkle trees do all this and more. The idea is that, as with a hash function, a large block of data can be represented by a single fixed-size identifier, known as the Merkle root or hash root. This much is exactly the same as with a regular hash function. However, if the data is broken down into a collection of individual parts, then it is possible to verify each part individually. This is different from regular hash functions, where the entire block of data is required in order to recompute the hash and verify the data correctness. Some example uses of Merkle trees are:

  • Transmission of large collections of files or databases. Instead of downloading a large database over the internet, we can first download its Merkle root. Then, the individual entries of the database can be downloaded as needed, possibly from different sources, and separately verified to be correct.
  • Efficient storage of large amounts of user permissions. Only the Merkle root needs to be stored then, when a user wants to verify his or her permissions, this can be done individually rather than requiring access to the entire database.
  • Bitcoin stores the Merkle root of the transaction data in the block header. This can be used by thin clients to verify that individual transactions are included in a block without having to download and store the entire blockchain. Only the headers are required to be stored.

Suppose that we have an array of N data blocks. Then, the procedure is as follows:

  • The Merkle root of the array of data is computed, which is a binary string of fixed length (256 bits is sufficient).
  • For any data block in the array, we can transmit this along with a short piece of information known as the Merkle proof, to any other party in possession of the Merkle root. Using this, they are able to verify that the data block is correct and was indeed in the original data.
  • The other party does not need to trust the person transmitting the data. They only need to be confident that they have the correct Merkle root. Then, the proof is sufficient for the verification, since it is impossible to create a valid proof for any data which was not in the original array.
  • The size of the Merkle proof is of the order of the logarithm of N. This is bounded for practical purposes. For example, it is very unlikely that we would ever have arrays longer than 2100 so, effectively, the base-2 logarithm of N is bounded by 100.

The magic here is that the other party only needs to have the 256 bit Merkle root stored locally. Then, to prove that a block of data is in the array, we only have to transmit this block along with a short proof, regardless of the size of the original array of data.


Constructing the Merkle Tree

I now go into the mechanics of a Merkle tree. The first thing that we need is a cryptographic hash function, which maps any block of data to an essentially unique binary string of fixed size. For example, the SHA256 hash function can be used, which creates hashes of 256 bits (Bitcoin actually uses a double, or repeated, SHA256 hash). Throughout this post, I assume a hash function has been chosen, which will be denoted by H.

We build a Merkle tree from a sequence of blocks of data as follows. First, take the hash of each item. Then, arrange them as the leaves of a binary tree, where each internal node is assigned a value equal to the hash of its two children concatenated together. For a sequence of 16 data items, labelled t0 through t15, this is as in figure 1. Here, I have labelled each node by its hash value hb for a binary string b indicating the position of the node in the tree. So, for example, we have h1001 = H(t9) and h100 = H(h1000h1001). These hash values can all be computed by starting at the leaf nodes and, iteratively, working up the tree. Finally, the Merkle root is the hash value h at the very top.

Merkle tree
Figure 1: A Merkle tree with 16 leaves.

Continue reading “Merkle Trees”

Proof of Work

bitcoin miners
A bitcoin mining farm.

Here’s an interesting experiment. Go to any Bitcoin block explorer, and look at some blocks. I chose the recent block numbers (or heights) 686710 through 686719. Next, look at their hashes. These are 256 bit numbers, usually expressed as length 64 hexadecimal strings and, for the heights I selected, they are:

000000000000000000044e63ddeb30d097e0b3ccd9cca2069137cb7297e424bd
0000000000000000000230e3d3dea9fbbb65723ca07a33c1b9b3089c994983d9
0000000000000000000074c2db5661b85341d10e106e2e773549a38e1ec4d16c
0000000000000000000688941d5031bd7167880b62b333f274e61cc8e381a028
00000000000000000004aa907cf6a1e9d44e60ec4638242be5d767a0df07f772
0000000000000000000cbf470e7497e224c598609d72c7921991aec121433883
0000000000000000000b1a18922652ebc989f899b59b050dd0a9f6a312fa5998
0000000000000000000655ec73770127f0d28937af9d6413237f46592a215340
0000000000000000000a6e8c366aa51b534ce5ab399e9e8c72df290873bb9f76
000000000000000000001ded7e91b5d29a08e8166044ef3fc06a0944e9877a95

One thing immediately stands out. All of the hashes start with a lot of zeros. Statistically, this should be virtually impossible unless the blocks have been chosen specifically to have such unlikely hash values. Interpreting the 256 bit strings as binary integers and dividing through by 2256 gives a number between 0 and 1 (which I will denote by p). For the hash of an arbitrary block of data, the p values will be uniformly distributed on the unit interval but, for the blocks above, they are:

height p
686710 3.56e-24
686711 1.81e-24
686712 3.77e-25
686713 5.40e-24
686714 3.86e-24
686715 1.06e-23
686716 9.18e-24
686717 5.24e-24
686718 8.63e-24
686719 9.67e-26

All of these values are miniscule, being less than about 10-23. By chance, this can only happen with probability 10-23 independently for each block. In fact, due to the pseudorandom properties of the SHA256 hash function, there is really only one way that this can happen. Many many different blocks must have been produced each time — of the order of 1023 of them — with only those having small hash values being transmitted to the network and accepted as valid.

This is, of course, the proof of work (POW) protocol, or consensus mechanism. It ensures that miners who create new Bitcoin blocks have to perform significant work, regulating the rate of block production and, vitally, ensuring immutability of the blockchain. Once a block has been added to the chain, and is already underneath several others (i.e., has several confirmations), then it is virtually impossible for it to be altered. This would require redoing all of the proof of work for the block in question and for all blocks above it, and doing this quickly enough to catch up with the main chain. Continue reading “Proof of Work”

Block Hashes

The application of hash functions that I want to describe in this post, is as the glue used to hold together the blocks in a blockchain. Put another way, it is a hash function that puts the chain into blockchain.

blockchain

A blockchain is just an ordered sequence of blocks of data, with new blocks constantly getting added. While it needs to be flexible enough that new blocks can be added, it also requires immutability, so that already existing blocks are never altered. This is achieved with the help of hash functions. The idea is that every block contains a field which is set to the hash of the previous block. The initial or genesis block is the only one without a previous block to refer to so, for that one only, its previous block hash field is set to zero. An example of a blockchain containing only 5 blocks is shown in figure 1 below.

block hashes
Figure 1: A blockchain held together by a hash function.

The rectangles labelled Bn refer to the blocks, which contain data including one field taking the value h(Bn-1), which is the hash of Bn-1. Finally, we compute the hash of last block, h(B5), which serves as an identifier for the blockchain. In the case of Bitcoin, these are double (repeated) SHA256 hashes, and the blocks used here are actually the block headers. The header contains only a hash (specifically, the Merkle root) of the transactions instead of the transaction data itself, so is much smaller than the entire block. Otherwise, it makes no conceptual difference. Continue reading “Block Hashes”

Hash Functions

sha

Cryptographic hash functions are amazing things, with almost magical properties. They have the ability to compress an arbitrarily large block of data into a fixed size, called its hash. For example, the SHA256 hash function will compress any amount of data into a unique string of 256 bits (32 bytes).

Mathematically, this is impossible. Consider how many strings of 256 bits exist. This is a straightforward question, and the answer is 2256. Then, how many strings of arbitrary length exist? Clearly, there are infinitely many. We cannot map an infinite set into a finite one while ensuring that distinct elements always have distinct images. More generally, we cannot map a large set into a smaller one in a one-to-one fashion. So, even restricting to inputs of length 257, there must exist pairs of distinct inputs that map to the same output hash.

What gives? Well, we can only ever evaluate the hash function a finite number of times. Adding up the entire number of times that anyone has evaluated an SHA256 hash ever in the world, we get a very large number, but it is still a lot less than 2256. All that we need is for the hash function to effectively be one-to-one, so that with all the time and computing power in the world, the probability of ever finding two distinct inputs with the same hash is vanishingly small.

It is for this reason that cryptographically secure hash functions have strong pseudo-random properties. Although the hash of any input string can be computed by a deterministic algorithm, the result obtained depends seemingly randomly on the input. It is tricky to make this idea mathematically precise, but one consequence is that the probability of finding two distinct strings with the same hash is about the same as randomly generating strings of 256 bits, and finding two that are equal. The space of strings of 256 bits is of size 2256 which, although finite, is huge. The probability of two totally randomly generated 256 bit numbers ever being the same is so small that it is effectively impossible. The situation is similar to that of digital signatures for a public key cryptosystem, where it is theoretically possible to find a valid digital signature without knowing the private key but, in practice, it is impossible.

For strong hash functions, there is no known way to go in the reverse direction, so that we cannot reconstruct a data string from its hash. In some ways this is a drawback since we cannot simply compress a file into 256 bits by a hash function and then decompress it later. In other ways, though, it is a big benefit of hash functions, as they allow private data to be represented in a secure way.

Cryptographic hash functions have the following two properties.

  • Preimage resistance. Given a hash, it is computationally infeasible to find an input string which hashes to that value.
  • Collision resistance. It is computationally infeasible to find any two distinct strings with the same hash value.

The first of these two properties is actually a consequence of the second. If we could find preimages of hashes, then we could also find two strings with the same hash. To just do this, randomly generate input strings of sufficient length and compute their hash. Then, find a preimage which, with high probability, will be distinct from the original string, giving a hash collision.

A few applications of hash functions are:

  • Error free transmission of large files. If we want to send a large file over the internet, then we can first send its hash. Then transmit the large file and, by computing the hash of the received file, we can detect if errors have occurred.
  • File storage. If we upload files to a file storage system, then we would like to know if any of our locally stored copies have any changes. Rather than having to download the entire filesystem to compare, we can just download the file hashes.
  • Storage of secret data. Secret data such as passwords can be stored in a secure way by just saving their hashes. It can then be verified that a password is correct by re-computing the hash, but the password itself never needs to be stored.
  • Proof of work methods used to secure blockchains rely on hash functions, which I will explain in another post.
  • Hash functions can be used to create an ID for referencing a block of data. For example, in Bitcoin, transaction IDs are created as the hash of a transaction, and every block is linked to the previous block by containing its hash.
  • Merkle trees allow arbitrarily large collections of data to be stored as a single 256 bit hash, and it is possible to verify that individual blocks of data are indeed contained in the collection without ever having to handle the whole database.

Continue reading “Hash Functions”