Merkle Trees

driving

As I described previously, cryptographic hash functions have the ability to compress an arbitrarily large block of data into a fixed size. The result, or hash, then acts as a unique identifier for the original data. Merkle trees do all this and more. The idea is that, as with a hash function, a large block of data can be represented by a single fixed-size identifier, known as the Merkle root or hash root. This much is exactly the same as with a regular hash function. However, if the data is broken down into a collection of individual parts, then it is possible to verify each part individually. This is different from regular hash functions, where the entire block of data is required in order to recompute the hash and verify the data correctness. Some example uses of Merkle trees are:

  • Transmission of large collections of files or databases. Instead of downloading a large database over the internet, we can first download its Merkle root. Then, the individual entries of the database can be downloaded as needed, possibly from different sources, and separately verified to be correct.
  • Efficient storage of large amounts of user permissions. Only the Merkle root needs to be stored then, when a user wants to verify his or her permissions, this can be done individually rather than requiring access to the entire database.
  • Bitcoin stores the Merkle root of the transaction data in the block header. This can be used by thin clients to verify that individual transactions are included in a block without having to download and store the entire blockchain. Only the headers are required to be stored.

Suppose that we have an array of N data blocks. Then, the procedure is as follows:

  • The Merkle root of the array of data is computed, which is a binary string of fixed length (256 bits is sufficient).
  • For any data block in the array, we can transmit this along with a short piece of information known as the Merkle proof, to any other party in possession of the Merkle root. Using this, they are able to verify that the data block is correct and was indeed in the original data.
  • The other party does not need to trust the person transmitting the data. They only need to be confident that they have the correct Merkle root. Then, the proof is sufficient for the verification, since it is impossible to create a valid proof for any data which was not in the original array.
  • The size of the Merkle proof is of the order of the logarithm of N. This is bounded for practical purposes. For example, it is very unlikely that we would ever have arrays longer than 2100 so, effectively, the base-2 logarithm of N is bounded by 100.

The magic here is that the other party only needs to have the 256 bit Merkle root stored locally. Then, to prove that a block of data is in the array, we only have to transmit this block along with a short proof, regardless of the size of the original array of data.


Constructing the Merkle Tree

I now go into the mechanics of a Merkle tree. The first thing that we need is a cryptographic hash function, which maps any block of data to an essentially unique binary string of fixed size. For example, the SHA256 hash function can be used, which creates hashes of 256 bits (Bitcoin actually uses a double, or repeated, SHA256 hash). Throughout this post, I assume a hash function has been chosen, which will be denoted by H.

We build a Merkle tree from a sequence of blocks of data as follows. First, take the hash of each item. Then, arrange them as the leaves of a binary tree, where each internal node is assigned a value equal to the hash of its two children concatenated together. For a sequence of 16 data items, labelled t0 through t15, this is as in figure 1. Here, I have labelled each node by its hash value hb for a binary string b indicating the position of the node in the tree. So, for example, we have h1001 = H(t9) and h100 = H(h1000h1001). These hash values can all be computed by starting at the leaf nodes and, iteratively, working up the tree. Finally, the Merkle root is the hash value h at the very top.

Merkle tree
Figure 1: A Merkle tree with 16 leaves.

Note that the total number of hashes that need to be computed to build the tree is 31 or, more generally, about twice the number of leaves. So, the time taken to compute the tree is of the order of the size of the original data.

Next, suppose that someone — Alice, say — is in the possession of the Merkle root h. How can we prove to her that an individual entry (t5 for example) is correct and is in the array? We do not want to transmit the entire array. Consider the path through the tree from the root to the leaf. This is marked in green in figure 2 below, and includes the hash values h, h0, h01, h010.

blockchain attack
Figure 2: The path from the root to a leaf of the tree.

Each of these hashes can be computed from the values of its two children, one of which is in the path (coloured green), and the other of which bounds the path and is coloured orange. Specifically, h1, h00, h011, h0100 are children of nodes on the path, but do not lie on the path itself.

The Merkle proof is the sequence (0101,h1,h00,h011,h0100). The first element is simply a binary string giving the shape of the path. If this data is transmitted to Alice, then she can perform the following calculations in order.

h0101 = H(t5)
h010 = H(h0100h0101)
h01 = H(h010h011)
h0 = H(h00h01)
h = H(h0h1)

The right hand side of each line only involves t5, the values from the Merkle proof, or values computed in previous lines. Then, Alice compares this computed value of h with her stored Merkle root. If they agree, the proof is valid and verifies that the array contains t5. If they do not agree, the proof is invalid.

So long as we send Alice the correct values for t5 and the correct hash values in the proof, then all her calculations simply reproduce those used in the original construction of the tree. She is guaranteed to reproduce the same Merkle root, so determines that the proof is valid. But, what about the other way round. If the calculations above show that the proof is valid, can Alice be absolutely sure that the true value of t5 was received? Put another way, is it possible to fool Alice by constructing valid Merkle proofs for data which does not exist in the original array?

Luckily, Alice can be confident that the correct value was received. At least, she can under the assumption that there does indeed exist a leaf at the place in the tree corresponding to t5. For example, if she already knows that the array was of size 16, so that the tree is as in figure 1, then this is the case. Alice can argue as follows.

  1. As h = H(h0h1), the values of h0 and h1 are correct.
  2. Then, as h0 = H(h00h01), the values of h00 and h01 are correct.
  3. Then, as h01 = H(h010h011), the values of h010 and h011 are correct.
  4. Then, as h010 = H(h0100h0101), the values of h0100 and h0101 are correct.
  5. Finally, as h0101 = H(t5), the value of t5 is correct.

The reasoning in each of these steps makes use of the collision-resistance of the hash function. In the final step, this means that there can be no known block of data, other than t5, whose hash is equal to the given value. Hence, she must have received the correct value of t5. For each of the other steps, she is reasoning that there is only one known binary string whose hash equals the value on the left hand side and, therefore, the value that she has for the concatenation hahb is correct. As the hashes ha,hb are of known length, they can be recovered from the concatenation, so again she must have the correct values.

As mentioned above, the argument assumes that Alice knows that there is a leaf of the tree in the position specified by the path in the Merkle proof. What happens if she does not know this? In that case, the argument above does not hold, so she cannot be sure that the received value of t5 was in the original array. This is because she could have been subject to a second preimage attack which I discuss below, and is an easily fixed issue.

Writing code to generate Merkle roots and trees is straightforward. For example, the following short C++ function takes in the array of hashes of the data elements and returns the Merkle root. This is a modified, and simplified, version of the function used by the Bitcoin source code (merkle.cpp).

uint256 ComputeMerkleRoot(std::vector hashes) { if(hashes.size() == 0) return uint256(); while(hashes.size() > 1) { if(hashes.size() & 1) hashes.push_back(uint256()); SHA256D64(hashes[0].data(), hashes[0].data(), hashes.size()/2); hashes.resize(hashes.size()/2); } return hashes[0]; }

The data type ‘uint256’ represents a 256 bit (or 32 byte) unsigned integer, as returned by the hash function. The function call SHA256D64 here is taking the array of 32 byte integers, treating them as an array of 64 byte integers of half the length, and computing their hash values. This results in an array of half the length, and this operation is repeated until the array is of length 1, then the value is returned.

There are a couple of additional lines in this calculation that I did not address above. First, if the initial array is empty then we simply return 0, or uint(), for the Merkle root. By preimage resistance, this is not a hash of any known value. Secondly, at each iteration of the calculation, the array of hashes could have an odd length so cannot be grouped into pairs. This does not happen if the original array has length equal to a power of two, such as the case described above of length 16. The remediation is to simply append 0, or uint(), to the end. The effect of this is to truncate the binary tree at nodes with zero hash, to reduce number of leaves. This is as shown in figure 3 for an initial array of 10 data values t0 through t9.

Merkle tree with zeros
Figure 3: Using zero hashes for truncating a Merkle tree.

Oddly, Bitcoin does not use this simple method but, instead, duplicates the last element of the array if it has an odd length. This caused a bug, which needed to be fixed by a messy workaround as described by comments in the source code (see merkle.cpp).

A more elegant and efficient way to handle data arrays of length not a power of two, without truncating, is to use binary trees whose leaves do not all have the same depth. So long as the depths do not vary by more than 1, the tree is balanced, as in figure 4.

balanced Merkle tree
Figure 4: A balanced Merkle tree.

For an original array of length N, the binary tree will have depth equal to the base 2 logarithm log2N, rounded up if this is not an integer. So, the total size of the Merkle proofs are proportional to logN.


Second Preimage Attack

Merkle trees constructed as described so far are vulnerable to second preimage attacks, where it is possible to create a valid Merkle proof for an item which is not actually in the original data array. In the example above, where we attempt to prove to Alice that the element t5 is in the array, Alice had a watertight argument that a valid Merkle proof is sufficient to determine that the claimed element is correct. However, the argument only works if she already knows that the tree does have a leaf at the stated position. If she does not know the size of the Merkle tree already then it is possible that t5 as not actually in the array at all.

So, what can go wrong? If the tree is smaller than that depicted in figure 1, then any of the first 4 steps of her argument can fail. For example, the first step was to conclude that the values she has for h0 and h1 are correct, based on the fact that h0h1 hashes to h. However, for all she knows, the tree may not even have nodes corresponding to these hashes, and could have just a single leaf with data element equal to h0h1. Similarly, in the second step of her argument, she cannot conclude that the values she has for h00 and h01 are actually hashes in the tree. Maybe, instead, it has a leaf with data element h00h01.

The final step of Alice’s argument can also be wrong if the tree is bigger than depicted in figure 1. The fact that t5 has hash value h0101 does not mean that it is the value of a data element of the array. It could, instead, be equal to the concatenation h01010h01011 of other hash values at internal nodes of the tree.

In the other direction, if the Merkle tree is as in figure 1, then it is possible to create Merkle proofs for data elements which are not in the original array. For example, (01,h1,h00) is a valid Merkle proof for h010h011. In fact, it is possible to build a new Merkle tree smaller than the one in figure 1 which has same root. An example is shown in figure 5 below, where the leaf data values are concatenations of pairs of hashes from the original tree.

Merkle second preimage attack
Figure 5: A different Merkle tree with the same root.

Fortunately, these issues are easily fixed. It is only necessary to be able to distinguish the values at internal nodes of the tree from those at the leaf nodes. One method is to prepend the leaf values by the byte (or bit) 0 before taking the hash, and by 1 at internal nodes. So, in figure 1, we now have h1001 = H(0t9) and h100 = H(1h1000h1001). Another method is to apply the hash function twice for leaf nodes and once for internal nodes. Yet another, if the tree has a constant depth d as in figures 1 and 3, is to distribute the depth along with the Merkle root. Or, replace the Merkle root by the hash H(hd). Since a Merkle proof also gives the depth of the tree, it is straightforward to add a check for the depth to the verification algorithm.

Unfortunately, the Merkle tree used by Bitcoin does not do any of these, so is vulnerable to second preimage attacks. This could potentially cause an issue for thin clients, which only hold the block header and rely on Merkle proofs to check that a transaction is in a block. However, it is very unlikely that the concatenation of a pair of hash values is also equal to a valid bitcoin transaction and, so, the problems can be avoided by simply refusing to process transactions that are 64 bytes long, or Merkle proofs where any of the concatenated hash values are also valid transactions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s