As I described previously, cryptographic hash functions have the ability to compress an arbitrarily large block of data into a fixed size. The result, or hash, then acts as a unique identifier for the original data. *Merkle trees* do all this and more. The idea is that, as with a hash function, a large block of data can be represented by a single fixed-size identifier, known as the *Merkle root* or *hash root*. This much is exactly the same as with a regular hash function. However, if the data is broken down into a collection of individual parts, then it is possible to verify each part individually. This is different from regular hash functions, where the entire block of data is required in order to recompute the hash and verify the data correctness. Some example uses of Merkle trees are:

- Transmission of large collections of files or databases. Instead of downloading a large database over the internet, we can first download its Merkle root. Then, the individual entries of the database can be downloaded as needed, possibly from different sources, and separately verified to be correct.
- Efficient storage of large amounts of user permissions. Only the Merkle root needs to be stored then, when a user wants to verify his or her permissions, this can be done individually rather than requiring access to the entire database.
- Bitcoin stores the Merkle root of the transaction data in the block header. This can be used by thin clients to verify that individual transactions are included in a block without having to download and store the entire blockchain. Only the headers are required to be stored.

Suppose that we have an array of *N* data blocks. Then, the procedure is as follows:

- The Merkle root of the array of data is computed, which is a binary string of fixed length (256 bits is sufficient).
- For any data block in the array, we can transmit this along with a short piece of information known as the
*Merkle proof*, to any other party in possession of the Merkle root. Using this, they are able to verify that the data block is correct and was indeed in the original data. - The other party does not need to trust the person transmitting the data. They only need to be confident that they have the correct Merkle root. Then, the proof is sufficient for the verification, since it is impossible to create a valid proof for any data which was not in the original array.
- The size of the Merkle proof is of the order of the logarithm of
*N*. This is bounded for practical purposes. For example, it is very unlikely that we would ever have arrays longer than 2^{100}so, effectively, the base-2 logarithm of*N*is bounded by 100.

The magic here is that the other party only needs to have the 256 bit Merkle root stored locally. Then, to prove that a block of data is in the array, we only have to transmit this block along with a short proof, regardless of the size of the original array of data.

#### Constructing the Merkle Tree

I now go into the mechanics of a Merkle tree. The first thing that we need is a cryptographic hash function, which maps any block of data to an essentially unique binary string of fixed size. For example, the SHA256 hash function can be used, which creates hashes of 256 bits (Bitcoin actually uses a double, or repeated, SHA256 hash). Throughout this post, I assume a hash function has been chosen, which will be denoted by *H*.

We build a Merkle tree from a sequence of blocks of data as follows. First, take the hash of each item. Then, arrange them as the leaves of a binary tree, where each internal node is assigned a value equal to the hash of its two children concatenated together. For a sequence of 16 data items, labelled *t*_{0} through *t*_{15}, this is as in figure 1. Here, I have labelled each node by its hash value *h _{b}* for a binary string

*b*indicating the position of the node in the tree. So, for example, we have

*h*

_{1001}=

*H*(

*t*

_{9}) and

*h*

_{100}=

*H*(

*h*

_{1000}

*h*

_{1001}). These hash values can all be computed by starting at the leaf nodes and, iteratively, working up the tree. Finally, the Merkle root is the hash value

*h*at the very top.

Note that the total number of hashes that need to be computed to build the tree is 31 or, more generally, about twice the number of leaves. So, the time taken to compute the tree is of the order of the size of the original data.

Next, suppose that someone — Alice, say — is in the possession of the Merkle root *h*. How can we prove to her that an individual entry (*t*_{5} for example) is correct and is in the array? We do not want to transmit the entire array. Consider the path through the tree from the root to the leaf. This is marked in green in figure 2 below, and includes the hash values *h*, *h*_{0}, *h*_{01}, *h*_{010}.

Each of these hashes can be computed from the values of its two children, one of which is in the path (coloured green), and the other of which bounds the path and is coloured orange. Specifically, *h*_{1}, *h*_{00}, *h*_{011}, *h*_{0100} are children of nodes on the path, but do not lie on the path itself.

The Merkle proof is the sequence (0101,*h*_{1},*h*_{00},*h*_{011},*h*_{0100}). The first element is simply a binary string giving the shape of the path. If this data is transmitted to Alice, then she can perform the following calculations in order.

h_{0101} = H(t_{5}) |

h_{010} = H(h_{0100}h_{0101}) |

h_{01} = H(h_{010}h_{011}) |

h_{0} = H(h_{00}h_{01}) |

h = H(h_{0}h_{1}) |

The right hand side of each line only involves *t*_{5}, the values from the Merkle proof, or values computed in previous lines. Then, Alice compares this computed value of *h* with her stored Merkle root. If they agree, the proof is valid and verifies that the array contains *t*_{5}. If they do not agree, the proof is invalid.

So long as we send Alice the correct values for *t*_{5} and the correct hash values in the proof, then all her calculations simply reproduce those used in the original construction of the tree. She is guaranteed to reproduce the same Merkle root, so determines that the proof is valid. But, what about the other way round. If the calculations above show that the proof is valid, can Alice be absolutely sure that the true value of *t*_{5} was received? Put another way, is it possible to fool Alice by constructing valid Merkle proofs for data which does not exist in the original array?

Luckily, Alice can be confident that the correct value was received. At least, she can under the assumption that there does indeed exist a leaf at the place in the tree corresponding to *t*_{5}. For example, if she already knows that the array was of size 16, so that the tree is as in figure 1, then this is the case. Alice can argue as follows.

- As
*h*=*H*(*h*_{0}*h*_{1}), the values of*h*_{0}and*h*_{1}are correct. - Then, as
*h*_{0}=*H*(*h*_{00}*h*_{01}), the values of*h*_{00}and*h*_{01}are correct. - Then, as
*h*_{01}=*H*(*h*_{010}*h*_{011}), the values of*h*_{010}and*h*_{011}are correct. - Then, as
*h*_{010}=*H*(*h*_{0100}*h*_{0101}), the values of*h*_{0100}and*h*_{0101}are correct. - Finally, as
*h*_{0101}=*H*(*t*_{5}), the value of*t*_{5}is correct.

The reasoning in each of these steps makes use of the collision-resistance of the hash function. In the final step, this means that there can be no known block of data, other than *t*_{5}, whose hash is equal to the given value. Hence, she must have received the correct value of *t*_{5}. For each of the other steps, she is reasoning that there is only one known binary string whose hash equals the value on the left hand side and, therefore, the value that she has for the concatenation *h _{a}h_{b}* is correct. As the hashes

*h*are of known length, they can be recovered from the concatenation, so again she must have the correct values.

_{a},h_{b}As mentioned above, the argument assumes that Alice knows that there is a leaf of the tree in the position specified by the path in the Merkle proof. What happens if she does not know this? In that case, the argument above does not hold, so she cannot be sure that the received value of *t*_{5} was in the original array. This is because she could have been subject to a *second preimage attack* which I discuss below, and is an easily fixed issue.

Writing code to generate Merkle roots and trees is straightforward. For example, the following short C++ function takes in the array of hashes of the data elements and returns the Merkle root. This is a modified, and simplified, version of the function used by the Bitcoin source code (merkle.cpp).

`uint256 ComputeMerkleRoot(std::vector hashes) { if(hashes.size() == 0) return uint256(); while(hashes.size() > 1) { if(hashes.size() & 1) hashes.push_back(uint256()); SHA256D64(hashes[0].data(), hashes[0].data(), hashes.size()/2); hashes.resize(hashes.size()/2); } return hashes[0]; }`

The data type ‘uint256’ represents a 256 bit (or 32 byte) unsigned integer, as returned by the hash function. The function call SHA256D64 here is taking the array of 32 byte integers, treating them as an array of 64 byte integers of half the length, and computing their hash values. This results in an array of half the length, and this operation is repeated until the array is of length 1, then the value is returned.

There are a couple of additional lines in this calculation that I did not address above. First, if the initial array is empty then we simply return 0, or uint(), for the Merkle root. By preimage resistance, this is not a hash of any known value. Secondly, at each iteration of the calculation, the array of hashes could have an odd length so cannot be grouped into pairs. This does not happen if the original array has length equal to a power of two, such as the case described above of length 16. The remediation is to simply append 0, or uint(), to the end. The effect of this is to truncate the binary tree at nodes with zero hash, to reduce number of leaves. This is as shown in figure 3 for an initial array of 10 data values *t*_{0} through *t*_{9}.

Oddly, Bitcoin does not use this simple method but, instead, duplicates the last element of the array if it has an odd length. This caused a bug, which needed to be fixed by a messy workaround as described by comments in the source code (see merkle.cpp).

A more elegant and efficient way to handle data arrays of length not a power of two, without truncating, is to use binary trees whose leaves do not all have the same depth. So long as the depths do not vary by more than 1, the tree is *balanced*, as in figure 4.

For an original array of length *N*, the binary tree will have depth equal to the base 2 logarithm log_{2}*N*, rounded up if this is not an integer. So, the total size of the Merkle proofs are proportional to log*N*.

#### Second Preimage Attack

Merkle trees constructed as described so far are vulnerable to *second preimage* attacks, where it is possible to create a valid Merkle proof for an item which is not actually in the original data array. In the example above, where we attempt to prove to Alice that the element *t*_{5} is in the array, Alice had a watertight argument that a valid Merkle proof is sufficient to determine that the claimed element is correct. However, the argument only works if she already knows that the tree does have a leaf at the stated position. If she does not know the size of the Merkle tree already then it is possible that *t*_{5} as not actually in the array at all.

So, what can go wrong? If the tree is smaller than that depicted in figure 1, then any of the first 4 steps of her argument can fail. For example, the first step was to conclude that the values she has for *h*_{0} and *h*_{1} are correct, based on the fact that *h*_{0}*h*_{1} hashes to *h*. However, for all she knows, the tree may not even have nodes corresponding to these hashes, and could have just a single leaf with data element equal to *h*_{0}*h*_{1}. Similarly, in the second step of her argument, she cannot conclude that the values she has for *h*_{00} and *h*_{01} are actually hashes in the tree. Maybe, instead, it has a leaf with data element *h*_{00}*h*_{01}.

The final step of Alice’s argument can also be wrong if the tree is bigger than depicted in figure 1. The fact that *t*_{5} has hash value *h*_{0101} does not mean that it is the value of a data element of the array. It could, instead, be equal to the concatenation *h*_{01010}*h*_{01011} of other hash values at internal nodes of the tree.

In the other direction, if the Merkle tree is as in figure 1, then it is possible to create Merkle proofs for data elements which are not in the original array. For example, (01,*h*_{1},*h*_{00}) is a valid Merkle proof for *h*_{010}*h*_{011}. In fact, it is possible to build a new Merkle tree smaller than the one in figure 1 which has same root. An example is shown in figure 5 below, where the leaf data values are concatenations of pairs of hashes from the original tree.

Fortunately, these issues are easily fixed. It is only necessary to be able to distinguish the values at internal nodes of the tree from those at the leaf nodes. One method is to prepend the leaf values by the byte (or bit) 0 before taking the hash, and by 1 at internal nodes. So, in figure 1, we now have *h*_{1001} = *H*(0*t*_{9}) and *h*_{100} = *H*(1*h*_{1000}*h*_{1001}). Another method is to apply the hash function twice for leaf nodes and once for internal nodes. Yet another, if the tree has a constant depth *d* as in figures 1 and 3, is to distribute the depth along with the Merkle root. Or, replace the Merkle root by the hash *H*(*hd*). Since a Merkle proof also gives the depth of the tree, it is straightforward to add a check for the depth to the verification algorithm.

Unfortunately, the Merkle tree used by Bitcoin does not do any of these, so is vulnerable to second preimage attacks. This could potentially cause an issue for thin clients, which only hold the block header and rely on Merkle proofs to check that a transaction is in a block. However, it is very unlikely that the concatenation of a pair of hash values is also equal to a valid bitcoin transaction and, so, the problems can be avoided by simply refusing to process transactions that are 64 bytes long, or Merkle proofs where any of the concatenated hash values are also valid transactions.