Blockchain Basics - Part One: Hashing

blockchainmerkle-treedata-structuresdecentralized

I'm working on a side project that requires a data store distributed amongst users. As such using a blockchain for the data store is one potential solution.
The blockchain structure is one of those topics I studied during school but never really applied or used. As such I need to review the basics for this project! When studying topics one of techniques I use to check if I really understand the topic is to write about it as though I'm presenting it to others. Therefore in this series of posts I'll review the basics of blockchain, what the fundamental concepts are, and techniques for implementing them.

Hashing Data

The first concept to know is the hash function and what it means to hash data. Data is ultimately a binary string - the larger the data the longer the string. A hash function is a mapping from a binary string of arbitrary length to a binary string of fixed length, where this length is something defines as part of defining the function. By applying the hash function to some input xx we hash x, generating its hash yy.

y=H(x)y = H(x)

For our purposes a we're interested in hash functions that have the following properties:

  1. Efficient and secure - hashing xx should be quick, but reversing the hash (i.e. computing xx from yy) should be very difficult. That is a brute force search over all inputs is needed.
  2. Minimal collisions - when hashing two different inputs xx and xx', we do not want H(x)=H(x)H(x) = H(x'). Ideally the hash function is injective with different inputs always yielding different outputs.
  3. Uniformly distributed - hashes should be widely spread across the range of possible values in a uniform manner. Another consequence of this is if xx and xx' as similar inputs, H(x)H(x) and H(x)H(x') should not necessarily be similar hashes.

A simple example of a hash function is to take the first 5 bits of the input string and call that the hash. While this is quick, for various inputs (e.g. names or messages) we'll eventually have a collision since their are only 25=322^5=32 possible hashes.

It may not seem clear how to construct a good hash function then. Fortunately the National Institute of Standards and Technologies provides a collection of functions called the Secure Hash Algorithms (SHA) family of Algorithms that are implemented by several packages such as NodeJs' crypto package. Below is an example of how the SHA-256 hashing function works by hashing input into a 256-length string (shown in hexadecimal to save screen space.) Note how even a slight difference in the text results in a drastically different hash.

Some Uses for Hashes

A hash function can be thought of a signature for the input. Suppose you want to sign your name to a document so others can verify that it was indeed you who signed it. Anyone can put your name, so instead you provide the hash of some message only you know. Then at a later time when you're asked to verify your signature you can regenerate the hash, providing the input that created it.

Another application is within data structures called dictionaries. A dictionary stores values based on a key value. The trick is to implement the dictionary in a manner where looking up a value using the key is fast. Using the uniform distribution property, hashes can act as the indices of an array, allowing for a constant time look up.

Merkle Tree

Now that we can generate a hash value you may ask how we can hash multiple bits of data. Since hashes work on data of arbitrary length the first thought may be to concatenate the data first then generate a single hash.

y=H(x1x2xm)y = H(x_1 \odot x_2 \odot \dots \odot x_m)

While this could work, one drawback is it will require storing all data in the event we would like to regenerate or verify the hash. For example if each input xix_i is several megabytes we'll require storing at least mm megabytes of data.

Recall that a hash is a signature of the input data, mapping it to a smaller representation (e.g. 256 bits when using SHA-256.) Instead if we can compute the hash of each value first, this greatly reduces the amount of data we need to store without losing the ability to regenerate or verify the hash.

y=M(H(x1),H(x2),,H(xm))y = M(H(x_1), H(x_2), \dots, H(x_m))

Next we'd like to accumulate all of these hashes into a single hash to represent the data by a lone value. One way to achieve this is via a Merkle tree. A Merkle tree consist of three types of notes:

  1. Merkle root
  2. Non-leaf nodes
  3. Leaf nodes

Leaf nodes contain the hashes of the input data. Two of these hashes are concatenated together and then hashed generating a non-leaf nodes. The hashes of non-leaf nodes are then combined together, repeating this process until the final two non-leaf nodes are combined together generating the Merkle root. This Merkle root represents the collection of data as a single hash.

Below is a simple snippet of how to generate a Merkle root using a fixed number of fields with React's useState hook:

import React, { useEffect, useState } from 'react';
import crypto from 'crypto';
const [data, setData] = useState({
  name: "Jane Doe",
  address: "555 Traversal St., New York", 
  email: "janedoe@email.com",
  position: "Software Engineer Manager"
});

const [hash, setHash] = useState("");

useEffect(() => {
  // compute  Merkle root
  const hashFunction = crypto.createHash('sha256');
  const nameHash = hashFunction.update(data.name).digest('hex');
  const addressHash = hashFunction.update(data.address).digest('hex');
  const emailHash = hashFunction.update(data.email).digest('hex');
  const positionHash = hashFunction.update(data.position).digest('hex');
  // non-leaf nodes, intermediate hashes
  const t1 = nameHash + addressHash;
  const t2 = emailHash + positionHash;
  const t1Hash = hashFunction.update(t1).digest('hex');
  const t2Hash = hashFunction.update(t2).digest('hex');
  const rootHash = hashFunction.update(t1Hash + t2Hash).digest('hex');
  setHash(rootHash);
}, [data]);

Another benefit of this approach is we if any of the individual data are altered the changes to their hashes will propagate up to the Merkle root (more on this in the next post!)

Conclusions and One Last Application

Using a Merkle Tree we can combine the hashes of multiple objects into a single hash. I'll discuss this more in the next post and how it fits into generating blocks of a blockchain, but we can add a new nonce object to hash which only purpose is to alter the Merkle hash of our data.

We then play the following game: how can we alter the nonce value such that the Merkle hash ends with nn zeros? Because there's no way to predict how different inputs will alter the hash we do a brute force search over different nonces. When there are multiple people playing this game, it is a race to search through this space.

In this post we introduced hashing functions, some properties of them, and how we can combine multiple hashes, allowing us to generate a single hash for a collection of data. In the next post we'll discuss how these get pulled together to create blocks and blockchains.