Multihash is a protocol for differentiating outputs from various well-established hash functions, addressing size + encoding considerations. It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.
Multihash is particularly important in systems which depend on cryptographically secure hash functions. Attacks may break the cryptographic properties of secure hash functions. These cryptographic breaks are particularly painful in large tool ecosystems, where tools may have made assumptions about hash values, such as function and digest size. Upgrading becomes a nightmare, as all tools which make those assumptions would have to be upgraded to use the new hash function and new hash digest length. Tools may face serious interoperability problems or error-prone special casing.
How many programs out there assume a git hash is a sha1 hash?
How many scripts assume the hash value digest is exactly 160 bits?
How many tools will break when these values change?
How many programs will fail silently when these values change?
This is precisely where Multihash shines. It was designed for upgrading.
When using Multihash, a system warns the consumers of its hash values that these may have to be upgraded in case of a break. Even though the system may still only use a single hash function at a time, the use of multihash makes it clear to applications that hash values may use different hash functions or be longer in the future. Tooling, applications, and scripts can avoid making assumptions about the length, and read it from the multihash value instead. This way, the vast majority of tooling – which may not do any checking of hashes – would not have to be upgraded at all. This vastly simplifies the upgrade process, avoiding the waste of hundreds or thousands of software engineering hours, deep frustrations, and high blood pressure.
A multihash follows the
TLV (type-length-value) pattern.
<hash-func-type>is an unsigned variable integer identifying the hash function. There is a default table, and it is configurable. The default table is the multicodec table.
<digest-length>is an unsigned variable integer counting the length of the digest, in bytes
<digest-value>is the hash function digest, with a length of exactly
sha2-256(code in hex:
These implementations are available:
The following multihash examples are different hash function outputs of the same exact input:
The multihash examples are chosen to show different hash functions and different hash digest lengths at play.
sha1(code in hex:
sha2-256(code in hex:
sha2-512(code in hex:
Note: this is the actual SHA-512 (as per code
0x13) truncated to 256 bits; some libraries support an hash called SHA-512⁄256 that has the same 256 bit length but with a different initialization vector (as defined in FIPS 180-4).
sha2-512(code in hex:
blake2b-512(code in hex:
blake2b-256(code in hex:
blake2s-256(code in hex:
blake2s-128(code in hex:
Q: Why have digest length as a separate number?
Because combining hash function code and hash digest length ends up with a function code really meaning “function-and-digest-size-code”. Makes using custom digest sizes annoying, and much less flexible. We would need hundreds of codes for all the combinations people would want to use.
Q: Why varints (variable integers)?
So that we have no limitation on functions or lengths.
Q: What kind of varints?
A Most Significant Bit unsigned varint, as defined by the multiformats/unsigned-varint doc.
Q: Don’t we have to agree on a table of functions?
Yes, but we already have to agree on functions, so this is not hard. The table even leaves some room for custom function codes.
Q: Why not use
For three reasons:
(1) Multihash and all other multiformats endeavor to make the values be “in-band” and to be treated as the original value. The construction
<string-prefix>:<hex-digest> is human readable and tuned for some outputs. Hashes are stored compactly in their binary representation. Forcing applications to always convert is cumbersome (split on
:, turn the right hand side into binary, remove the
(2) Multihash and all other multiformats endeavor to be as compact as possible, which means a binary packed representation will help save a lot of space in systems that use millions or billions of hashes. For example, a 100 TB file in IPFS may have as many as 400 million subobjects, which would mean 400 million hashes.
400,000,000 hashes * (7 - 2) bytes = 2 GB
(3) The length is extremely useful when hashes are truncated. This is a type of choice that should be expressed in-band. It is also useful when hashes are concatenated or kept in lists, and when scanning a stream quickly.
Q: Is Multihash only for cryptographic hashes?
What about non-cryptographic hashes like
We decided to make Multihash work for all hash functions, not just cryptographic hash functions. The same kind of choices that people make around
We wanted to be able to include
SHA1, as they are widely used even now, despite no longer being secure. Ultimately, we could consider these cryptographic hash functions that have transitioned into non-cryptographic hash functions. Perhaps all of them eventually do.
Q: How do I add hash functions to the table?
Three options to add custom hash functions:
Q. I want to upgrade a large system to use Multihash. Could you help me figure out how?
Sure, ask for help in IRC, github, or other fora. See the Multiformats Community listing.
Q. I wish Multihash would _______. I really hate _______.
Those are not questions. But please leave any and all feedback over in the Multihash repo. It will help us improve the project and make sure it addresses our users’ needs. Thanks!
There is a spec in progress, which we hope to submit to the IETF. It is being worked on at this pull-request.
The Multihash format was invented by @jbenet, and refined by the IPFS Team. It is now maintained by the Multiformats community. The Multihash implementations are written by a variety of authors, whose hard work has made future-proofing and upgrading hash functions much easier. Thank you!
The Multihash format (this documentation and the specification) is Open Source software, licensed under the MIT License and patent-free. The multihash implementations listed here are also Open Source software. Please contribute to make them great! Your bug reports, new features, and documentation improvements will benefit everyone.
Multihash is part of the Multiformats Project, a collection of protocols which aim to future-proof systems, today. Check out the other multiformats. It is also maintained and sponsored by Protocol Labs.