Self-describing hashes
Multihash is a protocol for differentiating outputs from various well-established hash functions, addressing size + encoding considerations. It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.
Multihash is particularly important in systems which depend on cryptographically secure hash functions. Attacks may break the cryptographic properties of secure hash functions. These cryptographic breaks are particularly painful in large tool ecosystems, where tools may have made assumptions about hash values, such as function and digest size. Upgrading becomes a nightmare, as all tools which make those assumptions would have to be upgraded to use the new hash function and new hash digest length. Tools may face serious interoperability problems or error-prone special casing.
How many programs out there assume a git hash is a sha1 hash?
How many scripts assume the hash value digest is exactly 160 bits?
How many tools will break when these values change?
How many programs will fail silently when these values change?
This is precisely where Multihash shines. It was designed for upgrading.
When using Multihash, a system warns the consumers of its hash values that these may have to be upgraded in case of a break. Even though the system may still only use a single hash function at a time, the use of multihash makes it clear to applications that hash values may use different hash functions or be longer in the future. Tooling, applications, and scripts can avoid making assumptions about the length, and read it from the multihash value instead. This way, the vast majority of tooling – which may not do any checking of hashes – would not have to be upgraded at all. This vastly simplifies the upgrade process, avoiding the waste of hundreds or thousands of software engineering hours, deep frustrations, and high blood pressure.
A multihash follows the TLV
(type-length-value) pattern.
<hash-func-type>
is an unsigned variable integer identifying the hash function. There is a default table, and it is configurable. The default table is the multicodec table.<digest-length>
is an unsigned variable integer counting the length of the digest, in bytes<digest-value>
is the hash function digest, with a length of exactly <digest-length>
bytes.<hash-func-type>
<digest-length>
<digest-value>
For example:
12
20
41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8
sha2-256
(code in hex: 0x12
)0x20
)41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8
These implementations are available:
The following multihash examples are different hash function outputs of the same exact input:
Merkle–Damgård
The multihash examples are chosen to show different hash functions and different hash digest lengths at play.
11
14
8a173fd3e32c0fa78b90fe42d305f202244e2739
sha1
(code in hex: 0x11
)0x14
)8a173fd3e32c0fa78b90fe42d305f202244e2739
12
20
41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8
sha2-256
(code in hex: 0x12
)0x20
)41dd7b6443542e75701aa98a0c235951a28a0d851b11564d20022ab11d2589a8
13
20
52eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4
sha2-512
(code in hex: 0x13
)0x20
)52eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4
Note: this is the actual SHA-512 (as per code 0x13
) truncated to 256 bits; some libraries support an hash called SHA-512/256 that has the same 256 bit length but with a different initialization vector (as defined in FIPS 180-4).
13
40
52eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4c2cbbafd365f96fb12b1d98a0334870c2ce90355da25e6a1108a6e17c4aaebb0
sha2-512
(code in hex: 0x13
)0x40
)52eb4dd19f1ec522859e12d89706156570f8fbab1824870bc6f8c7d235eef5f4c2cbbafd365f96fb12b1d98a0334870c2ce90355da25e6a1108a6e17c4aaebb0
c0e402
40
d91ae0cb0e48022053ab0f8f0dc78d28593d0f1c13ae39c9b169c136a779f21a0496337b6f776a73c1742805c1cc15e792ddb3c92ee1fe300389456ef3dc97e2
blake2b-512
(code in hex: 0xb240
)0x40
)d91ae0cb0e48022053ab0f8f0dc78d28593d0f1c13ae39c9b169c136a779f21a0496337b6f776a73c1742805c1cc15e792ddb3c92ee1fe300389456ef3dc97e2
a0e402
20
7d0a1371550f3306532ff44520b649f8be05b72674e46fc24468ff74323ab030
blake2b-256
(code in hex: 0xb220
)0x20
)7d0a1371550f3306532ff44520b649f8be05b72674e46fc24468ff74323ab030
e0e402
20
a96953281f3fd944a3206219fad61a40b992611b7580f1fa091935db3f7ca13d
blake2s-256
(code in hex: 0xb260
)0x20
)a96953281f3fd944a3206219fad61a40b992611b7580f1fa091935db3f7ca13d
d0e402
10
0a4ec6f1629e49262d7093e2f82a3278
blake2s-128
(code in hex: 0xb250
)0x10
)0a4ec6f1629e49262d7093e2f82a3278
Q: Why have digest length as a separate number?
Because combining hash function code and hash digest length ends up with a function code really meaning “function-and-digest-size-code”. Makes using custom digest sizes annoying, and much less flexible. We would need hundreds of codes for all the combinations people would want to use.
Q: Why varints (variable integers)?
So that we have no limitation on functions or lengths.
Q: What kind of varints?
A Most Significant Bit unsigned varint, as defined by the multiformats/unsigned-varint doc.
Q: Don’t we have to agree on a table of functions?
Yes, but we already have to agree on functions, so this is not hard. The table even leaves some room for custom function codes.
Q: Why not use
"sha256:<digest>"
?
For three reasons:
(1) Multihash and all other multiformats endeavor to make the values be “in-band” and to be treated as the original value. The construction <string-prefix>:<hex-digest>
is human readable and tuned for some outputs. Hashes are stored compactly in their binary representation. Forcing applications to always convert is cumbersome (split on :
, turn the right hand side into binary, remove the :
, concat).
(2) Multihash and all other multiformats endeavor to be as compact as possible, which means a binary packed representation will help save a lot of space in systems that use millions or billions of hashes. For example, a 100 TB file in IPFS may have as many as 400 million subobjects, which would mean 400 million hashes.
400,000,000 hashes * (7 - 2) bytes = 2 GB
(3) The length is extremely useful when hashes are truncated. This is a type of choice that should be expressed in-band. It is also useful when hashes are concatenated or kept in lists, and when scanning a stream quickly.
Q: Is Multihash only for cryptographic hashes?
What about non-cryptographic hashes like
murmur3
,cityhash
, etc?
We decided to make Multihash work for all hash functions, not just cryptographic hash functions. The same kind of choices that people make around
We wanted to be able to include MD5
and SHA1
, as they are widely used even now, despite no longer being secure. Ultimately, we could consider these cryptographic hash functions that have transitioned into non-cryptographic hash functions. Perhaps all of them eventually do.
Q: How do I add hash functions to the table?
Three options to add custom hash functions:
Q. I want to upgrade a large system to use Multihash. Could you help me figure out how?
Sure, ask for help in IRC, github, or other fora. See the Multiformats Community listing.
Q. I wish Multihash would _______. I really hate _______.
Those are not questions. But please leave any and all feedback over in the Multihash repo. It will help us improve the project and make sure it addresses our users’ needs. Thanks!
There is a spec in progress, which we hope to submit to the IETF. It is being worked on at this pull-request.
The Multihash format was invented by @jbenet, and refined by the IPFS Team. It is now maintained by the Multiformats community. The Multihash implementations are written by a variety of authors, whose hard work has made future-proofing and upgrading hash functions much easier. Thank you!
The Multihash format (this documentation and the specification) is Open Source software, licensed under the MIT License and patent-free. The multihash implementations listed here are also Open Source software. Please contribute to make them great! Your bug reports, new features, and documentation improvements will benefit everyone.
Multihash is part of the Multiformats Project, a collection of protocols which aim to future-proof systems, today. Check out the other multiformats. It is also maintained and sponsored by Protocol Labs.