Suitable 16-bit checksum?

Question

I am writing a toy1 transport protocol and I am trying to implement a feature whereby:

A 16-bit checksum is included in every packet.
This checksum can be extended to be 48-bit if the data is large, integrity is important, or whatever (creating this mostly for fun).
This checksum is intended to protect against errors, not a determined attacker.

My initial thought was to use a well known algorithm such as MD5/SHA2 and simply take the first/last 16/48 bits. This raises a few questions:

What are the consequences of simply taking a "substring" of a checksum? I assume, perhaps naively, that it would just increase the likelihood of a collision.
Is there any reason to prefer taking the last x bits instead of the first x bits, or vice versa?
When truncating a checksum like this, does SHA actually perform any better than MD5? Keeping in mind I am concerned with errors, not attackers.

[1] That is to say, I'm creating it for the purpose of learning. So "just use UDP" or "just use TCP" does apply, but that's why I'm not.

[2] Maybe use MD5 if it's 16-bit, SHA if it's 48-bit.

Squeamish Ossifrage · Accepted Answer

You can use a 16-bit truncation of MD5 or SHA-1 or SHA-256 or SHAKE128 as a checksum, but it's not a good checksum.
Why not?  Let's model them by a uniform random function of $t$ bits, which is a reasonable model in this scenario.  In this case, $Pr[H(x) = h] = 1/2^t$ for any message $x$ and hash $h$, and every $H(x)$ is independent for each distinct $x$.  Thus, the probability of failing to detect an error $e$ in a message $x$ is $Pr[H(x + e) = H(x)] = 1/2^t$ for any message $x$ and any error $e$.  This is the best error detection you can hope for from a hash function designed like a random oracle (or close to it, length extensions notwithstanding).
If you choose a good 16-bit CRC (preprint, paywall-free; web site), then you can be guaranteed:

any error of odd parity will be detected, provided the generator polynomial has $x + 1$ as a factor;
any burst error of up to 16 bits will be detected, except an error of the generator polynomial itself, provided the generator polynomial does not have $x$ as a factor (i.e., has a nonzero constant term);
many errors of up to a certain Hamming weights in data words of up to a certain size will be detected.

Cryptographic hash functions designed for random oracle use like MD5, SHA-1, SHA-256, SHAKE128, etc., guarantee none of these.  They're also often much more expensive to compute than error-detecting codes like checksums, because you're paying an extremely high cost for collision resistance which is of no concern for error detection.

What are the consequences of simply taking a "substring" of a checksum? I assume, perhaps naively, that it would just increase the likelihood of a collision.

Collisions are not relevant here.  The only thing that is relevant is the probability of error detection.  Any substring of MD5 or SHA-1 or SHA-256 or SHAKE128 is as good as any other.

Is there any reason to prefer taking the last x bits instead of the first x bits, or vice versa?

No.  If this led to an appreciably different probability of error detection, then that would be a noteworthy result about the hash function.

When truncating a checksum like this, does SHA actually perform any better than MD5? Keeping in mind I am concerned with errors, not attackers.

No.  They're both bad choices for detecting independent random bit errors.

b degnan · Answer

From what I have found, there is not a way to answer this question in a cryptographic context that will make cryptographers happy.  If you have a cryptographic hash with 16-bits, or even 48-bits, you have limited protection on using the checksum to detect someone forging the data depending on how much data you send.

I would just use Error Correcting Codes as a checksum in the stream and forgo the idea of using the checksum for anything but stream verification and not data integrity.

It's also worth mentioning that hash takes a lot more cycles and complexity than you probably want if you have a simple device.

Maarten Bodewes · Answer

What are the consequences of simply taking a "substring" of a checksum? I assume, perhaps naively, that it would just increase the likelihood of a collision.

That's correct for cryptographic checksums.

For non-cryptographic checksums it may also destroy properties with regards to finding e.g. single or double bit errors or regional errors (that the cryptographic hashes do not possess).

Is there any reason to prefer taking the last x bits instead of the first x bits, or vice versa?

No, not when using a hash that operates on bits / bytes / words, where each bit should be as probable depending on the input. This is true for almost all hashes in use; you may want to check hashes that are based on number theory - but in general you would not want to use those in the first place.

It's a non-formal standard to take the leftmost bits / bytes when truncating hashes.

When truncating a checksum like this, does SHA actually perform any better than MD5? Keeping in mind I am concerned with errors, not attackers.

No, both outputs could be thought of to be pseudo random.

As Eugene and poncho indicated in the comments, it is a much better idea to go for a suitable cyclic redundancy check with well specified error detecting capabilities.

Suitable 16-bit checksum?

3 Answers

Add your own answers!

Ask a Question