A Low Complexity "Fix" for MD5
Summary
The cryptographic
hash
function MD5 has become unusable for cryptographic and authentication
purposes because
hash collisions have been found that open up MD5 hashed data to
undetectable
alteration. Cryptographic functions (in order to work correctly) are
supposed
to be completely free of collisions. At least 100 MD5 hash collisions have been found via HashClash,
a
BOINC distributed computing project. Possibly millions of productive
MD5
collisions exist in the selective modification of 'dataset bits' that
MD5 was meant to protect. Thus, it is expected
that MD5 will be phased out in the next 5 years. MD5 can have its life
extended
by running the function backward and taking the resultant hash and
adding it to
the hash of normally run MD5. MD5's algorithm need
not be
altered.
Historical: MD2 and MD5's Known Weaknesses
-
Rogier and Chauvaud (1997) described collisions of MD2's
compression
function, although they were unable to extend the attack to the full
MD2. In 2004, MD2 was shown to be vulnerable to a preimage attack with time complexity equivalent
to 2104 applications of the compression function (Muller,
2004). The author concludes, "MD2 can no longer be considered a
secure one-way hash function". The design weaknesses of MD2 led to
the creation of MD5.
-
MD5 makes only one pass over the data, if two prefixes with the
same hash can be constructed, a common suffix can be added to both to
make the collision more reasonable. And because the current
collision-finding techniques allow the preceding hash state to be
specified arbitrarily, a collision can be found for any desired
prefix—for any given string of characters X, two colliding files
can be
determined which both begin with X. All that is required to generate
two colliding files is a template file, with a 128-byte block of data
aligned on a 64-byte boundary, that can be changed freely by the
collision-finding algorithm.
There is no formal definition which captures all of the properties
considered desirable for a cryptographic hash function. These
properties below are generally considered prerequisites:
- Preimage resistant (Similar to ideal one way functions, but a slightly
different property): given hash it should be hard to find any message
such that hash = hash_function(message).
- Second preimage resistant: given an input message1,
it should be hard to find another input, message2
(not
equal to message1) such that hash_function(message1) = hash_function(message2).
- Collision-resistant: it should be hard to find two
different messages message1
and message2 such
that hash_function(message1) = hash_function(message2).
- In order to avoid a possible birthday attack, the hash
function output (in bits) must be at least twice as large as what is
required for
preimage-resistance.
A hash function meeting these criteria may still have undesirable
properties. For instance, most popular hash functions are vulnerable to
length-extension attacks: given hash(message) and length(message)
but not message, by choosing a suitable message' an
attacker can
calculate hash(message || message'), where || denotes
concatenation.
This property can be used to break naive authentication schemes based
on hash functions. The HMAC construction works around these problems.
An ideal hash function would be maximally "boring": it would have no
interesting properties such as length extension, and the only
interesting way it would differ from a random function would be in that
it was deterministic and efficiently computable. This criterion is of
course deeply resistant to formal expression; the closest thing to
formal expression is the random oracle model, which is an
idealization no real hash function can satisfy.
Note
- Most MD5 hash collisions (when they are found) are not useable for forging electronic
documents and data. A great deal of computational effort (with larger
data sets, but not smaller
data sets) is still involved for those wishing to forge MD5
hashes.
- MD5 is still a vast improvement over
using CRC-32 or CRC-40. Mathmatical research suggests that MD5 may
still be better by
many orders of magnitude compared to CRC-64-ECMA, the strongest CRC in
common use today.
- A small percentage of all MD5
collisions (in the subset of all collisions) can
be used to forge small data sets. However, routine use of forged
MD5 data (of larger datasets) is not practical as it is with CRC-32 or
some versions of CRC-40.
- Most communications, data transfer and
authentications protocols are designed (or
should be redesigned) to trap MD5 forgeries.
In spite of the relatively good news about the
health of
MD5, it is set to put into the cryptography technology dustbin.
MD5 need not be thrown
away just
yet. A low complexity fix can keep MD5 in common use for another
decade, but in
a slightly modified form.
The Fix, Forward + Backward MD5 (MD5BF)
Key terms & definitions
- BOF: Beginning of File (or data
stream, data packet, data field etc)
- EOF: End of File (or data stream, data
packet, data field etc)
- MD5(): MD5 in its default run mode
(BOF to EOF)
- MD5_backward(): MD5 run the opposite
direction (EOF to BOF)
- [+] : XOR function,
bitwise not bytewise.
- I will just assume File =
ASCII(“1234567890”)
- Thus: MD5(File) =
e807f1fcf82d132f9bb018ca6738a19f (hex)
- A File 'read backward' = ASCII(“0987654321”), but is always
indicated as _backward() in the specification syntax used here.
- The code of MD5()
is not changed at all -- only the bytes are fed to from EOF to BOF.
- Thus: MD5_backward(File) =
6fb42da0e32e07b61c9f0251fe627a9c (hex)
- I have ignored the issue of Big-Endian and Little-Endian
here, as I don’t believe this binary math layer issue affects my
technical description at this level
Why the backwards option was
chosen
Running MD5 backward by itself is not an option. MD5_backward()
is
mathematically as weak to MD5 in its traditional forward direction.
- In some cases it may be advisable that
branch versions of this hash function be created that add CRC-32(File)
or CRC-64-ECMA(File) to MD5_backward()
after the hash function has been run -- in order to be unhelpful to
those seeking to find collisions. The exact math is
still being worked on here.
- Syntactically the CRC addition should
be reflected as such: MD5_backward_CRC32()
or MD5_backward_CRC64ECMA().
- Adding the CRC to the beginning or end
of the data stream is also an option, but probably riskier for some
uses of MD5 that have evolved over time.
What is not fixed
Rainbow Table Immunity: there is no real increased immunity for MD5BF
with respect to Rainbow Table immunity. So long as MD5 is asked to
create a hash for a number of bits smaller than the number of bits that
MD5 creates as output -- the Rainbow Talble weakenss will continue to
be a problem.
What is fixed
MD5BF is much more immune to the append_word_to_hash() -->
find_hash_collission() weakness in MD5. MD5BF may indeed push the
process of appending to a bytestrem to an existing bytestream to find a
hash collission into near computational impossiblity (with current
equipment).
Rainbow Table (or similar plaintext attacks) immunity for long
bytestreams: MD5BF should be considered immune to these kinds of
attacks up to some limit. Counterfitting MD5BF signitures should be
nearly impossible under normal circomstances.
The New MD5-PSB & MD5-PMD2
MD5-PSB (Plus 'hash' Sum Backward)
- MD5-PSB = {MD5(File) [+] MD5_backward(File)}
- Costs compared to standard MD5:
MD5 sum + running MD5 backward + bytewise or wordwise hash equivalncy
testing + binary addition (~2.05x of MD5)
- For small data sets, this should be no
more than 1%-5% extra cost, computationally.
- MD5() and MD5_backward() must be compared to see
if they are equal, using bytewise testing this means 16 iterations to
compare all 160 bits -- but using word testing this means only 8
iterations. This testing algorythm is highly optimisable. If the two
hashes are equal, an XORed mask of '101010101010....' (binary) should
be applied to the MD5 hash.
- Memory requirements: for embedded
systems, this could be done in under 4kb.
Demonstration of MD5-PSB
|
MD5(File)
=
|
e807f1fcf82d132f9bb018ca6738a19f
|
hex
|
|
[+]
|
|
xor
|
|
MD5_backward(File)
=
|
6fb42da0e32e07b61c9f0251fe627a9c
|
hex
|
|
Sum
=
|
|
|
MD5-PMD2 (Plus MD2)
- MD5-PMD2 = {MD5(File) [+]
Trim-to-MD5-length(MD2(File))}
- Costs compared to standard MD5: MD2
sum + trimming off MD2's + MD5 sum + binary addition.
- For small data sets, this should be no
more than 1%-5% extra cost, computationally.
Demonstration of MD5-PMD2
|
MD5(File)
=
|
e807f1fcf82d132f9bb018ca6738a19f
|
hex
|
|
[+]
|
|
xor
|
|
MD2(File)
=
|
116f4212fae29cf4a77e1be4f53dc551
|
hex
|
|
Sum
=
|
|
|
MD5-FB-SHA-Interop (SHA-0 and SHA-1 Bit Field Length Option)
- Take MD5()
[+] MD5_backward(), but using MD5() at the beginning of a SHAxxx length bit field and MD5_backward()
at the end of an SHAxxx length bit field.
Using padding of binary 0s in parts of the
data field that have no MD5 origin hash content bits.
- The beginning bits will come from
traditional MD5()
- The terminating bits will come from
MD5_backward()
- Some of the bits remaining will come
from the addition of MD5() [+] MD5_backward()
- Computational cost will probably be
less than running SHA on some devices depending on how the algorithm is
implemented
- This approach is
not designed to work with longer bit lengths longer than 160
bits, as the SHA hash family has many members that have longer bit
length fields.
Technical References
- Hash Functions: http://en.wikipedia.org/wiki/Hash_function
- Cryptographic Hash Functions: http://en.wikipedia.org/wiki/Cryptographic_hash_function
- MD5: http://en.wikipedia.org/wiki/MD5
- MD4: http://en.wikipedia.org/wiki/MD4
- MD2: http://en.wikipedia.org/wiki/MD2
- SHA Hash Functions: http://en.wikipedia.org/wiki/SHA
- SHA & MD Family replacements being
tested at NIST: http://ehash.iaik.tugraz.at/wiki/The_SHA-3_Zoo
- Checksums (CRCs):
http://en.wikipedia.org/wiki/Cyclic_redundancy_check
- Paper on CRC collisions: http://apollo.backplane.com/matt/crc64.html
Date Created
16 OCT
2006
Last Revised
06 MAR
2007
15 NOV
2008
Created By
Max Power