The Hash Algorithm Dilemma–Hash Value Collisions

Article Posted: December 01, 2008

Digital Evidence, like any other type of evidence, requires identification, collection, a chain of custody, examination/analysis, and finally authentication in court during presentation to the trier of fact.

Following best practices, a forensic hash is used for identification, verification, and authentication of file data. A forensic hash is a form of a checksum. A checksum is a mathematical calculation, which in its simplest form, adds up the assorted bits in a data string and provides a value. MD5 (Message Digest 5) and SHA-1 (Secure Hash Algorithm 1) are more complex forms of checksum algorithms. A forensic hash is the process of using a mathematical function and applying it to the collected data, which results in a hash value that is a unique identifier for the acquired (collected) data (similar to a DNA sequence or a fingerprint of the data). When a hash algorithm is used, it computes a string of numbers for a digital file. Any change to the data will result in a change to the hash value. Both MD5 and SHA-1 algorithms are commonly used on forensic image files. The hash process is normally used during acquisition of the evidence, during verification of the forensic image (duplicate of the evidence), and again at the end of the examination to ensure the integrity of the data and forensic processing. MD5 and SHA-1 hash values are also currently used to validate the integrity of downloaded files in information technology applications. They have been accepted by the scientific and consumer community to confirm that the files that are downloaded are the same and complete files that are requested to be downloaded.

Recently, research and news has created a great deal of discussion about hash algorithms and their validity for forensic uses. Over the past several years, the primary hash algorithm used in forensic applications, MD5, has been compromised for use in encryption, a cryptographic use of this mathematical process. The SHA-1 algorithm has been compromised on a theoretical level and attempts proving the theoretical compromise have not yet been successful. The question is then asked, how do these compromises affect their use in forensics?

When I testified recently a defense attorney brought this subject up. The testimony went something like this.

Q. “Mr. Lewis, are you aware that the MD5 algorithm has been compromised?”
A. “Yes, I am.”

Q. “So, its use to authenticate evidence is no longer valid!”
A. “No, the use of the MD5 algorithm is still a valid function for authentication.”

Q. “Why is that?”
A. “There are multiple uses for hash algorithms. One is cryptography (encryption), another is identification, and another is authentication. In digital evidence forensics, we use hash algorithms for known file identification and evidence authentication, which differs from its use in encryption.”

The questions and answers went on while the eyes of the jury glazed over. At the conclusion of the trial, the jury provided feedback to the District Attorney, and indicated that this line of questioning got too complex for them to understand and did not seem relevant to the case being tried.

Related Topics: Chain of Custody Digital Forensics Digital Forensics Hardware Digital Forensics Software December 2008/January 2009