Digital Evidence, like any other type of evidence, requires identification, collection, a chain of custody, examination/analysis, and finally authentication in court during presentation to the trier of fact.

Following best practices, a forensic hash is used for identification, verification, and authentication of file data. A forensic hash is a form of a checksum. A checksum is a mathematical calculation, which in its simplest form, adds up the assorted bits in a data string and provides a value. MD5 (Message Digest 5) and SHA-1 (Secure Hash Algorithm 1) are more complex forms of checksum algorithms. A forensic hash is the process of using a mathematical function and applying it to the collected data, which results in a hash value that is a unique identifier for the acquired (collected) data (similar to a DNA sequence or a fingerprint of the data). When a hash algorithm is used, it computes a string of numbers for a digital file. Any change to the data will result in a change to the hash value. Both MD5 and SHA-1 algorithms are commonly used on forensic image files. The hash process is normally used during acquisition of the evidence, during verification of the forensic image (duplicate of the evidence), and again at the end of the examination to ensure the integrity of the data and forensic processing. MD5 and SHA-1 hash values are also currently used to validate the integrity of downloaded files in information technology applications. They have been accepted by the scientific and consumer community to confirm that the files that are downloaded are the same and complete files that are requested to be downloaded.

Recently, research and news has created a great deal of discussion about hash algorithms and their validity for forensic uses. Over the past several years, the primary hash algorithm used in forensic applications, MD5, has been compromised for use in encryption, a cryptographic use of this mathematical process. The SHA-1 algorithm has been compromised on a theoretical level and attempts proving the theoretical compromise have not yet been successful. The question is then asked, how do these compromises affect their use in forensics?

When I testified recently a defense attorney brought this subject up. The testimony went something like this.

Q. “Mr. Lewis, are you aware that the MD5 algorithm has been compromised?”
A. “Yes, I am.”

Q. “So, its use to authenticate evidence is no longer valid!”
A. “No, the use of the MD5 algorithm is still a valid function for authentication.”

Q. “Why is that?”
A. “There are multiple uses for hash algorithms. One is cryptography (encryption), another is identification, and another is authentication. In digital evidence forensics, we use hash algorithms for known file identification and evidence authentication, which differs from its use in encryption.”

The questions and answers went on while the eyes of the jury glazed over. At the conclusion of the trial, the jury provided feedback to the District Attorney, and indicated that this line of questioning got too complex for them to understand and did not seem relevant to the case being tried.

There appear to be three areas of concern regarding the use of hash algorithms; 1) encryption, 2) known file identification, and 3) file and/or data authentication. An algorithm is a set of rules for a mathematical process, and can be graphically represented by a flow chart. Mathematical processes with rules are predictable and repeatable. This is their inherent weakness. Using the predictability and targeting the weakness of the process, it can be influenced. When used in cryptography by manipulating files prior to the application of the algorithm, a “collision” can be caused. When two files can be influenced in similar blocks of data, they can be manipulated to result in a matching hash value.

The original MD5 attack research conducted by Professor Xiaoyun Wang and her co-authors in 2004, used files that needed to be exactly the same size in byte length. Later research by Stevens, Lenstra, and de Weger, using “chosen prefix” attacks, does not require this constraint, but the resulting files must be the same size after the application of the attack creating the collision. In a chosen prefix attack, data packets are appended to the files subjected to the attack. These data packets create the condition which results in the collision. The data packets are unique to compensate for the differences in the original files subjected to the attack. The files must be padded, having insignificant data bits added, to create the final length that makes the file byte size match.

Simple mathematical examples can illustrate the concept of creating a collision using a chosen prefix attack. When two uniquely different strings of data are calculated using our algorithm, they result in different values. In this illustration, our mathematical process (algorithm) is addition.

1+1=2 and 1+2=3

If we influence the data and append additional unique data to the string we can create a collision.

1+1+5=7 and 1+2+4=7

By influencing the mathematical process and adding data, we now have two matching results. We have caused a collision resulting in our data having the same value. If we change the mathematical process (algorithm) applied to the data, we will see that the data is not the same.

1x1=1 and 1x2=2

By changing the algorithm to multiplication, we will see a different unique result from the influenced data.

1x1x5=5 and 1x2x4=8

The compromise of the MD5 and SHA-1 hash algorithms is vastly more complex.

While it is possible to cause two files to have matching hash values, it is a complex process. The person creating the compromised files must have physical possession of the files to be altered. The affected/compromised files must be altered prior to the hash algorithm being run so that a matching hash value is produced. Research by Stevens, et al. has shown, in their vulnerability assessment, that a known hash value cannot be targeted to produce a duplicate hash of a known file. “We cannot target a given hash value, and produce a (meaningful) input bit string hashing to that given value… colliding files have to be specially prepared by the attacker… Existing files with a known hash that have not been prepared in this way are not vulnerable.” This is important in the use of hash sets to identify known files. Since the known file hash sets have already been created independently by the National Institute of Standards and Technology (NIST) in the National Software Reference Library (NSRL) Hash Sets. Additionally, known file filtering is utilized to identify known contraband files. The National Child Victim Identification Program (NCVIP) Hash Sets are created to identify victim images of child sexual exploitation. It is infeasible, if not impossible, to create a hash value of a contraband image, and have a known hash set filter it out of a case, in an attempt to hide it from an examiner.

Using the files produced by Stevens et al. in their MD5 collision research, I applied both MD5 and SHA-1 hash algorithms. The resulting MD5 hashes confirmed their successful collision. The resulting SHA-1 values, however, were unique and did not share the same colliding values. The use of two hash algorithms would identify files specifically created to collide. Examination of the data packets confirmed the unique data appended to the files. The appended data was easily identified at the end of the file structure, when compared to the original file. If it were possible to reinsert a manipulated file into a forensic image, a cascading effect would occur changing the hash values throughout its parent folders in the image ultimately creating a mismatch to the acquisition hash.

What does this mean to the forensic examiner? There are a few worst case scenarios heard from fellow examiners that turn out to be unfounded fears.

If criminals know creating a hash collision would prevent known file filter identification, why would they not preconfigure matching MD5 sum files for precisely that purpose? For example, if I have an extensive collection of contraband pornography, why not make those files perfectly match a duplicate set of innocuous picture files?

Even if criminals caused collisions with contraband files, that would not change the visual content of an illicit image, just prevent MD5 identification of the image.

I wouldn't be at all surprised to find out that some criminal group is working on creating a hash set to perfectly match the corresponding hash sets that Law Enforcement uses for identification and filtering.

It is highly improbable at this time because information from the cryptographic collision research indicates that it is not possible to construct a known hash value using this technique.

“We cannot target a given hash value, and produce a (meaningful) input bit string hashing to that given value. ...both colliding files have to be specially prepared... Existing files with a known hash that have not been prepared in this way are not vulnerable.” (Stevens et al.)

Additionally, the “criminal group” would need two things that would be difficult to obtain, the Law Enforcement Known File Hash Sets and all the physical files represented in those hash set databases. Using the Stevens et al. average of one collision every five minutes, it would take over 69.25 years to create an MD5 collision for the 14.5 million unique MD5 values in the NIST National Software Reference Library hash database.

So, in court, the prosecutor gets his expert to state on the record that the MD5 hash value proves that the file in question is contraband. Then the defense attorney brings in his expert, who shows the exact same MD5 hash value for a picture of a fire truck and boom! Your case just went bad–unless you have examined every file and identified those that were known contraband.

MD5 hash value can’t prove the file is contraband, and an expert should not conclude or opine that the MD5 hash value proves that it is. MD5 hash values can identify a known file that should then be visually verified. Yes, it is very time consuming, and following best practices is a part of our process in these cases.

So, you scatter a bunch of fakes among the real ones and then sit back and wait for the expert to testify.

This already occurs (there are “teen/child” websites that promote “real teens” and/or barely legal). That is why visual verification and resources such as National Center for Missing and Exploited Children (NCMEC) Child Victim Identification Program are critical for the investigators’ and examiners’ preparation prior to conclusions, reporting, and trial testimony.

Modern computing power makes has the potential to allow individuals to download a program to cause MD5 hash value collisions.

In 2007, it took researchers Stevens et al. just under two days to create the two file collisions for their MD5 chosen prefix attack. However, they did develop a theory that using a chosen prefix attack, a two file MD5 collisions could be preformed in an average of five minutes. It took the original researchers 800 days to prove their concept, with their first two file MD5 collision.

Wouldn’t it be trivial to take it to full automation where, in the matter of a few days or weeks, you’ve matched every known hash?

There are fatal flaws in the above scenario based on the information provided by the current research. However let’s take it a step further. We know that MD5 has been proven to have been shown vulnerable for cryptographic collisions. Finding a colliding message pair for SHA-1 hash values is one of the biggest challenges in practical cryptography. Due to recent progress of cryptanalytic methods, a distributed computing project to tackle the problem was begun in August 2007 and is currently using over 24 thousand computers to tackle the problem. They have yet to achieve their first SHA-1 collision.

There are a number of final ideas related to the limitations of the MD5 hash algorithm. Since the MD5 hash result is a 128 bit value, it has a finite limitation (it is restricted to the number of characters allowed). The value is made up of 32 hexadecimal characters. At some point in time, different files will share the same value.

Steve Mead of NIST put together a nice paper and presentation on the Viability of MD5 and SHA-1 Hashes, which explains the limitation. In it, he describes the number of possible MD5 values as 2^128, which is 1,700,000,000,000,000, 000,000,000,000,000,000,000,000 possible combinations. Statistically to have a 50 percent probability of a duplicate file hash, the number of unique files would need to reach 850,000,000,000,000,000,000,000,000,000,000,000,000. If we could use all the electronic storage media on earth to store unique data files, and duplicating that storage on all the planets in our solar system, we would not be able to make a dent in this number. Further, SHA-1 is a 160bit value resulting in a 40 character hexadecimal string. 2^160 is a much larger number than MD5’s 2^128.

A recent case with a 320 Gigabyte evidence hard drive that had 145.1 Gigabytes of stored data, contained 127,946 files. This quantity of files appears to be typical of cases encountered in a normal forensic exam, and is far less than the 2^128 number of possible hash values available with the MD5 hash algorithm. While this was a single computer, even a case involving a network with 100 similar computers would result in approximately 12.8 million files, still far less than the MD5 universe will allow.

There is a great analogy which I heard recently, putting these numbers into an understandable perspective. For use in file identification and authentication, there is a greater probability that single individual, from a twelve member jury, will win the Power Ball Lottery sixty days in a row, than an accidental occurrence of two matching MD5 hash values from files that have not been manipulated to collide. It appears that based on the research and the use of sound practices, MD5 and SHA-1 hash algorithms both have a long useful life in Digital and Multimedia Forensics.


Don L. Lewis is a Forensic Computer Analyst with the Lakewood, CO Police Department. Don began his Law Enforcement career in 1979 as a Crime Scene Photographer, and Photo Lab Technician. Don has been with the Lakewood PD for 19 years, the last six in computer forensics. Don provides consultation to individuals in law enforcement on the local and national level, trains personnel in conventional and digital imaging, analysis techniques and procedures. Don is currently the Vice Chairman for the Scientific Working Group for Digital Evidence (SWGDE). Don may be reached at