I recently saw a post on a list I belong to asking about DeDuplicating and DeNSRLing some files. He was trying to do this in a very popular forensic product and after 4 days he still had nothing. Someone replied (I had thought the same thing) about using a SQL Server database to do this. Now if you are not that familiar with using databases then this would not be an easy task. Thinking about this I thought it would make a good project. To start off you first need to accommodate a large amount of data and it should perform well (that is a bigger challenge then you may think).
The parameters for the project are:
1. The NSRL reference table will only hold 1 set of hash values (I chose MD5 to use but you could choose SHA1 or CRC).
2. Load the NSRL data in a timely manner.
3. Be able to add my own hash sets to compare against as well.
4. Use as much free software as possible.
5. Load my hashs to compare in a timely manner.
6. Compare my hashs in a timely manner.
7. Be able to easily report and extract knowns and unknown hash sets from what I loaded.
8. Work on both Windows and Linux (Sorry Mac)
I started off by using SQLite with a perl script to load the NSRL data. I was able to load the NSRL data in aprox 1 hour which for the amount of data and an embedded database I thought was pretty good as well as you would only do this task possibly once a quarter. The problem came next when I tried to create an index on the table and it went out to lunch. After a couple of hours I knew I would have to come up with a different database solution. I then looked at the free version of Oracle (I am pretty familiar with this database and it also has a Linux version, that is why I chose it over SQL Server), now here is where it starts to get hard since I am limited to only having 4GB of data in the free version. I installed it without a problem and started it up. It was using aprox 300M of memory so for anyone out there wanting to do this you should probably have 1gb of memory on your machine.
I next started to create some tablespaces, users and tables. I then used Oracle's SQL Loader product to load the data into the database and then indexed the table. This took about 3.5 GB between the index and table (40,000,000+ rows). I then created a list of hashs from a previous examination that using x-ways forensics version 13. I then loaded this data into the database (600,000+ rows) and then created a table of known and unknown hashs for the examination. After trying many different things to make it fast and small I finally came up with the following:
NSRL table is deduplicated from 40,000,000 rows down to 14,000,000+ rows and from 3.5 GB (table and index) down to 1.2gb (table and index) with a load time of aprox 36 minutes.
My hash set was smaller then 500m and took aprox 5 minutes to load the 660,000+ rows and create 2 tables (known hash set and unknown hash set). The known hashs table has aprox 46,000 rows with the unknown hashs tables having 604,000+ rows.
Now I have uploaded the scripts here (sql and sqlload) and batch files to run to create your own little hash comparison system. There is a install.txt file to help you get started. Once you install Oracle Express and download the NSRL data you should be able to get started.
If you don't want to use the MD5 that I did then just change the MD5 references to SHA1 or CRC and then the load cards to only load what you want. You can also change the hash set tables to what ever you want to load. Just use what I supplied as a template to make your modifications. With a little creativity you can also create your own list of knowns and unknowns and use these to compare against as well, just use the nsrl schema as a template.
Now looking back I feel I accomplished everything I set out to. It is fast, 41 minutes from start to finish if I do not have the NSRL already loaded, otherwise it takes roughly 5 minutes for 660,000+ rows. It is a free solution. I can now export the rows, create reports as well. Using Oracle Express I can run it on either Windows or Linux platform and since I do not use any gui tools there are not too many modifications to make it work on either platform. I would love to hear your experiences with using this and what timing's you get with your hash set comparisons.
Questions/Comments/Thoughts?
Tuesday, May 22, 2007
Comparing Large Hash sets Against NSRL.......
Monday, May 7, 2007
Thumbs DB Files
I received a email about a new product from InfinaDyne. It is called ThumbsDisplay and you can display the contents of the Thumbs.db file. It will also do the following:
Cut and paste the picture to another application
Print 3 types of report (Contacts Sheet with all the pictures displayed, Picture with date and time, Full Size picture with date and time).
Scan the drive for all thumbs.db files.
You can also call the program with a thumbs.db file as a parameter and it will load that file into the viewer. This is really nice since you can then use it to view thumbs.db files from within other forensics programs, ie: X-Ways Forensics. One of the best things about this program is the price, only $29.99. If you want to test drive it before you buy they also have a demo version you can download.
The only draw back I see right now is that you can only print the reports, you can't save them. You need someting installed like cutePDF to print the file to a PDF file. Maybe in a future release they will add this feature. Otherwise it seems like a great inexpensive tool to keep in the toolbox. And in case you are wondering I did pay for my own copy of the program I am not getting anything free here.
Thoughts/Comments/Questions.
Posted by Mark McKinnon at 12:30 PM
Labels: CutePDF, InfinaDyne, ThumbsDisplay, X-ways Forensics