Tuesday, May 22, 2007

Comparing Large Hash sets Against NSRL.......

I recently saw a post on a list I belong to asking about DeDuplicating and DeNSRLing some files. He was trying to do this in a very popular forensic product and after 4 days he still had nothing. Someone replied (I had thought the same thing) about using a SQL Server database to do this. Now if you are not that familiar with using databases then this would not be an easy task. Thinking about this I thought it would make a good project. To start off you first need to accommodate a large amount of data and it should perform well (that is a bigger challenge then you may think).

The parameters for the project are:

1. The NSRL reference table will only hold 1 set of hash values (I chose MD5 to use but you could choose SHA1 or CRC).

2. Load the NSRL data in a timely manner.

3. Be able to add my own hash sets to compare against as well.

4. Use as much free software as possible.

5. Load my hashs to compare in a timely manner.

6. Compare my hashs in a timely manner.

7. Be able to easily report and extract knowns and unknown hash sets from what I loaded.

8. Work on both Windows and Linux (Sorry Mac)

I started off by using SQLite with a perl script to load the NSRL data. I was able to load the NSRL data in aprox 1 hour which for the amount of data and an embedded database I thought was pretty good as well as you would only do this task possibly once a quarter. The problem came next when I tried to create an index on the table and it went out to lunch. After a couple of hours I knew I would have to come up with a different database solution. I then looked at the free version of Oracle (I am pretty familiar with this database and it also has a Linux version, that is why I chose it over SQL Server), now here is where it starts to get hard since I am limited to only having 4GB of data in the free version. I installed it without a problem and started it up. It was using aprox 300M of memory so for anyone out there wanting to do this you should probably have 1gb of memory on your machine.

I next started to create some tablespaces, users and tables. I then used Oracle's SQL Loader product to load the data into the database and then indexed the table. This took about 3.5 GB between the index and table (40,000,000+ rows). I then created a list of hashs from a previous examination that using x-ways forensics version 13. I then loaded this data into the database (600,000+ rows) and then created a table of known and unknown hashs for the examination. After trying many different things to make it fast and small I finally came up with the following:

NSRL table is deduplicated from 40,000,000 rows down to 14,000,000+ rows and from 3.5 GB (table and index) down to 1.2gb (table and index) with a load time of aprox 36 minutes.

My hash set was smaller then 500m and took aprox 5 minutes to load the 660,000+ rows and create 2 tables (known hash set and unknown hash set). The known hashs table has aprox 46,000 rows with the unknown hashs tables having 604,000+ rows.

Now I have uploaded the scripts here (sql and sqlload) and batch files to run to create your own little hash comparison system. There is a install.txt file to help you get started. Once you install Oracle Express and download the NSRL data you should be able to get started.

If you don't want to use the MD5 that I did then just change the MD5 references to SHA1 or CRC and then the load cards to only load what you want. You can also change the hash set tables to what ever you want to load. Just use what I supplied as a template to make your modifications. With a little creativity you can also create your own list of knowns and unknowns and use these to compare against as well, just use the nsrl schema as a template.

Now looking back I feel I accomplished everything I set out to. It is fast, 41 minutes from start to finish if I do not have the NSRL already loaded, otherwise it takes roughly 5 minutes for 660,000+ rows. It is a free solution. I can now export the rows, create reports as well. Using Oracle Express I can run it on either Windows or Linux platform and since I do not use any gui tools there are not too many modifications to make it work on either platform. I would love to hear your experiences with using this and what timing's you get with your hash set comparisons.

Questions/Comments/Thoughts?

7 comments:

hogfly said...

Mark,
This sounds really interesting.

Can you detail the specs on the box you ran this on?

Anonymous said...

what kind of box was it?

Anonymous said...

now I got it, thanks!

Cell Phone Number Search

Mark McKinnon said...

Hogfly,

Sorry I missed your post. I ran this on a Dell D610 Latitude laptop with 1gb memory, 40gb hd and 1.8 ghz pentium M processor.

Mark

Anonymous said...

You can potentially boost the performance by adding direct=true to the sqlldr command

Anonymous said...

The first misconception that wannabes might invariably have is Hogan could be marginally harder than mountain trekking. If Hogan scarpe donna was the case, Hogan uomo need not have to be stiffer and flexible simultaneously. Add to Hogan scarpe uomo the extra mid-sole cushion to make you forget the heavy thumping downhill running.

pay per head said...

It is nice to find a site about my interest. My first visit to your site is been a big help