Tuesday, May 22, 2007

Comparing Large Hash sets Against NSRL.......

I recently saw a post on a list I belong to asking about DeDuplicating and DeNSRLing some files. He was trying to do this in a very popular forensic product and after 4 days he still had nothing. Someone replied (I had thought the same thing) about using a SQL Server database to do this. Now if you are not that familiar with using databases then this would not be an easy task. Thinking about this I thought it would make a good project. To start off you first need to accommodate a large amount of data and it should perform well (that is a bigger challenge then you may think).

The parameters for the project are:

1. The NSRL reference table will only hold 1 set of hash values (I chose MD5 to use but you could choose SHA1 or CRC).

2. Load the NSRL data in a timely manner.

3. Be able to add my own hash sets to compare against as well.

4. Use as much free software as possible.

5. Load my hashs to compare in a timely manner.

6. Compare my hashs in a timely manner.

7. Be able to easily report and extract knowns and unknown hash sets from what I loaded.

8. Work on both Windows and Linux (Sorry Mac)

I started off by using SQLite with a perl script to load the NSRL data. I was able to load the NSRL data in aprox 1 hour which for the amount of data and an embedded database I thought was pretty good as well as you would only do this task possibly once a quarter. The problem came next when I tried to create an index on the table and it went out to lunch. After a couple of hours I knew I would have to come up with a different database solution. I then looked at the free version of Oracle (I am pretty familiar with this database and it also has a Linux version, that is why I chose it over SQL Server), now here is where it starts to get hard since I am limited to only having 4GB of data in the free version. I installed it without a problem and started it up. It was using aprox 300M of memory so for anyone out there wanting to do this you should probably have 1gb of memory on your machine.

I next started to create some tablespaces, users and tables. I then used Oracle's SQL Loader product to load the data into the database and then indexed the table. This took about 3.5 GB between the index and table (40,000,000+ rows). I then created a list of hashs from a previous examination that using x-ways forensics version 13. I then loaded this data into the database (600,000+ rows) and then created a table of known and unknown hashs for the examination. After trying many different things to make it fast and small I finally came up with the following:

NSRL table is deduplicated from 40,000,000 rows down to 14,000,000+ rows and from 3.5 GB (table and index) down to 1.2gb (table and index) with a load time of aprox 36 minutes.

My hash set was smaller then 500m and took aprox 5 minutes to load the 660,000+ rows and create 2 tables (known hash set and unknown hash set). The known hashs table has aprox 46,000 rows with the unknown hashs tables having 604,000+ rows.

Now I have uploaded the scripts here (sql and sqlload) and batch files to run to create your own little hash comparison system. There is a install.txt file to help you get started. Once you install Oracle Express and download the NSRL data you should be able to get started.

If you don't want to use the MD5 that I did then just change the MD5 references to SHA1 or CRC and then the load cards to only load what you want. You can also change the hash set tables to what ever you want to load. Just use what I supplied as a template to make your modifications. With a little creativity you can also create your own list of knowns and unknowns and use these to compare against as well, just use the nsrl schema as a template.

Now looking back I feel I accomplished everything I set out to. It is fast, 41 minutes from start to finish if I do not have the NSRL already loaded, otherwise it takes roughly 5 minutes for 660,000+ rows. It is a free solution. I can now export the rows, create reports as well. Using Oracle Express I can run it on either Windows or Linux platform and since I do not use any gui tools there are not too many modifications to make it work on either platform. I would love to hear your experiences with using this and what timing's you get with your hash set comparisons.



hogfly said...

This sounds really interesting.

Can you detail the specs on the box you ran this on?

cell phone number search said...

what kind of box was it?

Anonymous said...

now I got it, thanks!

Cell Phone Number Search

Mark McKinnon said...


Sorry I missed your post. I ran this on a Dell D610 Latitude laptop with 1gb memory, 40gb hd and 1.8 ghz pentium M processor.


homeopathy eczema said...

black mold exposureblack mold symptoms of exposurewrought iron garden gatesiron garden gates find them herefine thin hair hairstylessearch hair styles for fine thin hairnight vision binocularsbuy night vision binocularslipitor reactionslipitor allergic reactionsluxury beach resort in the philippines

afordable beach resorts in the philippineshomeopathy for eczema.baby eczema.save big with great mineral makeup bargainsmineral makeup wholesalersprodam iphone Apple prodam iphone prahacect iphone manualmanual for P 168 iphonefero 52 binocularsnight vision Fero 52 binocularsThe best night vision binoculars here

night vision binoculars bargainsfree photo albums computer programsfree software to make photo albumsfree tax formsprintable tax forms for free craftmatic air bedcraftmatic air bed adjustable info hereboyd air bedboyd night air bed lowest pricefind air beds in wisconsinbest air beds in wisconsincloud air beds

best cloud inflatable air bedssealy air beds portableportables air bedsrv luggage racksaluminum made rv luggage racksair bed raisedbest form raised air bedsaircraft support equipmentsbest support equipments for aircraftsbed air informercialsbest informercials bed airmattress sized air beds

bestair bed mattress antique doorknobsantique doorknob identification tipsdvd player troubleshootingtroubleshooting with the dvd playerflat panel television lcd vs plasmaflat panel lcd television versus plasma pic the bestThe causes of economic recessionwhat are the causes of economic recessionadjustable bed air foam The best bed air foam

hoof prints antique equestrian printsantique hoof prints equestrian printsBuy air bedadjustablebuy the best adjustable air bedsair beds canadian storesCanadian stores for air beds

migraine causemigraine treatments floridaflorida headache clinicdrying dessicantair drying dessicantdessicant air dryerpediatric asthmaasthma specialistasthma children specialistcarpet cleaning dallas txcarpet cleaners dallascarpet cleaning dallas

vero beach vacationvero beach vacationsbeach vacation homes veroms beach vacationsms beach vacationms beach condosmaui beach vacationmaui beach vacationsmaui beach clubbeach vacationsyour beach vacationscheap beach vacations

bob hairstylebob haircutsbob layeredpob hairstylebobbedclassic bobCare for Curly HairTips for Curly Haircurly hair12r 22.5 best pricetires truck bustires 12r 22.5

washington new housenew house houstonnew house san antonionew house venturanew houston house houston house txstains removal dyestains removal clothesstains removalteeth whiteningteeth whiteningbright teeth

jennifer grey nosejennifer nose jobscalebrities nose jobsWomen with Big NosesWomen hairstylesBig Nose Women, hairstyles

Anonymous said...

You can potentially boost the performance by adding direct=true to the sqlldr command

Anonymous said...

The first misconception that wannabes might invariably have is Hogan could be marginally harder than mountain trekking. If Hogan scarpe donna was the case, Hogan uomo need not have to be stiffer and flexible simultaneously. Add to Hogan scarpe uomo the extra mid-sole cushion to make you forget the heavy thumping downhill running.

pay per head said...

It is nice to find a site about my interest. My first visit to your site is been a big help