PostgreSQL and Fuzzy Hashes– a natural combo?

A while back, I had the idea of creating a hash of geometry in order to create a unique constraint on geometry to avoid duplicate geometries in a PostGIS table. It works. It’s slow, but it works. I haven’t used it since.

What I really wanted to do though was be able to detect if geometries were similar as well (not just detect identical geometries), to test if geometries are homologous. Wouldn’t it be nice to know how similar two shapes are?

A couple of nights ago, I was introduced to a technique that may just do that– fuzzy hashes. Now a hash is a clever way to convert any and everything (in a computer), regardless of size, into the same size number, say 128-byte integer which is unique to the particular input. This has all sorts of uses, from identifying viruses, to indexing information, to ensuring that when you download something from the internet, you are getting what you thought you were downloading in an unmodified form. The unique property of a hash is that, make a small change to the input, and the output hash is completely different from the hash of unchanged input. If you want to know if two things are similar, a hash won’t help. In fact, hashes of bad files are used in forensics, and work-arounds to hash-based forensics can be as simple as changing a single byte in a file.

Enter fuzzy hashes. In short, fuzzy hashes break up a file into logical pieces or blocks, hash each block, and then similarity between files can be determined block-by-block. This is useful for file forensics, anti-virus, determining the origins of nefarious code, detecting plagiarism etc. In long, well… just read this. Also, check out ssdeep.

The question is, can we use this the geographic sphere. How would this be useful in, e.g. PostGIS?

Let’s give it a try:

If we convert our purple line above to text in PostGIS:

SELECT ST_AsEWKT(the_geom) FROM trail_test;

we get something like this:

"SRID=3734;MULTILINESTRING((2179985.47870642 567446.852705703 1030.80067171231 -1.79769313486232e+308,2179982.97870642 567445.252427514 1030.79452346621 -1.79769313486232e+308,2179980.47870642 567443.652149326 1030.74616711558 -1.79769313486232e+308,2179977.97870642 567442.051871137 1030.61640858078 -1.79769313486232e+308,2179975.47870642 567440.451592949 1030.50568889286

…

-1.79769313486232e+308,2179794.15034441 566393.137591605 1051.18052872389 -1.79769313486232e+308,2179794.04164876 566390.637591605 1051.46559305229 -1.79769313486232e+308,2179793.93295311 566388.137591605 1051.74478920181 -1.79769313486232e+308,2179793.82425746 566385.637591605 1052.01504940301 -1.79769313486232e+308,2179793.73503707 566383.585522703 1052.23441454273 -1.79769313486232e+308))"

Which fuzzy hashed looks like this:

# ssdeep -b temp ssdeep,1.0--blocksize:hash:hash,filename 384:RpLQqsmoBbzjvPW7KEhoAyoGchYqpmeYOB7LCfCMJrOevT1WLwR7UyYIU:7Eq5oBbzjvPWuAyoGcSqIUNLCfCMJ6eq,"temp"

If we change some vertices and rerun, but this time change the ssdeep command to tell us similarity:

Create a hash of the first geometry and write it to file:
# ssdeep -b temp > temphash

then to compare the hash of the second file:
# ssdeep -bm temphash temp1 temp1 matches temp (99)

Not surprisingly, they are 99% similar.
Now this is a painful way to do this comparison. However, imagine a function in the PostGIS suite wherein you could do a geometry-by-geometry similarity comparison for tables. Imagine combining that analysis with a snap-to-grid function to further generalize it’s applicability.

Any thoughts on uses? I know I’d use it to determine whether two data sources were related, i.e. were modified from the same parent, but I’m guessing there are some other more subtle applications.

2 thoughts on “PostgreSQL and Fuzzy Hashes– a natural combo?”

Pingback: Looping Trails– alternate routing for recreational use « Smathermather's Weblog
jo (@nejo) says:

August 12, 2011 at 6:52 am

This is exactly what I’m also looking for. I’ve got some database with content that is similar but not 100%. And i want to deduplicate this content when i query the database (full text search). With fuzzy hashes deduplicating these queries would be fast and easy. It’s actually how Solr does it’s deduplication.

Smathermather's Weblog

Remote Sensing, GIS, Ecology, and Oddball Techniques

PostgreSQL and Fuzzy Hashes– a natural combo?

2 thoughts on “PostgreSQL and Fuzzy Hashes– a natural combo?”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “PostgreSQL and Fuzzy Hashes– a natural combo?”

Leave a comment Cancel reply