Smathermather's Weblog

Remote Sensing, GIS, Ecology, and Oddball Techniques

Posts Tagged ‘Database Optimization’

Efficient delivery of raster data part 4

Posted by smathermather on May 1, 2016

Guest post from my colleague Patrick Lorch who wrote up what we did the other day in order to view a whole bunch of tiled images in a directory in QGIS (I did some mild editing to his posts. Mistakes are mine). The great thing about the approach is that is generalizeable to most tools that use GDAL for their raster API. This is part of a series. You can view this series in reverse with this post:

Building virtual datasets from a bunch of tiffs
What do you do when someone gives you aerial images stored as tiles in different directories representing different zoom levels? The goal is to make them easy to use as baselayers in QGIS. The answer is to reference them in a virtual data set (VRT).

gdalbuildvrt is the ticket
First make lists of tiffs
If the directory structure is something like this:

total 8047
drwxr-xr-x 7 pdl    4096 Apr 29 14:22 ./
drwxr-xr-x 5 pdl    4096 Apr 27 22:10 ../
drwxr-xr-x 2 pdl 1310720 Apr 22 21:37 0/
drwxr-xr-x 2 pdl  393216 Apr 22 22:54 1/
drwxr-xr-x 2 pdl   98304 Apr 28 14:44 2/
drwxr-xr-x 2 pdl   32768 Apr 28 14:44 3/
drwxr-xr-x 2 pdl    8192 Apr 28 14:44 4/

Then you first need a set of list files listing tiffs in each directory.

ls 0\*.tif > list0.txt
ls 1\*.tif > list1.txt
ls 2\*.tif > list2.txt
ls 3\*.tif > list3.txt
ls 4\*.tif > list4.txt

Now make the vrts

gdalbuildvrt -input_file_list list0.txt aerial_2015_0.vrt
gdalbuildvrt -input_file_list list1.txt aerial_2015_1.vrt
gdalbuildvrt -input_file_list list2.txt aerial_2015_2.vrt
gdalbuildvrt -input_file_list list3.txt aerial_2015_3.vrt
gdalbuildvrt -input_file_list list4.txt aerial_2015_4.vrt

Now you can open these in QGIS depending on what zoom level you need.

These VRTs may now be loaded as ordinary rasters in QGIS or whatever you please. In this case, we retiled with multiple resample levels (see this post for more info), so we’ll have to define max/min ranges at which the different image collections are visible.

Thanks for the write up Pat!

Posted in GDAL | Tagged: , , , | Leave a Comment »

(Whichever tiler you use) and efficient delivery of raster data (image pyramid layer) (update2)

Posted by smathermather on April 15, 2016

Subdivision of geographic data is a panacea to problems you didn’t know you had.

Maybe you deal with vector data, so you pre-tile your vector data to ship to the browser to render– you’re makin’ smaller data. Maybe you use cutting edge PostGIS so you apply ST_Subdivide to keep your data smaller than the database page size like Paul Ramsey describes here. Smaller’s better… . Or perhaps you are forever reprojecting your data in strange ways, across problematic boundaries or need to buffer in an optimum coordinate system to avoid distortion. Regardless of the reason, smaller is better.

Maybe you aren’t doing vector work, but this time raster. What’s the equivalent tiling process?  I wrote about this for GeoServer almost 5 (eep!) years ago now (with a slightly more recent follow up) and much of what I wrote still applies:

  • Pre-tile your raw data in modest chunks
  • Use geotiff so you can use internal data structures to have even smaller tiles inside your tiles
  • Create pyramids / pre-summarized data as tiles too.

Fortunately, while these posts were written for GeoServer, they apply to any tiler. Pre-process with gdal_retile.

gdal_retile.py -v -r bilinear -levels 4 -ps 6144 6144 -co "TILED=YES" -co "BLOCKXSIZE=256" -co "BLOCKYSIZE=256" -s_srs EPSG:3734 -targetDir aerial_2011 --optfile list.txt

Let’s break this down a little:

First we choose our resampling method for our pyramids (bilinear). Lanzcos would also be fine here.

-r bilinear

Next we set the number of resampling levels. This will depend on the size of the dataset.

-levels 4

Next we specify the pixel and line size of the output geotiff. This can be pretty large. We probably want to avoid a size that forces the use of bigtiff (i.e. 4GB).

-ps 6144 6144

Now we get into the geotiff data structure — we internally tile the tifs, and make them 256×256 pixels. We could also choose 512. We’re just aiming to have our tile size near to the size that we are going to send to the browser.

-co "TILED=YES" -co "BLOCKXSIZE=256" -co "BLOCKYSIZE=256"

Finally, we specify our coordinate system (this is state plane Ohio), our output directory (needs created ahead of time) and our input file list.

-s_srs EPSG:3734 -targetDir aerial_2011 --optfile list.txt

That’s it. Now you have a highly optimized raster dataset that can:

  • get the level of detail necessary for a given request,
  • and can extract only the data necessary for a given request.
  • Pretty much any geospatial solution which uses GDAL can leverage this work to make for very fast rendering of raster data to a tile cache. If space is an issue, apply compression options that match your use case.

    Posted in GDAL | Tagged: , , , | 2 Comments »

    Cartography and USGS — Fake Building Footprints in PostGIS now with distance operator (part 2)

    Posted by smathermather on May 17, 2012

    In a previous couple of posts (this one, and this one), we dealt with point rotations, first with bounding box searches, and then with nominal use of operators.

    First we create a function to do our angle calculations, then use select to loop through all the records and do the calculations.

    Within our function, first we find our first (in this case) five nearest streets using our distance operator , based on bounding boxes, then we further order by ST_Distance, with a LIMIT 1 to get just the closest street. By the way, this is about twice as fast as the last approach.

    And our new code:

    CREATE OR REPLACE FUNCTION angle_to_street (geometry) RETURNS double precision AS $$
    
    WITH index_query as
                    (SELECT ST_Distance($1,road.geom) as dist,
                                    degrees(ST_Azimuth($1, ST_ClosestPoint(road.geom, $1))) as azimuth
                    FROM street_centerlines As road
                    ORDER BY $1 <#> road.geom limit 5)
    
    SELECT azimuth
                    FROM index_query
                    ORDER BY dist
    LIMIT 1;
    
    $$ LANGUAGE SQL;
    
    DROP TABLE IF EXISTS address_points_rot;
    CREATE TABLE address_points_rot AS
    
                    SELECT addr.*, angle_to_street(addr.geom)
                    FROM
                                    address_points addr;
    

    I have to credit Alexandre Neto with much of the code for this solution.

    Posted in Analysis, Database, Database Optimization, PostGIS, PostgreSQL, SQL | Tagged: , , , , , , | Leave a Comment »

    GeoServer and efficient delivery of raster data (image pyramid layer) (update)

    Posted by smathermather on May 11, 2012

    A perennial favorite on this blog is “GeoServer and efficient delivery of raster data (image pyramid layer)“. I am neither the last nor the first authority on this topic (check the GeoSolutions blog for authoritative work on GeoServer and raster, also look to the GeoServer documentation), but I’ve had some good experiences with serving rasters in GeoServer, especially using image pyramid layers

    Read the original, as this will just augment, but here are some targets to hit with the retiling necessary for larger datasets. This is the command I currently use for the retiling:

    
    gdal_retile.py -v -r bilinear -levels 4 -ps 6144 6144 -co "TILED=YES" -co "BLOCKXSIZE=256" -co "BLOCKYSIZE=256" -s_srs  EPSG:3734 -targetDir aerial_2011 --optfile list.txt
    
    

    Don’t be afraid of big block sizes. Bump your memory up in your application container, stop worrying and learn to love the larger tif. I keep my total number of output tifs to no more than 2000, where I start to see performance issues in my implementation.
    Also, give the image pyramid code a break. After retiling, do this:

    mkdir 0
    mv *.tif 0/.
    

    Posted in GeoServer, GeoWebCache | Tagged: , , , , | 1 Comment »

    Dialog– What qualifies as a benchmark? Part 1

    Posted by smathermather on July 25, 2011

    Normally, my blog is a bit of a monologue.  It’s not a bad thing, but can be a little lonely.  Every now and then I get a great (and often doubtful) comments, which enhances things considerably.

    What follows is some dialog about the LiDAR shootout series, largely between Etienne and Pierre, posted with their permission:

    Pierre:

    “Etienne, Stephen,

    “I really appreciate the benchmark work you are doing to compare different technologies to solve your lidar data processing problem.

    “However, if you want your tests to have as much credibility as possible and mostly if you wishes to publish the results on the web I would recommend that you compare horses with horses, dogs with dogs, general vector technologies with general vector technologies, raster technologies with raster technologies, specialized lidar technologies with specialized lidar technologies. Otherwise the dog will always lose against the horses. You have to be clear on that and well explain the differences and the pro’s and con’s between all the tech you will be using in your benchmark. Are you doing the test in ArcGIS using ESRI lidar technology (or general vector one)? Same question for R and PostGIS Raster (in the last case it is clear that the answer is no). In your graph you are comparing very different kind of animals without giving much chances to dogs.

    “For every technologies you will also have to test different approaches in order to find the fastest one for this technology in order to compare it with the fastest one in the other technologies. Comparing a slow approach in one tech with the fastest in the other would be a bit unfair. This is what we did for PostGIS Raster. Did you do the same with R and ArcGIS?

    “Basically I think it is just unfair to compare ArcGIS, R and PostGIS Raster with LAStools for working with Lidar data. Mostly if you are not using ArcGIS and R packages for dealing with lidar. Teach us you did not use them if you did not. We will all learn something.”

    Etienne:

    “As you said, the point is to compare different strategies to get the job done. That’s it. It’s not a question of being fair or not, we report facts. Maybe the bottom line is that pgraster shouldn’t be used to extract lidar height. However it can be used for other things. Well, we all learned something.

    “It also emphasize the fact that there is no point_cloud type in postgis. I guess this could improve a lot the postgis results as it seems handling point cloud might be really fast. I learned that.

    “Each solution has its drawbacks, for ArcGIS I think we’ll all agree about the cost of the licence and furthermore the cost of the lidar plugin. For lastools, it require a licensing for commercial use (which is really low btw compared to ArcGIS + Spatial analyst + LP360). PG raster require a postgres installation and some SQL knowledge, etc. R require to learn a new language, etc.

    “Now I must acknowledge, the tests we did were on only 500k points. That could be considered a lot, and not. When dealing with lidar, one could easily expect to get more than a thousand times this quantity. But it was the limit imposed by the LP360 version I used and also a reasonable size for benchmarking. Note that pg raster was really stable in the timing (whatever the size).

    “Finally, please note that Martin’s approach is new and he just released the tool a week ago. It’s to the better of my knowledge, the first one the do it.

    “Pierre, explain your point better, I’m not sure I get it (or I disagree, as you can see).”

    Pierre:
    “My point is just that you have to be clear in your post about the fact that LAStool is a technology developed explicitly for dealing with lidar data. PostGIS raster is not and I guess you are not using ESRI lidar technology and I have no idea how you do it with R. The point is to compare potatoes with potatoes.

    “If I would be from the ESRI side I would quickly post a comment in the post saying “Did you read this article?” http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Estimating_forest_canopy_density_and_height/00q8000000v3000000/

    “A don’t know how you did it in ArcGIS but it seems that they are storing lidar data as multipoints.

    “A good benchmark should precise all this.

    Etienne (Pierre > quoted):

    > If I would be from the ESRI side I would quickly post a comment in the post saying “Did you read this article?”

    > http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Estimating_forest_canopy_density_and_height/00q8000000v3000000/
    “I would reply: apart from the fact that it uses lidar data, it has nothing to do with the object of the present article 🙂 But I’d be happy to discover a better way than what I did (for all the tools btw).

    “Maybe you should (re-)read the post, it’s written. For R as well. Maybe Steve, you should precise (this is my fault) that for ArcGIS, I’ve used LP360 (evaluation) to load lidar data, which was limited to 500k points. But it wasn’t involved in the computation as I used (as written in the post) «ExtractValuesToPoints».”

    Pierre:

    > So it’s a good benchmark ?

    “I just think it would be more fair to create three test categories:

    “1) Lidar technologies (apparently your ArcGIS test should be in this category)

    “2) Vector/raster technologies (why not use non lidar techs if lidar techs are so much better?)

    “3) Nearest neighbor technique using vector only technology

    “I won’t comment further…

    Etienne:

    “I personally would not encourage these categories as there is some confusion (at least in my mind) on the definition of what is a «LiDAR thechnology». However I think it would be a good idea to nuance the conclusions, which I think will be done in further posts (or not). It’s also pretty clear from the RMSE measures what is the main drawback of each solution (with speed).

    “I’m still open for discussion…

    Stephen (me):

    “Hi Pierre,
         “I agree with you in principle, but the ArcGIS aside, there are some very real opportunities for optimization on point in raster analyses with tiling, indexing, etc.  In other words, I don’t see any practical reasons why PostGIS Raster couldn’t be (nearly) as fast as an nn technique, so long as the tile size is optimized for density of the analysis in question.  The only thing that makes it a dog is its age as a technology.

    (moments later after discovering the lengthy dialog…)

    “It looks like I missed all the discussion here…

    “The reason such benchmarks are interesting is (in part) that they are not fair.  PostGIS isn’t optimized for LiDAR analyses.  PostGIS Raster isn’t optimized for LiDAR analyses.  PostGIS Raster is very young technology which really isn’t fair.  The PostGIS development folks know this, and were interested in seeing my initial benchmarks with nn analyses for LiDAR height (last year) because they are aware of the optimizations that are not yet there for handling point clouds.  Vector and raster analyses (as we’ve run them here) should not in principle or practice substantially different in speed, so long as the query analyzer maximizes the efficiency of the techniques.  Add in multipoint optimizations and other voodoo, and well, that’s a different story.  Another aspect to this is that comparing lastools to the rest is unfair not just because it’s optimized for lidar but because it’s working at a much lower level (file level) than the abstractions a database provides.  The trade-off, off course, is that a database can provide transaction support, standardized language, automated query analyzers, etc. etc. that don’t apply for file level processing.

    “What I see here, categories aside, is a need for the other 2-3 parts of this 3-4 part series– the breakdown of which techniques are used, how they’re implemented, and the rationale behind them.  Don’t worry, we’ll get there.  That’s the danger of starting with the punchline, but we’ll build the nuance in.

     

    So, there we are.  Looks like I’m committed to building this out with a little more nuance.  🙂  Stay tuned.

    Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , | Leave a Comment »

    LiDAR Shootout! — New Chart, Final Results

    Posted by smathermather on July 21, 2011

    Comparison of lidar height calculations

    In reviewing the final numbers and charts from Etienne and Pierre, above are the results we see.  The only revision is a moderate increase in speed for the PG Raster query.

    Final results in speed for lastools– ~350,000 points per second.  In other words– off-the-charts fast.  And the initial RMSE of ~25 feet was a mistake– it is probably closer to 0.2 feet.

    Stay tuned for detailed reviews of these techniques (with code).

     

    Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , | Leave a Comment »

    LiDAR Shootout!

    Posted by smathermather on July 18, 2011

    For a couple of months now I’ve been corresponding with Etienne Racine and Pierre Racine out of Montreal Laval University in Quebec City.  They decided to take on the problem of finding the speed and accuracy of a number of different techniques for extracting canopy height from LiDAR data.  They have been kind enough to allow me to post the results here.  This will be a multipart series.  We’ll start with the punchline, and then build out the code behind each of the queries/calculations in turn (category will be “LiDAR Shootout“) with separate posts.

    In the hopper for testing were essentially 4 different ways to extract canopy height from a LiDAR point cloud and (in three of the cases) a DEM extracted from the LiDAR point cloud.  The 4 techniques were as follows:

    • Use ArcGIS ExtractValuesToPoints.
    • Use R using the raster and rgdal packages with raster::extract() function with in-memory results.
    • Use postgis (2.0 svn) Raster datatype (formerly WKT Raster)
    • And use a simple nearest neighbor approach with ST_DWithin within PostGIS.

    The data are from the Ohio Statewide Imagery Program (OSIP) program, run by Ohio Geographically Referenced Information Program (OGRIP).  Ohio is blessed with an excellent 2006-2008 LiDAR dataset and statewide DEM derived from the (effectively) 3.5 foot horizontal posting data (specs may say 5-ft, I can’t remember…).

    Speeeed:

    So, first to speed, initial results.  Who wins in the speed category?  Measured as points per second (on consistently the same hardware), nearest neighbor wins by a landslide (bigger is better here):

    Application:  Points per Second

    ArcGIS:  4504
    R:  3228
    PostGIS Raster:  1451 (maybe as high as 4500)
    Postgis nn:  18208

    Now, later on, Pierre discovered changes to indexing may help an the EXPLAIN query analyzer optimization which tripled the PostGIS Raster query speed, making it about the same speed as ArcGIS.  More on that later.

    Figure removed– to be replaced in another post

    Accuracy:

    A fast calculation is always better, unless you trade off accuracy in some detrimental way.  With PostGIS NN, we’re not interpolating our LiDAR ground point cloud before calculating the difference, so relative to the other three techniques, we could be introducing some bias/artifacts here, a point Jeff Evans makes here.  Overall error relative to the interpolated solution introduced by using an NN technique on this dataset: 0.268 feet.  Whew!  Sigh of relief for me!  We may spend some time looking at the geographic distribution of that error to see if it’s random or otherwise, but that’s very near the error level for the input data.

    Addendum:

    8 days ago, Martin Isenburg’s let the lasheight tool drop. Lastools is impressively fast.  Preliminary tests by Etienne: 140450 points per second.

    Oy!  A factor of 8 faster than PostGIS NN  And it uses an on-the-fly calculated DEM using delaunay triangulation.  Preliminary results indicate a very different DEM than the one calculated by the state:

    RMSE: 25.4227446600656

    More research will need done to find out the source of the differences: they are not random.

    In other news:

    PostGIS will get a true Knn technique in the near future.  Such a project is funded, so stay tuned for faster results:

    http://www.postgis.org/pipermail/postgis-devel/2011-June/013931.html

    Also, index changes to PostGIS could come down the pike, if funded, that will result in bounding box queries that are 6-times faster, ala:

    http://blog.opengeo.org/2011/05/27/pgcon-notes-3/

    Between the two optimizations, we could give Lastools a run for its money on speed  🙂 .  In the meantime, preliminary RMSE aside, Lastools (the 5th tool not in this test) may win hands down.

    Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , , | 9 Comments »

    Contours– Symbolized from a single table

    Posted by smathermather on July 15, 2011

    In a previous post, I restructured the contour data for display in GeoServer, e.g.:

    
    UPDATE base.contours_2
    	SET 	div_10 = CAST( contours_2.elevation % 10 AS BOOLEAN ),
    		div_20 = CAST( contours_2.elevation % 20 AS BOOLEAN ),
    		div_50 = CAST( contours_2.elevation % 50 AS BOOLEAN ),
    		div_100 = CAST( contours_2.elevation % 100 AS BOOLEAN ),
    		div_250 = CAST( contours_2.elevation % 250 AS BOOLEAN );
    

    The SLD styling for the contours based on the new data structure will be forthcoming with my intern’s upcoming guest post, but in the meantime, I have a teaser of a map snapshot, served up through a GeoExt interface (note the correct contour intervals showing in the legend simply and elegantly because of the data structure and a single SLD):

    Contour Map, 10ft and 50ft contours

    Over-zoom showing 2ft and 10ft Contours

    Posted in Database, Database Optimization, GeoServer, PostGIS | Tagged: , , , , , , , | Leave a Comment »

    Contours– Structuring PostGIS data for viewing with GeoServer

    Posted by smathermather on May 25, 2011

    Naively structured data is my bane– the desire (and need) to get stuff done so often overtakes the time needed to do things the better way. So, we bootstrap.

    A long time ago, we managed to load in a few tens of gigs of contour data into PostGIS, partitioned it into 2ft, 10ft, 20ft, 50ft, 100ft and 250ft tables using select queries with a modulus operator, e.g.

    
    CREATE TABLE base.cuy_contours_10
    	AS
    	SELECT elevation, the_geom
    		FROM base.cuy_contours_2
    		WHERE base.cuy_contours_2.elevation % 10 = 0;
    

    And then we built SLD‘s to display the data in GeoServer and formed the data into layer groups, and away we went… .

    However, layer groups can often be replaced by properly structured PostGIS tables. For large (and homogenous) datasets like this, it’s the only way to go. In addition, properly structuring this allows us to take advantage of GeoServer’s ability to modify the legend based on zoom extent, which for representing contours at a range of scales is an almost “must have”.

    To structure the table, we could be disciplined and build this out as a proper normalized relational dataset where our gid is used to determine if a contour is divisible by 10, 20, 50 etc., and while “I don’t wanna” isn’t a good enough reason not to do this, I think the computational overhead of a database view piecing these data back into a single table each time we need to access this would not be justified in the light of the static nature of the table. So database normalization be darned, disk space is cheap, full speed ahead. Let’s add some boolean fields for flagging whether a contour is divisible by our numbers and calculate that out:

    
    UPDATE base.contours_2
    	SET div_10 = CAST( contours_2.elevation % 10 AS BOOLEAN );
    

    
    UPDATE base.contours_2
    	SET div_250 = CAST( contours_2.elevation % 250 AS BOOLEAN );
    

    yields the following (with highlights added):

    Sort of the opposite of what I intended, the “falses” maybe should be “trues” and vice versa, but hey, it’ll work anyway. BTW, we did test doing this calculation a few different ways, and while I can’t remember all the details, doing this as a calculation instead of doing an update query with a while statement testing the modulus was much faster (3x faster).

    Ok, so now we style an sld for it, and sit back and enjoy (pics later…).

    Posted in Database, Database Optimization, GeoServer, PostGIS, SQL | Tagged: , , , , , , , , , | Leave a Comment »

    PostgreSQL and Fuzzy Hashes– a natural combo?

    Posted by smathermather on May 20, 2011

    A while back, I had the idea of creating a hash of geometry in order to create a unique constraint on geometry to avoid duplicate geometries in a PostGIS table. It works. It’s slow, but it works. I haven’t used it since.

    What I really wanted to do though was be able to detect if geometries were similar as well (not just detect identical geometries), to test if geometries are homologous.  Wouldn’t it be nice to know how similar two shapes are?

    A couple of nights ago, I was introduced to a technique that may just do that– fuzzy hashes.  Now a hash is a clever way to convert any and everything (in a computer), regardless of size, into the same size number, say 128-byte integer which is unique to the particular input.  This has all sorts of uses, from identifying viruses, to indexing information, to ensuring that when you download something from the internet, you are getting what you thought you were downloading in an unmodified form.  The unique property of a hash is that, make a small change to the input, and the output hash is completely different from the hash of unchanged input.  If you want to know if two things are similar, a hash won’t help.  In fact, hashes of bad files are used in forensics, and work-arounds to hash-based forensics can be as simple as changing a single byte in a file.

    Enter fuzzy hashes.  In short, fuzzy hashes break up a file into logical pieces or blocks, hash each block, and then similarity between files can be determined block-by-block.  This is useful for file forensics, anti-virus, determining the origins of nefarious code, detecting plagiarism etc.  In long, well… just read this.  Also, check out ssdeep.

    The question is, can we use this the geographic sphere.  How would this be useful in, e.g. PostGIS?

    Let’s give it a try:

    Line geometry-- original

    If we convert our purple line above to text in PostGIS:

    SELECT ST_AsEWKT(the_geom) FROM trail_test;

    we get something like this:

    "SRID=3734;MULTILINESTRING((2179985.47870642 567446.852705703 1030.80067171231 -1.79769313486232e+308,2179982.97870642 567445.252427514 1030.79452346621 -1.79769313486232e+308,2179980.47870642 567443.652149326 1030.74616711558 -1.79769313486232e+308,2179977.97870642 567442.051871137 1030.61640858078 -1.79769313486232e+308,2179975.47870642 567440.451592949 1030.50568889286

    -1.79769313486232e+308,2179794.15034441 566393.137591605 1051.18052872389 -1.79769313486232e+308,2179794.04164876 566390.637591605 1051.46559305229 -1.79769313486232e+308,2179793.93295311 566388.137591605 1051.74478920181 -1.79769313486232e+308,2179793.82425746 566385.637591605 1052.01504940301 -1.79769313486232e+308,2179793.73503707 566383.585522703 1052.23441454273 -1.79769313486232e+308))"

    Which fuzzy hashed looks like this:

    # ssdeep -b temp
    ssdeep,1.0--blocksize:hash:hash,filename
    384:RpLQqsmoBbzjvPW7KEhoAyoGchYqpmeYOB7LCfCMJrOevT1WLwR7UyYIU:7Eq5oBbzjvPWuAyoGcSqIUNLCfCMJ6eq,"temp"

    If we change some vertices and rerun, but this time change the ssdeep command to tell us similarity:

    Original Geometry

    Modified Geometry


    Create a hash of the first geometry and write it to file:
    # ssdeep -b temp > temphash

    then to compare the hash of the second file:
    # ssdeep -bm temphash temp1
    temp1 matches temp (99)

    Not surprisingly, they are 99% similar.
    Now this is a painful way to do this comparison. However, imagine a function in the PostGIS suite wherein you could do a geometry-by-geometry similarity comparison for tables. Imagine combining that analysis with a snap-to-grid function to further generalize it’s applicability.

    Any thoughts on uses? I know I’d use it to determine whether two data sources were related, i.e. were modified from the same parent, but I’m guessing there are some other more subtle applications.

    Posted in PostGIS | Tagged: , , , , | 2 Comments »