Smathermather's Weblog

Remote Sensing, GIS, Ecology, and Oddball Techniques

Posts Tagged ‘CLUSTER’

LiDAR Shootout!

Posted by smathermather on July 18, 2011

For a couple of months now I’ve been corresponding with Etienne Racine and Pierre Racine out of Montreal Laval University in Quebec City.  They decided to take on the problem of finding the speed and accuracy of a number of different techniques for extracting canopy height from LiDAR data.  They have been kind enough to allow me to post the results here.  This will be a multipart series.  We’ll start with the punchline, and then build out the code behind each of the queries/calculations in turn (category will be “LiDAR Shootout“) with separate posts.

In the hopper for testing were essentially 4 different ways to extract canopy height from a LiDAR point cloud and (in three of the cases) a DEM extracted from the LiDAR point cloud.  The 4 techniques were as follows:

  • Use ArcGIS ExtractValuesToPoints.
  • Use R using the raster and rgdal packages with raster::extract() function with in-memory results.
  • Use postgis (2.0 svn) Raster datatype (formerly WKT Raster)
  • And use a simple nearest neighbor approach with ST_DWithin within PostGIS.

The data are from the Ohio Statewide Imagery Program (OSIP) program, run by Ohio Geographically Referenced Information Program (OGRIP).  Ohio is blessed with an excellent 2006-2008 LiDAR dataset and statewide DEM derived from the (effectively) 3.5 foot horizontal posting data (specs may say 5-ft, I can’t remember…).

Speeeed:

So, first to speed, initial results.  Who wins in the speed category?  Measured as points per second (on consistently the same hardware), nearest neighbor wins by a landslide (bigger is better here):

Application:  Points per Second

ArcGIS:  4504
R:  3228
PostGIS Raster:  1451 (maybe as high as 4500)
Postgis nn:  18208

Now, later on, Pierre discovered changes to indexing may help an the EXPLAIN query analyzer optimization which tripled the PostGIS Raster query speed, making it about the same speed as ArcGIS.  More on that later.

Figure removed– to be replaced in another post

Accuracy:

A fast calculation is always better, unless you trade off accuracy in some detrimental way.  With PostGIS NN, we’re not interpolating our LiDAR ground point cloud before calculating the difference, so relative to the other three techniques, we could be introducing some bias/artifacts here, a point Jeff Evans makes here.  Overall error relative to the interpolated solution introduced by using an NN technique on this dataset: 0.268 feet.  Whew!  Sigh of relief for me!  We may spend some time looking at the geographic distribution of that error to see if it’s random or otherwise, but that’s very near the error level for the input data.

Addendum:

8 days ago, Martin Isenburg’s let the lasheight tool drop. Lastools is impressively fast.  Preliminary tests by Etienne: 140450 points per second.

Oy!  A factor of 8 faster than PostGIS NN  And it uses an on-the-fly calculated DEM using delaunay triangulation.  Preliminary results indicate a very different DEM than the one calculated by the state:

RMSE: 25.4227446600656

More research will need done to find out the source of the differences: they are not random.

In other news:

PostGIS will get a true Knn technique in the near future.  Such a project is funded, so stay tuned for faster results:

http://www.postgis.org/pipermail/postgis-devel/2011-June/013931.html

Also, index changes to PostGIS could come down the pike, if funded, that will result in bounding box queries that are 6-times faster, ala:

http://blog.opengeo.org/2011/05/27/pgcon-notes-3/

Between the two optimizations, we could give Lastools a run for its money on speed  🙂 .  In the meantime, preliminary RMSE aside, Lastools (the 5th tool not in this test) may win hands down.

Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , , | 9 Comments »

PostGIS and LiDAR, the saga completes?

Posted by smathermather on April 6, 2010

I have to thank this post for the code that follows http://postgis.refractions.net/pipermail/postgis-users/2009-June/023660.html:

I’ve been putting at this nearest neighbor for points problem for a few months now, never dedicating enough time to think in through, nor expending enough energy to be really great at SQL, let alone how it can be applied in spatial operations with an OGC compliant database like PostGIS.  But here’s my naive nearest neighbor solution.  It’s my solution for getting vegetation height from a set of vegetation points and a set of ground points, basically taking the nearest ground point, and subtracting the z value of the ground from the z value of the vegetation.

The points themselves, while attributed with x, y, and z, are in fact 2D point sets, so the nearest point is not in the 3D domain, but the nearest point horizontally.  I did this on purpose– it reduces errors in areas with steep slopes, and has less computational overhead.

SELECT DISTINCT ON(g1.gid)  g1.z - g2.z as height
 FROM veg As g1, ground As g2   
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)   
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

height   
-----------
 0.00000
 33.80000
 0.52000
 0.00000
 0.00000
 0.00000
 10.79000
 11.55000
 6.92000
 7.61000
 22.54000
 50.16000
 50.62000
 50.96000
 45.93000

It’d be nice to know which points are which, and where they are at as well, so we’ll grab the gid for each point, as well as the value for x, y, and z:

SELECT DISTINCT ON(g1.gid)  g1.gid as gid, g2.gid as gid_ground, g1.x as x, g1.y as y, g1.z as z, g1.z - g2.z as height, the_geom g1.the_geom as geometry
 FROM veg As g1, ground As g2   
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)   
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

How efficient is this?

EXPLAIN ANALYZE SELECT DISTINCT ON(g1.gid)  g1.gid As gid, g2.gid As gid_ground, g1.x as x, g1.y as y, g1.z as z, g1.z - g2.z as height, the_geom as geometry
 FROM veg As g1, ground As g2
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

Unique  (cost=1347373.60..1347373.61 rows=1 width=225) (actual time=6342.041..6358.930 rows=1882 loops=1)
->  Sort  (cost=1347373.60..1347373.60 rows=1 width=225) (actual time=6342.039..6353.324 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external merge  Disk: 1704kB
->  Nested Loop  (cost=44.80..1347373.59 rows=1 width=225) (actual time=0.057..6267.593 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND (g2.the_geom && st_expand(g1.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on ground g2  (cost=0.00..57.22 rows=2522 width=113) (actual time=0.011..0.927 rows=2522 loops=1)
->  Materialize  (cost=44.80..63.71 rows=1891 width=112) (actual time=0.000..0.170 rows=1891 loops=2522)
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.006..0.634 rows=1891 loops=1)

Total runtime: 6392.724 ms

(10 rows)

Oh, yeah. We need some indexes.

CREATE INDEX veg_the_geom_idx ON veg USING gist(the_geom);
CREATE INDEX ground_the_geom_idx ON ground USING gist(the_geom);

 

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=1266.33..1266.33 rows=1 width=225) (actual time=378.057..393.715 rows=1882 loops=1)
->  Sort  (cost=1266.33..1266.33 rows=1 width=225) (actual time=378.055..387.902 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external sort  Disk: 1712kB
->  Nested Loop  (cost=0.00..1266.32 rows=1 width=225) (actual time=0.127..302.111 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.014..0.490 rows=1891 loops=1)
->  Index Scan using ground_the_geom_idx on ground g2  (cost=0.00..0.37 rows=1 width=113) (actual time=0.025..0.065 rows=29 loops=1891)
Index Cond: (g2.the_geom && st_expand(g1.the_geom, 15::double precision))
Total runtime: 410.327 ms

(10 rows)

Whew! That’s a bit faster. Will clustering help us this time?

ALTER TABLE veg CLUSTER ON veg_the_geom_idx;
ALTER TABLE ground CLUSTER ON ground_the_geom_idx;

 

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=1266.33..1266.33 rows=1 width=225) (actual time=377.542..393.175 rows=1882 loops=1)
->  Sort  (cost=1266.33..1266.33 rows=1 width=225) (actual time=377.541..387.374 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external sort  Disk: 1712kB
->  Nested Loop  (cost=0.00..1266.32 rows=1 width=225) (actual time=0.065..302.340 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.006..0.459 rows=1891 loops=1)
->  Index Scan using ground_the_geom_idx on ground g2  (cost=0.00..0.37 rows=1 width=113) (actual time=0.025..0.065 rows=29 loops=1891)
Index Cond: (g2.the_geom && st_expand(g1.the_geom, 15::double precision))
Total runtime: 409.784 ms
(10 rows)

Eh, not much. But this is a pretty small dataset. We’ll see later how it helps on a larger dataset.

And the relavent stats:

These were run on a tortured late model MacBook Pro, with a 2.4GHz Intel Core 2 Duo, with a couple of gigs of RAM on PostgreSQL.

SELECT version();
 version                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.2 on i386-apple-darwin10.2.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1), 64-bit
(1 row)

and PostGIS 1.5.1.

I promised a test in this post to find out whether my supposition that I have a dense enough LiDAR point cloud precludes the need for doing the difference with an interpolated ground point cloud, vs. the raw point cloud as I’m using it here.  I haven’t forgotten… .  That’s the next test, I think.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , , | 2 Comments »

PostGIS and LiDAR– oops!

Posted by smathermather on February 7, 2010

I won’t offer much prelude.  Read this post, and this post first… .

Inadvertently I demonstrated the value of spatial indices, i.e. I meant to use them and didn’t.  In attempting to sort my tables by their spatial index I got the following errors:

ALTER TABLE lidar_ground CLUSTER ON lidar_ground_the_geom_idx;
 ERROR:  "lidar_ground_the_geom_idx" is not an index for table "lidar_ground"
ALTER TABLE lidar_veg CLUSTER ON lidar_veg_the_geom_idx;
 ERROR:  "lidar_veg_the_geom_idx" is not an index for table "lidar_veg"

Looking further, I saw there were no spatial indices on either table.

Returning to my original post, I had typos:

CREATE INDEX lidar_veg_the_geom_idx ON lidar USING gist(the_geom);
CREATE INDEX lidar_ground_the_geom_idx ON lidar USING gist(the_geom);

Do you see it?  It should be

CREATE INDEX lidar_veg_the_geom_idx ON lidar_veg USING gist(the_geom);
CREATE INDEX lidar_ground_the_geom_idx ON lidar_ground USING gist(the_geom);

I created two identical indices on the lidar table, not on the two tables I was interested in.  So, what was the relative performance hit?

Without the indices:

Vinton=# EXPLAIN ANALYZE SELECT pgis_fn_nn(b.the_geom, 10,1,1,'lidar_ground','true','vnum','the_geom') FROM lidar_veg AS b WHERE b.vnum > 576505 AND b.vnum < 576600;
 QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using lidar_veg_pkey on lidar_veg b  (cost=0.00..108.97 rows=31 width=100) (actual time=571.154..28589.303 rows=50 loops=1)
 Index Cond: ((vnum > 576505) AND (vnum < 576600))
 Total runtime: 28589.398 ms
(3 rows)

With the indices:

EXPLAIN ANALYZE SELECT pgis_fn_nn(b.the_geom, 10,1,1,'lidar_ground','true','vnum','the_geom') FROM lidar_veg AS b WHERE b.vnum > 576505 AND b.vnum < 576600;
 QUERY PLAN                                                       

------------------------------------------------------------------------------------------------------------------------------
-------
 Index Scan using lidar_veg_pkey on lidar_veg b  (cost=0.00..108.97 rows=31 width=100) (actual time=86.489..138.579 rows=50 lo
ops=1)
 Index Cond: ((vnum > 576505) AND (vnum < 576600))
 Total runtime: 138.717 ms
(3 rows)

Now with CLUSTERing:

ALTER TABLE lidar_veg CLUSTER ON lidar_veg_the_geom_idx;
ALTER TABLE lidar_ground CLUSTER ON lidar_ground_the_geom_idx;

EXPLAIN ANALYZE SELECT pgis_fn_nn(b.the_geom, 10,1,1,'lidar_ground','true','vnum','the_geom') FROM lidar_veg AS b WHERE b.vnum > 576505 AND b.vnum < 576600;
 QUERY PLAN                                                        

------------------------------------------------------------------------------------------------------------------------------
-----
 Index Scan using lidar_veg_pkey on lidar_veg b  (cost=0.00..108.97 rows=31 width=100) (actual time=1.929..56.285 rows=50 loop
s=1)
 Index Cond: ((vnum > 576505) AND (vnum < 576600))
 Total runtime: 56.378 ms
(3 rows)

Indices sped up the query by a factor of 206 times.  Adding the CLUSTER sped it up another 2.4 times over and above the indices for a total of a query that runs 507 times faster.  Did I say the full query would take a few years?  Ahem, I meant a couple days.  Specifically about 2.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , , , , | Leave a Comment »

Rethinking PostGIS Analyses– Remembering to CLUSTER

Posted by smathermather on February 6, 2010

Something’s been nagging at me as far as the level of optimization (or lack there-of) that I did to my database before my other posts of using PostGIS to analyze LiDAR data (e.g. this post).  It seemed my results were remarkably slow, but I couldn’t put my finger on why that was problematic.

Then, as I was testing a smaller lidar dataset in order to do a shading analysis for my future garden, I noticed that with those 1800 points (a high density point cloud considering I live on a quarter-acre), the nearest-neighbor algorithm, per point, was much faster than when applied to the larger dataset I was testing before.  On one hand this seems reasonable.  A bigger dataset takes longer to process.  But, properly optimized, a large dataset should not take significantly longer per record.  We want linear, or near linear, scaling of processing.  When we don’t, in this case, it indicates that the data are not structured in a way that allows for fast access to records of interest.

Re-ordering records (PostgreSQL CLUSTER) to reflect the hierarchy of an index is a good way to optimize a database, assuming you know which index is your most important one to use to organize your data structure.  In the case of large spatial datasets, this is usually the spatial index.

In this example, I will apply CLUSTERing to the PostGIS database, clustering on the spatial index in order to optimize spatial queries.

ALTER TABLE lidar_ground CLUSTER ON lidar_ground_the_geom_idx;
ALTER TABLE lidar_veg CLUSTER ON lidar_veg_the_geom_idx;

I will report back with some results.  Hopefully this helps.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , , , | Leave a Comment »