Smathermather's Weblog

Remote Sensing, GIS, Ecology, and Oddball Techniques

Archive for the ‘Database Optimization’ Category

Normalizing tables in PostgreSQL: experiments with Fulcrum

Posted by smathermather on February 12, 2015

Fulcrum by Spatial Networks is an interesting and very useful hosted service for collecting geo data in the field with whatever smartphone or tablet you already own.  It works well, so long as you just want to collect points, and the field building capabilities are quite extraordinary. Also, for many applications, it’s cheaper than custom solutions, so it hits that sweet spot well. If you’ve been paying attention, I prefer my hosted solutions to be (nearly) purely open source, so as to avoid vendor lock-in. Fulcrum is sufficiently compelling, we are using it carefully and with a bit of joy, in spite of not meeting my normal preferences.

Screen shot of Fulcrum's form builder

So, I thought I’d share some table normalization we are doing on our wetland inventory data that we’ll be collecting with Fulcrum. The State of Ohio specifies a rapid assessment method for classifying wetlands called Ohio Rapid Assessment Method for Wetlands (ORAM) (warning– pdf). This assessment method has force of law, and on paper is a one-page, two-sided form with a range of physical, water regime, land cover and other characterizations, and to date we’ve always filled it out on paper, transcribed that at the office, and then populated spreadsheets, or more recently a PostGIS/PostgreSQL database.

Where we run into some difficulty is that some of our form values can be multiple values. Fulcrum returns a single field for these, with comma delimited contents:

comma_delimited

Since we need to do lookups and calculations on these fields, we really need some normalization. Enter regexp_split_to_table.

CREATE TABLE nr.oram_metrics AS
  SELECT oram_id, 'm1_wetland_area'::text AS metric, regexp_split_to_table(m1_wetland_area, ',')::text AS selection
    FROM nr.cm_oram_data;

Now in our case, we have one lookup table for ORAM. Perhaps it should be multiple tables, but it’s just one discrete (ht. Brian Timoney for grammar fix) thing, this ORAM form, so we keep it simple. This means for each of our columns that should be normalized, we can throw them all in one table. We use UNION ALL to accomplish this:

CREATE TABLE nr.oram_metrics AS
  SELECT oram_id, 'm1_wetland_area'::text AS metric, regexp_split_to_table(m1_wetland_area, ',')::text AS selection
    FROM nr.cm_oram_data

UNION ALL

  SELECT oram_id, 'm2a_upland_buffer_width'::text AS metric, regexp_split_to_table(m2a_upland_buffer_width, ',')::text AS selection
    FROM nr.cm_oram_data;

Finally, we can use a FOREIGN KEY constraint with a table that has all our metrics and selections plus our lookup value to complete the exercise:

CREATE TABLE nr.oram_metrics AS
  SELECT oram_id, 'm1_wetland_area'::text AS metric, regexp_split_to_table(m1_wetland_area, ',')::text AS selection
    FROM nr.cm_oram_data

UNION ALL

  SELECT oram_id, 'm2a_upland_buffer_width'::text AS metric, regexp_split_to_table(m2a_upland_buffer_width, ',')::text AS selection
    FROM nr.cm_oram_data;

ALTER TABLE nr.oram_metrics ADD CONSTRAINT "metric_constraint"
  FOREIGN KEY (metric, selection) REFERENCES nr.oram_score_lookup_all (metric, selection);​ 

Posted in Database, Database Optimization, Fulcrum, PostGIS, PostgreSQL, SQL | Tagged: , , , | Leave a Comment »

Cartography and USGS — Fake Building Footprints in PostGIS now with distance operator (part 2)

Posted by smathermather on May 17, 2012

In a previous couple of posts (this one, and this one), we dealt with point rotations, first with bounding box searches, and then with nominal use of operators.

First we create a function to do our angle calculations, then use select to loop through all the records and do the calculations.

Within our function, first we find our first (in this case) five nearest streets using our distance operator , based on bounding boxes, then we further order by ST_Distance, with a LIMIT 1 to get just the closest street. By the way, this is about twice as fast as the last approach.

And our new code:

CREATE OR REPLACE FUNCTION angle_to_street (geometry) RETURNS double precision AS $$

WITH index_query as
                (SELECT ST_Distance($1,road.geom) as dist,
                                degrees(ST_Azimuth($1, ST_ClosestPoint(road.geom, $1))) as azimuth
                FROM street_centerlines As road
                ORDER BY $1 <#> road.geom limit 5)

SELECT azimuth
                FROM index_query
                ORDER BY dist
LIMIT 1;

$$ LANGUAGE SQL;

DROP TABLE IF EXISTS address_points_rot;
CREATE TABLE address_points_rot AS

                SELECT addr.*, angle_to_street(addr.geom)
                FROM
                                address_points addr;

I have to credit Alexandre Neto with much of the code for this solution.

Posted in Analysis, Database, Database Optimization, PostGIS, PostgreSQL, SQL | Tagged: , , , , , , | Leave a Comment »

Dialog– What qualifies as a benchmark? Part 1

Posted by smathermather on July 25, 2011

Normally, my blog is a bit of a monologue.  It’s not a bad thing, but can be a little lonely.  Every now and then I get a great (and often doubtful) comments, which enhances things considerably.

What follows is some dialog about the LiDAR shootout series, largely between Etienne and Pierre, posted with their permission:

Pierre:

“Etienne, Stephen,

“I really appreciate the benchmark work you are doing to compare different technologies to solve your lidar data processing problem.

“However, if you want your tests to have as much credibility as possible and mostly if you wishes to publish the results on the web I would recommend that you compare horses with horses, dogs with dogs, general vector technologies with general vector technologies, raster technologies with raster technologies, specialized lidar technologies with specialized lidar technologies. Otherwise the dog will always lose against the horses. You have to be clear on that and well explain the differences and the pro’s and con’s between all the tech you will be using in your benchmark. Are you doing the test in ArcGIS using ESRI lidar technology (or general vector one)? Same question for R and PostGIS Raster (in the last case it is clear that the answer is no). In your graph you are comparing very different kind of animals without giving much chances to dogs.

“For every technologies you will also have to test different approaches in order to find the fastest one for this technology in order to compare it with the fastest one in the other technologies. Comparing a slow approach in one tech with the fastest in the other would be a bit unfair. This is what we did for PostGIS Raster. Did you do the same with R and ArcGIS?

“Basically I think it is just unfair to compare ArcGIS, R and PostGIS Raster with LAStools for working with Lidar data. Mostly if you are not using ArcGIS and R packages for dealing with lidar. Teach us you did not use them if you did not. We will all learn something.”

Etienne:

“As you said, the point is to compare different strategies to get the job done. That’s it. It’s not a question of being fair or not, we report facts. Maybe the bottom line is that pgraster shouldn’t be used to extract lidar height. However it can be used for other things. Well, we all learned something.

“It also emphasize the fact that there is no point_cloud type in postgis. I guess this could improve a lot the postgis results as it seems handling point cloud might be really fast. I learned that.

“Each solution has its drawbacks, for ArcGIS I think we’ll all agree about the cost of the licence and furthermore the cost of the lidar plugin. For lastools, it require a licensing for commercial use (which is really low btw compared to ArcGIS + Spatial analyst + LP360). PG raster require a postgres installation and some SQL knowledge, etc. R require to learn a new language, etc.

“Now I must acknowledge, the tests we did were on only 500k points. That could be considered a lot, and not. When dealing with lidar, one could easily expect to get more than a thousand times this quantity. But it was the limit imposed by the LP360 version I used and also a reasonable size for benchmarking. Note that pg raster was really stable in the timing (whatever the size).

“Finally, please note that Martin’s approach is new and he just released the tool a week ago. It’s to the better of my knowledge, the first one the do it.

“Pierre, explain your point better, I’m not sure I get it (or I disagree, as you can see).”

Pierre:
“My point is just that you have to be clear in your post about the fact that LAStool is a technology developed explicitly for dealing with lidar data. PostGIS raster is not and I guess you are not using ESRI lidar technology and I have no idea how you do it with R. The point is to compare potatoes with potatoes.

“If I would be from the ESRI side I would quickly post a comment in the post saying “Did you read this article?” http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Estimating_forest_canopy_density_and_height/00q8000000v3000000/

“A don’t know how you did it in ArcGIS but it seems that they are storing lidar data as multipoints.

“A good benchmark should precise all this.

Etienne (Pierre > quoted):

> If I would be from the ESRI side I would quickly post a comment in the post saying “Did you read this article?”

> http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Estimating_forest_canopy_density_and_height/00q8000000v3000000/
“I would reply: apart from the fact that it uses lidar data, it has nothing to do with the object of the present article 🙂 But I’d be happy to discover a better way than what I did (for all the tools btw).

“Maybe you should (re-)read the post, it’s written. For R as well. Maybe Steve, you should precise (this is my fault) that for ArcGIS, I’ve used LP360 (evaluation) to load lidar data, which was limited to 500k points. But it wasn’t involved in the computation as I used (as written in the post) «ExtractValuesToPoints».”

Pierre:

> So it’s a good benchmark ?

“I just think it would be more fair to create three test categories:

“1) Lidar technologies (apparently your ArcGIS test should be in this category)

“2) Vector/raster technologies (why not use non lidar techs if lidar techs are so much better?)

“3) Nearest neighbor technique using vector only technology

“I won’t comment further…

Etienne:

“I personally would not encourage these categories as there is some confusion (at least in my mind) on the definition of what is a «LiDAR thechnology». However I think it would be a good idea to nuance the conclusions, which I think will be done in further posts (or not). It’s also pretty clear from the RMSE measures what is the main drawback of each solution (with speed).

“I’m still open for discussion…

Stephen (me):

“Hi Pierre,
     “I agree with you in principle, but the ArcGIS aside, there are some very real opportunities for optimization on point in raster analyses with tiling, indexing, etc.  In other words, I don’t see any practical reasons why PostGIS Raster couldn’t be (nearly) as fast as an nn technique, so long as the tile size is optimized for density of the analysis in question.  The only thing that makes it a dog is its age as a technology.

(moments later after discovering the lengthy dialog…)

“It looks like I missed all the discussion here…

“The reason such benchmarks are interesting is (in part) that they are not fair.  PostGIS isn’t optimized for LiDAR analyses.  PostGIS Raster isn’t optimized for LiDAR analyses.  PostGIS Raster is very young technology which really isn’t fair.  The PostGIS development folks know this, and were interested in seeing my initial benchmarks with nn analyses for LiDAR height (last year) because they are aware of the optimizations that are not yet there for handling point clouds.  Vector and raster analyses (as we’ve run them here) should not in principle or practice substantially different in speed, so long as the query analyzer maximizes the efficiency of the techniques.  Add in multipoint optimizations and other voodoo, and well, that’s a different story.  Another aspect to this is that comparing lastools to the rest is unfair not just because it’s optimized for lidar but because it’s working at a much lower level (file level) than the abstractions a database provides.  The trade-off, off course, is that a database can provide transaction support, standardized language, automated query analyzers, etc. etc. that don’t apply for file level processing.

“What I see here, categories aside, is a need for the other 2-3 parts of this 3-4 part series– the breakdown of which techniques are used, how they’re implemented, and the rationale behind them.  Don’t worry, we’ll get there.  That’s the danger of starting with the punchline, but we’ll build the nuance in.

 

So, there we are.  Looks like I’m committed to building this out with a little more nuance.  🙂  Stay tuned.

Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , | Leave a Comment »

LiDAR Shootout! — New Chart, Final Results

Posted by smathermather on July 21, 2011

Comparison of lidar height calculations

In reviewing the final numbers and charts from Etienne and Pierre, above are the results we see.  The only revision is a moderate increase in speed for the PG Raster query.

Final results in speed for lastools– ~350,000 points per second.  In other words– off-the-charts fast.  And the initial RMSE of ~25 feet was a mistake– it is probably closer to 0.2 feet.

Stay tuned for detailed reviews of these techniques (with code).

 

Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , | Leave a Comment »

LiDAR Shootout!

Posted by smathermather on July 18, 2011

For a couple of months now I’ve been corresponding with Etienne Racine and Pierre Racine out of Montreal Laval University in Quebec City.  They decided to take on the problem of finding the speed and accuracy of a number of different techniques for extracting canopy height from LiDAR data.  They have been kind enough to allow me to post the results here.  This will be a multipart series.  We’ll start with the punchline, and then build out the code behind each of the queries/calculations in turn (category will be “LiDAR Shootout“) with separate posts.

In the hopper for testing were essentially 4 different ways to extract canopy height from a LiDAR point cloud and (in three of the cases) a DEM extracted from the LiDAR point cloud.  The 4 techniques were as follows:

  • Use ArcGIS ExtractValuesToPoints.
  • Use R using the raster and rgdal packages with raster::extract() function with in-memory results.
  • Use postgis (2.0 svn) Raster datatype (formerly WKT Raster)
  • And use a simple nearest neighbor approach with ST_DWithin within PostGIS.

The data are from the Ohio Statewide Imagery Program (OSIP) program, run by Ohio Geographically Referenced Information Program (OGRIP).  Ohio is blessed with an excellent 2006-2008 LiDAR dataset and statewide DEM derived from the (effectively) 3.5 foot horizontal posting data (specs may say 5-ft, I can’t remember…).

Speeeed:

So, first to speed, initial results.  Who wins in the speed category?  Measured as points per second (on consistently the same hardware), nearest neighbor wins by a landslide (bigger is better here):

Application:  Points per Second

ArcGIS:  4504
R:  3228
PostGIS Raster:  1451 (maybe as high as 4500)
Postgis nn:  18208

Now, later on, Pierre discovered changes to indexing may help an the EXPLAIN query analyzer optimization which tripled the PostGIS Raster query speed, making it about the same speed as ArcGIS.  More on that later.

Figure removed– to be replaced in another post

Accuracy:

A fast calculation is always better, unless you trade off accuracy in some detrimental way.  With PostGIS NN, we’re not interpolating our LiDAR ground point cloud before calculating the difference, so relative to the other three techniques, we could be introducing some bias/artifacts here, a point Jeff Evans makes here.  Overall error relative to the interpolated solution introduced by using an NN technique on this dataset: 0.268 feet.  Whew!  Sigh of relief for me!  We may spend some time looking at the geographic distribution of that error to see if it’s random or otherwise, but that’s very near the error level for the input data.

Addendum:

8 days ago, Martin Isenburg’s let the lasheight tool drop. Lastools is impressively fast.  Preliminary tests by Etienne: 140450 points per second.

Oy!  A factor of 8 faster than PostGIS NN  And it uses an on-the-fly calculated DEM using delaunay triangulation.  Preliminary results indicate a very different DEM than the one calculated by the state:

RMSE: 25.4227446600656

More research will need done to find out the source of the differences: they are not random.

In other news:

PostGIS will get a true Knn technique in the near future.  Such a project is funded, so stay tuned for faster results:

http://www.postgis.org/pipermail/postgis-devel/2011-June/013931.html

Also, index changes to PostGIS could come down the pike, if funded, that will result in bounding box queries that are 6-times faster, ala:

http://blog.opengeo.org/2011/05/27/pgcon-notes-3/

Between the two optimizations, we could give Lastools a run for its money on speed  🙂 .  In the meantime, preliminary RMSE aside, Lastools (the 5th tool not in this test) may win hands down.

Posted in Database, Database Optimization, LiDAR Shootout, PostGIS, SQL | Tagged: , , , , , , , , | 9 Comments »

Contours– Symbolized from a single table

Posted by smathermather on July 15, 2011

In a previous post, I restructured the contour data for display in GeoServer, e.g.:


UPDATE base.contours_2
	SET 	div_10 = CAST( contours_2.elevation % 10 AS BOOLEAN ),
		div_20 = CAST( contours_2.elevation % 20 AS BOOLEAN ),
		div_50 = CAST( contours_2.elevation % 50 AS BOOLEAN ),
		div_100 = CAST( contours_2.elevation % 100 AS BOOLEAN ),
		div_250 = CAST( contours_2.elevation % 250 AS BOOLEAN );

The SLD styling for the contours based on the new data structure will be forthcoming with my intern’s upcoming guest post, but in the meantime, I have a teaser of a map snapshot, served up through a GeoExt interface (note the correct contour intervals showing in the legend simply and elegantly because of the data structure and a single SLD):

Contour Map, 10ft and 50ft contours

Over-zoom showing 2ft and 10ft Contours

Posted in Database, Database Optimization, GeoServer, PostGIS | Tagged: , , , , , , , | Leave a Comment »

Contours– Structuring PostGIS data for viewing with GeoServer

Posted by smathermather on May 25, 2011

Naively structured data is my bane– the desire (and need) to get stuff done so often overtakes the time needed to do things the better way. So, we bootstrap.

A long time ago, we managed to load in a few tens of gigs of contour data into PostGIS, partitioned it into 2ft, 10ft, 20ft, 50ft, 100ft and 250ft tables using select queries with a modulus operator, e.g.


CREATE TABLE base.cuy_contours_10
	AS
	SELECT elevation, the_geom
		FROM base.cuy_contours_2
		WHERE base.cuy_contours_2.elevation % 10 = 0;

And then we built SLD‘s to display the data in GeoServer and formed the data into layer groups, and away we went… .

However, layer groups can often be replaced by properly structured PostGIS tables. For large (and homogenous) datasets like this, it’s the only way to go. In addition, properly structuring this allows us to take advantage of GeoServer’s ability to modify the legend based on zoom extent, which for representing contours at a range of scales is an almost “must have”.

To structure the table, we could be disciplined and build this out as a proper normalized relational dataset where our gid is used to determine if a contour is divisible by 10, 20, 50 etc., and while “I don’t wanna” isn’t a good enough reason not to do this, I think the computational overhead of a database view piecing these data back into a single table each time we need to access this would not be justified in the light of the static nature of the table. So database normalization be darned, disk space is cheap, full speed ahead. Let’s add some boolean fields for flagging whether a contour is divisible by our numbers and calculate that out:


UPDATE base.contours_2
	SET div_10 = CAST( contours_2.elevation % 10 AS BOOLEAN );


UPDATE base.contours_2
	SET div_250 = CAST( contours_2.elevation % 250 AS BOOLEAN );

yields the following (with highlights added):

Sort of the opposite of what I intended, the “falses” maybe should be “trues” and vice versa, but hey, it’ll work anyway. BTW, we did test doing this calculation a few different ways, and while I can’t remember all the details, doing this as a calculation instead of doing an update query with a while statement testing the modulus was much faster (3x faster).

Ok, so now we style an sld for it, and sit back and enjoy (pics later…).

Posted in Database, Database Optimization, GeoServer, PostGIS, SQL | Tagged: , , , , , , , , , | Leave a Comment »

Further optimization of the PostGIS LiDAR Vegetation Height Query

Posted by smathermather on April 21, 2010

There’s much to be said for knowing your data in order to best optimize the analysis of it.  Beyond all other bits of cleverness, having a functional understanding of your problem is the first step toward conceiving an intelligent and efficient solution.

One thing that I didn’t do two posts ago was to spend any time deciding how far out the search for nearby points with ST_DWithin would be.  So I played around a bit visualizing the data to address this question.  Here in green are the ground points shown in Quantum GIS.

And below they are displayed as points two feet in diameter (kind of like buffering them, but here, it’s just taking advantage of qGIS’ display properties).

They almost converge.  Going up to 3 feet, we can conclude that the spacing of this LiDAR dataset is about 3 feet.  These points are over a flat residential area (the rectangular gaps are buildings).

and below we have the points over a cliff/slump area.  We can see some gaps in this extreme topography.  So I’ll conclude that if I use a search area of 3.5 feet, I should be able to more efficiently perform my nearest neighbor search, and find the heights of the vegetation and buildings relative to the closest ground point.

SELECT DISTINCT ON(g1.gid)  g1.gid as gid, g2.gid as gid_ground, g1.x as x, g1.y as y, g2.z as z, g1.z - g2.z as height, g1.the_geom as geometry
 FROM veg As g1, ground As g2   
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 3.5)   
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

So, here is the vegetation height shown from lightest green (shortest) to darkest green (tallest):

We can throw in buildings similarly colored in gray (in this case we extend the search window to 30 feet to ensure we find a ground point nearby):

And a closer look over my house.

I know I wrote down some numbers for how much faster this is than searching 15 feet, but I can’t remember where I put them.  Needless to say, this is the most important optimization.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , , , | 1 Comment »

Trigger Speeeeed, or triggers and data maintenance in PostGIS

Posted by smathermather on April 21, 2010

I have completed 4 out of 6 of my triggers for maintaining boundary data in PostGIS.  This is a follow-up to this post.  If you haven’t read my PostGIS trigger post, follow this link and come back.  I’ll wait.  Go on.

Back?  OK.  So I’ve implemented my property boundary triggers.  I need about 6-7 different versions of our boundary for cartographic reasons– some which are dissolved (ST_Union) based on one set of attributes, some based on another, etc..  Needless to say, I just want to modify one version, and have the other versions automagically be created.  This is where PostGIS functions and triggers come in.  Anything you can conceive of doing with the hundreds of PostGIS spatial functions you could run within a trigger.

First, as always, testing.  Each time I update the record (can’t figure out how to do triggers once per transaction in PostgreSQL, so this runs with each record insert) I want to take my boundary and create a dissolved copy of it in another table.  Naively, I tried to remove and recreate the table first:

DROP TRIGGER dissolver ON base.reservation_boundaries_public_private_cm;

CREATE OR REPLACE FUNCTION dissolve_bounds() RETURNS TRIGGER AS $dissolver$

BEGIN
 IF(TG_OP='INSERT') THEN

 DROP TABLE IF EXISTS base.reservation_boundaries_public_private_cm_dissolved;

 CREATE TABLE base.reservation_boundaries_public_private_cm_dissolved
 AS
 SELECT res, ST_Union(the_geom)
 AS the_geom
 FROM base.reservation_boundaries_public_private_cm
 GROUP BY res;

 ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD COLUMN gid BIGSERIAL;
 ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD PRIMARY KEY (gid);
 ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD CONSTRAINT enforce_dims_the_geom CHECK (ndims(the_geom) = 2);
 ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD CONSTRAINT enforce_srid_the_geom CHECK (srid(the_geom) = 102722);
 GRANT SELECT ON TABLE base.reservation_boundaries_public_private_cm_dissolved TO low_level_access;
 COMMENT ON TABLE base.reservation_boundaries_public_private_cm_dissolved IS 'Base boundary layer, with public and private boundaries, dissolved (CM)';

 END IF;
 RETURN NEW;
END;

$dissolver$ LANGUAGE plpgsql;

CREATE TRIGGER dissolver
 AFTER INSERT ON base.reservation_boundaries_public_private_cm
 FOR EACH STATEMENT EXECUTE PROCEDURE dissolve_bounds();

It works, but it is a little slow.  It takes about 50 seconds to complete the transaction.  So, when I wrote this I was a fundamental SQL nube.  Now, still nubish, I realize that dropping and recreating the table is quite inefficient.  It’s much better to remove the contents of the table, and insert new records (here is just the modified code between IF() THEN and END IF;).

 DELETE FROM base.reservation_boundaries_public_private_cm_dissolved;

 INSERT INTO base.reservation_boundaries_public_private_cm_dissolved
  SELECT res, ST_Union(the_geom)
   AS the_geom
   FROM base.reservation_boundaries_public_private_cm
   GROUP BY res;

Moderately faster– runs in about 30 seconds.

To create all the tables, I have a few functions run in the trigger.  Remarkably, this doesn’t take much longer than just modifying one table within the trigger, or about 30 seconds:

DROP TRIGGER dissolver ON base.reservation_boundaries_public_private_cm;

CREATE OR REPLACE FUNCTION dissolve_bounds() RETURNS TRIGGER AS $dissolver$

BEGIN
 IF(TG_OP='INSERT') THEN

 DELETE FROM base.reservation_boundaries_public_private_cm_dissolved;

 INSERT INTO base.reservation_boundaries_public_private_cm_dissolved
  SELECT res, ST_Union(the_geom)
   AS the_geom
   FROM base.reservation_boundaries_public_private_cm
   GROUP BY res;

 DELETE FROM base.reservation_boundaries_cm_dissolved;

 INSERT INTO base.reservation_boundaries_cm_dissolved;
 SELECT res, ST_Union(the_geom)
 AS the_geom
 FROM base.reservation_boundaries_public_private_cm
 WHERE access = 1
 GROUP BY res;

 DELETE FROM base.reservation_boundaries_cm;

 INSERT INTO base.reservation_boundaries_cm
 SELECT res, the_geom
 FROM base.reservation_boundaries_public_private_cm
 WHERE access = 1;

 DELETE FROM base.reservation_boundaries_private_cm;

 INSERT INTO base.reservation_boundaries_private_cm
 SELECT res, the_geom
 FROM base.reservation_boundaries_public_private_cm
 WHERE access = 0;

 END IF;
 RETURN NEW;
END;

$dissolver$ LANGUAGE plpgsql;

CREATE TRIGGER dissolver
 AFTER INSERT ON base.reservation_boundaries_public_private_cm
 FOR EACH STATEMENT EXECUTE PROCEDURE dissolve_bounds();

Now just a few more things to stick in there for masking versions of my boundary, and a script to export this to a shapefile to synchronize with my file system, PostGIS gets to maintain by boundaries, not me or an intern.  Between you and me, my interns are good and very reliable, but PostGIS is much more reliable than I am.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , | 3 Comments »

PostGIS and LiDAR, the saga completes?

Posted by smathermather on April 6, 2010

I have to thank this post for the code that follows http://postgis.refractions.net/pipermail/postgis-users/2009-June/023660.html:

I’ve been putting at this nearest neighbor for points problem for a few months now, never dedicating enough time to think in through, nor expending enough energy to be really great at SQL, let alone how it can be applied in spatial operations with an OGC compliant database like PostGIS.  But here’s my naive nearest neighbor solution.  It’s my solution for getting vegetation height from a set of vegetation points and a set of ground points, basically taking the nearest ground point, and subtracting the z value of the ground from the z value of the vegetation.

The points themselves, while attributed with x, y, and z, are in fact 2D point sets, so the nearest point is not in the 3D domain, but the nearest point horizontally.  I did this on purpose– it reduces errors in areas with steep slopes, and has less computational overhead.

SELECT DISTINCT ON(g1.gid)  g1.z - g2.z as height
 FROM veg As g1, ground As g2   
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)   
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

height   
-----------
 0.00000
 33.80000
 0.52000
 0.00000
 0.00000
 0.00000
 10.79000
 11.55000
 6.92000
 7.61000
 22.54000
 50.16000
 50.62000
 50.96000
 45.93000

It’d be nice to know which points are which, and where they are at as well, so we’ll grab the gid for each point, as well as the value for x, y, and z:

SELECT DISTINCT ON(g1.gid)  g1.gid as gid, g2.gid as gid_ground, g1.x as x, g1.y as y, g1.z as z, g1.z - g2.z as height, the_geom g1.the_geom as geometry
 FROM veg As g1, ground As g2   
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)   
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

How efficient is this?

EXPLAIN ANALYZE SELECT DISTINCT ON(g1.gid)  g1.gid As gid, g2.gid As gid_ground, g1.x as x, g1.y as y, g1.z as z, g1.z - g2.z as height, the_geom as geometry
 FROM veg As g1, ground As g2
 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)
 ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

Unique  (cost=1347373.60..1347373.61 rows=1 width=225) (actual time=6342.041..6358.930 rows=1882 loops=1)
->  Sort  (cost=1347373.60..1347373.60 rows=1 width=225) (actual time=6342.039..6353.324 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external merge  Disk: 1704kB
->  Nested Loop  (cost=44.80..1347373.59 rows=1 width=225) (actual time=0.057..6267.593 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND (g2.the_geom && st_expand(g1.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on ground g2  (cost=0.00..57.22 rows=2522 width=113) (actual time=0.011..0.927 rows=2522 loops=1)
->  Materialize  (cost=44.80..63.71 rows=1891 width=112) (actual time=0.000..0.170 rows=1891 loops=2522)
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.006..0.634 rows=1891 loops=1)

Total runtime: 6392.724 ms

(10 rows)

Oh, yeah. We need some indexes.

CREATE INDEX veg_the_geom_idx ON veg USING gist(the_geom);
CREATE INDEX ground_the_geom_idx ON ground USING gist(the_geom);

 

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=1266.33..1266.33 rows=1 width=225) (actual time=378.057..393.715 rows=1882 loops=1)
->  Sort  (cost=1266.33..1266.33 rows=1 width=225) (actual time=378.055..387.902 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external sort  Disk: 1712kB
->  Nested Loop  (cost=0.00..1266.32 rows=1 width=225) (actual time=0.127..302.111 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.014..0.490 rows=1891 loops=1)
->  Index Scan using ground_the_geom_idx on ground g2  (cost=0.00..0.37 rows=1 width=113) (actual time=0.025..0.065 rows=29 loops=1891)
Index Cond: (g2.the_geom && st_expand(g1.the_geom, 15::double precision))
Total runtime: 410.327 ms

(10 rows)

Whew! That’s a bit faster. Will clustering help us this time?

ALTER TABLE veg CLUSTER ON veg_the_geom_idx;
ALTER TABLE ground CLUSTER ON ground_the_geom_idx;

 

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=1266.33..1266.33 rows=1 width=225) (actual time=377.542..393.175 rows=1882 loops=1)
->  Sort  (cost=1266.33..1266.33 rows=1 width=225) (actual time=377.541..387.374 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external sort  Disk: 1712kB
->  Nested Loop  (cost=0.00..1266.32 rows=1 width=225) (actual time=0.065..302.340 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.006..0.459 rows=1891 loops=1)
->  Index Scan using ground_the_geom_idx on ground g2  (cost=0.00..0.37 rows=1 width=113) (actual time=0.025..0.065 rows=29 loops=1891)
Index Cond: (g2.the_geom && st_expand(g1.the_geom, 15::double precision))
Total runtime: 409.784 ms
(10 rows)

Eh, not much. But this is a pretty small dataset. We’ll see later how it helps on a larger dataset.

And the relavent stats:

These were run on a tortured late model MacBook Pro, with a 2.4GHz Intel Core 2 Duo, with a couple of gigs of RAM on PostgreSQL.

SELECT version();
 version                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.2 on i386-apple-darwin10.2.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1), 64-bit
(1 row)

and PostGIS 1.5.1.

I promised a test in this post to find out whether my supposition that I have a dense enough LiDAR point cloud precludes the need for doing the difference with an interpolated ground point cloud, vs. the raw point cloud as I’m using it here.  I haven’t forgotten… .  That’s the next test, I think.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , , | 2 Comments »