# Archive for April, 2010

## Cost of nearest neighbor search depending on distance

Posted by smathermather on April 24, 2010

A quick review of costs to search with increasing distances.  Reference this original post for the code being run.

```SELECT DISTINCT ON(g1.gid)  g1.gid as gid, g2.gid as gid_ground, g1.x as x, g1.y as y, g2.z as z, g1.z - g2.z as height, g1.the_geom as geometry
FROM veg As g1, ground As g2
WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 3.5)
ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);
This is the total query time comparing 167,000 to 222,000 points using ST_DWithin to do a nearest neighbor search with different search distances.

```

Posted in Database, PostGIS, SQL | Tagged: , , , , , | Leave a Comment »

## Further optimization of the PostGIS LiDAR Vegetation Height Query

Posted by smathermather on April 21, 2010

There’s much to be said for knowing your data in order to best optimize the analysis of it.  Beyond all other bits of cleverness, having a functional understanding of your problem is the first step toward conceiving an intelligent and efficient solution.

One thing that I didn’t do two posts ago was to spend any time deciding how far out the search for nearby points with ST_DWithin would be.  So I played around a bit visualizing the data to address this question.  Here in green are the ground points shown in Quantum GIS.

And below they are displayed as points two feet in diameter (kind of like buffering them, but here, it’s just taking advantage of qGIS’ display properties).

They almost converge.  Going up to 3 feet, we can conclude that the spacing of this LiDAR dataset is about 3 feet.  These points are over a flat residential area (the rectangular gaps are buildings).

and below we have the points over a cliff/slump area.  We can see some gaps in this extreme topography.  So I’ll conclude that if I use a search area of 3.5 feet, I should be able to more efficiently perform my nearest neighbor search, and find the heights of the vegetation and buildings relative to the closest ground point.

```SELECT DISTINCT ON(g1.gid)  g1.gid as gid, g2.gid as gid_ground, g1.x as x, g1.y as y, g2.z as z, g1.z - g2.z as height, g1.the_geom as geometry
FROM veg As g1, ground As g2
WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 3.5)
ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);
So, here is the vegetation height shown from lightest green (shortest) to darkest green (tallest):

We can throw in buildings similarly colored in gray (in this case we extend the search window to 30 feet to ensure we find a ground point nearby):

And a closer look over my house.

I know I wrote down some numbers for how much faster this is than searching 15 feet, but I can’t remember where I put them.  Needless to say, this is the most important optimization.
```

## Trigger Speeeeed, or triggers and data maintenance in PostGIS

Posted by smathermather on April 21, 2010

I have completed 4 out of 6 of my triggers for maintaining boundary data in PostGIS.  This is a follow-up to this post.  If you haven’t read my PostGIS trigger post, follow this link and come back.  I’ll wait.  Go on.

Back?  OK.  So I’ve implemented my property boundary triggers.  I need about 6-7 different versions of our boundary for cartographic reasons– some which are dissolved (ST_Union) based on one set of attributes, some based on another, etc..  Needless to say, I just want to modify one version, and have the other versions automagically be created.  This is where PostGIS functions and triggers come in.  Anything you can conceive of doing with the hundreds of PostGIS spatial functions you could run within a trigger.

First, as always, testing.  Each time I update the record (can’t figure out how to do triggers once per transaction in PostgreSQL, so this runs with each record insert) I want to take my boundary and create a dissolved copy of it in another table.  Naively, I tried to remove and recreate the table first:

```DROP TRIGGER dissolver ON base.reservation_boundaries_public_private_cm;

CREATE OR REPLACE FUNCTION dissolve_bounds() RETURNS TRIGGER AS \$dissolver\$

BEGIN
IF(TG_OP='INSERT') THEN

DROP TABLE IF EXISTS base.reservation_boundaries_public_private_cm_dissolved;

CREATE TABLE base.reservation_boundaries_public_private_cm_dissolved
AS
SELECT res, ST_Union(the_geom)
AS the_geom
FROM base.reservation_boundaries_public_private_cm
GROUP BY res;

ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD COLUMN gid BIGSERIAL;
ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD PRIMARY KEY (gid);
ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD CONSTRAINT enforce_dims_the_geom CHECK (ndims(the_geom) = 2);
ALTER TABLE base.reservation_boundaries_public_private_cm_dissolved ADD CONSTRAINT enforce_srid_the_geom CHECK (srid(the_geom) = 102722);
GRANT SELECT ON TABLE base.reservation_boundaries_public_private_cm_dissolved TO low_level_access;
COMMENT ON TABLE base.reservation_boundaries_public_private_cm_dissolved IS 'Base boundary layer, with public and private boundaries, dissolved (CM)';

END IF;
RETURN NEW;
END;

\$dissolver\$ LANGUAGE plpgsql;

CREATE TRIGGER dissolver
AFTER INSERT ON base.reservation_boundaries_public_private_cm
FOR EACH STATEMENT EXECUTE PROCEDURE dissolve_bounds();
```

It works, but it is a little slow.  It takes about 50 seconds to complete the transaction.  So, when I wrote this I was a fundamental SQL nube.  Now, still nubish, I realize that dropping and recreating the table is quite inefficient.  It’s much better to remove the contents of the table, and insert new records (here is just the modified code between IF() THEN and END IF;).

``` DELETE FROM base.reservation_boundaries_public_private_cm_dissolved;

INSERT INTO base.reservation_boundaries_public_private_cm_dissolved
SELECT res, ST_Union(the_geom)
AS the_geom
FROM base.reservation_boundaries_public_private_cm
GROUP BY res;
```

Moderately faster– runs in about 30 seconds.

To create all the tables, I have a few functions run in the trigger.  Remarkably, this doesn’t take much longer than just modifying one table within the trigger, or about 30 seconds:

```DROP TRIGGER dissolver ON base.reservation_boundaries_public_private_cm;

CREATE OR REPLACE FUNCTION dissolve_bounds() RETURNS TRIGGER AS \$dissolver\$

BEGIN
IF(TG_OP='INSERT') THEN

DELETE FROM base.reservation_boundaries_public_private_cm_dissolved;

INSERT INTO base.reservation_boundaries_public_private_cm_dissolved
SELECT res, ST_Union(the_geom)
AS the_geom
FROM base.reservation_boundaries_public_private_cm
GROUP BY res;

DELETE FROM base.reservation_boundaries_cm_dissolved;

INSERT INTO base.reservation_boundaries_cm_dissolved;
SELECT res, ST_Union(the_geom)
AS the_geom
FROM base.reservation_boundaries_public_private_cm
WHERE access = 1
GROUP BY res;

DELETE FROM base.reservation_boundaries_cm;

INSERT INTO base.reservation_boundaries_cm
SELECT res, the_geom
FROM base.reservation_boundaries_public_private_cm
WHERE access = 1;

DELETE FROM base.reservation_boundaries_private_cm;

INSERT INTO base.reservation_boundaries_private_cm
SELECT res, the_geom
FROM base.reservation_boundaries_public_private_cm
WHERE access = 0;

END IF;
RETURN NEW;
END;

\$dissolver\$ LANGUAGE plpgsql;

CREATE TRIGGER dissolver
AFTER INSERT ON base.reservation_boundaries_public_private_cm
FOR EACH STATEMENT EXECUTE PROCEDURE dissolve_bounds();
```

Now just a few more things to stick in there for masking versions of my boundary, and a script to export this to a shapefile to synchronize with my file system, PostGIS gets to maintain by boundaries, not me or an intern.  Between you and me, my interns are good and very reliable, but PostGIS is much more reliable than I am.

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , | 3 Comments »

## PostGIS and LiDAR, the saga completes?

Posted by smathermather on April 6, 2010

I have to thank this post for the code that follows http://postgis.refractions.net/pipermail/postgis-users/2009-June/023660.html:

I’ve been putting at this nearest neighbor for points problem for a few months now, never dedicating enough time to think in through, nor expending enough energy to be really great at SQL, let alone how it can be applied in spatial operations with an OGC compliant database like PostGIS.  But here’s my naive nearest neighbor solution.  It’s my solution for getting vegetation height from a set of vegetation points and a set of ground points, basically taking the nearest ground point, and subtracting the z value of the ground from the z value of the vegetation.

The points themselves, while attributed with x, y, and z, are in fact 2D point sets, so the nearest point is not in the 3D domain, but the nearest point horizontally.  I did this on purpose– it reduces errors in areas with steep slopes, and has less computational overhead.

```SELECT DISTINCT ON(g1.gid)  g1.z - g2.z as height
FROM veg As g1, ground As g2
WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)
ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

height
-----------
0.00000
33.80000
0.52000
0.00000
0.00000
0.00000
10.79000
11.55000
6.92000
7.61000
22.54000
50.16000
50.62000
50.96000
45.93000
```

It’d be nice to know which points are which, and where they are at as well, so we’ll grab the gid for each point, as well as the value for x, y, and z:

```SELECT DISTINCT ON(g1.gid)  g1.gid as gid, g2.gid as gid_ground, g1.x as x, g1.y as y, g1.z as z, g1.z - g2.z as height, the_geom g1.the_geom as geometry
FROM veg As g1, ground As g2
WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)
ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);
How efficient is this?
EXPLAIN ANALYZE SELECT DISTINCT ON(g1.gid)  g1.gid As gid, g2.gid As gid_ground, g1.x as x, g1.y as y, g1.z as z, g1.z - g2.z as height, the_geom as geometry
FROM veg As g1, ground As g2
WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 15)
ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom);

Unique  (cost=1347373.60..1347373.61 rows=1 width=225) (actual time=6342.041..6358.930 rows=1882 loops=1)
->  Sort  (cost=1347373.60..1347373.60 rows=1 width=225) (actual time=6342.039..6353.324 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external merge  Disk: 1704kB
->  Nested Loop  (cost=44.80..1347373.59 rows=1 width=225) (actual time=0.057..6267.593 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND (g2.the_geom && st_expand(g1.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on ground g2  (cost=0.00..57.22 rows=2522 width=113) (actual time=0.011..0.927 rows=2522 loops=1)
->  Materialize  (cost=44.80..63.71 rows=1891 width=112) (actual time=0.000..0.170 rows=1891 loops=2522)
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.006..0.634 rows=1891 loops=1)

Total runtime: 6392.724 ms

(10 rows)

Oh, yeah.  We need some indexes.
CREATE INDEX veg_the_geom_idx ON veg USING gist(the_geom);
CREATE INDEX ground_the_geom_idx ON ground USING gist(the_geom);

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=1266.33..1266.33 rows=1 width=225) (actual time=378.057..393.715 rows=1882 loops=1)
->  Sort  (cost=1266.33..1266.33 rows=1 width=225) (actual time=378.055..387.902 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external sort  Disk: 1712kB
->  Nested Loop  (cost=0.00..1266.32 rows=1 width=225) (actual time=0.127..302.111 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.014..0.490 rows=1891 loops=1)
->  Index Scan using ground_the_geom_idx on ground g2  (cost=0.00..0.37 rows=1 width=113) (actual time=0.025..0.065 rows=29 loops=1891)
Index Cond: (g2.the_geom && st_expand(g1.the_geom, 15::double precision))
Total runtime: 410.327 ms

(10 rows)

Whew!  That’s a bit faster.  Will clustering help us this time?
ALTER TABLE veg CLUSTER ON veg_the_geom_idx;
ALTER TABLE ground CLUSTER ON ground_the_geom_idx;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=1266.33..1266.33 rows=1 width=225) (actual time=377.542..393.175 rows=1882 loops=1)
->  Sort  (cost=1266.33..1266.33 rows=1 width=225) (actual time=377.541..387.374 rows=42239 loops=1)
Sort Key: g1.gid, (st_distance(g1.the_geom, g2.the_geom))
Sort Method:  external sort  Disk: 1712kB
->  Nested Loop  (cost=0.00..1266.32 rows=1 width=225) (actual time=0.065..302.340 rows=42239 loops=1)
Join Filter: ((g1.gid <> g2.gid) AND (g1.the_geom && st_expand(g2.the_geom, 15::double precision)) AND _st_dwithin(g1.the_geom, g2.the_geom, 15::double precision))
->  Seq Scan on veg g1  (cost=0.00..42.91 rows=1891 width=112) (actual time=0.006..0.459 rows=1891 loops=1)
->  Index Scan using ground_the_geom_idx on ground g2  (cost=0.00..0.37 rows=1 width=113) (actual time=0.025..0.065 rows=29 loops=1891)
Index Cond: (g2.the_geom && st_expand(g1.the_geom, 15::double precision))
Total runtime: 409.784 ms
(10 rows)

Eh, not much.  But this is a pretty small dataset.  We’ll see later how it helps on a larger dataset.
And the relavent stats:
These were run on a tortured late model MacBook Pro, with a 2.4GHz Intel Core 2   Duo, with a couple of gigs of RAM on PostgreSQL.
SELECT version();
version
------------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.2 on i386-apple-darwin10.2.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1), 64-bit
(1 row)
and PostGIS 1.5.1.
I promised a test in this post to find out whether my supposition that I have a dense enough LiDAR point cloud precludes the need for doing the difference with an interpolated ground point cloud, vs. the raw point cloud as I’m using it here.  I haven’t forgotten… .  That’s the next test, I think.
```

Posted in Database, Database Optimization, PostGIS, SQL | Tagged: , , , , , | 2 Comments »