PostGIS Gripe—Limits to Postgre’s B-tree indexing

Is this a gripe about PostGIS or PostgreSQL?  Hard to say.  At my office, we are migrating to a PostGIS—GeoServer—Client-of-your-choice stack.  As usual, I have to try the hard stuff first.  So, rather than keeping the 2ft contour lines for our 7 county area in shapefiles and doing some indexing on them there, I’m attempting to load them into PostGIS all in a single table.  Make any sense?  Maybe not.  Honestly, I really don’t know.  In doing so though, PostGIS has done pretty well.  I loaded all 9.6 GB of 2ft contours from one county into the database, served it up directly in uDig, and then through GeoServer to OpenLayers and Google Earth.  For viewing the whole county, it was slow, but I’d never let a user view the 2ft contour dataset all in one go.  Zoomed in, however, it did pretty well, in fact, I’d say considering there was no caching turned on for OpenLayers, it did splendidly.

First, how I loaded it:

I used shp2pgsql.  My contours were in a bunch of little chunks derived from a state DEM dataset, so I iterated through:

Create the table:

shp2pgsql -s 102722 N2110610 base.contours > N2110610.sql

psql -U postgres -d CM -f N2110610.sql
Then iterate through the appends (note the -a flag)

shp2pgsql -s 102722 N2110615 base.contours -a > N2110615.sql
psql -U postgres -d CM -f N2110615.sql

we can script this—I have to confess to having done this in excel with worksheet functions since my Bash shell scripting skills help me none in the Windows world, but a friend told me about FOR…IN…DO in batch scripting, so next time… .

So, loading was easy, and all worked well.  Viewing was fine.  What is my gripe?  B-tree Indexing.  What if I wanted to ensure I had only unique geometric records in my database?  I might initiate the table like this instead:

CREATE TABLE “base”.”contours” (gid serial PRIMARY KEY,
“objectid” int8,
“type” int8,
“elevation” smallint,
“shape_len” numeric);
SELECT AddGeometryColumn(‘base’,’contours’,’the_geom’,’102722′,’MULTILINESTRING’,2);
alter table base.contours add constraint unique_geom unique (the_geom);

Adding a unique constraint to the geometry creates an implicit B-tree index, and therein lies the problem– B-tree indices are limited in size to around 2700 bytes.  Our index exceeds that manyfold.  According to the Postgre documentation “Currently, only B-tree indexes can be declared unique.”

This probably is fine for most datasets, but it really limits the use of a uniqueness constraint on geometry columns of large size.  Since this dataset won’t require updates, my work-around (I hope) will be to load the data in the database, perform a select distinct and dump it into another table to clean up any possible duplicates, but it sure would be nice if I could constrain the table in the first place… .

7 thoughts on “PostGIS Gripe—Limits to Postgre’s B-tree indexing

  1. Hi,
    Be careful that when you declare a unique constraint on the_geom, you are actually not eliminating exact geometry duplicates, but geometries with the same bounding box.
    As a matter of fact, it uses the = operator, which in postgis is defined on geometries as a bbox comparison.
    bye,
    Vincent

  2. Actually using DISTINCT the_geom would use the same bbox = logic that Vincent mentioned, so that won’t work either.

    If you want to go the DISTINCT route, you’ll need to do a
    INSERT INTO newtable(the_geom)
    SELECT DISTINCT ST_AsEWKB(the_geom)
    FROM oldtable

    that will force it into binary mode for the distinct thus preventing a bbox = thingy, but on insert into the table it will be cast back to a geometry.

  3. This is very helpful, thank you. I can’t offhand think of any case where bbox instead of a full comparison would get me in trouble, but it’d always be out there lurking as a possibility, and there’s nothing worse than a subtle lurking problem.

    Now a nagging question about floating point values: if the comparison is done on well known binary, do we have to worry about comparison of floating point values? What I mean is, can we be certain that each time a float value is represented, it will be represented identically to the last time, and thus we can detect when two values are equal? How does PostGIS handle this?

  4. So, ignorant question, why wouldn’t this work (other than my table would be too large to index):
    alter table base.contours add constraint unique_geom unique ST_AsEWKB(the_geom);

  5. Hmm I suppose that would work. Though may not work for large geometries. Not sure about the floating point case. You could also try this trick I mentioned – which would also use a spatial index so might perform better. Though admittedly I haven’t tried it. Also check out other people’s suggestions.

    http://postgis.refractions.net/pipermail/postgis-users/2008-November/021891.html

    This is not something I run into often so haven’t really stress tested any of these tricks as far as geometries go.

    hope that helps,
    Regina

  6. Why not hash the geometry value and place that into another column, say “the_geom_hash”. Then put a unique constraint on the_geom_hash. Now create an insert trigger that does the hashing and sets the value of the_geom_hash column.

    I realize it’s possible to have hash collisions, but they should generally be unique. If your data can tolerate this very slim possibility of error then it should work fine.

  7. I had to do some reading in Wikipedia to make sure I understood this response, but I think it will work, just so long as there aren’t hash collisions (a term which I’d never heard before but now understand… shouldn’t confess to such Nube-ness).

    So, my interpretation of Abe’s comment is that an alternative approach is to create that unique index on a proxy for the geometry field, without having to index something as long as the geometry field. So the idea is to turn the geometry field into a hash representation to compress it, hope there aren’t any false duplicates when I do inserts later (hash collisions), and index and constrain that. Brilliant approach, and a new tool in my toolbox. So, maybe something like this:

    So first binary straight from the database…

    ST_AsBinary(the_geom)

    then encode to text:

    encode()

    and then convert this to an MD5 hash

    MD5()

    for a complete command like this:

    MD5(encode(ST_AsBinary(the_geom)))

    resulting in something like this:
    900150983cd24fb0 d6963f7d28e17f72

    Which is much shorter than the original values, and might be index-able.

    To be tested ASAP (next week most likely…).

Leave a reply to smathermather Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.