This is a quick blog post about technologies that I don’t know well… so please comment if you know better. GeoGig and dat are great tools for addressing versioning in data, so what’s the difference?
GeoGig is built on Java and meant for any “simple features” geometry (points, lines, polygons).
It’s strength is that it is built from the ground up to handle geometries well, going beyond CRUD functions to specifically address geospatial problems in versioning. Think of it as git for geospatial data.
There’s a hosted version in pre-release from BoundlessGeo called Versio and meant to be the GitHub for geospatial data. You can run your local version from http://geogig.org/
From the website:
“Users are able to import raw geospatial data (currently from Shapefiles, PostGIS or SpatiaLite) in to (sic) a repository where every change to the data is tracked. These changes can be viewed in a history, reverted to older versions, branched in to sandboxed areas, merged back in, and pushed to remote repositories.”
Ok, so how about dat?
“Dat is an open source project that provides a streaming interface between every file format and data storage backend.”
A cursory look indicates it will work for geospatial data, but effectively as blobs, with no special handling for changes within features like GeoGig. But, it does what GeoGig does not, and that is to make datasets automatically syncable.
Like all projects, each has its strengths. Choose your project wisely.
9 thoughts on “Quick (and likely apocryphal) post on versioning and databases”
Thanks for this post, as I’ve been wondering how dat can help me bridge the gap between SQL Server and PostGIS… but I can’t figure out if that’s what it’s for, or if it’s a tool for helping developers access these different formats only for their apps…
Will watch for more comments / info!
It seems like a good tool for abstracting away the problems of data sharing between database types. IDK how far along they are with SQL Server support so far — haven’t looked… .
It would be great if dat would be able to support geospatial data in the future. The syncing feature sounds need. I’ve yet to use it though. Thanks for the post!
dat support geospatial data, but you won’t get into the weeds on within row edits in any meaningful way. And since a row in geospatial data can contain so much complexity in geometry, the CRUD approach has its limits. That said, for synchronizing data across several different mediums (without attention to the above), it seems to hold great potential.
I think it’s a fair summary. Disclaimer: I was the CTO of Boundless until recently and was very involved with GeoGig and Versio up until a few weeks ago. In my current job dat is already on the radar for some of the problems we will need to solve, so I think I have kind of a unique perspective (but need more time with dat to form solid opinion).
My current thought is GeoGig is very specialized, and targets geo only. Yes, it could be extended to support more generic datasets, but nothing is happening along those lines in the project. GeoGig’s design is very focused on translating the Git workflow to geospatial, on the command line. Building a sync feature on top of GeoGit is possible and not too hard actually. Dat on the other hand seems to take a more generic approach to data, it has more breadth and makes less assumptions, which probably means that it can’t yet handle large geospatial datasets but is able to work with a wider variety of information. Geospatial data rarely goes alone, so having a tool that can version any data is a huge plus. I think some of the streaming features of dat will prove to be a huge advantage in the medium to long run (thinking IoT, sensors and the like here).
Thanks Juan! I’m glad to hear I am in the ballpark.
IoT, sensors, and such are much why I’m excited about dat. In the environmental / ecological circles I run in, sensors and sensor data collection and distribution are a large problem space.
BTW Juan, where’s that old OpenGeo blog post describing the issues of leaves when using git with text encoded geospatial data? That was such a great description of the problem space… .
I believe that content was taken off the Boundless website (some parts were very outdated).
That explains why I couldn’t find it. It was a nice foundational piece though. Any equivalent explanations of the problem space?