One thing I’ve learned in the last few years is that there is no theoretical reason why (properly indexed and summarized) data cannot be displayed at all scales as quickly as at any scale. This is the principle at work behind the extraordinary efficiencies of delivering data and imagery through a slippy map interface like Google, Bing, and OpenLayers as well as efficient thick client interfaces like Google Earth. So, in principle, and largely in practice, serving spatial data for a whole county or the whole world shouldn’t be any more onerous than serving data for a particular site, so long as you have adequate storage for the pre-rendered and summarized data, and time to pre-render the data. As storage tends to be cheaper than processing and network speed, this is a no-brainer.
A number of great Open Source tools exist to help with serving large amounts of data efficiently, not the least of which is my favorite, GeoServer (paired with GeoWebCache). For serving imagery, in our case orthorectified 0.6-inch (0.1524 meter) aerial imagery, we have a few options. GeoServer does natively support GeoTiff, but for this large an area at this level of detail, we’d have to wade into the realm of BigTiff support through the GDAL extension, because we have 160GB imagery to serve. We could use wavelet compressed imagery, e.g. MrSid or ECW or Jpeg2000, but I don’t have a license to create a lossless version of these, and besides, storage is cheaper than processors– wavelet compressed imagery may be a good field solution, but for server side work, it doesn’t make a lot of sense unless it’s all you have available. Finally, there are two data source extensions to GeoServer meant for large imagery, the ImageMosaic Plugin, and the ImagePyramid Plugin. The ImageMosaic Plugin works well for serving large amounts of images, and has some great flexibility with respect to handling transparency and image overlap. The ImagePyramid extension is tuned for serving imagery at many scales. The latter is what we chose to deploy.
The ImagePyramid extension takes advantage of gdal_retile.py, a utility built as part of GDAL that takes an image or set of images and re-tiles them to a standardized size (e.g. 2048×2048) and creates overviews as separate images in a hierachy (here shown as outlines of the images):
But here’s the problem– for some reason, I can’t load all the images at once. If I do, only the low resolution pyramids (8-foot pixels and larger) load. If I break the area into smaller chunks, most of them fewer than 2000 images, they load fine.
I think the following snippet is fairly naive and without much merit to be honest.
{quote}We could use wavelet compressed imagery, e.g. MrSid or ECW or Jpeg2000, but I don’t have a license to create a lossless version of these, and besides, storage is cheaper than processors– wavelet compressed imagery may be a good field solution, but for server side work, it doesn’t make a lot of sense unless it’s all you have available{quote}
* Wavelet compressed formats are absolutely useful for server side work
* JP2 and MrSID have both lossy and lossless compression options. ECW is lossy only
* The “storage is cheap” argument doesnt really fly or scale. Enterprise level storage is still expensive, slow storage is cheap~
* Reading off disk is the bottleneck for most server deployments. For well architected software, efficient memory caching of compressed wavelet imagery will mean you can store more data in memory … reducing the disk I/O
* Wavelet compression yields better image quality than jpg compression (which i assume you are using in the tif tiles). Furthermore the quality will be consistent across the whole mosaic regardless of scale. Using nearest resampling for your overviews will give artifacts that may appear, then disappear depending on scale. Not a great user experience IMO but kinda depends how picky you are
full disclosure that i work at ERDAS … but I’d definitely encourage you to qualify some of your assumptions as you may be surprised 🙂
As to enterprise storage, I think most folks have drunk the appliance cool-aid. Being used to large IT budgets and the need for them, when storage appliances came online and virtualization seemed to demand them, the dialog about how inexpensive storage was becoming switched back to belly-aching about how much we spend on storage (instead of creating dedicated NFS shares with decent or even great, non-appliance, non-proprietary hardware). But this is the view from the outside. So far, I haven’t had to deliver to mass numbers, so I’ll reserve this tirade/apology for when/if I have numbers to back it up, or find someone who has run the numbers.
As to how I have it deployed– I use a small tile size, and small blocksize within the tiffs, and (gasp) no compression. Overviews are calculated at power of 2 scales, bilinear– removes the NN issues, and actually scales quite nicely with low processing overhead on the prep side. I’ve tried all the more complicated resampling algorithms, and while bilinear isn’t very nice at non-power of 2 scales, it looks as smooth as anything 4 cells at a time. Data are well structured spatially to maximize delivery, reducing search time.
I have yet to see IO bottlenecks in my GeoServer deployments, and we essentially simulate really high demand scenarios by letting users print through the MapFish protocol, so I’m trying to envision a scenario in which having all of my data in memory would be beneficial. Any of my data can be accessed quickly because it’s well structured, not because it’s nearer to the processor (latency), but because the appropriate scale/resolution of data is available for any given area for any given application. I could see the potential for data in memory maybe if I were doing serious calculations on the raster values, or otherwise rendering them, but then I certainly wouldn’t want a lossy source.
I’m assuming given your disclosure that ERDAS has a solution which uses the architecture you describe. What would be meaningful would be a proper Image Service Shootout– ala the FOSS4G MapServer/GeoServer shootouts of yore. I wouldn’t be the one to talk to about that, but someone on the GeoServer team might be interested. I think you may be surprised at the performance of software not protected by patents and IP. 🙂
I did a quick search for Foss4G shootouts of your, and it looks like ERDAS Apollo was a participant last year, so props for that– now that I’ve looked, I see, Chris, that you are the PM for that project. Unfortunately, we don’t know the outcome from that shootout, as “They’re refusing to publish results because CPU-bound results cannot be compared with Disk-bound results. Yeah, contestants can step aside from displaying results for any reason they see fit.” This may be a fair reason, given the discussion above, but it does leave one wanting more info.
http://slashgeo.org/2010/09/08/Words-FOSS4G-2010-and-Famous-WMS-Shoutout-Winner
It looks like ERDAS will not be participating this year:
http://2011.foss4g.org/sessions/web-mapping-performance-shootout
My apologies, Chris, for telling you about an event you already know about. Maybe next year ERDAS will participate again?
Useful article. Thanks for posting.