The tilecache goldrush

Does anyone else not see a problem with the trend over the past few years? “Tile-itis” is reaching critical mass and it is driving me bonkers. We’re taking away styling, reprojection, tile sizes and giving them … tiles. No wait, fast tiles? Really? Oh, so I can put them on Google Maps? Awesome. Can I have them in projection X? No, sorry, we don’t have another terabyte to reseed the cache. Can I have just the streets? No sorry, same problem.

Why do I seem like the only one asking “wtf” when I see something like this at OAM,

This means, as a rule of thumb, that the network must store ((4/3) + 1) * 3 = 7 MB of imagery plus tiles for every 1 MB of source imagery uploaded. If we load up all of the approximately 4 TB of LandSat-7 data at a 30m resolution, and generate a complete tile set, we will need 16-28 TB of storage in the network to hold it all. If stored on EC2, this would cost up to US$3,000 per month — and that’s just for one layer at a low resolution.

Or when a user asks a simple question

We want to serve the US NAIP Aerials in 1m resolution (which are a total of about 4.7 TB of MrSid/Jp2 data) on a interactive  web map as an optional map background. [sic] .. we determined early on is that MapServer is too slow to serve compressed imagery such as the native MrSid Jp2 imagery on the fly for our needs. [On using Mapserver to serve uncompressed tifs] … would also “blow up” the total data volume to something about 60 TB … Thus, we are in the process of researching options on how to serve the compressed data as fast as possible “on the fly” and without the need for caching them on disk

All replies, except one from (somewhat ironically :)) Christopher Schmidt, ignores the initial constraint and instantly tells the user a cache is required.

The root of the problem is the assumption that for every organisation, every deployment, you absolutely, unequivocally must create a tile-geo-arcgis-spatial-osm-mapproxy-squid-cache. We’ve gotta do what Google does! I truly fear many organisations are being misled and are unnecessarily transitioned to tiling solutions when quite frankly they don’t need to. More importantly though, GIS software representatives are using the community affinity addiction(?) for tiling everything to mask quite frankly, badly poorly performing software to begin with.

So let us all take a deeeep breath next time you’re scoping out an imagery solution. Why do you need a tile cache? That’s great that your cache can max out a 100mbit connection (its not hard), but you’ve not only increased your storage requirements by a factor of 4, 8 or 20 times, you’ve also taken away other functionality for your customers and limited yourself to one convention.

If you do need a cache and by crikey they are needed in many situations, implement LRU or a hybrid cache solution but most importantly, give your customers the original WMS service. For all its warts, at least it gives them some options.

So to answer both quotes above,

  1. Storing 4TB of uncompressed Landsat 7, 30m data for the whole world as a single compressed ECW at 1:20 will be approx. 200 gb, visually lossless and $30 per month to store on Amazon S3. As some examples, i have the following 3 band mosaics
    1. Landsat742.ecw, 1,414,317 px x  534,778 px which totals 2,515,088 KB (yes, thats ~2.5gb). Did i mention this was created way back in 2003?
    2. Melbourne.ecw, 413,333 px x 346,667 px which totals 30,626,916 KB or ~30 GB from our friends at SKM Ausimage
    3. Metro_Central_2007_Mosaic.ecw,  224,100 px x 304,400 px which totals ~11.5 GB from Landgate
  2. ERDAS Apollo can serve all these mosaics, as 256px tiles on demand and still max out the 100mbit network; no problems. To prove, I ran our tiling test tool over a gigabit connection back to Apollo to see the throughput over a short 180 second test plan
    1. Landsat.ecw
      1. Random: 31837 tiles, avg 181.79 tiles per second, RT 0.03 seconds, throughput 15.2 MB / sec
      2. Sequential: 60673 tiles, avg 314.41 tiles per second, RT 0.02 seconds, throughput 26.65 MB / sec
    2. Melbourne.ecw
      1. Random: 10286 tiles, avg 109.92 tiles per second, RT 0.05 seconds, throughput 13.43 MB / sec
      2. Sequential: 39980 tiles, avg 230.25 tiles per second, RT 0.02 seconds, throughput 34.89 MB / sec
    3. Metro_Central_2007_Mosaic.ecw
      1. Random: 35585 tiles, avg 203.18 tiles per second, RT 0.02 seconds, throughput 33.15 MB / sec
      2. Sequential: 47191 tiles, avg 271.19 tiles per second, RT 0.02 seconds, throughput 51.12 MB / sec

So instead of looking at pure throughput of the cache tile server (which has been proven to be a fizzer), if we also take into account the storage requirements and plot the two variables, I know which one I’d choose. That ERDAS Apollo license is looking pretty damn attractive right now, isn’t it … isnt it *starts shaking*?

What I also find interesting is there seems to be a slight resurgence back to on-demand solutions after, invariably, users realise the scalability or flexibility issues with full tile caches. JPEG2000 seems to be making a comeback thats for sure for image serving, but dont forget Kakadu has the same licensing restriction as the ECWJP2 SDK, it aint free-as-in-beer either. OSM Mod_tile is also a good example of a hybrid solution with on demand rendering.

ps. Has anyone tested beyond 100mbit on any other tiling solution?

pps. ERDAS has its own tiling container format known as OTDF. Clearly this is for our most demanding customers where they need performance above and beyond the above

5 thoughts on “The tilecache goldrush”

  1. crschmidt’s answer still involves storing the imagery as uncompressed tiffs which would still inflate the on-disk size to something quite large with ~5TB of original compressed imagery.

    The ‘truth’ is that its going to cost you $ either way, whether its for hardware (read disk) or licenses, and there are valid reasons for doing it either way.

  2. That’s true Jeffrey, but his solution still is a magnitude sized better than using a full cache

    And yes, the truth is $ either way (no such thing as a free lunch!). Although i can almost guarantee you the cost of an Apollo license would be offset extremely quickly by the cost of purchasing or paying for the equivalent storage with the alternative solution. So really, the scatterplot should be three dimensional, tile throughput vs storage cost (size) vs hardware/software cost.

    I’ll put my money on software winning out by a significant margin. Any takers? :-)

  3. Depends on too many variables to say that software will always win, the key factors being what the actual application is, what kinds and how much data. There are very good reasons that google (and bing etc) serves up tiles into earth and maps and doesn’t operate from the compressed data.

Comments are closed.