Cloud-Native Geo-Indexing with LocationTech GeoMesa
Scaling LocationTech Intelligence with Cloud-Native Technologies
We live in an increasingly data-rich, sensor-observed world. A growing number of companies are gathering, selling, analyzing, and fusing streams of IoT data. Every day, the number of streams increases and the volume and velocity of those streams increases. This leads to a world where one needs to manage not only complex data feeds, but also limit and control costs for data warehouing and analysis.
Commercial cloud providers like Amazon Web Services, Google Cloud, and Microsoft Azure are competing to provide novel approaches to easing transitions to the cloud. Cloud-native storage like Amazon S3, Google Cloud Filestore, and Azure Blob Storage reduce the cost to house terabytes and petabytes of data in cloud-based systems. For on-premise solutions, large organizations run their own data centers to manage their protected customer and commercial data sources. As they do so, many have implemented robust filesystems.
While much of the gathered data has a geospatial component, most cloud-native solutions offer little in terms of spatial indexing. LocationTech GeoMesa is a collection of tools for managing, streaming, persisting, analyzing, and visualizing volumes of spatiotemporal data. GeoMesa has kept pace with the industry by integrating with both commercial cloud providers and solutions commonly used on-premise.
What is LocationTech GeoMesa?
Previous Eclipse newsletter articles have described GeoMesa’s geospatial indexing on distributed databases (such as Apache Accumulo and HBase), streaming using Kafka, and how GeoMesa is used to manage satellitle AIS data and analyze it with Spark SQL.
The broader GeoMesa ecosystem includes projects which integrate GeoMesa's powerful indexing and analysis capabilities with Apache NiFi (for data management) and GeoServer for OGC web service access. NiFi provides a resilient system for data management in an enterprise. It has numerous processors for reading and writing data from data streams and common databases. GeoServer is an open-source web server designed to make accessing geospatial data easy using open standards.
By leveraging NiFi for data management, Kafka for streaming data, HBase for data storage, GeoServer for the web tier, and OpenLayers client-side, one can build a powerful lambda architecture which allows end users to visualize, inspect, and analyze real-time and historical data.
GeoMesa FileSystem DataStore
For users who already manage Accumulo or HBase, adopting GeoMesa is easy; installing GeoMesa’s distributed runtime is quick. After that, they need to select and run a collection of additional services to create a solution that addresses their needs.
On the other hand, if one isn’t already an expert at managing a distributed database, it is often desirable to design architectures around more accessible resources. The GeoMesa FileSystem DataStore fits this use case by adding configurable, spatiotemporal indexing to any file system compatible with Hadoop.
Such file systems include classical, on-premise solutions like HDFS, Swift, and Ceph as well as offerings like Amazon S3, Azure Blob Storage, and Google Cloud Filestore. As part of the GeoMesa ecosystem, the FileSystem DataStore has rich command line tools and integrations with GeoServer and Spark SQL.
Since the FileSystem DataStore requires no additional services, its overhead to setup is incredibly low. The FileSystem DataStore achieves this by decoupling storage and query resources as well as leveraging leading big-data file format projects like Apache Orc and Parquet.
GeoMesa HBase deployed on EMR with S3
The GeoMesa ecosystem is designed to accelerate many user-facing analysis and visualization workflows. To do this, the GeoMesa team has constructed complex, server-side optimizations with Accumulo iterators and HBase coprocessors. While the FileSystem DataStore answers the mail around ease of deployment and low cost, it doesn’t provide access to these tools.
For users looking for a quick and cost-effective deployment on a distributed database, LocationTech GeoMesa can be deployed on HBase over S3 with Amazon EMR*. Leveraging EMR, one can scale an HBase cluster up and down to meet changing needs. Ad hoc EMR clusters can also provide access to Spark clusters of whatever scale is needed to complete batch analysis leveraging data in the GeoMesa HBase tables.
* Some Accumulo users have had success using Accumulo on Microsoft Azure using Azure BlobStorage.
Cost-effective analysis
Both the FileSystem DataStore and the HBase DataStore provide tools for managing volumes of geospatial data while keeping costs under control. A new set of features allows one to get the best of both worlds by using Accumulo or HBase together with the FileSystem DataStore.
The first feature is the Merge DataStore View. This lets one configure multiple GeoTools datastores into one consistent view in GeoServer and Spark. Since this works at the GeoTools DataStore level, one could provide a view across traditional data sources like PostGIS or Shapefiles in addition to any of the GeoMesa DataStores.
The second feature is the ability to manage time-based partitions for an HBase or Accumulo DataStore. As an example, one could create separate partitions for each month of data. Distributed databases maintain write ahead queues, logs, and memory with the assumption that new data could come at any point. For IoT data, typically there is a window of time where new data arrives and then it is not updated. Having time-based partitions allow for old, mostly-immutable tables to be optimized for reading. In addition to optimizing for future writes, old tables can be moved into cheaper ‘cold’ storage.
Combining these features with the HBase on S3 and the FileSystem DataStore allows for a powerful, flexible, low-cost architecture which optimizes common user queries and allows for batch analysis.
For IoT data, user-driven visualizations tend to focus on more recent data. The most recent weeks or months of data can be stored in a time-partitioned HBase. Older partitions can be aged-off into the FileSystem DataStore. Using the Merge DataStore View, one can access recent data quickly via HBase and still see a consistent view of all the data in the enterprise.
Conclusion
LocationTech GeoMesa was born out of a need for high-performance spatiotemporal indexing on distributed databases. Over the years, the project has pushed the envelope for ease of use by integrating with existing technologies like GeoServer, NiFi, and Spark. The GeoMesa team succeeded in delivering a set of libraries that can be formed into solutions which fit into many varied deployment environments and scale to match any budget.
Visit geomesa.org to learn more about GeoMesa. Join our email list or chat with us on Gitter to ask questions and join the community. Developers can jump into the code at GitHub.
Additional information
Learn more about:
- GeoMesa on HBase EMR
- GeoMesa FileSystem DataStore
- GeoMesa DataStore Partitioning
- GeoMesa Merge DataStore View