Azavea is pleased to introduce the open source GeoTrellis geospatial data processing framework to the Eclipse community. Released under an Apache 2 license, GeoTrellis is a pure-Scala, open source project developed to support geospatial processing at both web-scale and cluster-scale. The framework was designed to solve three core problems, with an initial focus on raster processing:
- Create low latency, scalable geoprocessing web services;
- Create batch geoprocessing services that can act on large data sets by operating on a distributed architecture;
- Parallelize geoprocessing operations to take full advantage of multi-core architectures
Azavea has been experimenting with techniques for accelerating processing of spatial data for almost 10 years. Early efforts focused on understanding and overcoming performance bottlenecks in a single type of raster processing activity, a weighted overlay operation that could support geographic prioritization. After creating an initial prototype application that supported real-time weighted overlay operations for prioritizing residential real estate decisions, Azavea received a small research and development grant under the U.S. Department of Agriculture’s Small Business Innovation Research (SBIR) grant program (#2006-33610-16777). This seed grant enabled the development of DecisionTree, a software framework for geographic prioritization. This early framework successfully supported real-time weighted overlay operations at the city and county scale, but the fact that it only supported a single Map Algebra operation limited its utility as a general framework.
In 2010, work on two projects – a sustainable transit web application for the William Penn Foundation and an educational game focused on watershed modeling for the Stroud Water Research Center – gave Azavea an opportunity to implement changes that would result in a more generic low latency geospatial data processing framework. While Map Reduce and its Hadoop implementation were attracting a great deal of attention for distributed processing, we elected to take a different approach. Our primary use case was to provide real-time processing for web and mobile applications in which users could manipulate model parameters to generate new spatial data. Since then, the framework has been used to support a variety of applications, including planning, digital humanities, government infrastructure investment, and forest growth simulation and modeling. Recent work has also extended the framework to support machine learning applications for crime risk forecasting and low latency processing of data streams. In 2011, Azavea decided to release the new software, now called GeoTrellis, as an open source project under an Apache 2 license. The project was submitted to the Eclipse LocationTech working group in late 2013.
While the original goal of the GeoTrellis project was to transform user interaction with geospatial data by bringing the power of spatial analysis to real time, interactive web applications, this has recently been extended to include cluster-scale batch processing through the integration of Spark. The web-scale use-case is fairly mature, and enables analysts to run sub-second response time operations ranging from simple raster math (*, +, /, etc.) to fairly sophisticated raster and vector operations like Kernel Density and Cost Distance. The cluster-scale use-case is a relatively new effort to port GeoTrellis' rich library of algorithms to Spark, and thereby support both batch and interactive processing on large geospatial datasets such as satellite imagery. The Spark effort is a collaboration between Azavea and DigitalGlobe.
Transit Web App Example
Watershed Modeling Example
The core GeoTrellis framework provides an ability to process large and small data sets with low latency by distributing the computation across multiple threads, cores, CPUs and machines. After evaluating several language and architectural approaches, Azavea selected Scala as the language and the Akka framework to implement an Actor model of distributed processing. GeoTrellis includes the ability to rapidly process and distribute processing of both raster and vector data, as well as data import and conversion tools for an optimized raster data structure, known as an ARG file. GeoTrellis is complementary to other open source geospatial projects such as GeoServer, OpenLayers and PostGIS.
GeoTrellis spatial data processing is organized into operations. Following the formal Map Algebra nomenclature developed by C. Dana Tomlin, operations include Local, Focal and Zonal operations for raster data, a few vector operations and network operations. Multiple operations can be composed into Models. A geoprocessing model in GeoTrellis is composed of smaller geoprocessing operations with well-defined inputs and outputs.
GeoTrellis is designed to help a developer create simple, standard REST services that return the results of geoprocessing models. Like an RDBS that can optimize queries, GeoTrellis will automatically parallelize and optimize geoprocessing models where possible.
Geotrellis Fast Geoprocessing
New Features and Documentation
For the recent GeoTrellis 0.9 release, the GeoTrellis documentation site was significantly revised. It includes both case studies and some samples that were developed since the 0.8 release. There is a full set of release notes available on the site, but major enhancements include:
- API Refactor: We are moving away from requiring users to manually create operations and pass in rasters as arguments. In 0.9, objects called “DataSources” represent the source of data, with the operations to transform or combine that data as methods on those objects.
- File I/O: Reading ARGs from disk is significantly faster. In some cases, improvements are an order of magnitude or more.
- Tile Operation Improvements: Running multiple operations over tiled data has been greatly improved.
- Clustering Improvements: Several steps have been taken to make it easier to distribute operations over a cluster using Akka clustering.
Geotrellis Cluster Processing
We have big plans for GeoTrellis that will extend its utility and make it easier to implement across a broad range of domains. Upcoming milestones will include:
- Integrate Apache Spark with GeoTrellis to allow for interactive analysis of terabyte raster data sets.
- Support for operating on data stored in the Hadoop Distributed File System (HDFS)
- Support for multi-band rasters
- Develop a Scala wrapper for another Eclipse LocationTech project, the Java Topology Suite (JTS)
- Add more Map Algebra operations
This will move us closer to our long-term vision of a general purpose, high performance raster geoprocessing library and runtime designed to perform and scale for the web.