Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Improving geomesa performance with MapReduce

Hello,

To answer some of your questions:

* Accumulo doesn't really have any concept of a trigger. There are certain 'hacky' ways to do so (i.e. constraints), but they aren't recommended.
* GeoMesa has a concept of query interceptors[1], which let you rewrite a query with custom code. This may not be sufficient for your needs as it doesn't let you directly change the return values, but may be a useful integration point.
* MapReduce jobs can be initiated in a variety of ways, but that is not really within the scope of GeoMesa. I'd refer you to the Hadoop documentation here.

In general, I would suggest that you first consider returning data in the Apache Arrow[2] or custom GeoMesa 'binary'[3] formats. Either one can greatly reduce the bandwidth required to return a given result set, while still returning the same number of features. You may also want to consider feature sampling[4], which will reduce the total number of features returned.

If those options are insufficient, then I would suggest writing your data reduction as an Accumulo interator or combiner, which will let you do map/reduce style programming directly in Accumulo. It sounds like your data reduction depends on each query - if so, you'd need to modify the GeoMesa query planner in order to configure and invoke your iterators. If the data reduction can be done globally, then you can simply configure the iterators on your table directly, and they will be run for each query and compaction.

GeoMesa doesn't currently have an integration point for adding new iterators, but if you'd like to contribute something to that effect, it may make your solution more robust as the API would be well defined and 'officially' supported.

Hope that helps,

Emilio

[1]: https://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-query-interceptors
[2]: https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#arrow-encoding
[3]: https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#binary-encoding
[4]: https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#feature-sampling

On 10/22/19 7:16 PM, Udstrand, Will M wrote:

Hey Emilio,

 

In our current setup we are using accumulo as the backend database and we are querying geomesa with the open source open source api via org.geotools.* and org.opengis.*

 

 

From: Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Sent: Tuesday, October 22, 2019 11:22 AM
To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
Cc: Udstrand, Will M <Will.Udstrand@xxxxxxxxx>; Gorham, Kent <Kent.Gorham@xxxxxxxxx>; Wagner, Brett D <Brett.Wagner@xxxxxxxxx>
Subject: Re: [geomesa-users] Improving geomesa performance with MapReduce

 

Can you say more about your setup? What back-end database are you using? Are you using geoserver for querying, or something else?

Thanks,

Emilio

On 10/22/19 11:23 AM, Udstrand, Will M wrote:

Problem Description:

Currently in our platform we are using geomesa to store large amounts of geographical and time sensitive metadata, and we are experiencing very poor performance metrics (i.e. latency) with our systems current configuration. The primary bottleneck has to do with the large amount of data returned by geomesa, so we are actively pursuing avenues to reduce and shrink the size of the responses. We have been investigating the use of MapReduce with in the system, but have run into some knowledge gaps due to the lack of documentation. The idea behind our MapReduce use case is to either intercept queries coming into our cluster, or run jobs to periodically to combine and reduce the primary dataset and place the results into a separate table. Ideally we would intercept the queries due to the complications of the data reduction, since the reductions is dependent on the parameters of a query.

 

MapReduce Options

·        When intercepting queries coming into our cluster we’d  have them trigger jobs that combine and reduce the queries raw metadata into a smaller set of formatted/processed data points which is then returned to our backend services as the result of the query.

·        Periodically or have events such as a write to a table trigger a job that process and reduces the primary data set and write the result to our new “query” table.

 

Questions

·        Can MapReduce jobs be triggered by events in the database?

·        Can one intercept the queries written to a geomesa instance?

·        How are MapReduce Jobs initiated, and can they be triggered programmatically?

·        Can we send back the results of a MapReduce Job as the result of a query?

·        Are there any other options to reduce the latency occurred by large responses from the database?

 

We were hoping that you'd be able to give us some insight into our problems and additional help in terms of the plausibility for our MapReduce and geomesa use case.

 



_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users

 



Back to the top