Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Improving geomesa performance with MapReduce

Generally, when trying to debug a query result, you can get a lot of insight from enabling explain query logging[1]. In your case, by default GeoMesa creates 2 indices, a spatial z2 index and a spatio-temporal z3 index. Because space is a constrained value, we can represent the entire world as a single range. However, because time is open-ended, we have to 'bin' the index by time periods (weeks by default). When scanning a large time period, you end up having to scan each time bin. This can lead to significant overhead in query times, even if there is no data, as we still have to construct the query ranges and send them to Accumulo. There are a few things you can do to mitigate this:

* You can create an attribute index[2] on your date field, at the cost of increasing your size on disk and decreasing overall write speeds. The attribute index key is optimized for that type of query.
* You can increase the time period[3], which will reduce the number of time bins scanned. Generally, you want to align your time period with the range of data you expect to query.
* You can reduce the range decomposition[4] for a query. Having fewer, broader ranges can slow down scans due to more false positives being filtered, but will reduce the overhead involved with sending many ranges to Accumulo.
* If your data is well-known, you can apply date filters to each query that define your data bounds. In GeoServer, you can do this through configuring default layer filters.

Thanks,

Emilio

[1]: https://www.geomesa.org/documentation/user/datastores/query_planning.html#explaining-query-plans
[2]: https://www.geomesa.org/documentation/user/datastores/index_basics.html#attribute-index
[3]: https://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-z-index-time-interval
[4]: https://www.geomesa.org/documentation/user/datastores/runtime_config.html#geomesa-scan-ranges-target

On 10/25/19 12:34 PM, Gorham, Kent wrote:

Hello Emilio,

 

Thank you for the information.  I’ve investigated some of those avenues, but I have also been performing additional tests and do not understand the results.

 

We have populated an GeoMesa/Accumulo database (with 8 nodes) with 3,110,440 records (360*180*48).  There are 48 datapoints recorded across every point on the planet.  At each location, we have 48 points, that increment in time by 1 millisecond each.  We have both point and time has part of our featuretype. 

For clarity, location -180,-90 will contain a datapoint with a time value of 0.  Location 180,90 will contain a time value of 3,110,439.

 

We wrote a test to retrieve the data in various ways (by location only, by time only and by time and location).

 

For location, as we approach a zero-sized search area, the time for the search approached zero.  See ‘location only.png’.  The blue line is the number of records returned, the orange line is the time in milliseconds to do the search.  The search area starts at (-180,-90,180,90) and decreases one longitude degree for each search.  We only performed 300 searches.  This graph makes sense.

NOTE: The iterator to read the records returned simply counts the records, so it is not a factor.

 

However, when we search by time only, (see time-only.png), we see that there seems to be some significant overhead associated with performing the search.  In this search, we reach zero because we have caculated the appropriate increment for each search (e.g. first seach, 0 to 3110440 milliseconds (milliseconds are converted directly to Date), 10368 to 3110400 for second search).  We have still only performed 300 iterations in the test loop.

 

Also if we perform a time search for the area of time that does not contain any data (e.g. from 3110441 to NOW), the system still takes a couple seconds to return zero results. (~2.4 seconds)

 

We realize that there is a significant amount of time between 3110440 (12-31-1969 18:51:50) and NOW, even though there’s no data there (but possibly indexes exist?). We are wonder if that is part of the problem.

 

We would like to understand this overhead that occurs with a temporal search.

Would you be able to explain it, or is there a good way to diagnose it?

 

Thanks,

Kent

 

 

 

 

 

From: Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Sent: Thursday, October 24, 2019 10:42 AM
To: Udstrand, Will M <Will.Udstrand@xxxxxxxxx>; geomesa-users@xxxxxxxxxxxxxxxx
Cc: Gorham, Kent <Kent.Gorham@xxxxxxxxx>; Udstrand, Will M <Will.Udstrand@xxxxxxxxx>
Subject: Re: [geomesa-users] Improving geomesa performance with MapReduce

 

Hello,

To answer some of your questions:

* Accumulo doesn't really have any concept of a trigger. There are certain 'hacky' ways to do so (i.e. constraints), but they aren't recommended.
* GeoMesa has a concept of query interceptors[1], which let you rewrite a query with custom code. This may not be sufficient for your needs as it doesn't let you directly change the return values, but may be a useful integration point.
* MapReduce jobs can be initiated in a variety of ways, but that is not really within the scope of GeoMesa. I'd refer you to the Hadoop documentation here.

In general, I would suggest that you first consider returning data in the Apache Arrow[2] or custom GeoMesa 'binary'[3] formats. Either one can greatly reduce the bandwidth required to return a given result set, while still returning the same number of features. You may also want to consider feature sampling[4], which will reduce the total number of features returned.

If those options are insufficient, then I would suggest writing your data reduction as an Accumulo interator or combiner, which will let you do map/reduce style programming directly in Accumulo. It sounds like your data reduction depends on each query - if so, you'd need to modify the GeoMesa query planner in order to configure and invoke your iterators. If the data reduction can be done globally, then you can simply configure the iterators on your table directly, and they will be run for each query and compaction.

GeoMesa doesn't currently have an integration point for adding new iterators, but if you'd like to contribute something to that effect, it may make your solution more robust as the API would be well defined and 'officially' supported.

Hope that helps,

Emilio

[1]: https://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-query-interceptors
[2]: https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#arrow-encoding
[3]: https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#binary-encoding
[4]: https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#feature-sampling

On 10/22/19 7:16 PM, Udstrand, Will M wrote:

Hey Emilio,

 

In our current setup we are using accumulo as the backend database and we are querying geomesa with the open source open source api via org.geotools.* and org.opengis.*

 

 

From: Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Sent: Tuesday, October 22, 2019 11:22 AM
To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
Cc: Udstrand, Will M <Will.Udstrand@xxxxxxxxx>; Gorham, Kent <Kent.Gorham@xxxxxxxxx>; Wagner, Brett D <Brett.Wagner@xxxxxxxxx>
Subject: Re: [geomesa-users] Improving geomesa performance with MapReduce

 

Can you say more about your setup? What back-end database are you using? Are you using geoserver for querying, or something else?

Thanks,

Emilio

On 10/22/19 11:23 AM, Udstrand, Will M wrote:

Problem Description:

Currently in our platform we are using geomesa to store large amounts of geographical and time sensitive metadata, and we are experiencing very poor performance metrics (i.e. latency) with our systems current configuration. The primary bottleneck has to do with the large amount of data returned by geomesa, so we are actively pursuing avenues to reduce and shrink the size of the responses. We have been investigating the use of MapReduce with in the system, but have run into some knowledge gaps due to the lack of documentation. The idea behind our MapReduce use case is to either intercept queries coming into our cluster, or run jobs to periodically to combine and reduce the primary dataset and place the results into a separate table. Ideally we would intercept the queries due to the complications of the data reduction, since the reductions is dependent on the parameters of a query.

 

MapReduce Options

·         When intercepting queries coming into our cluster we’d  have them trigger jobs that combine and reduce the queries raw metadata into a smaller set of formatted/processed data points which is then returned to our backend services as the result of the query.

·         Periodically or have events such as a write to a table trigger a job that process and reduces the primary data set and write the result to our new “query” table.

 

Questions

·         Can MapReduce jobs be triggered by events in the database?

·         Can one intercept the queries written to a geomesa instance?

·         How are MapReduce Jobs initiated, and can they be triggered programmatically?

·         Can we send back the results of a MapReduce Job as the result of a query?

·         Are there any other options to reduce the latency occurred by large responses from the database?

 

We were hoping that you'd be able to give us some insight into our problems and additional help in terms of the plausibility for our MapReduce and geomesa use case.

 




_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users

 

 



Back to the top