Hello Emilio,
Thank you for
the information. I’ve investigated some of those avenues,
but I have also been performing additional tests and do not
understand the results.
We have
populated an GeoMesa/Accumulo database (with 8 nodes) with
3,110,440 records (360*180*48). There are 48 datapoints
recorded across every point on the planet. At each
location, we have 48 points, that increment in time by 1
millisecond each. We have both point and time has part of
our featuretype.
For clarity,
location -180,-90 will contain a datapoint with a time value
of 0. Location 180,90 will contain a time value of
3,110,439.
We wrote a test
to retrieve the data in various ways (by location only, by
time only and by time and location).
For location,
as we approach a zero-sized search area, the time for the
search approached zero. See ‘location only.png’. The blue
line is the number of records returned, the orange line is
the time in milliseconds to do the search. The search area
starts at (-180,-90,180,90) and decreases one longitude
degree for each search. We only performed 300 searches.
This graph makes sense.
NOTE: The
iterator to read the records returned simply counts the
records, so it is not a factor.
However, when
we search by time only, (see time-only.png), we see that
there seems to be some significant overhead associated with
performing the search. In this search, we reach zero
because we have caculated the appropriate increment for each
search (e.g. first seach, 0 to 3110440 milliseconds
(milliseconds are converted directly to Date), 10368 to
3110400 for second search). We have still only performed
300 iterations in the test loop.
Also if we
perform a time search for the area of time that does not
contain any data (e.g. from 3110441 to NOW), the system
still takes a couple seconds to return zero results. (~2.4
seconds)
We realize that
there is a significant amount of time between 3110440
(12-31-1969 18:51:50) and NOW, even though there’s no data
there (but possibly indexes exist?). We are wonder if that
is part of the problem.
We would like
to understand this overhead that occurs with a temporal
search.
Would you be
able to explain it, or is there a good way to diagnose it?
Thanks,
Kent
Hello,
To answer some of your questions:
* Accumulo doesn't really have any concept of a trigger. There
are certain 'hacky' ways to do so (i.e. constraints), but they
aren't recommended.
* GeoMesa has a concept of query interceptors[1], which let
you rewrite a query with custom code. This may not be
sufficient for your needs as it doesn't let you directly
change the return values, but may be a useful integration
point.
* MapReduce jobs can be initiated in a variety of ways, but
that is not really within the scope of GeoMesa. I'd refer you
to the Hadoop documentation here.
In general, I would suggest that you first consider returning
data in the Apache Arrow[2] or custom GeoMesa 'binary'[3]
formats. Either one can greatly reduce the bandwidth required
to return a given result set, while still returning the same
number of features. You may also want to consider feature
sampling[4], which will reduce the total number of features
returned.
If those options are insufficient, then I would suggest
writing your data reduction as an Accumulo interator or
combiner, which will let you do map/reduce style programming
directly in Accumulo. It sounds like your data reduction
depends on each query - if so, you'd need to modify the
GeoMesa query planner in order to configure and invoke your
iterators. If the data reduction can be done globally, then
you can simply configure the iterators on your table directly,
and they will be run for each query and compaction.
GeoMesa doesn't currently have an integration point for adding
new iterators, but if you'd like to contribute something to
that effect, it may make your solution more robust as the API
would be well defined and 'officially' supported.
Hope that helps,
Emilio
[1]:
https://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-query-interceptors
[2]:
https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#arrow-encoding
[3]:
https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#binary-encoding
[4]:
https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#feature-sampling
On 10/22/19 7:16 PM, Udstrand, Will M
wrote:
Hey
Emilio,
In our current setup we are using
accumulo as the backend database and we are querying
geomesa with the open source
open source api via
org.geotools.* and org.opengis.*
Can you say
more about your setup? What back-end database are you using?
Are you using geoserver for querying, or something else?
Thanks,
Emilio
On 10/22/19 11:23 AM, Udstrand, Will M
wrote:
Problem Description:
Currently in our platform we are using
geomesa to store large amounts of geographical and time
sensitive metadata, and we are experiencing very poor
performance metrics (i.e. latency) with our systems
current configuration. The primary bottleneck has to do
with the large amount of data returned by geomesa, so we
are actively pursuing avenues to reduce and shrink the
size of the responses. We have been investigating the use
of MapReduce with in the system, but have run into some
knowledge gaps due to the lack of documentation. The idea
behind our MapReduce use case is to either intercept
queries coming into our cluster, or run jobs to
periodically to combine and reduce the primary dataset and
place the results into a separate table. Ideally we would
intercept the queries due to the complications of the data
reduction, since the reductions is dependent on the
parameters of a query.
MapReduce Options
·
When intercepting
queries coming into our cluster we’d have them trigger
jobs that combine and reduce the queries raw metadata into
a smaller set of formatted/processed data points which is
then returned to our backend services as the result of the
query.
·
Periodically or
have events such as a write to a table trigger a job that
process and reduces the primary data set and write the
result to our new “query” table.
Questions
·
Can MapReduce jobs
be triggered by events in the database?
·
Can one intercept
the queries written to a geomesa instance?
·
How are MapReduce
Jobs initiated, and can they be triggered
programmatically?
·
Can we send back
the results of a MapReduce Job as the result of a query?
·
Are there any other
options to reduce the latency occurred by large responses
from the database?
We were hoping that you'd be able to
give us some insight into our problems and additional help
in terms of the plausibility for our MapReduce and geomesa
use case.
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users