Thanks, Emilio!
The
number of partitions per day is rather small for our use
case, up to four only. But for one partition we easily end
up having thousands of small files. The reason for this is
we are collecting events from (a small number of) vehicles,
which are directly stored in GeoMesa FSDS without any
pre-aggregation. Maybe FSDS is just not a perfect fit for
our use case and we should use some other data store? We've
decided for FSDS because it was the simplest and cheapest
approach we could think of and query performance was of
secondary importance to us. But since the number of data /
the number of files increases, performance becomes an issue
now.
We
compact by partition already. Maybe we'll give JDBC metadata
persistence a try. Thanks for that hint.
One
more thing we've noticed: Using FSDS with a local file
system (i.e. a file://... URL) seems to be considerably
faster than using some S3 compatible object store (i.e. an
s3a://… URL). Is that due to the nature of an object store
being slower than a file system or might that be an issue
with the underlying org.apache.hadoop.fs.FileSystem
implementation? We are using hadoop-2.8.5.
Best,
Christian
Hello,
The FSDS is going to work best when you only have to query a
few large files. The metadata will be cached, so if you keep
a data store around (e.g. in geoserver), it shouldn't be
doing repeated reads of the metadata files. That leads me to
believe that you are seeing slowness from scanning a large
number of files, where the overhead of opening the file is
dominating the query time.
A few suggestions:
You're creating a lot of partitions - up to 256 per day. How
much data ends up in a typical partition with your current
setup? I would suggest trying with 2 or 4 bits of precision
in your partition scheme.
How are you ingesting data? You should try to avoid creating
lots of small data files, as that requires a lot of overhead
to scan.
If you aren't already, make sure that you compact by
partition. Assuming your data is coming in semi-live, there
won't be any writes going to older partitions. Compacting
them again will not improve performance, but may generate
considerable work.
Finally, you may want to switch to JDBC for metadata
persistence, which should alleviate most of the issues
around metadata operations:
https://www.geomesa.org/documentation/user/filesystem/metadata.html#relational-database-persistence
Re: getTypeNames, that could probably be improved, although
the metadata is read once and then cached, so you will
likely pay that penalty the first time you access each
feature type anyway. I've opened a ticket to track the issue
here:
https://geomesa.atlassian.net/browse/GEOMESA-2678
Thanks,
Emilio
Hi GeoMesa Users,
we are using GeoMesa
with an S3 file system datastore and are experiencing
extremely slow response times when we access our data -
even with a “moderate” number of files stored in it (let’s
say 10.000).
Our setup:
* GeoMesa 2.3.0
* Filesystem datastore
pointing to an S3 URL
** encoding: orc
** partition scheme:
daily,xz2-8bits
** leaf-storage: true
We’re accessing that
data store using different “clients”:
* a Java microservice
which uses the GeoTools GeoMesa API and is running in the
same AWS region as the S3 bucket
* GeoServer (2.14)
running in the same AWS region as the S3 bucket
* geomesa-fs CLI
running in the same AWS region as the S3 bucket
All of them are really
slow (it takes minutes up to hours until we get a
response). Doing some debugging with our microservice we
found out that even operations like
org.geotools.data.DataStore.getTypeNames() takes really
long because all of the metadata files seem to be scanned
(which does not seem to be necessary since reading the
per-feature top-level storage.json files should be
sufficient). Is that “works-as-designed” or might that be
a bug inside the Geomesa-FSDS implementation?
Is there anything
(besides switching the actual data store) we can do to
improve the performance?
We’re doing a
“geomesa-fs compact …” from time to time which gives us a
fairly acceptable performance (but also takes hours,
sometimes even days, to complete).
Thanks,
Christian
Mit freundlichen Grüßen / Kind regards
Christian Sickert
Crowd Data & Analytics for Automated Driving
Daimler AG - Mercedes-Benz Cars Development - RD/AFC
+49 176 309
71612
christian.sickert@xxxxxxxxxxx
If you are not the addressee, please inform us immediately
that you have received this e-mail by mistake, and delete
it.
We
thank you for your support.
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users