Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] GeoMesa FSDS on S3 - very slow response times

Title: Default Disclaimer Daimler AG
Hello,

Yes, you should definitely aggregate the data before ingesting. If you use NiFi for your ingest pipeline, I believe it has some provided processors that can do pre-aggregation. You can also use a regular feature writer, and hold it open - there is a property for controlling how often data is flushed to disk, which will determine how many files you create: https://www.geomesa.org/documentation/user/filesystem/configuration.html#geomesa-fs-writer-partition-timeout

If you need semi-real-time access to the data, then you probably shouldn't use the FSDS. However, you could use a hybrid approach, with more recent data stored in a more performant store, and older compacted data in the FSDS, using the Merged View data store: https://www.geomesa.org/documentation/user/merged_view.html

Using s3 is generally going to be slower than a local filesystem, and the slowness will be exacerbated with lots of small files. Once you open a file connection to s3, it can stream data back fairly quickly, but there is some latency in getting the connection in the first place. It's also not a random-access file system, so it's likely that orc/parquet can do some optimized skipping around when reading a local file, which they can't do with s3. I'm not sure if the hadoop s3 implementation would be a bottleneck or not.

Thanks,

Emilio

On 7/31/19 2:58 AM, christian.sickert@xxxxxxxxxxx wrote:

Thanks, Emilio!

 

The number of partitions per day is rather small for our use case, up to four only. But for one partition we easily end up having thousands of small files. The reason for this is we are collecting events from (a small number of) vehicles, which are directly stored in GeoMesa FSDS without any pre-aggregation. Maybe FSDS is just not a perfect fit for our use case and we should use some other data store? We've decided for FSDS because it was the simplest and cheapest approach we could think of and query performance was of secondary importance to us. But since the number of data / the number of files increases, performance becomes an issue now.

 

We compact by partition already. Maybe we'll give JDBC metadata persistence a try. Thanks for that hint.

 

One more thing we've noticed: Using FSDS with a local file system (i.e. a file://... URL) seems to be considerably faster than using some S3 compatible object store (i.e. an s3a://… URL). Is that due to the nature of an object store being slower than a file system or might that be an issue with the underlying org.apache.hadoop.fs.FileSystem implementation? We are using hadoop-2.8.5.

 

Best,

Christian

 

Von: geomesa-users-bounces@xxxxxxxxxxxxxxxx <geomesa-users-bounces@xxxxxxxxxxxxxxxx> Im Auftrag von Emilio Lahr-Vivaz
Gesendet: Dienstag, 30.
Juli 2019 17:58
An: geomesa-users@xxxxxxxxxxxxxxxx
Betreff: Re: [geomesa-users] GeoMesa FSDS on S3 - very slow response times

 

Hello,

The FSDS is going to work best when you only have to query a few large files. The metadata will be cached, so if you keep a data store around (e.g. in geoserver), it shouldn't be doing repeated reads of the metadata files. That leads me to believe that you are seeing slowness from scanning a large number of files, where the overhead of opening the file is dominating the query time.

A few suggestions:

You're creating a lot of partitions - up to 256 per day. How much data ends up in a typical partition with your current setup? I would suggest trying with 2 or 4 bits of precision in your partition scheme.

How are you ingesting data? You should try to avoid creating lots of small data files, as that requires a lot of overhead to scan.

If you aren't already, make sure that you compact by partition. Assuming your data is coming in semi-live, there won't be any writes going to older partitions. Compacting them again will not improve performance, but may generate considerable work.

Finally, you may want to switch to JDBC for metadata persistence, which should alleviate most of the issues around metadata operations:
https://www.geomesa.org/documentation/user/filesystem/metadata.html#relational-database-persistence

Re: getTypeNames, that could probably be improved, although the metadata is read once and then cached, so you will likely pay that penalty the first time you access each feature type anyway. I've opened a ticket to track the issue here:
https://geomesa.atlassian.net/browse/GEOMESA-2678

Thanks,

Emilio

On 7/30/19 11:23 AM, christian.sickert@xxxxxxxxxxx wrote:

Hi GeoMesa Users,

 

we are using GeoMesa with an S3 file system datastore and are experiencing extremely slow response times when we access our data - even with a “moderate” number of files stored in it (let’s say 10.000).

 

Our setup:

* GeoMesa 2.3.0

* Filesystem datastore pointing to an S3 URL

** encoding: orc

** partition scheme: daily,xz2-8bits

** leaf-storage: true

 

We’re accessing that data store using different “clients”:

* a Java microservice which uses the GeoTools GeoMesa API and is running in the same AWS region as the S3 bucket

* GeoServer (2.14) running in the same AWS region as the S3 bucket

* geomesa-fs CLI running in the same AWS region as the S3 bucket

 

All of them are really slow (it takes minutes up to hours until we get a response). Doing some debugging with our microservice we found out that even operations like org.geotools.data.DataStore.getTypeNames() takes really long because all of the metadata files seem to be scanned (which does not seem to be necessary since reading the per-feature top-level storage.json files should be sufficient). Is that “works-as-designed” or might that be a bug inside the Geomesa-FSDS implementation?

 

Is there anything (besides switching the actual data store) we can do to improve the performance?

 

We’re doing a “geomesa-fs compact …” from time to time which gives us a fairly acceptable performance (but also takes hours, sometimes even days, to complete).

 

Thanks,

Christian

 

 

 

Mit freundlichen Grüßen / Kind regards

Christian Sickert

Crowd Data & Analytics for Automated Driving
Daimler AG - Mercedes-Benz Cars Development - RD/AFC

+49 176 309 71612
christian.sickert@xxxxxxxxxxx

 


If you are not the addressee, please inform us immediately that you have received this e-mail by mistake, and delete it.
We thank you for your support.




_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users

 

Default Disclaimer Daimler AG
If you are not the addressee, please inform us immediately that you have received this e-mail by mistake, and delete it. We thank you for your support.


_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users


Back to the top