Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Tips on ingesting a lot of data into GeoMesa

Hello Emilio,

First, thanks for having taking the time to write such a detailed answer!

On Mon, Jan 23, 2017 at 5:56 PM, Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:

Is this a one-off ingest, or continuous streaming data?

My use case currently deals with a one-off ingest -- at least that's where performance is critical, given the amount of data to process.
 
BigTable is fairly opaque, in that it hides any database configuration from you.

That's true that there's no much to configure apart from the instance ID and the project ID. But I suppose that's the beauty of the thing 😊
 
Thus, optimizations are limited. There is no way to e.g. write database files directly, so whatever ingest mechanism you use will end up using the same client writers. The bottleneck will likely be your BigTable instance - any client bottlenecks can be overcome by parallelizing your ingestion clients. Client connections are configured through the hbase-site.xml file - I haven't played around with it too much, but there might be some optimizations possible there.

I'll have a look if there are any settings to "tweak" the HBase client talking to BigTable.
(I remember having to set some timeout value by the way.)
 
An issue you might run into is BigTable node parallelism - GeoMesa creates some initial split points in the table structure, but my understanding is that BigTable will eventually collapse those back down if your data isn't large enough (in the TB). Thus, you might only be utilizing a single node for writing.
 
I suppose what you're saying relates to the point in BigTable's documentation titled "The workload isn't appropriate for Cloud Bigtable"?
 
In general, you want to have your clients 'close' to your back end - so in this case running your ingestion in GCE.

Yes, I've been indeeed using GCE so far, for performance and cost reason.
 
To get started, you can pretty easily use the GeoMesa command line tools for a local ingestion of flat files (you will have to define a GeoMesa converter that maps your data into SimpleFeatures). You can specify multiple local threads, up to the number of files you are processing.

Speaking of the command line tools, that would be nice if the binaries for the BigTable backend were built by default and provided in the Maven repository.

If you find that you need more ingest throughput, you can use the same converter to run a distributed map/reduce ingest. For BigTable, there may be some classpath issues to be sorted out with the GeoMesa map/reduce ingest - in particular getting your hbase-site.xml on the distributed classpath. If you go this route and hit any issues, let us know.

Using map/reduce is an interesting idea. I personally don't have much experience in that subject, but I'll definitely have a look.

We don't currently have any tools for ingesting directly from another database - you could pretty easily write something custom, or just export to files and ingest those.

If I'm not mistaken, GeoMesa relies on GeoTools in order to support the various file formats to read from, right?
While super useful, this abstraction makes it quite difficult for an "outside observer" like me to understand what happens precisely when, say, I call something like:

      bigTableFeatureStore.addFeatures(inputFeatureCollection);

Where inputFeatureCollection is a coming from, say, a huge Shapefile or CSV file.
Between this very simple call and the source data finally arriving in BigTable perfectly organised, quite a lot of things happened 😁
For example, during this pipe-like operation, is there any kind of buffering or "batch" commit? Can I expect GeoTools to process large amount of features and not leak any memory or such?
That's the kind of topics where my understand is a bit limited, given everything just works automagically!

Going back to your idea of splitting the input files, could this also be done dynamically, based on a org.geootols.data.{Query + Filter} combination which would sort of "shard" the data? For example, with 5 ingester threads, each one of the them would process a fifth of the FeatureCollection (via a modulo 5).
Does that make any sense or would it be not very reliable / efficient in practice?

One minor GeoTools optimization is to use the PROVIDED_FID hint, if you already have unique IDs. If not, GeoMesa will generate UUIDs for each feature. (the converter framework I mentioned earlier supports this by default).

That's good to know, thanks for the tip!

Best regards,

--
Damiano Albani
Geodan

Back to the top