[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [geomesa-dev] Ingest performance issues with newest version of geomesa
|
Blake,
Here are some of our findings regarding map-reduce ingest of point vs.
non-point geometries from text files:
1. Non-point geometries are somewhat slower than point geometries to
ingest, but this difference was dwarfed in our tests by the following
effects:
1.A. The size of the feature description in the file being ingested:
Non-point geometries have, by definition, larger descriptions. In our
testing with Open Street-Map (OSM) data, lines describing non-point
geometries were more than 10 times longer than lines describing point
geometries (1,429 bytes/line compared to 105 bytes/line).
1.A.1. For confirmation, we took our two files -- point and
non-point geometries -- and used a dummy mapper that ignored the real
data, and used pre-defined geometries. This eliminated any of the
GeoMesa-specific (and point vs. non-point) processing costs of the
map-reduce ingest. Even in this setup, the point files ingested 6-8
times faster than the non-point files.
1.B. Map-reduce overhead: In our tests, approximately 1/2 of the
entire elapsed time was occupied by tasks not directly related to the
GeoMesa ingest of features. These tasks include creating and preparing
the destination table; copying JAR files and data files to HDFS; etc.
Many of these are fixed costs, meaning that as the size of the data file
increases, the net throughput -- in records ingested per second --
should increase.
1.C. The number of map tasks: In our testing, there were no
(explicit) reducers, because all of the insertions were made from within
the map tasks. If you don't adjust the file-split size, Hadoop uses
relatively few map tasks, which doesn't necessarily provide you with as
much parallelism as you would like. We found that matching the splits
to the number of task-slots worked reasonably well.
2. We experimented with some ways to make non-point geometries ingest
faster relative to point geometries, but found that all of these methods
had a corresponding increase in query time after the data were fully
ingested. (This is because making ingest faster meant incurring more
false positive matches to sift through during a query.) That is,
non-point geometries seem to include a certain amount of "pay me now or
pay me later" baggage. The current implementation appears to represent
a reasonable balance between ingest time and query time.
To sum up, these are our recommendations on how to increase the
throughput of map-reduce ingest of non-point geometries:
- use the smallest, simplest input files possible; if there are fields
you don't need, eliminate them before the ingest occurs
- favor fewer, larger map-reduce ingest jobs over more, smaller jobs
- ensure that you are using a reasonable number of map-tasks for your
cluster
If you have any questions, please contact us at your convenience:
geomesa-users@xxxxxxxxxxxxxxxx or geomesa-dev@xxxxxxxxxxxxxxxx; they
both work just fine.
Thanks!
Sincerely,
-- Chris Eichelberger
On Mon, 2014-06-02 at 14:14 -0400, Chris Eichelberger wrote:
> Blake,
>
> I am still working through some of the numbers here on ingest of points
> v. non-point geometries, but will write you back as soon as I have
> something cogent.
>
> Thanks!
>
> Sincerely,
> -- Chris Eichelberger
>
>
> On Fri, 2014-05-30 at 19:35 -0400, Anthony Fox wrote:
> > Blake, I don't think it's you. Chris has some preliminary performance
> > stats on non-point geometry features and something has impacted
> > ingest. We are still looking into it. Will get back to you when we
> > know more.
> >
> > On May 30, 2014, at 5:59 PM, "Peno, Blake"
> > <Blake.Peno@xxxxxxxxxxxxxxx> wrote:
> >
> >
> > > When I use a MapReduce job, I use a split for each layer of my
> > > dataset, which is about 340ish splits. I’m getting ingest speeds of
> > > around 75/s using this MapReduce job, but it’s actually must faster
> > > for me if I just push them one at a time without using any of the
> > > MapReduce stuff, so I have to assume I’m doing something
> > > incorrectly, but I’m not really sure what. You guys will have to
> > > forgive me, as I’m not very well versed with hadoop in general, so
> > > working with geomesa is a bit of a learning experience for me.
> > >
> > >
> > >
> > > If you could get me some information on how fast you can ingest
> > > polygons, I can confirm that the problem is on my end and just keep
> > > learning and fixing things over here. I just want to make sure that
> > > it is just me getting these speed issues.
> > >
> > >
> > >
> > > Blake
> > >
> > >
> > >
> > > From: geomesa-dev-bounces@xxxxxxxxxxxxxxxx
> > > [mailto:geomesa-dev-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Anthony
> > > Fox
> > > Sent: Thursday, May 29, 2014 7:48 AM
> > > To: Discussions between GeoMesa committers
> > > Subject: Re: [geomesa-dev] Ingest performance issues with newest
> > > version of geomesa
> > >
> > >
> > >
> > > Blake, we're running some tests against polygons and will let you
> > > know the result. Can you tell me how many map tasks were
> > > instantiated by your MapReduce job?
> > >
> > >
> > > Thanks,
> > > Anthony
> > >
> > >
> > >
> > >
> > > On Wed, May 28, 2014 at 5:31 PM, Peno, Blake
> > > <Blake.Peno@xxxxxxxxxxxxxxx> wrote:
> > >
> > > The MapReduce jobs from the example is getting me about the same
> > > speed ingestion. I’m getting an average of (according to the
> > > accumulo overview site) around 120 per second ingests. Let me know
> > > if your polygons are getting any better performance than this and
> > > I’m just doing something wrong.
> > >
> > >
> > >
> > > From:geomesa-dev-bounces@xxxxxxxxxxxxxxxx
> > > [mailto:geomesa-dev-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Anthony
> > > Fox
> > > Sent: Wednesday, May 28, 2014 2:29 PM
> > >
> > >
> > > To: Discussions between GeoMesa committers
> > > Subject: Re: [geomesa-dev] Ingest performance issues with newest
> > > version of geomesa
> > >
> > >
> > >
> > >
> > > Blake,
> > >
> > > This is a good mailing list to contact us - you can also use the
> > > users mailing list (geomesa-users@xxxxxxxxxxxxxxxx). We benchmarked
> > > against point data - I'll test out an area and lines ingest and let
> > > you know some numbers. I'd recommend creating MapReduce jobs for
> > > your ingest (or a Storm job if it is streaming). That way, you'd
> > > get lots of parallelism and the index requires no communication so
> > > parallelism is fine. Check out the tutorial here:
> > >
> > > http://geomesa.github.io/2014/04/17/geomesa-gdelt-analysis/
> > >
> > >
> > > The code referenced in that tutorial (available on GitHub)
> > > demonstrates MapReduce based ingest. For Storm, check out:
> > >
> > > http://geomesa.github.io/2014/05/16/geomesa-osm-analysis/
> > >
> > >
> > > Let me know if this helps.
> > >
> > >
> > > Thanks,
> > > Anthony
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, May 28, 2014 at 2:23 PM, Peno, Blake
> > > <Blake.Peno@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Sorry, forgot to mention our cluster is 14 nodes.
> > >
> > >
> > >
> > > From:geomesa-dev-bounces@xxxxxxxxxxxxxxxx
> > > [mailto:geomesa-dev-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Peno,
> > > Blake
> > > Sent: Wednesday, May 28, 2014 1:21 PM
> > >
> > >
> > > To: Discussions between GeoMesa committers
> > > Subject: Re: [geomesa-dev] Ingest performance issues with newest
> > > version of geomesa
> > >
> > >
> > >
> > >
> > > I’m using Java to push features as described in the documentation
> > > PDF. I’m getting a FeatureSource from the DataStore and using the
> > > addFeatures method. 500k/second is about 50k/second times faster
> > > than what I’ve been getting recently. Even before updating to the
> > > latest version I wasn’t getting anywhere near that. It seems to be
> > > much faster when using point data, of course, but most of my data is
> > > area and line features.
> > >
> > >
> > >
> > > Also, side note, is this the mailing list I should be using? I know
> > > I’m not a developer of geomesa per say, but I didn’t know how else
> > > to contact you guys easily.
> > >
> > >
> > >
> > > From:geomesa-dev-bounces@xxxxxxxxxxxxxxxx
> > > [mailto:geomesa-dev-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Anthony
> > > Fox
> > > Sent: Wednesday, May 28, 2014 11:54 AM
> > > To: Discussions between GeoMesa committers
> > > Subject: Re: [geomesa-dev] Ingest performance issues with newest
> > > version of geomesa
> > >
> > >
> > >
> > > Blake,
> > >
> > > We recently switched from a text based encoding to an Avro binary
> > > encoding. This should have actually sped up your ingest
> > > significantly - it performed very well in tests we ran during
> > > development of the binary encoding. As a point of reference, we
> > > have been able to ingest (on a 21 node cluster) about 500K records
> > > per second using a map/reduce job. Can you give a bit more detail
> > > about how you are performing your ingest?
> > >
> > >
> > > Thanks,
> > > Anthony
> > >
> > >
> > >
> > >
> > > On Wed, May 28, 2014 at 12:48 PM, Peno, Blake
> > > <Blake.Peno@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Hi all,
> > >
> > >
> > >
> > > I recently upgraded to the newest version of geomesa on github, and
> > > I’ve noticed that my performance has drastically dropped in regards
> > > to pushing features to geomesa. At this rate it’s going to take
> > > about a week to get all of my data uploaded. Has something changed
> > > that would cause this, or am I missing something simple?
> > >
> > >
> > >
> > > _______________________________________________
> > > geomesa-dev mailing list
> > > geomesa-dev@xxxxxxxxxxxxxxxx
> > > http://locationtech.org/mailman/listinfo/geomesa-dev
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > geomesa-dev mailing list
> > > geomesa-dev@xxxxxxxxxxxxxxxx
> > > http://locationtech.org/mailman/listinfo/geomesa-dev
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > geomesa-dev mailing list
> > > geomesa-dev@xxxxxxxxxxxxxxxx
> > > http://locationtech.org/mailman/listinfo/geomesa-dev
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > geomesa-dev mailing list
> > > geomesa-dev@xxxxxxxxxxxxxxxx
> > > http://locationtech.org/mailman/listinfo/geomesa-dev
> > >
> > _______________________________________________
> > geomesa-dev mailing list
> > geomesa-dev@xxxxxxxxxxxxxxxx
> > http://locationtech.org/mailman/listinfo/geomesa-dev
>
> _______________________________________________
> geomesa-dev mailing list
> geomesa-dev@xxxxxxxxxxxxxxxx
> http://locationtech.org/mailman/listinfo/geomesa-dev