Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

Hello,

On Tue, Feb 21, 2017 at 4:46 PM, Damiano Albani <damiano.albani@xxxxxxxxx> wrote:
Now the remaining issue is that I don't understand the overall behavior of the MapReduce job on Google Dataproc: only 1 worker node (e.g. out of 2) gets tasks (albeit correctly 1 task per vCPU) and, even more surprising, I don't see any performance boost in Bigtable write throughput.

For the record, using the preview version of the Dataproc environment fixed my issue somehow.
MapReduce ingest jobs are now fully split over all nodes — so fast that I think Bigtable is now the bottleneck.
I mean, at least starting off an empty Bigtable instance.

This comment on StackOverflow made me think that it could be preferable to pre-split Bigtable before ingesting the data.
(Bigtable will eventually reorganize those splits if I understand correctly.)
Given that I use UUID strings as feature identifiers, I suppose I could use split prefixes going from "0" to "f"?
Anyway, I'll report if that improved the performance.

Regards,

--
Damiano Albani
Geodan

Back to the top