Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

Awesome! I think that avro files are not splittable in our input format because they have a defined header and format that must be read by a single mapper. My understanding is that it's like XML - if you arbitrarily split an XML document each piece will no longer be valid. I could be wrong though, and there may be better work-arounds also.

Thanks,

Emilio

On 02/20/2017 08:27 AM, Anthony Fox wrote:
Dan,

This is great!  Any chance you could submit a PR?  I'll merge ASAP as I
need it for some work I'm doing now.  I just haven't gotten around to
enabling the distributed ingest on GCP - it currently works on S3 and
HDFS.  And regarding the shaded jars, definitely open to suggestions.  I
struggled with this recently when running some Spark jobs on a GCP
Dataproc cluster.  Basically, getting the hdfs-site.xml file with the
INSTANCE and PROJECT set properly into the jar that gets distributed
should happen as part of the deployment.  What do you think?

Thanks,
Anthony


Damiano Albani <damiano.albani@xxxxxxxxx> writes:

Hello,

I've been successfully ingesting Avro-formatted data into Bigtable using
the command line program.
This was done via a MapReduce job targeting Avro files located on GCS,
thanks to the
Google Cloud Storage Connector for Spark and Hadoop
<https://cloud.google.com/hadoop/google-cloud-storage-connector>.

By the way, don't you think it would be appropriate to include a dependency
to this connector in the *geomesa-bigtable-tools* module by default?
A related change would be to add *"gs://"* to the list of *distPrefixes* in
*AbstractIngest*
<https://github.com/locationtech/geomesa/blob/master/geomesa-tools/src/main/scala/org/locationtech/geomesa/tools/ingest/AbstractIngest.scala#L91>
.

I've used Google Cloud Dataproc (i.e. hosted Hadoop environment) to run the
MapReduce job.
The issue I run into was that Dataproc requires a JAR file (or several
JARs) to run the job.
So I couldn't simply tell it to call *"geomesa-bigtable convert ..."*.
The solution I came up with was to build a shaded JAR of
*geomesa-bigtable-tools*.
Do you think it would be a good idea to provide such a JAR by default for
Hadoop usage?

Last point I wanted to mention: it looks like the input of the MapReduce
job was *not* split, even though I was using Avro files on purpose.
I suppose it has to do with *AvroFileInputFormat*
<https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/AvroFileInputFormat.scala>
extending *FileStreamInputFormat*
<https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/FileStreamInputFormat.scala>,
which explicitly returns *"isSplitable = false"*.
Should *AvroFileInputFormat* thus simply overrides it to *"isSplitable =
true"*? (I haven't tested how GeoMesa would react.)
I suppose TSV and CSV input formats should also be marked as splitable by
the way, shouldn't they?

Thanks,
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users



Back to the top