Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Go

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

From: Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Date: Mon, 20 Feb 2017 09:23:01 -0500
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://dev.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <https://dev.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://dev.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0

Awesome! I think that avro files are not splittable in our input formatbecause they have a defined header and format that must be read by asingle mapper. My understanding is that it's like XML - if youarbitrarily split an XML document each piece will no longer be valid. Icould be wrong though, and there may be better work-arounds also.


Thanks,

Emilio

On 02/20/2017 08:27 AM, Anthony Fox wrote:

Dan,

This is great!  Any chance you could submit a PR?  I'll merge ASAP as I
need it for some work I'm doing now.  I just haven't gotten around to
enabling the distributed ingest on GCP - it currently works on S3 and
HDFS.  And regarding the shaded jars, definitely open to suggestions.  I
struggled with this recently when running some Spark jobs on a GCP
Dataproc cluster.  Basically, getting the hdfs-site.xml file with the
INSTANCE and PROJECT set properly into the jar that gets distributed
should happen as part of the deployment.  What do you think?

Thanks,
Anthony


Damiano Albani <damiano.albani@xxxxxxxxx> writes:

Hello,

I've been successfully ingesting Avro-formatted data into Bigtable using
the command line program.
This was done via a MapReduce job targeting Avro files located on GCS,
thanks to the
Google Cloud Storage Connector for Spark and Hadoop
<https://cloud.google.com/hadoop/google-cloud-storage-connector>.

By the way, don't you think it would be appropriate to include a dependency
to this connector in the *geomesa-bigtable-tools* module by default?
A related change would be to add *"gs://"* to the list of *distPrefixes* in
*AbstractIngest*
<https://github.com/locationtech/geomesa/blob/master/geomesa-tools/src/main/scala/org/locationtech/geomesa/tools/ingest/AbstractIngest.scala#L91>
.

I've used Google Cloud Dataproc (i.e. hosted Hadoop environment) to run the
MapReduce job.
The issue I run into was that Dataproc requires a JAR file (or several
JARs) to run the job.
So I couldn't simply tell it to call *"geomesa-bigtable convert ..."*.
The solution I came up with was to build a shaded JAR of
*geomesa-bigtable-tools*.
Do you think it would be a good idea to provide such a JAR by default for
Hadoop usage?

Last point I wanted to mention: it looks like the input of the MapReduce
job was *not* split, even though I was using Avro files on purpose.
I suppose it has to do with *AvroFileInputFormat*
<https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/AvroFileInputFormat.scala>
extending *FileStreamInputFormat*
<https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/FileStreamInputFormat.scala>,
which explicitly returns *"isSplitable = false"*.
Should *AvroFileInputFormat* thus simply overrides it to *"isSplitable =
true"*? (I haven't tested how GeoMesa would react.)
I suppose TSV and CSV input formats should also be marked as splitable by
the way, shouldn't they?

Thanks,

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users

Follow-Ups:
- Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
  - From: Damiano Albani

References:
- [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
  - From: Damiano Albani
- Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
  - From: Anthony Fox

Prev by Date: Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
Next by Date: Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
Previous by thread: Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
Next by thread: Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
Index(es):
- Date
- Thread

Breadcrumbs