Re: [geomesa-users] Loading Data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] Loading Data

From: Hunter Provyn <fhp@xxxxxxxx>
Date: Wed, 16 Apr 2014 17:18:43 -0400
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <http://www.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <http://www.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi Chris,

Here is an example Java project that ingests GDELT using Hadoop 2.2, Accumulo 1.5, and the tip of GeoMesa master.
It took 30 minutes to ingest a 72G TSV file that is an uncompressed concatenation of GDELT up to Feb 24, 2014.

We plan to roll it into geomesa/geomesa-gdelt but for now it is a separate project:

https://github.com/ccri/geomesa-gdelt

Instructions
1) mvn install
2) hadoop jar target/geomesa-gdelt-1.0-SNAPSHOT.jar geomesa.gdelt.GDELTIngest -instanceId [instanceId] -zookeepers [zookeepers] -user [user] -password [password] -auths [auths] -tableName [tableName] -featureName [featureName] -ingestFile [ingestFile]

It will copy its jar to HDFS and requires that the ingestFile be a gdelt format TSV that is already on HDFS.

I hope this is helpful - let me know if you have any questions. This branch is still under development and we are also working on a complete tutorial to accompany it.

thanks,
Hunter

On 04/11/2014 04:42 PM, Hunter Provyn wrote:

Hi Chris,

We recommend the steps below for ingesting a non-shapefile csv or tsv:

1. in java code, get a handle on a DataStore using DataStoreFinder.getDataStore()
2. create a SimpleFeatureType for GDELT using DataUtilities
3. call ds.createSchema(schemaType)
4. run map reduce job with that schema

I'm working on an example project in Java that I will send you when complete.

Below is an example of using DataUtilities to create a SimpleFeatureType for GDELT. You may need to double check some of the types in the sftSpec String, I referred to the GDELT online documentation:

String name = "gdelt";
String sftSpec =
"GLOBALEVENTID:Integer,SQLDATE:Date,MonthYear:Integer,Year:Integer,FractionDate:Float,Actor1Code:String,Actor1Name:String,Actor1CountryCode:String,Actor1KnownGroupCode:String,Actor1EthnicCode:String,Actor1Religion1Code:String,Actor1Religion2Code:String,Actor1Type1Code:String,Actor1Type2Code:String,Actor1Type3Code:String,Actor2Code:String,Actor2Name:String,Actor2CountryCode:String,Actor2KnownGroupCode:String,Actor2EthnicCode:String,Actor2Religion1Code:String,Actor2Religion2Code:String,Actor2Type1Code:String,Actor2Type2Code:String,Actor2Type3Code:String,IsRootEvent:Integer,EventCode:String,EventBaseCode:String,EventRootCode:String,QuadClass:Integer,GoldsteinScale:Float,NumMentions:Integer,NumSources:Integer,NumArticles:Integer,AvgTone:Float,Actor1Geo_Type:Integer,Actor1Geo_FullName:String,Actor1Geo_CountryCode:String,Actor1Geo_ADM1Code:String,Actor1Geo_Lat:Float,Actor1Geo_Long:Float,Actor1Geo_FeatureID:Integer,Actor2Geo_Type:Integer,Actor2Geo_FullName:String,Actor2Geo_CountryCode:String,Actor2Geo_ADM1Code:String,Actor2Geo_Lat:Float,Actor2Geo_Long:Float,Actor2Geo_FeatureID:Integer,ActionGeo_Type:Integer,ActionGeo_FullName:String,ActionGeo_CountryCode:String,ActionGeo_ADM1Code:String,ActionGeo_Lat:Float,ActionGeo_Long:Float,ActionGeo_FeatureID:Integer,DATEADDED:Integer";

SimpleFeatureType featureType = DataUtilities.createType(name, sftSpec);

dataStore.createSchema(featureType);

Hunter

On 04/11/2014 01:22 PM, Chris Snider wrote:
Hi,

I saw some of the Geomesa YouTube videos referencing loading data as well as the “Spatio-temporal Indexing in Non-relational Distributed Databases” paper referencing loading the GDELT dataset. Are there any documented steps on how to load the GDELT dataset?

Additionally, I have been able to load features to a feature type using the WFS-T endpoint. Is there a better/faster/more efficient method of loading even modest amounts of data? Example, I have a Natural Earth Country Polygon set that I extracted the geometry, name and admin columns from to push into Geomesa through the WFS-T endpoint. I can only push between 5 and 10 rows without hitting a timeout.

Thanks,

Chris Snider

Senior Software Engineer

Intelligent Software Solutions, Inc.
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
http://www.locationtech.org/mailman/listinfo/geomesa-users

Follow-Ups:
- Re: [geomesa-users] Loading Data
  - From: Chris Snider

References:
- [geomesa-users] Loading Data
  - From: Chris Snider
- Re: [geomesa-users] Loading Data
  - From: Hunter Provyn

Prev by Date: Re: [geomesa-users] Loading Data
Next by Date: Re: [geomesa-users] Loading Data
Previous by thread: Re: [geomesa-users] Loading Data
Next by thread: Re: [geomesa-users] Loading Data
Index(es):
- Date
- Thread

Breadcrumbs