Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] duplicate data in geomesa 1.2.1--how? and why?

Hi Ben,

If there is a geometry and date field in your simple feature type, then geomesa will use that for indexing. If you have more than one date or geometry field, you can indicate which ones you want to be used for the index - more below. If nothing is indicated, I believe geomesa will default to the first one declared. You can check which fields are indexed by scanning the geomesa metadata table in the accumulo shell. The exact entries will vary by version, but you can probably figure it out - if not please reply back with the scan output and we can parse it for you.

Exactly how you indicate the defaults depends on how you are creating your simple feature type. For dates, the end result should be that there is a user-data entry for "geomesa.index.dtg" set to the name of your date field. For geometries, the default should be returned by simpleFeatureType.getGeometryDescriptor().

If you just have lat/lon, you will need to turn them into an actual geometry type in your simple feature type.

During ingestion, you can then set the indexed fields to whatever values you want (sys time, provided time, etc).

Thanks,

Emilio

On 02/14/2017 10:35 AM, Benjamin Weaver wrote:

Thanks, Emilio,


There is a lot of very valuable information here.


Two questions just to clarify (you were clear in your answers--the lack of clarity is in my understanding of things):


1. How would we index in Geomesa on latitude, longitude, and a time we provide from our own data, i.e. not a system generated timestamp?




From: geomesa-users-bounces@xxxxxxxxxxxxxxxx <geomesa-users-bounces@xxxxxxxxxxxxxxxx> on behalf of Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Sent: 13 February 2017 14:42
To: geomesa-users@xxxxxxxxxxxxxxxx
Subject: Re: [geomesa-users] duplicate data in geomesa 1.2.1--how? and why?
 
Hi Ben,

1. The key used to write in geomesa depends on the particular index, but it will always include the feature ID, so if the feature ID changes you will get a duplicate record.

2. If you're using our converter framework, we do have some methods to use an MD5 of the values as the feature ID, which will prevent duplicates. If not, you can do the same thing by generating the feature ID yourself and setting the PROVIDED_FID or USE_PROVIDED_FID hint. We also have a pluggable SPI interface for generating feature IDs when they aren't set. See http://www.geomesa.org/documentation/user/datastores/runtime_config.html#geomesa-feature-id-generator. By default we generate a UUID that includes parts of the Z3 index, so that features grouped in space-time will also be grouped in accumulo. Note that the feature ID is a string and has no inherent restrictions on form.

3. The Z3 index uses the default date attribute to index records, not the insertion time.

Let me know if anything isn't clear!

Thanks,

Emilio

On 02/12/2017 03:28 PM, Benjamin Weaver wrote:

Hi all,


If we ingest, say, the same line of text data twice (by mistake) in Geomesa 1.2.1 we end up with duplicate data in our Accumulo (1.7.2) database. We are ingesting using Gemesa-generated featureIDs (setting our featureBuilder.setFeatureID to NULL without the use of Hints).


A colleague asked me, why are duplicates generated in this case? I realized I did not know.


1. How, exactly, in our configuration of geomesa + Accumulo, is a geomesa row, or record made unique? I know the importance of Accumulo logical rows, but in this case of identical data we would want to insure insertation of only one geomesa record, namely, one instance of our geomesa SimpleFeature.


1a. Are duplicate geomesa rows added because the time at insertion differs? or because different featureIDs are randomly generated on each insertion?



Potentially related questions:


2. How are featureIDs generated by geomesa? I thought randomly, but I read a comment somewhere suggesting that FeatureIDs were created out of an md5 hash of all the values in the feature. But a colleague points out that even if this is so, a featureID does not resemble an md5 hash, so must be composed at least partially by other means


3. A potentially related question: can we create a z3 index by using a data-derived timestamp--not the insertion timestamp-- as the time dimension?


All comments and perspectives are appreciated and welcome!


Ben Weaver





This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users

This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users


Back to the top