Re: [geowave-dev] Accumulo Key Structure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geowave-dev] Accumulo Key Structure - Storing Point Data

From: Eric Robertson <rwgdrummer@xxxxxxxxx>
Date: Tue, 13 Oct 2015 09:55:36 -0400
Delivered-to: geowave-dev@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mailman/private/geowave-dev>
List-help: <mailto:geowave-dev-request@locationtech.org?subject=help>
List-subscribe: <https://www.locationtech.org/mailman/listinfo/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://www.locationtech.org/mailman/options/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=unsubscribe>

Marcel,

There is SFC per tier and per bin. Each SFC is independent (hilbert values start over).

(1) Any data item occupies one tier (per bin).

(1a) A geometry only exists in one tier (per bin). If the geometry is not a point and spans the edge of a cell, then it will be duplicated.

(1b) A geometry with a time range exists in one tier. If the geometry and/or time spans the edge of a cell, then it will be duplicated

Note: A point exists in the most granular tier.

(2) Bins can be used in any scenario. However, it is necessary for 'unbounded' dimensions (e.g. time). When a numeric range spans a bin range (e.g. a year), multiple bins are used creating duplicates for each bin. The data only occupies one tier in that bin. Within a tier, data can be duplicate, as per statement # 1.

Only an index strategy can decompose a key. TieredSFCIndexStrategy has private/protected methods to perform this function. See the private static method 'getSFCIdAndBinInfo'.

On Tue, Oct 13, 2015 at 9:34 AM, Marcel <m.jacob@xxxxxxxxxxx> wrote:

Rich,
thank you for your explanations. I made a new and simpler drawing which belongs to the single tier strategy because I only have point data. I hope this sketch is easier to understand and more precise than the first one.
Assuming a precision of 2 for each dimension. According to the TieredSFCIndexFactory class only one space filling curve is created for point data. So for each tier there is only one SFC?, isn´t it? This means that my SFC have to be comprehensive over all of my bins. The hilbert-values 1-8 belongs to binId = 2000, value 9-16 (or do they start again from 1?) to binId 2001 and so on.

Okay, the AccumuloRowId helped a lot, but the most interesting part for me is in the insertionId. How would you decompose them further (tier, bin als hilbert-value) if you want to analyze the indexId?

Thanks in advance,
Marcel Jacob.
Am 12.10.2015 21:50, schrieb Rich Fecher:
Marcel,
I'm trying to understand the attachment and have really just concluded that there seems to be some confusion about binning that I can't quite pinpoint. I think there is some conflation with binning in this diagram and hilbert values that just does not exist. The bin is actually a basic concept that is completely decoupled from the hilbert curve. Actually there is a lot of discussion on space filling curves that could serve as good background here: https://github.com/geotrellis/curve/issues/3#issuecomment-76588640

If you look at Rob's comment regarding "Unbounded Dimensions" its a fairly accurate characterization of binning as primarily concerned with bounding the unbounded dimension using periodicity. Our default uses a year as the periodicity, but we have the enum to easily allow day or month to be used through index configuration (https://github.com/ngageoint/geowave/blob/master/core/geotime/src/main/java/mil/nga/giat/geowave/core/geotime/index/dimension/TemporalBinningStrategy.java). With gdelt data, you likely will want to continue to use year. In the case of space and time bin ID's are really straightforward because there is only one unbounded dimension (time). When there are multiple unbounded dimensions then it can become less clear. If you use the default index, your bin ID will be the year. The hilbert value would be a 3D SFC value with the time dimension bounded by the beginning and end of that particular year. So 2010 would be the prefix in your example and the SFC value would be based on January 1st (year agnostic - ie. it would be much like a longitude value of -180).

Does that make more sense?

As far as seeing key structure, you can scan accumulo programmatically and use https://github.com/ngageoint/geowave/blob/master/extensions/datastores/accumulo/src/main/java/mil/nga/giat/geowave/datastore/accumulo/AccumuloRowId.java to try to see the individual components, although insertion ID can really only be further decomposed by a NumericIndexStrategy.

It would be nice to provide an implementation of org.apache.accumulo.core.util.format.Formatter for this so that the scans could be performed directly in the accumulo shell and the keys could be nicely formatted and human readable. We have that for the values as much as possible by using this, but have nothing equivalent for understanding the keys:

https://github.com/ngageoint/geowave/blob/master/extensions/datastores/accumulo/src/main/java/mil/nga/giat/geowave/datastore/accumulo/util/PersistentDataFormatter.java

I just created the issue for that (https://github.com/ngageoint/geowave/issues/528) as its a fairly straightforward task for interested new contributors, although requires some digging into understanding the key structure, which may be a valuable way to digest the concept.

Rich

On Mon, Oct 12, 2015 at 2:51 PM, Marcel <m.jacob@xxxxxxxxxxx> wrote:

Hello,
IÂ´ve got a couple of questions when storing point data. In the attachment you can find a drawing with my current understanding how this key structure might work.

I read your presentation at the accumulo summit, but itÂ´s not quite clear how to determine some values.
http://accumulosummit.com/program/talks/geowave-geospatial-and-geotemporal-data-storage-and-retrieval-in-accumulo/

IÂ´ve chosen a very simple case with 8 cubes. The whole cube represents the world from 2000-2015 (16 years).
If I want to store my Point P(30, -180, 2010-01-01) it is said that we first have to determine the "tier". Because itÂ´s a point it will be stored in the highest tier number. In my case there are only tier 0 and tier 1. Now itÂ´s up to the bin. This is where my presumptions starts: We need a binID...In my drawing this is done by using a Hilbert-curve. Is this correct? Because my point P is the last of the 8 sub-cubes, binID would set to 8. Because the date range is known this could be done without any problems. But when I want to add my point P to Accumulo without having any additional information this would causes some problems. Is there are default date range which is used? Or will the binID added later on, when all data is in Accumulo (now we know the daterange)?
Each bin has its own hilbert space. But which resolution do you use? (in my drawing its also first order hilbert curve). Where do you store the boundaries for each bin (or are they calculated on the fly)? The resulting entries in Accumulo for my example is at the bottom of my sheet of paper.
Within the accumulo structure I canÂ´t see a parameter which partitions the data evenly across my nodes. Do you avoid hotspots with a random prefix?

I hope my sketch helps a little bit that you can understand what my problems are with the Accumulo key structure. Please correct me if my drawing is wrong. But itÂ´s hard to get an understanding of this complex structure.

Is there a method which returns an entry in the accumulo data format? I wrote a Scanner, but part of the results of the rowId were not readable: "2003 >)æ šb ï¿¿ geowave-gdelt260176188"

Thanks in advance,
Marcel Jacob.

_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev
_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev
_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

Follow-Ups:
- Re: [geowave-dev] Accumulo Key Structure - Storing Point Data
  - From: Marcel

References:
- [geowave-dev] Accumulo Key Structure - Storing Point Data
  - From: Marcel
- Re: [geowave-dev] Accumulo Key Structure - Storing Point Data
  - From: Rich Fecher
- Re: [geowave-dev] Accumulo Key Structure - Storing Point Data
  - From: Marcel

Prev by Date: Re: [geowave-dev] Accumulo Key Structure - Storing Point Data
Next by Date: Re: [geowave-dev] Geowave Cassandra Proposal
Previous by thread: Re: [geowave-dev] Accumulo Key Structure - Storing Point Data
Next by thread: Re: [geowave-dev] Accumulo Key Structure - Storing Point Data
Index(es):
- Date
- Thread

Breadcrumbs