Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geowave-dev] Avoiding Hotspots

A follow up wrt hot spots:

First, from my previous answer, 'attempt to add' was supposed to be 'attempt to aid'.  

Second, I found this nice little write up about balancing, that also mentions GeoWave, addressing some of the ideas I expressed in the prior response: https://blogs.apache.org/accumulo/entry/balancing_groups_of_tablets.  Like I said, nothing comes for free.  A little planning and data profiling will yield good results.

I am planning on pushing some examples for using the numeric index strategy approach.  Even though GeoWave has default index strategies, the defaults are not 'one size fits all'.   When using GeoServer, one can create a custom index and store it within the IndexStore prior to creating a layer in GeoServer (and ingest the data).  The GeoServer plugin finds and recognizes the index, choosing the pre-stored index over one of the defaults.  

When performing a bulk ingest, each 'format' supports a method to obtain the preferred indices.  A developer can override this method to provide a custom index.

final Index index = new CustomIdIndex(
new CompoundIndexStrategy(new CustomBalanceIndexStrategy(), IndexType.SPATIAL_VECTOR.createDefaultIndexStrategy()),
IndexType.SPATIAL_VECTOR.getDefaultIndexModel(),
new ByteArrayId(
"my_index_id"));


On Wed, Oct 21, 2015 at 3:18 PM, Eric Robertson <rwgdrummer@xxxxxxxxx> wrote:
The key structure is documented here: http://ngageoint.github.io/geowave/documentation.html#architecture-accumulo.

Using the default indexing, assuming that only points are stored, the tier does not change.   The bin is used for unconstrained numeric ranges (e.g. time).   With the Hilbert value being the third part of the key, data within a specific locality is likely to be collocated on one server.  This is not guaranteed since space filing curves have some long stretches.  Furthermore, row IDs are unique; they include adapter id and data id.  A split can occur anywhere range.  

Per the documentation, supported by observation, Accumulo maintains an ingest load to a single table on a single tablet server, re-balancing as the file grows beyond certain size.  Nothing comes for free here.  Doing a little extra work can have great benefit.  If, for example, temporal data is indexed over many years and the distribution of the data is known in advance, pre-splitting can occur since the bin ID (e.g. a per year for temporal data by default) is second most significant component of the key. 

We have not offered any additional completed tools to attempt to add with hot spotting.  We do have have continuous ongoing discussions about adding some Index Strategies to help.  

One customized way to handle this is to add a new NumericIndexStrategy, combine it with the tiered index strategy using CompoundIndexStrategy.  The NumericIndexStategy would use the a limited set of identifiers, selected uniformly.   Doing so requires some caution.   Queries need to run across all possible identifiers (getQueryRanges returns all identifiers).  If you would like to try this, I would be happy to help in any way


 Some of the initial work on alternative approaches has been couched for the moment.  Although, I suspect that our priorities will change soon.    

.




On Wed, Oct 21, 2015 at 11:38 AM, Marcel <m.jacob@xxxxxxxxxxx> wrote:
Hello all,
this a question corresponding the performance and load balancing while querying. Do you have a strategy for avoiding hotspots like a random sharding key as prefix? All data is distributed uniformly over all nodes, but this doesn´t guarantee that all my nodes have the approximately same number of records to process when executing a query. Maybe one node keeps all the information about germany (simplified assumption).

Best regards,
Marcel Jacob.
_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev



Back to the top