Re: [geowave-dev] SimpleIngestIndexWriter example

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geowave-dev] SimpleIngestIndexWriter example

From: Rich Fecher <rfecher@xxxxxxxxx>
Date: Mon, 15 Aug 2016 11:04:32 -0400
Delivered-to: geowave-dev@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mailman/private/geowave-dev>
List-help: <mailto:geowave-dev-request@locationtech.org?subject=help>
List-subscribe: <https://www.locationtech.org/mailman/listinfo/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://www.locationtech.org/mailman/options/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=unsubscribe>

Dave,

I can help with the real-time ingest from AWS Kinesis and my colleague will follow-up regarding the secondary indices. Secondary indexing, as well as collecting appropriate stats to inform query planning is a feature that is actively being improved for our next release and he has the best insight into the latest development.

Generally, Accumulo should actually be a really good fit for your use case of real-time ingest while querying. The times to concern yourself with query performance degradation is when the tables are actively running minor or major compaction. If that happens to be occurring after you kill the spark ingest perhaps that could explain it, but it seems awfully coincidental. You can force compaction (programmatically or through Accumulo shell) at any time if there are low usage time windows. Looking at the current code for a text-based attribute secondary index I can see that when the NGram secondary index is used, a HyperLogLog statistic is added to the metadata table (the statistic is used for query planning). Perhaps, it may help to force compaction on the *_GEOWAVE_METADATA table periodically or prior to testing the query? Programmatically this can be done like this, after constructing an Accumulo Connector:

connector.tableOperations().compact("spire_GEOWAVE_METADATA", null, null, true, true);

As some background, when stats are written to GeoWave's metadata table an Accumulo combiner is used to merge them together (so every call to close the writer will flush a new set of statistics accumulated locally as new netires in the table) and typical/default statistics are very inexpensive to merge, but its not clear to me how expensive or not merging HyperLogLog stats can be. When compaction is run, the table is re-written with the merged results. So you should see the METADATA table grow in size of entries and shrink after compaction.

Hopefully that helps.

Rich

On Mon, Aug 15, 2016 at 1:29 AM, David McDonald <dmcdonald@xxxxxxxxxxxxxx> wrote:

Hi Rich,

Thanks for your very detailed reply.

I discovered the main performance hit I was having was actually due to the fact that I was still ingesting at the time I had been running my queries. Our AIS vessel messages are on a AWS Kinesis queue so I have been using spark streaming for my ingest. I had assumed this would have an impact on performance so I had previously tried stopping the spark job before running my tests, however I had never really given the cluster much time between stopping the spark job and running my tests. Turn's out I need to wait several minutes after stopping the ingest before the performance improves, I guess because Accumulo can take a little while to catchup if it has a queue of inserts backed up locally..

Is there something you would suggest to better support a realtime scenario like this? Do I just need to use a bigger cluster?

I also discovered how to control partitioning in code by using a CompoundIndexStrategy like below. Took me a little while to figure this one out.
final NumericDimensionDefinition[] SPATIAL_DIMENSIONS = new NumericDimensionDefinition[]{new LongitudeDefinition(), new LatitudeDefinition(true)};

return new CustomIdIndex(
        new CompoundIndexStrategy(
            new RoundRobinKeyIndexStrategy(6),
                TieredSFCIndexFactory.createFullIncrementalTieredStrategy(SPATIAL_DIMENSIONS,
                new int[]{31, 31}, SFCFactory.SFCType.HILBERT)
        ),
        new BasicIndexModel(new NumericDimensionField[]{new LongitudeField(), new LatitudeField(true)}),
        new ByteArrayId("SPATIAL_IDX_ALLTIERS"));
Another question I have is around secondary indexes. The AIS messages contain an a few columns that I would like to filter on: vessel ID (int), type (int), status (text), accuracy (text). I discovered I could do this in code when defining my SimpleFeatureType like so:
builder.add(ab.binding(
        Integer.class).nillable(
        false).userData(NumericSecondaryIndexConfiguration.INDEX_KEY,
        true).buildDescriptor(
        "type"));
builder.add(ab.binding(
        String.class).nillable(
        false).userData(TextSecondaryIndexConfiguration.INDEX_KEY,
        true).buildDescriptor(
        "status"));
However, something strange is happening with the text fields resulting in a huge number of entries in the 2ND_IDX_NGRAM_2_4 table, 30 million entries for only 150 thousand records. Is this expected?

Cheers
Dave

On Thu, 11 Aug 2016 at 22:42 Rich Fecher <rfecher@xxxxxxxxx> wrote:
Hi Dave,
There are a couple things you can do that should dramatically improve your experience. First of all, you can ingest with multiple indices and it will choose the appropriate index for a query, but if you are only using a spatial index then it is using the spatial index. You can see this if you look at the Accumulo monitor web interface while panning the map there is a column showing the active scans.

The first one is quite simple and will probably be the biggest benefit. While, the cluster is very small, it seems like you are running into a natural bottleneck in rendering a map with a million+ points. GeoWave can return points to GeoServer much more rapidly than rendering them on a map. To work around this bottleneck, GeoWave has various tricks. In your case the best one to employ is GeoWave's ability to use the bounds of map requests to translate a pixel to GeoWave's internal spatial index and then GeoWave will skip returning data from the tablet servers once a point that satisfies the filters is found for a given pixel bounds. Therefore, when interacting with a map with millions or more points, there is a cap on how much work GeoServer has to do based on pixels of the map request and GeoServer is no longer dealing with thousands or more overlapping points from Accumulo. To enable this, add a "rendering transformation" like this to your SLD:
https://github.com/ngageoint/geowave/blob/v0.9.2.1/examples/example-slds/DecimatePoints.sld#L14

"pixelSize" is an optional parameter that is a scale factor (considering the point size for rendering is typically multiple pixels its usually safe for this to be slightly greater than 1, we often use 1.5). You can use this transformation in combination with any stylization rules.

Regarding splits/partitions, recommendations are heavily based on use case and cluster size. When you configure your GeoWave index you can set `--numPartitions` which will set up random pre-splits on the table when you ingest (essentially randomly sharding your data). This will increase the performance of initial writing to Accumulo (otherwise you will rely on exceeding the tablet size and having Accumulo naturally split as you are ingesting a large amount of data). Regarding a recommendation on the number of splits, typically you want to have at least as many active tablets as the total number of cores in your cluster. But this can be achieved by decreasing the table split threshold also (table.split.threshold in accumulo shell, it defaults to 1 GB). Regarding recommendation of when to use pre-splits and when to rely on Accumulo naturally splitting as the threshold is exceeded, the pre-splits has a couple advantages in it will allow a single user to easily guarantee fully utilizing a cluster even when querying a small spatial extent because the splits are randomized. Pre-splitting will also allow ingest to immediately write to multiple nodes without needing to wait for the split threshold (although for best performance on a large scale ingest, bulk/offline ingest is recommended). For a multi-tenant environment in which concurrent users are querying different areas or if the use case is typically using large extents of data that should naturally span multiple nodes, the natural splits will preserve data locality best and should be more efficient.

Anyways, perhaps setting `--numPartitions` on your configured index is something worth playing with (you will need to re-ingest if you want to use this), or you can just decrease the split threshold and compact your existing spatial index table through the Accumulo shell to force more splits and more cluster utilization. My intuition is that you are just running into that rendering bottleneck and using the render transformation in your SLD will be the most critical change. Generally, that cluster is small (we typically use m4 xlarge instances for our small sandboxes) and if you want to see better performance just scale it a bit larger (the entire point of the technology). But in this case you should be able to play with your relatively tiny dataset and just configure it slightly different to see very reasonable performance.

Good luck and let us know how it goes!

Rich

On Thu, Aug 11, 2016 at 4:41 AM, David McDonald <dmcdonald@xxxxxxxxxxxxxx> wrote:
Hi there,

I've been trialling Geowave on EMR for a project we're working on (querying AIS vessel points) and I'm not seeing the performance I expected and am struggling to discover where I've gone wrong.

I have used the example SimpleIngestIndexWriter.java class to write a few million records into my cluster (3 x M4 large cores & 1 master), with SpatialDimensionalityTypeProvider().createPrimaryIndex(), no secondary indexes, and I am using Geoserver openlayers preview to pan around the map and view basic vessel location points.

When I have a few hundred thousand records the map responds in less than 1 second but as I reach 1 million records and above the response times increase to 40 seconds and above. I have noticed that all of my records are being written to one node in the cluster. However even with a single node I was expecting better performance.

Is there some way that I can determine whether the spatial index is actually being used?

Should I be using a bigger server?

Also, is there someway to specify partitioning options when inserting records using an IndexWriter? I've noticed you can specify partitioning options when using the command line ingest tools. Would this help for querying AIS vessel points?

Thanks for your time!

Dave

_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev
_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

Follow-Ups:
- Re: [geowave-dev] SimpleIngestIndexWriter example
  - From: Derek Yeager

References:
- [geowave-dev] SimpleIngestIndexWriter example
  - From: David McDonald
- Re: [geowave-dev] SimpleIngestIndexWriter example
  - From: Rich Fecher
- Re: [geowave-dev] SimpleIngestIndexWriter example
  - From: David McDonald

Prev by Date: Re: [geowave-dev] SimpleIngestIndexWriter example
Next by Date: Re: [geowave-dev] SimpleIngestIndexWriter example
Previous by thread: Re: [geowave-dev] SimpleIngestIndexWriter example
Next by thread: Re: [geowave-dev] SimpleIngestIndexWriter example
Index(es):
- Date
- Thread

Breadcrumbs