For some reason, I will not seeing the latest threads from the group, so I will respond to your request here.
In fact, you have stumbled on two concerns
(1) The code does not handle the empty case well. I created an issue # 513 and addressed the issue in branch GEOWAVE-513. You are welcome to opine on this fix, as I plan to have it reviewed and merged tomorrow. We are about to make another minor release (which includes a DBScan refactor).
(2) The batch ingest only updates statistics at the end. The DBScan refactor branch does have an adjustment to support periodic flushes. Fixed width histograms have greater distortion with frequent 'merges' of independent records. The intent was to minimize the number of 'writes' and 'merges' by flushing at the end. I ran into the same problem you did, hence the fix. One question I have: The fix was a fixed number of records prior to flush. The only control given to the developer is a 'system property' that turns off flushing. It may make sense to all expose a system property to alter the flush rate (at developer's own risk of distortion and performance degradation). Thoughts?
The DBScan refactor addresses some critical issues. There are some bugs in the code that lead to indeterminant results. Furthermore, performance is directly affected by the partitioning in the Mapper. Originally, I thought that the best performance would be gained from having a cell size equivalent to twice to maximum distance. For a heavy load of data distributed over a large map, a small cluster can not handle that amount of keys. Since the partition size is independently configurable, I can choose a large cell size to reduce the number of keys (or buy a bigger cluster). This increases the workload (an memory requirements) of the reducer. To compensate, the reducer performs a secondary partitioning with the cell size equal to twice the maximum distance. The reducer tosses cells that contain less than the minimum number of neighbors. I realize this may toss some critical geometries, but it does reduce the over-all workload considerably and only affects less dense areas. The reducer then pre-processes the data looking for geometries with a large number of neighbors, compressing them into single convex polygons. I found there is nothing more telling than processing large amounts of data on a small under-nourished cluster.
Thanks for helping.