Re: [geomesa-users] Key/Index construction question.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] Key/Index construction question.

From: Moises Baly <moises@xxxxxxxxxxxxx>
Date: Wed, 23 Sep 2015 15:19:29 -0400
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <http://www.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <http://www.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>

Hi Chris,

Thank you for your answer. We are looking into performances of using a secondary indexes, and in parallel we'll start looking into extending the grammar and query planners. For the query planners, this is the first time we're looking through the code (you're right, it's going to be challenging). Would it be too hard to provide us with a high level path for tackling the query planner modification?

Thank you again for your time,

Moises

On Wed, Sep 23, 2015 at 12:19 PM, Chris Eichelberger <cne1x@xxxxxxxx> wrote:

Moises,

Apologies in advance, but this turned out to be (another) longer
response than I had expected.

There are two entire sections to this note: 1) directly responding to
the question about the index-schema format; 2) suggestions for how you
might avoid changing anything in the index-schema format, but use the
existing indexing mechanisms.

Part 1: Concerning the index-schema format

The best reference for the index-schema format syntax is the code itself
(fortunately, Scala makes this sort of DSL grammar mostly readable from
source):

https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/IndexSchema.scala#L146-L160

As you can see, the RowID and the ColF re-use the same syntactic
requirement ("keypart" inside the DSL), meaning that anything that is
valid in the RowID is -- so far as the parser is concerned -- valid in
the ColF. You are correct that the ID-substitution token is only valid
at the end of the ColQ; it is not allowed in either the RowID or ColF.

In order to do exactly what you described, there are multiple
considerations:

A. You would need to extend that grammar. This is not particularly
difficult.

B. You would need to extend the key-planners. This can be more
challenging, especially if this is the first time you've looked through
this code.

As you might guess, the aggregate recommendation from the GeoMesa team
would probably be, "It's probably easier to find another way to do what
you want." Fortunately, there may be just such a way...

Part 2: How to use the existing index structures

There are multiple query strategies. For example, there is one strategy
that is geo-time oriented, and one strategy that is (secondary)
attribute-oriented. The difference among these is which index (or
combination of indexes) they use in what order.

This is the way the Accumulo filters are applied, roughly in order (for
a geo-time strategy):

1. coarse geo-time filtering on the "_st_idx" or ("_z3") tables
2. fine-grained geo-time filtering
3. feature-based filtering (using ECQL expressions)

This suggests that, if your geo-time constraints are highly selective,
then storing the filter attribute inside your simple feature may be
adequate to get the performance you're looking for, because that
filtering happens only on the entries with qualifying geo-time data, and
is distributed (uniformly) across tablet servers.

If the geo-time constraints are not very selective, but your attribute
constraints are highly selective, then using an attribute strategy in
your query will invert that filter order, essentially performing the
attribute-selection first, and then filtering down to the geo-time
constraints.

If neither the geo-time nor the attribute constraints are singly
selective, then you may be able to get some lift out of creating a
synthetic field that is jointly selective, and then use a (secondary
attribute) index on that value.

I hope that helps. If not, please just let us know.

Thanks!

Sincerely,
-- Chris

On Wed, 2015-09-23 at 10:45 -0400, Moises Baly wrote:
> Hi there:
>
>

> On the same subject of keys, I have a couple of questions when
> building them:
>
>
> 1- I only have one way to store non constant "strings" within the key
> - using the #id - correct? For example, I have a point and want to
> store something of the sort -> gh :: some_string_ie_HOUSE :: #cstr,
> changing that string on insertion into Acc. The way I would do this
> would be with a schema such as "%~#s%99#r%0,11#gh::%~#s%#id::%~#s%
> TEST#cstr". However, this gives me a parser error, I think because
> there is a restriction on the id() position - has to be at the end.
>
>
> The idea is that I want to be able to filter first by location (gh),
> then by a particular string in the column family.
>
>
> 2- When building the key schema, '%#i' allows you to index what comes
> after right?
>
>
> Thanks for your time,
>
>
> Moises
>
>
>
>
>
>
>
> On Fri, Sep 18, 2015 at 3:29 PM, Moises Baly <moises@xxxxxxxxxxxxx>
> wrote:
> Perfect.
>
>
> Thank you again for your answers, we are looking forward to go
> in production with GM.
>
>
> Kind regards,
>
>
> Moises
>
> On Fri, Sep 18, 2015 at 3:22 PM, Chris Eichelberger
> <cne1x@xxxxxxxx> wrote:
> Moises,
>
> These are reasonable questions. I'll re-use your
> numbering.
>
> 1. We right-pad lower-precision (larger) Geohashes
> with periods, so a
> 10-bit Geohash for Charlottesville might be "dq..."
> when padded to 35
> bits. This becomes a minor bit of hassle for the
> query planner, which
> has to accommodate the (possible) presence of these
> characters in
> addition to valid Geohash characters, but it's not too
> bad.
>
> 2. You are correct that each index key encodes a
> disjoint subset of the
> entire geometry's covering. Fortunately, the entire
> geometry is stored
> elsewhere in the value of the Accumulo entry, so no
> reconstruction is
> required on the client side.
>
> Sincerely,
> -- Chris
>
>
>
> On Fri, 2015-09-18 at 15:14 -0400, Moises Baly wrote:
> > This is an amazing explanation!! Thank you very much
> for taking the
> > time of being so clear.
> >
> >
> > Two additional questions:
> > 1- If we are deconstructing non-point geometries
> into geohashes of
> > different precisions,and, say, I specified my key
> schema as being: "%
> > ~#s%foo#cstr%0,7#gh%99#r::_::_ (don't mind cf and
> cq, just an example)
> > - in which I want to have a length 7 geohash in the
> row id, how do you
> > fit the different precision you obtain into my 7
> specification? Or I'm
> > not making sense here?
> >
> >
> > 2- In the index schema builder, the index or data
> flag (%#i) builds an
> > "index" over a particular portion of the entire key?
> >
> >
> > @Emilio: so if I understood you correctly you have 6
> "entire" rows,
> > but if you look at the cf or cq portions you might
> many more distinct
> > values correct?
> >
> >
> > For example, I store a polygon, and then I want to
> retrieve that
> > particular polygon. How do you go about putting it
> together again? It
> > has to depend in some sort of identifier no?
> >
> >
> > Thank you both again for your time,
> >
> >
> > Moises
> >
> >
> >
> > On Fri, Sep 18, 2015 at 2:47 PM, Chris Eichelberger
> <cne1x@xxxxxxxx>
> > wrote:
> > Moises,
> >
> > Good question! The good news is that there
> is nothing special
> > about how
> > the keys are being constructed; the
> interesting part is in how
> > GeoMesa
> > decides which keys should be constructed...
> >
> > (Apologies in advance if, in the course of
> lecturing, I tell
> > you things
> > you already know.)
> >
> > The first point to remember is that each
> Geohash index-entry
> > represents
> > a cell. For 35-bit Geohashes, each cell is
> no more than ~150
> > meters
> > square. A 0-bit (degenerate) Geohash is the
> entire surface of
> > the
> > (flat) Earth. Each bit of precision you add
> to a Geohash
> > halves exactly
> > one of its dimensions (when zero-based, even
> bits halve
> > longitude; odd
> > bits halve latitude).
> >
> > Whenever you are indexing data that contain
> only single-point
> > geometries, there will be one index-key per
> record, because
> > every point
> > will fall inside exactly one Geohash cell.
> (Each Geohash cell
> > in
> > GeoMesa includes its minimum X and minimum Y
> values, but
> > excludes its
> > maximum X and maximum Y extents.)
> >
> > Whenever you are indexing non-point
> geometries -- line
> > strings;
> > polygons; etc. -- you have a problem: How
> do you create a
> > single
> > index-entry for a geometry that can cross
> multiple cell
> > boundaries? If
> > you only index the vertices, you lose
> information about the
> > fact that
> > the geometry covers the space between them.
> There are
> > typically two
> > approaches to solving this problem:
> >
> > 1. You can encode a single entry that
> represents the
> > minimum-bounding
> > cell description that contains your
> geometry; or
> >
> > 2. you can decompose your geometry into
> covering cells, at
> > potentially
> > heterogeneous resolutions (different sizes),
> and index each of
> > those
> > separately (and then de-duplicate results at
> query time so
> > that each
> > feature appears no more than once in any
> given results set).
> >
> > GeoMesa takes approach #2 (for now; we're
> experimenting with
> > other ways
> > to do this). This is how the polygon you
> quote, with a large
> > number of
> > points, can be decomposed into just a few
> covering cells; each
> > of those
> > covering cells receives its own index key.
> I've attached an
> > image to
> > this email that shows how a polygon and a
> line-string can be
> > decomposed.
> > In practice, we do not allow non-point
> geometries to be
> > decomposed into
> > so many covering Geohashes. Here is the
> reference to the code
> > in
> > GeoMesa where this decomposition is called:
> >
> >
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/STIndexEntry.scala#L49
> >
> > Please note that, with the advent of the new
> Z3 index, we will
> > be
> > revisiting this scheme. The Z3 index is
> much faster than the
> > old
> > Geohash-based index, but does not yet
> support non-point
> > geometries, so
> > it's a great opportunity for us to improve
> that feature.
> >
> > I hope this addressed some of your
> questions; if not, or if
> > you think of
> > new ones, please just let us know.
> >
> > Thanks!
> >
> > Sincerely,
> > -- Chris
> >
> >
> > On Fri, 2015-09-18 at 14:14 -0400, Moises
> Baly wrote:
> > > Hi there:
> > >
> > >
> > > I've come across some tests in the project
> in my quest to
> > understand
> > > how indexes work and how is the index
> partitioned in
> > Accumulo's Key
> > > (what goes where, and how is constructed.
> > >
> > >
> > > val dummyType =
> > >
> >
> SimpleFeatureTypes.createType("DummyType",s"foo:String,bar:Geometry,baz:Date,$DEFAULT_GEOMETRY_PROPERTY_NAME:Geometry,$DEFAULT_DTG_PROPERTY_NAME:Date,$DEFAULT_DTG_END_PROPERTY_NAME:Date")
> > > val customType =
> > >
> >
> SimpleFeatureTypes.createType("DummyType",s"foo:String,bar:Geometry,baz:Date,*the_geom:Geometry,dt_start:Date,$DEFAULT_DTG_END_PROPERTY_NAME:Date")
> > > customType.setDtgField("dt_start")
> > > val dummyEncoder =
> SimpleFeatureSerializers(dummyType,
> > > SerializationType.AVRO)
> > > val customEncoder =
> SimpleFeatureSerializers(customType,
> > > SerializationType.AVRO)
> > > val dummyIndexValueEncoder =
> IndexValueEncoder(dummyType)`
> > > val geometryFactory = new
> GeometryFactory(new
> > PrecisionModel, 4326)
> > > val now = new DateTime().toDate
> > >
> > > val Apr_23_2001 = new DateTime(2001, 4,
> 23, 12, 5, 0,
> > > DateTimeZone.forID("UTC")).toDate
> > >
> > > val schemaEncoding = "%~#s%feature#cstr%
> 99#r::%~#s%
> > 0,4#gh::%~#s%
> > > 4,3#gh%#id"
> > >
> > > val index =
> IndexSchema.buildKeyEncoder(dummyType,
> > schemaEncoding)
> > > val line : Geometry =
> > WKTUtils.read("LINESTRING(-78.5000092574703
> > > 38.0272986617359,-78.5000196719491
> > 38.0272519798381,-78.5000300864205
> > > 38.0272190279085,-78.5000370293904
> > 38.0271853867342,-78.5000439723542
> > > 38.027151748305,-78.5000509153117
> > 38.027118112621,-78.5000578582629
> > > 38.0270844741902,-78.5000648011924
> > 38.0270329867966,-78.5000648011781
> > > 38.0270165108316,-78.5000682379314
> > 38.026999348366,-78.5000752155953
> > > 38.026982185898,-78.5000786870602
> > 38.0269657099304,-78.5000856300045
> > > 38.0269492339602,-78.5000891014656
> > 38.0269327579921,-78.5000960444045
> > > 38.0269162820211,-78.5001064588197
> > 38.0269004925451,-78.5001134017528
> > > 38.0268847030715,-78.50012381616
> > 38.0268689135938,-78.5001307590877
> > > 38.0268538106175,-78.5001411734882
> > 38.0268387076367,-78.5001550593595
> > > 38.0268236046505,-78.5001654737524
> > 38.0268091881659,-78.5001758881429
> > > 38.0267954581791,-78.5001897740009
> > 38.0267810416871,-78.50059593303
> > > 38.0263663951609,-78.5007972751677
> 38.0261625038609)")
> > > val item =
> >
> AvroSimpleFeatureFactory.buildAvroFeature(dummyType,
> > > List("TEST_LINE", line, now, line, now,
> now), "TEST_LINE")
> > > val toWrite = new
> FeatureToWrite(item, "",
> > dummyEncoder,
> > > dummyIndexValueEncoder)
> > > val indexEntries =
> index.encode(toWrite).toList
> > > indexEntries.size must equalTo(1)
> > > indexEntries.head.size()
> mustEqual(6)
> > > val cf = new
> > >
> Text(indexEntries.head.getUpdates.get(0).getColumnFamily)
> > > val cq = new
> > >
> Text(indexEntries.head.getUpdates.get(0).getColumnQualifier)
> > > val keyStr = cf + "::" + cq val
> line : Geometry =
> > >
> WKTUtils.read("LINESTRING(-78.5000092574703
> > > 38.0272986617359,-78.5000196719491
> > 38.0272519798381,-78.5000300864205
> > > 38.0272190279085,-78.5000370293904
> > 38.0271853867342,-78.5000439723542
> > > 38.027151748305,-78.5000509153117
> > 38.027118112621,-78.5000578582629
> > > 38.0270844741902,-78.5000648011924
> > 38.0270329867966,-78.5000648011781
> > > 38.0270165108316,-78.5000682379314
> > 38.026999348366,-78.5000752155953
> > > 38.026982185898,-78.5000786870602
> > 38.0269657099304,-78.5000856300045
> > > 38.0269492339602,-78.5000891014656
> > 38.0269327579921,-78.5000960444045
> > > 38.0269162820211,-78.5001064588197
> > 38.0269004925451,-78.5001134017528
> > > 38.0268847030715,-78.50012381616
> > 38.0268689135938,-78.5001307590877
> > > 38.0268538106175,-78.5001411734882
> > 38.0268387076367,-78.5001550593595
> > > 38.0268236046505,-78.5001654737524
> > 38.0268091881659,-78.5001758881429
> > > 38.0267954581791,-78.5001897740009
> > 38.0267810416871,-78.50059593303
> > > 38.0263663951609,-78.5007972751677
> 38.0261625038609)")
> > > val item =
> >
> AvroSimpleFeatureFactory.buildAvroFeature(dummyType,
> > > List("TEST_LINE", line, now, line, now,
> now), "TEST_LINE")
> > > val toWrite = new
> FeatureToWrite(item, "",
> > dummyEncoder,
> > > dummyIndexValueEncoder)
> > > val indexEntries =
> index.encode(toWrite).toList
> > > indexEntries.size must equalTo(1)
> > > indexEntries.head.size()
> mustEqual(6)
> > > val cf = new
> > >
> Text(indexEntries.head.getUpdates.get(0).getColumnFamily)
> > > val cq = new
> > >
> Text(indexEntries.head.getUpdates.get(0).getColumnQualifier)
> > > val keyStr = cf + "::" + cq
> > >
> > >
> > > How all those points in the Linestring
> translate to encoding
> > only 6
> > > rows in Accumulo? As far as I understand,
> the Key definition
> > > (string :: gh :: gh + ID) should encode a
> single point
> > correct? What
> > > am I missing in the process here?
> > >
> > >
> > > If somebody could walk me through this
> example with special
> > attention
> > > to how the key is being constructed it
> would be very much
> > appreciated.
> > >
> > >
> > > Thank you for your time
> > >
> > >
> > > Moises
> > >
> > >
> >
> > >
> _______________________________________________
> > > geomesa-users mailing list
> > > geomesa-users@xxxxxxxxxxxxxxxx
> > > To change your delivery options, retrieve
> your password, or
> > unsubscribe from this list, visit
> > >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >
> >
> >
> _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve
> your password, or
> > unsubscribe from this list, visit
> >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >
> >
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your
> password, or unsubscribe from this list, visit
> >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your
> password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
>
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users

Follow-Ups:
- Re: [geomesa-users] Key/Index construction question.
  - From: Chris Eichelberger

References:
- [geomesa-users] Key/Index construction question.
  - From: Moises Baly
- Re: [geomesa-users] Key/Index construction question.
  - From: Chris Eichelberger
- Re: [geomesa-users] Key/Index construction question.
  - From: Moises Baly
- Re: [geomesa-users] Key/Index construction question.
  - From: Chris Eichelberger
- Re: [geomesa-users] Key/Index construction question.
  - From: Moises Baly
- Re: [geomesa-users] Key/Index construction question.
  - From: Moises Baly
- Re: [geomesa-users] Key/Index construction question.
  - From: Chris Eichelberger

Prev by Date: Re: [geomesa-users] Key/Index construction question.
Next by Date: Re: [geomesa-users] Key/Index construction question.
Previous by thread: Re: [geomesa-users] Key/Index construction question.
Next by thread: Re: [geomesa-users] Key/Index construction question.
Index(es):
- Date
- Thread

Breadcrumbs