Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Schema of st_idx table

Vaibhav,

A few more-or-less random thoughts about what you've written...

1.  Pre-pending the ID to the spatio-temporal index table's key may not
do what you seem to have in mind.  For instance, it looks as if you have
preserved the shard number after the ID, which was meant to distribute
data evenly across tablet servers uniformly.  (In fact, the shard number
that is assigned by GeoMesa to a record is based on a hash of the
feature ID, ensuring that we always know in which partition to find a
feature based on its ID.)  Adding the ID itself in front means that the
data are no longer uniformly sharded across tablet-servers.  Further, it
means that the query planners can only work from the right-hand side of
the key (unless you know, a priori, which keys you need, in which case
you don't need a spatio-temporal index at all); this means resorting to
regex-type filters that have to visit most (if not all) records in the
entire table, which may represent a significant loss in query
efficiency.

2.  Storing the ID (instead of the record) in the value field means
doing two-phase queries, one in which you find the list of IDs you need,
and a second in which you scan an IDs-ordered table to return the full
records.  We have used this sort of index before in GeoMesa -- it's what
we call the "join" index type for secondary attributes -- but this turns
out to work very well *only* when there is a single query running at any
given time.  If you haven't already read it, you might enjoy this paper,
particularly page 5 that addresses this exact scenario, albeit not in
the context of GeoMesa:

  http://ieee-hpec.org/2013/index_htm_files/28-2868615.pdf

3.  GeoMesa queries occur in two places:  There is significant rewriting
and planning that occurs on the client side, and then the iterators to
the sifting and sorting on the Accumulo tablet-server side.

3.A.  Query planners:  On the client-side, we try to take care to plan
queries so that various kinds of requests -- feature selection; requests
for aggregate densities; non-point geometries; etc. -- can run well.  To
do this, GeoMesa has to consider rewriting queries (to handle
complicated filters), parse optional query commands/hints to identify
the specific task, map query-satisfaction strategies to the specific
request, and enumerate scan ranges that are relevant to the iterator
initialization (if appropriate).  Here is one file that sits near the
center of much of this work:


https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/QueryPlanner.scala

3.B.  Accumulo iterators:  As you can see from the following link, there
are quite a few separate iterators that the query-planners can use to
execute various search strategies:


https://github.com/locationtech/geomesa/tree/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/iterators

Most specifically, you may be interested in the
SpatioTemporalIntersectingIterator.scala that is responsible for the
(deprecated) "_st_idx" table reads.  Of course, it extends base classes,
but it's a place to start reading.

The bottom line, as you already know, is that indexing and querying are
tricky.  To get reasonable performance across all queries may well be
impossible, so a good index requires knowing as much as you can about
your data and query profiles.  I'm not sure where you are in your
literature review for this, but you may want to read the D4M paper (if
you haven't already!) for some interesting pointers on high-level
indexing concerns:

  http://www.mit.edu/~kepner/D4M/

All the best!

Sincerely,
  -- Chris


On Wed, 2015-09-16 at 10:51 +0530, vaibhav.thapliyal wrote:
> Dear Chris,
> 
> Thanks for explaining the structure of the st_idx table. I was able to 
> index the lat-lon and temporal data in the row. However I have a slight 
> modification as compared to the geomesa schema. I have appended an "id" 
> field in front of the existing row content. So now I have the row like this:
> 
> ~48253cb63641fa0d31faa52c~0~0~featureNewsMediaV1_2~sx9~2014021818
> 
> I also used Kryo serializer library to create an encoded form of the id 
> which I am storing in the value field.
> So my value is like this.
> 
> \x0148253cb63641fa0d31faa52\xE3\x00\x00\x00\x00\x00
> 
> which is different from the way value is encoded in the original geomesa 
> tables.
> 
> So it would be great if you could point out somewhere in the code as to 
> how value is encoded and how the geomesa iterators perform query on this 
> schema.
> 
> Thanks
> Vaibhav
> 
> On 09/10/2015 05:56 PM, Chris Eichelberger wrote:
> > Vaibhav,
> >
> > For what it's worth, the "st_idx" table is about to be deprecated in
> > favor of a new Z3 structure.  We will shortly have documentation up
> > describing how the new structure works.  (The short version is that it's
> > a concatenation of the most-significant bits of T and a Z-order
> > combination of X, Y, and the low-order bits of T.)
> >
> > In the "st_idx" table, the value is typically a Kryo-encoded version of
> > the entire Feature, which allows us to do fine-grained filtering within
> > the Accumulo iterators beyond the spatio-temporal constraints.
> >
> > Secondary indexes, those not involving (X, Y, T), are created by
> > concatenating the feature-type name with the attribute name and the
> > value within the "attr_idx" table.  The value in this table will be
> > either a Kryo-encoded version of the Feature or the record ID (depending
> > on the index type requested in the SimpleFeature specification, "full"
> > or "join", respectively.)  This way, if we know that an attribute's
> > value has high selectivity (high cardinality), we can seek directly to
> > the part of the "attr_idx" table where those records reside.  This
> > structure also enables range queries and right-hand wildcards on
> > attribute values.
> >
> > I hope this helps.  If you have any additional questions, please just
> > let us know.
> >
> > Thanks!
> >
> > Sincerely,
> >    -- Chris
> >
> >
> > On Thu, 2015-09-10 at 17:45 +0530, vaibhav.thapliyal wrote:
> >> Hello everyone,
> >>
> >> I am trying to use geomesa for querying geo-spatial data. While doing
> >> so, I came across the st_idx table which is used to index
> >> spatio-temporal data. On reading the research paper I found out the way
> >> indexing is being done, and how data is mapped on Row, colf and
> >> col-qualifier, using geo-hashed value of the lat-lon ie how the key is
> >> being generated. What I am failing to understand is what is contained
> >> inside the Value field of Accumulo.
> >>
> >> Also if someone could explain how is non-spatial temporal data (for eg
> >> Customer Name)is stored/indexed in geomesa, it would be very helpful.
> >>
> >> Thanks
> >> Vaibhav
> >> _______________________________________________
> >> geomesa-users mailing list
> >> geomesa-users@xxxxxxxxxxxxxxxx
> >> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
> 
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users




Back to the top