Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Schema of st_idx table

Vaibhav,

You shouldn't have to write new code to do this.  If you use the
GeoTools API for GeoMesa, all of this should be handled
behind-the-scenes for you.  For example, the SimpleFeatureTypes utility
class will accept some of the indexing options for you when you create
your type; here's an example of using this class to establish a
"full"-type index encoding:


https://github.com/locationtech/geomesa/blob/master/geomesa-utils/src/test/scala/org/locationtech/geomesa/utils/geotools/SimpleFeatureTypesTest.scala#L246

Kryo encoding of features is now the default in GeoMesa.

If you have any additional questions, please just let us know.

Thanks!

Sincerely,
  -- Chris



On Mon, 2015-09-21 at 14:29 +0530, vaibhav.thapliyal wrote:
> Thank you Chris for a very good explanation on how geomesa indexes and 
> queries data.
> 
> As you mentioned about the disadvantage of having a Join Index type, I 
> am trying to store the whole record in the Value field. Can you point 
> out any class in geomesa code that encodes and inserts these records in 
> Value field and also a pointer in encoding through Kryo serialization.
> 
> Thanks
> Vaibhav
> 
> On 09/16/2015 05:11 PM, Chris Eichelberger wrote:
> > Vaibhav,
> >
> > A few more-or-less random thoughts about what you've written...
> >
> > 1.  Pre-pending the ID to the spatio-temporal index table's key may not
> > do what you seem to have in mind.  For instance, it looks as if you have
> > preserved the shard number after the ID, which was meant to distribute
> > data evenly across tablet servers uniformly.  (In fact, the shard number
> > that is assigned by GeoMesa to a record is based on a hash of the
> > feature ID, ensuring that we always know in which partition to find a
> > feature based on its ID.)  Adding the ID itself in front means that the
> > data are no longer uniformly sharded across tablet-servers.  Further, it
> > means that the query planners can only work from the right-hand side of
> > the key (unless you know, a priori, which keys you need, in which case
> > you don't need a spatio-temporal index at all); this means resorting to
> > regex-type filters that have to visit most (if not all) records in the
> > entire table, which may represent a significant loss in query
> > efficiency.
> >
> > 2.  Storing the ID (instead of the record) in the value field means
> > doing two-phase queries, one in which you find the list of IDs you need,
> > and a second in which you scan an IDs-ordered table to return the full
> > records.  We have used this sort of index before in GeoMesa -- it's what
> > we call the "join" index type for secondary attributes -- but this turns
> > out to work very well *only* when there is a single query running at any
> > given time.  If you haven't already read it, you might enjoy this paper,
> > particularly page 5 that addresses this exact scenario, albeit not in
> > the context of GeoMesa:
> >
> >    http://ieee-hpec.org/2013/index_htm_files/28-2868615.pdf
> >
> > 3.  GeoMesa queries occur in two places:  There is significant rewriting
> > and planning that occurs on the client side, and then the iterators to
> > the sifting and sorting on the Accumulo tablet-server side.
> >
> > 3.A.  Query planners:  On the client-side, we try to take care to plan
> > queries so that various kinds of requests -- feature selection; requests
> > for aggregate densities; non-point geometries; etc. -- can run well.  To
> > do this, GeoMesa has to consider rewriting queries (to handle
> > complicated filters), parse optional query commands/hints to identify
> > the specific task, map query-satisfaction strategies to the specific
> > request, and enumerate scan ranges that are relevant to the iterator
> > initialization (if appropriate).  Here is one file that sits near the
> > center of much of this work:
> >
> >
> > https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/QueryPlanner.scala
> >
> > 3.B.  Accumulo iterators:  As you can see from the following link, there
> > are quite a few separate iterators that the query-planners can use to
> > execute various search strategies:
> >
> >
> > https://github.com/locationtech/geomesa/tree/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/iterators
> >
> > Most specifically, you may be interested in the
> > SpatioTemporalIntersectingIterator.scala that is responsible for the
> > (deprecated) "_st_idx" table reads.  Of course, it extends base classes,
> > but it's a place to start reading.
> >
> > The bottom line, as you already know, is that indexing and querying are
> > tricky.  To get reasonable performance across all queries may well be
> > impossible, so a good index requires knowing as much as you can about
> > your data and query profiles.  I'm not sure where you are in your
> > literature review for this, but you may want to read the D4M paper (if
> > you haven't already!) for some interesting pointers on high-level
> > indexing concerns:
> >
> >    http://www.mit.edu/~kepner/D4M/
> >
> > All the best!
> >
> > Sincerely,
> >    -- Chris
> >
> >
> > On Wed, 2015-09-16 at 10:51 +0530, vaibhav.thapliyal wrote:
> >> Dear Chris,
> >>
> >> Thanks for explaining the structure of the st_idx table. I was able to
> >> index the lat-lon and temporal data in the row. However I have a slight
> >> modification as compared to the geomesa schema. I have appended an "id"
> >> field in front of the existing row content. So now I have the row like this:
> >>
> >> ~48253cb63641fa0d31faa52c~0~0~featureNewsMediaV1_2~sx9~2014021818
> >>
> >> I also used Kryo serializer library to create an encoded form of the id
> >> which I am storing in the value field.
> >> So my value is like this.
> >>
> >> \x0148253cb63641fa0d31faa52\xE3\x00\x00\x00\x00\x00
> >>
> >> which is different from the way value is encoded in the original geomesa
> >> tables.
> >>
> >> So it would be great if you could point out somewhere in the code as to
> >> how value is encoded and how the geomesa iterators perform query on this
> >> schema.
> >>
> >> Thanks
> >> Vaibhav
> >>
> >> On 09/10/2015 05:56 PM, Chris Eichelberger wrote:
> >>> Vaibhav,
> >>>
> >>> For what it's worth, the "st_idx" table is about to be deprecated in
> >>> favor of a new Z3 structure.  We will shortly have documentation up
> >>> describing how the new structure works.  (The short version is that it's
> >>> a concatenation of the most-significant bits of T and a Z-order
> >>> combination of X, Y, and the low-order bits of T.)
> >>>
> >>> In the "st_idx" table, the value is typically a Kryo-encoded version of
> >>> the entire Feature, which allows us to do fine-grained filtering within
> >>> the Accumulo iterators beyond the spatio-temporal constraints.
> >>>
> >>> Secondary indexes, those not involving (X, Y, T), are created by
> >>> concatenating the feature-type name with the attribute name and the
> >>> value within the "attr_idx" table.  The value in this table will be
> >>> either a Kryo-encoded version of the Feature or the record ID (depending
> >>> on the index type requested in the SimpleFeature specification, "full"
> >>> or "join", respectively.)  This way, if we know that an attribute's
> >>> value has high selectivity (high cardinality), we can seek directly to
> >>> the part of the "attr_idx" table where those records reside.  This
> >>> structure also enables range queries and right-hand wildcards on
> >>> attribute values.
> >>>
> >>> I hope this helps.  If you have any additional questions, please just
> >>> let us know.
> >>>
> >>> Thanks!
> >>>
> >>> Sincerely,
> >>>     -- Chris
> >>>
> >>>
> >>> On Thu, 2015-09-10 at 17:45 +0530, vaibhav.thapliyal wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I am trying to use geomesa for querying geo-spatial data. While doing
> >>>> so, I came across the st_idx table which is used to index
> >>>> spatio-temporal data. On reading the research paper I found out the way
> >>>> indexing is being done, and how data is mapped on Row, colf and
> >>>> col-qualifier, using geo-hashed value of the lat-lon ie how the key is
> >>>> being generated. What I am failing to understand is what is contained
> >>>> inside the Value field of Accumulo.
> >>>>
> >>>> Also if someone could explain how is non-spatial temporal data (for eg
> >>>> Customer Name)is stored/indexed in geomesa, it would be very helpful.
> >>>>
> >>>> Thanks
> >>>> Vaibhav
> >>>> _______________________________________________
> >>>> geomesa-users mailing list
> >>>> geomesa-users@xxxxxxxxxxxxxxxx
> >>>> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >>>> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >>> _______________________________________________
> >>> geomesa-users mailing list
> >>> geomesa-users@xxxxxxxxxxxxxxxx
> >>> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >>> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >> _______________________________________________
> >> geomesa-users mailing list
> >> geomesa-users@xxxxxxxxxxxxxxxx
> >> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >> http://www.locationtech.org/mailman/listinfo/geomesa-users
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
> 
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users




Back to the top