[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [geomesa-users] Schema of st_idx table
|
Vaibhav,
You shouldn't have to write new code to do this. If you use the
GeoTools API for GeoMesa, all of this should be handled
behind-the-scenes for you. For example, the SimpleFeatureTypes utility
class will accept some of the indexing options for you when you create
your type; here's an example of using this class to establish a
"full"-type index encoding:
https://github.com/locationtech/geomesa/blob/master/geomesa-utils/src/test/scala/org/locationtech/geomesa/utils/geotools/SimpleFeatureTypesTest.scala#L246
Kryo encoding of features is now the default in GeoMesa.
If you have any additional questions, please just let us know.
Thanks!
Sincerely,
-- Chris
On Mon, 2015-09-21 at 14:29 +0530, vaibhav.thapliyal wrote:
> Thank you Chris for a very good explanation on how geomesa indexes and
> queries data.
>
> As you mentioned about the disadvantage of having a Join Index type, I
> am trying to store the whole record in the Value field. Can you point
> out any class in geomesa code that encodes and inserts these records in
> Value field and also a pointer in encoding through Kryo serialization.
>
> Thanks
> Vaibhav
>
> On 09/16/2015 05:11 PM, Chris Eichelberger wrote:
> > Vaibhav,
> >
> > A few more-or-less random thoughts about what you've written...
> >
> > 1. Pre-pending the ID to the spatio-temporal index table's key may not
> > do what you seem to have in mind. For instance, it looks as if you have
> > preserved the shard number after the ID, which was meant to distribute
> > data evenly across tablet servers uniformly. (In fact, the shard number
> > that is assigned by GeoMesa to a record is based on a hash of the
> > feature ID, ensuring that we always know in which partition to find a
> > feature based on its ID.) Adding the ID itself in front means that the
> > data are no longer uniformly sharded across tablet-servers. Further, it
> > means that the query planners can only work from the right-hand side of
> > the key (unless you know, a priori, which keys you need, in which case
> > you don't need a spatio-temporal index at all); this means resorting to
> > regex-type filters that have to visit most (if not all) records in the
> > entire table, which may represent a significant loss in query
> > efficiency.
> >
> > 2. Storing the ID (instead of the record) in the value field means
> > doing two-phase queries, one in which you find the list of IDs you need,
> > and a second in which you scan an IDs-ordered table to return the full
> > records. We have used this sort of index before in GeoMesa -- it's what
> > we call the "join" index type for secondary attributes -- but this turns
> > out to work very well *only* when there is a single query running at any
> > given time. If you haven't already read it, you might enjoy this paper,
> > particularly page 5 that addresses this exact scenario, albeit not in
> > the context of GeoMesa:
> >
> > http://ieee-hpec.org/2013/index_htm_files/28-2868615.pdf
> >
> > 3. GeoMesa queries occur in two places: There is significant rewriting
> > and planning that occurs on the client side, and then the iterators to
> > the sifting and sorting on the Accumulo tablet-server side.
> >
> > 3.A. Query planners: On the client-side, we try to take care to plan
> > queries so that various kinds of requests -- feature selection; requests
> > for aggregate densities; non-point geometries; etc. -- can run well. To
> > do this, GeoMesa has to consider rewriting queries (to handle
> > complicated filters), parse optional query commands/hints to identify
> > the specific task, map query-satisfaction strategies to the specific
> > request, and enumerate scan ranges that are relevant to the iterator
> > initialization (if appropriate). Here is one file that sits near the
> > center of much of this work:
> >
> >
> > https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/QueryPlanner.scala
> >
> > 3.B. Accumulo iterators: As you can see from the following link, there
> > are quite a few separate iterators that the query-planners can use to
> > execute various search strategies:
> >
> >
> > https://github.com/locationtech/geomesa/tree/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/iterators
> >
> > Most specifically, you may be interested in the
> > SpatioTemporalIntersectingIterator.scala that is responsible for the
> > (deprecated) "_st_idx" table reads. Of course, it extends base classes,
> > but it's a place to start reading.
> >
> > The bottom line, as you already know, is that indexing and querying are
> > tricky. To get reasonable performance across all queries may well be
> > impossible, so a good index requires knowing as much as you can about
> > your data and query profiles. I'm not sure where you are in your
> > literature review for this, but you may want to read the D4M paper (if
> > you haven't already!) for some interesting pointers on high-level
> > indexing concerns:
> >
> > http://www.mit.edu/~kepner/D4M/
> >
> > All the best!
> >
> > Sincerely,
> > -- Chris
> >
> >
> > On Wed, 2015-09-16 at 10:51 +0530, vaibhav.thapliyal wrote:
> >> Dear Chris,
> >>
> >> Thanks for explaining the structure of the st_idx table. I was able to
> >> index the lat-lon and temporal data in the row. However I have a slight
> >> modification as compared to the geomesa schema. I have appended an "id"
> >> field in front of the existing row content. So now I have the row like this:
> >>
> >> ~48253cb63641fa0d31faa52c~0~0~featureNewsMediaV1_2~sx9~2014021818
> >>
> >> I also used Kryo serializer library to create an encoded form of the id
> >> which I am storing in the value field.
> >> So my value is like this.
> >>
> >> \x0148253cb63641fa0d31faa52\xE3\x00\x00\x00\x00\x00
> >>
> >> which is different from the way value is encoded in the original geomesa
> >> tables.
> >>
> >> So it would be great if you could point out somewhere in the code as to
> >> how value is encoded and how the geomesa iterators perform query on this
> >> schema.
> >>
> >> Thanks
> >> Vaibhav
> >>
> >> On 09/10/2015 05:56 PM, Chris Eichelberger wrote:
> >>> Vaibhav,
> >>>
> >>> For what it's worth, the "st_idx" table is about to be deprecated in
> >>> favor of a new Z3 structure. We will shortly have documentation up
> >>> describing how the new structure works. (The short version is that it's
> >>> a concatenation of the most-significant bits of T and a Z-order
> >>> combination of X, Y, and the low-order bits of T.)
> >>>
> >>> In the "st_idx" table, the value is typically a Kryo-encoded version of
> >>> the entire Feature, which allows us to do fine-grained filtering within
> >>> the Accumulo iterators beyond the spatio-temporal constraints.
> >>>
> >>> Secondary indexes, those not involving (X, Y, T), are created by
> >>> concatenating the feature-type name with the attribute name and the
> >>> value within the "attr_idx" table. The value in this table will be
> >>> either a Kryo-encoded version of the Feature or the record ID (depending
> >>> on the index type requested in the SimpleFeature specification, "full"
> >>> or "join", respectively.) This way, if we know that an attribute's
> >>> value has high selectivity (high cardinality), we can seek directly to
> >>> the part of the "attr_idx" table where those records reside. This
> >>> structure also enables range queries and right-hand wildcards on
> >>> attribute values.
> >>>
> >>> I hope this helps. If you have any additional questions, please just
> >>> let us know.
> >>>
> >>> Thanks!
> >>>
> >>> Sincerely,
> >>> -- Chris
> >>>
> >>>
> >>> On Thu, 2015-09-10 at 17:45 +0530, vaibhav.thapliyal wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I am trying to use geomesa for querying geo-spatial data. While doing
> >>>> so, I came across the st_idx table which is used to index
> >>>> spatio-temporal data. On reading the research paper I found out the way
> >>>> indexing is being done, and how data is mapped on Row, colf and
> >>>> col-qualifier, using geo-hashed value of the lat-lon ie how the key is
> >>>> being generated. What I am failing to understand is what is contained
> >>>> inside the Value field of Accumulo.
> >>>>
> >>>> Also if someone could explain how is non-spatial temporal data (for eg
> >>>> Customer Name)is stored/indexed in geomesa, it would be very helpful.
> >>>>
> >>>> Thanks
> >>>> Vaibhav
> >>>> _______________________________________________
> >>>> geomesa-users mailing list
> >>>> geomesa-users@xxxxxxxxxxxxxxxx
> >>>> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >>>> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >>> _______________________________________________
> >>> geomesa-users mailing list
> >>> geomesa-users@xxxxxxxxxxxxxxxx
> >>> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >>> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >> _______________________________________________
> >> geomesa-users mailing list
> >> geomesa-users@xxxxxxxxxxxxxxxx
> >> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> >> http://www.locationtech.org/mailman/listinfo/geomesa-users
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users