Re: [smila-dev] search record: group by vs. faceting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [smila-dev] search record: group by vs. faceting

From: Thomas Menzel <tmenzel@xxxxxxx>
Date: Mon, 9 Jan 2012 13:17:00 +0100
Accept-language: en-US, de-DE
Acceptlanguage: en-US, de-DE
Comment: x-taglocity-tags-start(smila, spec)x-taglocity-tags-end
Delivered-to: smila-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-dev>
List-help: <mailto:smila-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=unsubscribe>
Thread-index: AczK5jVoFJ9vVacYS+WsF5wJH66r0wApnZigAAxP+qA=
Thread-topic: search record: group by vs. faceting

Hi folks,

Attached you will find some sample XMLs that spec both query and result side.

> I assume that the $maxcount most relevant results would still be listed as “records” as in a “ungrouped” search additionally, at least optionally?

Hm, not quite understanding you comment here. Do you want to have one of the grouped results be returned redundantly in the normal results, i.e. a main group that is selected on its hit count?

If no: plz explain further, especially what you mean by: the $maxcount most relevant results

Anyhow, I have provided the option “_asMainResult” to define the main group.

> attribute values vs. keys & dynamic groups

Ok, I see you point and had thought myself of doing smth. similar to your approach in order to be more flexible but deemed it more bloated than I wanted it to be and hence came up with the sent proposal.

But since I like the idea with

- being open to dynamic grouping

- be able to ship more easily additional parameters

I will go with your proposal.

Not relevant anymore now, but I’m wondering if we should have one serialization format dictate the design…

> should not define two structures for very similar things, but rather try to create one structure that support all “grouping/faceting/clustering” use cases

As I said above and mentioned in my initial mail, faceting and grouping/clustering are two fundamentally different things. Faceting is just concerned about counting and from the facets results themselves one can’t infer directly which facet “contains” which result items (however the reverse is possible if the result items contains the faceted values again, but then u just do the work again). On the other hand, with the “group by” the results are nested in groups and we don’t have just one result list but one for each group. Due to this I think that we really should have here different structures. But not even that, with solr u can do faceting *and* group by at the same time and hence we just need for this reason the two diff. return structures.

As you can see in the examples I have extended the faceting to support ranges and also the filtering of selected facet values. One could drive this even further. The question is: do we want to spec it (filtering) in that detail as a general convention or shall we leave this to impl. of integrated search technologies?

Thomas Menzel @ brox IT-Solutions GmbH

From: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] On Behalf Of Jürgen Schumacher
Sent: Donnerstag, 5. Januar 2012 11:28
To: Smila project developer mailing list
Subject: Re: [smila-dev] search record: group by vs. faceting

First, A Happy New Year to everybody (:

Basically, It’s fine with me to extend the groups result structure. Some questions or remarks:

- I assume that the $maxcount most relevant results would still be listed as “records” as in a “ungrouped” search additionally, at least optionally?

- Usually we try not to use attribute values as Map keys because this can lead to problems in some JSON parsers (some assume that there is only a relatively limited number of keys in JSON objects because they are more like member names rather than “hash map keys”, so they store (or even intern()) all used keys which may lead to memory problems if arbitrary keys are used), so I would prefer to have the attribute values of the groups stored as Values, too (I’ll do a consolidated example below).

- For readability it would be nice if the “groups” structure would contain the attribute names, too. This would also allow to represent “dynamic” groupings later (as a hypothetical extension of your example: LEDs are sub-grouped by size, while Plasmas are sub-grouped by manufacturer, because all results have the same size … or something like this). Or allow multiple sub-groupings for one group value, etc (the structure would then evolve into some kind of “decision tree” to help the user to find the best result). It would be possible easily to add a “type” attribute to the top-level “groups” map to describe which kind of grouping is contained, if it’s necessary to know this on the search client side.

- I think the current grouping with non-hierarchical groups is still useful in other scenarios, so it would be nice if the “groups” structure could support both use cases. I didn’t want to introduce a separate structure for this, the idea was that we should not define two structures for very similar things, but rather try to create one structure that support all “grouping/faceting/clustering” use cases, because that is easier for clients usually.

So, my proposal would be to extend the current structure by adding sub-grouping and the possibility to add results to the groups. It would depend on the available features of the integrated search engine which parts of the structure are actually used (of course, a search engine integration could add specific parameters to the groupby-Parameter to specify what is returned or not). Also it would be easier for search engine to add specific properties to the structure without having to break it again.

For example, your example could look like this (XML makes it quite big, it would be much more readable in JSON ;-):

<Map>

<Map>

…

</Seq>

</Map>

<Map>

…

</Seq>

</Map>

</Seq>

</Map>

<Map>

<Val key="value">Plasma</Val>

<Map>

…

</Seq>

</Map>

<Map>

…

</Seq>

</Map>

</Seq>

</Map>

</Seq>

</Map>

Regards,

Juergen

Taglocity Tags: smila, spec

<Val key="query">tv</Val>
<Map key="groupby">
  <!-- a grouping is defined in a map that bares the group-name. For each definition in this Map there
    exists a corresponding Map under the groups Map in the result allowing for multiple groups to be
    returned. -->
  <Map key="${group-name}">
    <!--
      The Grouping features are highly dependent on the search technology. An integrated search technology
      may chose to place its own config mapping for grouping into this section and extend it or entirely
      elsewhere. For this reason the outer structure is a map to host named parameters.
    -->
    <Seq key='attributes'>
      <!--
        Since smila shall remain open it will only specify the simple case of grouping by attribute fields
        and a given search technology might no even support this fully.

        By convention SMILA defines the behavior of this attributes Seq as follows: For each subsequent
        attribute given a nested groups strcuture is added to each value of the previous attribute.
      -->
      <Map>
        <Val key="attribute">type</Val>
        <Val key="maxcount" type="long">10</Val>
        ...
      </Map>
      <Map>
        <Val key="attribute">size</Val>
        <Val key="maxcount" type="long">10</Val>
        ...
      </Map>
    </Seq>
  </Map>
  <!-- optional. usually grouped results are solely returned in the groups Map, but sometimes there is
    the need/desire to return one of the groups as the main result list, as if it were a normal search.
    Sample use case: field collapsing. The value given must correspong to one of the group names. -->
  <Val key="_asMainResult">${group-name}</Val>
</Map>



<Map key="groups">
  <!-- the groups result structure supports both lists of groups and nesting. the toplevel groups correspond
    to the toplevel definitions in the group by section. Integrators are highly encuraged to stickt to
    this structure and possibly extend it with additional result values. -->
  <Seq key="${group-name}">
    <Seq key="type">  <!-- key: name of the group command, usually attribute name -->
      <Map>
        <Val key="value">LED</Val>
        <Val key="count" type="long">323</Val>
        <Map key=âgroupsâ>
          <Seq key="size">   <!--key: group attribute name -->
            <Map>
              <Val key="value">32</Val>
              <Val key="count" type="long">13</Val>
              <Seq key=âresultsâ>
                <!-- exact same structure as the normal result list -->
                â¦
              </Seq>
            </Map>
            <Map>
              <Val key="value">40</Val>
              <Val key="count" type="long">29</Val>
              <Seq key=âresultsâ>
                â¦
              </Seq>
            </Map>
          </Seq>
        </Map>
      </Map>
      <Map>
        <Val key="value">Plasma</Val>
        <Val key="count" type="long">17</Val>
        <Map key=âgroupsâ>
          <Seq key="size">    <!--key: group attribute name -->
            <Map>
              <Val key="value">32</Val>
              <Val key="count" type="long">5</Val>
              <Seq key=âresultsâ>
                â¦
              </Seq>
            </Map>
            <Map>
              <Val key="value">40</Val>
              <Val key="count" type="long">12</Val>
              <Seq key=âresultsâ>
                â¦
              </Seq>
            </Map>
          </Seq>
        </Map>
    </Seq>
  </Seq>
</Map>

<Val key="query">tv</Val>
<!-- TODO
  - filters
  - amazon style
-->

<Seq key="facetby">
  <Map>
    <!-- facet-name defines the key in facets result map. Internal use of this value depends on search
      technology but it is likely to correspond to an attribute name. More advanced faceting features
      might not though... -->
    <Val key="facet-name">type</Val>
    <!-- one of: enum, gap. Optional, defaults to enum. Valid values may be extended by the search
      technology. Enum will return a facet per value , range will return the counts per range defined
      thru gap or buckets.
    -->
    <Val key="type">enum</Val>
    <Map key="sortby">
      <!-- one of: value, count, others by extension of the search technology -->
      <Val key="criterion">count</Val>
      <!-- one of: ascending, descending -->
      <Val key="order">ascending</Val>
    </Map>
    <!-- only valid with type=enum -->
    <Val key="maxcount" type="long">5</Val>
    <!-- the filter is not initially required but later on when a user wants to filte on one or more
      values -->
    <Seq key="filterOn">
      <Val>LED</Val>
      <Val>Plasma</Val>
    </Seq>
  </Map>
  <Map>
    <!-- defines a range facet with equal sized gaps. The ranges are named according to their lower edge, 
      and referenced like that in the result as well as in <filterOn> -->
    <Val key="facet-name">size-gap</Val>
    <Val key="attribute">size</Val>
    <Val key="type">gap</Val>
    <Val key="start">0</Val>
    <Val key="end">60</Val>
    <Val key="gap">10</Val>

    <Seq key="filterOn">
      <Val>20</Val>
      <Val>30</Val>
      <Val>40</Val>
    </Seq>
  </Map>
</Seq>


<Map key="facets">
  <!-- the groups result structure supports both lists of groups and nesting. the toplevel groups correspond
    to the toplevel definitions in the group by section. Integrators are highly encuraged to stickt to
    this structure and possibly extend it with additional result values. -->
  <Seq key="type">
    <Map>
      <Val key="value">LED</Val>
      <Val key="count" type="long">323</Val>
    </Map>
    <Map>
      <Val key="value">Plasma</Val>
      <Val key="count" type="long">17</Val>
  </Seq>
  <Seq key="size">
    <Map>
      <Val key="value">30</Val>
      <Val key="count" type="long">5</Val>
    </Map>
    <Map>
      <Val key="value">40</Val>
      <Val key="count" type="long">12</Val>
    </Map>
  </Seq>
</Map>

<Map>
  <!-- defines arbitrary ranges that might overlap and even leave gaps, both are shown here but the example
    isnt too good a use case for this but gets the idea across .. -->
  <Val key="facet-name">size-ranged</Val>
  <Val key="attribute">size</Val>
  <Val key="type">ranges</Val>
  <Seq key="ranges">
    <Map>
      <!-- defines the value's value in the facets result -->
      <Val name='value-name'><![CDATA[< 32]]></Val>
      <!-- * is the special value signifying the lowest or highest upper bound -->
      <Val name='start'>*</Val>
      <Val name='end'>30</Val>
    </Map>
    <Map>
      <Val name='value-name'><![CDATA[40..50]]></Val>
      <Val name='start'>40</Val>
      <Val name='end'>50</Val>
    </Map>
    <Map>
      <Val name='value-name'><![CDATA[52..60]]></Val>
      <Val name='start'>52</Val>
      <Val name='end'>70</Val>
    </Map>

    <Seq key='filterOn'>
      <Val>40..50</Val>
    </Seq>
  </Seq>
</Map>

<Map key="facets">
  <Seq key="size-ranged">
    <Map>
      <Val key="value"><![CDATA[< 32]]></Val>
      <Val key="count" type="long">5</Val>
    </Map>
    <Map>
      <Val key="value">40..50</Val>
      <Val key="count" type="long">12</Val>
    </Map>
  </Seq>
</Map>

Follow-Ups:
- Re: [smila-dev] search record: group by vs. faceting
  - From: Jürgen Schumacher

References:
- [smila-dev] search record: group by vs. faceting
  - From: Thomas Menzel
- Re: [smila-dev] search record: group by vs. faceting
  - From: Jürgen Schumacher

Prev by Date: Re: [smila-dev] search record: group by vs. faceting
Next by Date: [smila-dev] Advance notice: Lucene integration will no longer be supportd
Previous by thread: Re: [smila-dev] search record: group by vs. faceting
Next by thread: Re: [smila-dev] search record: group by vs. faceting
Index(es):
- Date
- Thread

Breadcrumbs