Re: [smila-dev] search record: group by vs. faceting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [smila-dev] search record: group by vs. faceting

From: Jürgen Schumacher <juergen.schumacher@xxxxxxxxxxxxx>
Date: Fri, 13 Jan 2012 10:06:58 +0100
Accept-language: de-DE, en-US
Acceptlanguage: de-DE, en-US
Delivered-to: smila-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-dev>
List-help: <mailto:smila-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=unsubscribe>
Thread-index: AczK5jVoFJ9vVacYS+WsF5wJH66r0wApnZigAAxP+qABgy7lEA==
Thread-topic: search record: group by vs. faceting

Hi,

Thomas wrote:
> > should not define two structures for very similar things, but rather try to create one structure that support all “grouping/faceting/clustering” use cases
> As I said above and mentioned in my initial mail, faceting and grouping/clustering are two fundamentally different things...

Thanks, I got it (now ;-). It's ok to have both. Anyway, as far as parameters or result structures are similar we should use the same stuff to represent them. But that's OK now in your examples.

> As you can see in the examples I have extended the faceting to support ranges and also the filtering of selected facet values.  One 
> could drive this even further. The question is: do we want to spec it (filtering) in that detail as a general convention or shall we
> leave this to impl. of integrated search technologies?

No I don’t think we should specify too much. It would be just guessing. So rather: Keep it simple, let's focus on what we (you ;-) need today. If there is some really fancy feature next year that doesn't fit it, we can extend the specification then. 

> Attached you will find some sample XMLs that spec both query and result side.

Looks ok to me, basically. I'm a bit concerned about the <Map key="${group-name}"> level. I can see that there may be use cases for it, but it makes the usage a bit inconvenient in most use cases, where only one grouping is used. Could we make it optional? So in most use cases this would be sufficient (and it's very similar to the faceting parameters)

<Val key="query">tv</Val>
<Seq key="groupby">
  <Map>
    <Val key="attribute">type</Val>
    <Val key="maxcount" type="long">10</Val>
    ...
  </Map>
  <Map>
    <Val key="attribute">size</Val>
    <Val key="maxcount" type="long">10</Val>
    ...
  </Map>
</Seq>

while in more sophisticated use cases your proposal could be used?

On the faceting examples: I suppose you are more accustomed to possible options here, so that's I cannot discuss these in detail. Just one thing:

You write in facetby.xml:

    <!-- facet-name defines the key in facets result map. Internal use of this value depends on search
      technology but it is likely to correspond to an attribute name. More advanced faceting features
      might not though... -->
    <Val key="facet-name">type</Val>

And later:

    <Val key="facet-name">size-gap</Val>
    <Val key="attribute">size</Val>

- We don't use "-" in parameter names yet, I think, so this should rather be "facetName".
- As the default use case is "faceting using attributes", I think it would be nicer to represent this in the "normal" parameter structure. So "normally" you specify the attribute to use for faceting and the attribute name will be used as the facet-name, too, so the first example could be just

<Map>
  <Val key="attribute">type</Val>
  <Val key="type">enum</Val>
  ...

If you want to, you can still add a facetName:

<Map>
  <Val key="facetName">type-enum</Val>
  <Val key="attribute">type</Val>
  <Val key="type">enum</Val>
  ...

which then would be used in the result as the key of the sequence instead of the attribute name:

<Map key="facets">
  <Seq key="type-enum">
    ...
  </Seq>
  <Seq key="size-gap">     
    ...
  </Seq>
  ...
</Map>

This would make it possible to have different facetings for a single attribute. Of course the client needs to remember which facetting is based on which attribute, if the key is not the attribute name. But I suppose that's not a real problem (:

If the facet parameter does not contain an "attribute" because the faceting algorithm does not use a single attribute or whatever, the "facetName" would be required, of course. The faceting algorithm would rather be specified by the "type" parameter anyway, instead of the name, or did I get this wrong? 

On the other side: If we don't have a real need for the "facetName" parameter now, it should be left out. Let's keep it simple.

>> I assume that the $maxcount most relevant results would still be listed as “records” as in a “ungrouped” search additionally, at least optionally?
> Hm, not quite understanding you comment here. Do you want to have one of the grouped results be returned redundantly in the normal 
> results, i.e. a main group that is selected on its hit count?
> If no: plz explain further, especially what you mean by: the $maxcount most relevant results

I just wanted to make sure that the result record can still contain the standard "records" list of ungrouped results - if the search engine can produce it, of course:

{
  "count": 1234,
  "records": [...], // first 10 results (or whatever "maxcount" was set to in the request) ordered by ranking
  "groups": [...] // grouing result
}

> Anyhow, I have provided the option “_asMainResult” to define the main group.

I'm not sure if this is really necessary, but if you need it, it's OK with me.

Btw, parameters in the search request record do not need "_" prefixes, as there should be no attribute names (as defined in the index schema) on the top level, but they are placed in a map under "query" (the query can be either written as a single query string (as in the "default search") or as a query record (as in the "advanced search")).

>> attribute values vs. keys & dynamic groups
> I will go with your proposal.

Fine (:

> Not relevant anymore now, but I’m wondering if we should have one serialization format dictate the design…

No, it should not, of course, and it doesn't. I just think that JSON is more convenient for describing examples to human readers. It's equivalent to the XML representation in any case.

Cheers,
Juergen.

Follow-Ups:
- Re: [smila-dev] search record: group by vs. faceting
  - From: Thomas Menzel

References:
- [smila-dev] search record: group by vs. faceting
  - From: Thomas Menzel
- Re: [smila-dev] search record: group by vs. faceting
  - From: Jürgen Schumacher
- Re: [smila-dev] search record: group by vs. faceting
  - From: Thomas Menzel

Prev by Date: Re: [smila-dev] Advance notice: Lucene integration will no longer be supportd
Next by Date: Re: [smila-dev] search record: group by vs. faceting
Previous by thread: Re: [smila-dev] search record: group by vs. faceting
Next by thread: Re: [smila-dev] search record: group by vs. faceting
Index(es):
- Date
- Thread

Breadcrumbs