[KinoSearch] After search, which field/s scored highest

Peter Karman peter at peknet.com
Tue Nov 14 07:56:45 PST 2006



Marvin Humphrey scribbled on 11/13/06 11:47 PM:
> 
> On Nov 12, 2006, at 12:03 PM, Peter Karman wrote:
> 
>>> I'd reckon that the KS/Lucene "Field" is closer, conceptually, to the 
>>> fields in a database table.  You can divvy up a document by context 
>>> if you so choose, but a field could be other things as well: date 
>>> indexed, document id, etc.
>>
>> I use MetaNames the same way:
>>
>> <row>
>>  <date>123456789</date>
>>  <name>Cowboy</name>
>>  <age>34</age>
>> </row>
>>
>> MetaNames date name age
>> MetaNameAlias swishdefault row
> 
> Looking at the documentation for MetaNameAlias at 
> <http://www.swish-e.com/devel/devel_docs/swish-config.html#metanamealias>, 
> I gather that you would be able to search "row" for either "Cowboy" or 
> "34" and you'd hit the doc above -- correct?

yes. Though that only works for XML, iirc.

And you can also specify name=Cowboy to get more specific.


> It's not clear to me how that would be useful for storing Swish 
> "properties" as I understand them.  What benefit would you get from 
> sorting those?  I'm imagining that the output of a search in Swish is a 
> series of doc_num => score pairs, as it is in KS and Lucene.  So you 
> have to do some sort of lookup based on document number to retrieve 
> them, and I don't see why you would want the properties database sorted 
> by anything other than document number.

here's a recent overview of the pre-sorted properties Swish uses:
http://article.gmane.org/gmane.comp.web.swish-e/6369/match=sort+properties

of course, the scheme mentioned there doesn't work well with large doc sets, or 
incremental indexes, both of which KS excels at.

There are many optimizations Swish makes for speed that work against it when 
trying to scale. Or support UTF-8. Or do incremental indexing. Or...


> Out of curiosity, can you change up MetaNameBias on a token-by-token basis?
> 

no. per-MetaName only. But since you can assign MetaName on a token-by-token 
basis, I guess you could 'fake it.'

Swish users have asked for a bias per-token. I think that's a nice feature.



>> Yes, one of the big things I've been thinking about with Swish3 is 
>> that it would be nice to make it easy to experiment with different 
>> retrieval/ranking schemes. Sounds like you're on that track.
>>
>> fwiw, Xapian already implements something like this.
> 
> To the best of my knowledge, Xapian allows you to specify only the 
> term-weighting formula: BM25 vs. BM11 and such.

yes, you're right. Sorry. As my 12th grade english teacher used to say, I 
committed a faulty reference with my 'something like this'. "This" was intended 
to refer to the idea of storing weights with the term position, which Xapian 
does implement in the Document class, add_posting() method.

I'm growing to like the Xapian API for 'posting', 'term', 'data' and 'value'.
http://xapian.org/docs/apidoc/html/classXapian_1_1Document.html

But that's not the 'rich position' idea. I get that.

>  Lucene has the 
> Similarity class, where, by  overriding certain functions like 
> lengthNorm() and tf(), you have control over basically the same thing, 
> term weighting.  I plan to make a variant of Similarity available in KS, 
> probably in version 0.20.
> 
> But for Lucy, I'm talking about something else.  I'm talking about 
> extending Lucy with e.g. link analysis data a la PageRank value, or 
> maybe aggregate vector data a la Latent Semantic Analysis.
> 

very cool.

> That sort of extensibility can be achieved by making it possible to 
> override the serialization and sorting functionality for Document, 
> Field, Posting, and such.  Happily, I've now been tasked with sorting, 
> and as sorting will depend on this new scheme, I'll have more to say 
> about it in a bit.
> 

nice.

-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com



More information about the kinosearch mailing list