[KinoSearch] Wildcards

Father Chrysostomos sprout at cpan.org
Fri Feb 29 10:11:02 PST 2008


On Feb 26, 2008, at 8:30 PM, Marvin Humphrey wrote:

>
> On Feb 15, 2008, at 10:25 PM, Father Chrysostomos wrote:
>
>> I’ve written some code that follows approach #1 above, namely, it  
>> iterates through the posting lists one after the other, keeping a  
>> list of doc nums that have been seen. It counts them afterwards, to  
>> get an accurate ‘doc_freq’.
>
> PostingList objects have a get_doc_freq() method, so you can just do  
> this:
>
>  my $doc_freq = 0;
>  $doc_freq += $_->get_doc_freq for @posting_lists;

There is a problem with this approach that is best demonstrated with  
an example: If there are two documents, one containing ‘dog’ and  
‘dot,’ and the other containing just ‘dog’, and the search term is  
‘do*’, then the doc freq should be 2, since the term matches two docs.  
The doc freqs of the individual docs are 2 and 1, respectively, so if  
we add them together we get 3, and if we average them out, we get 1.5,  
neither of which is the right answer.

>
>
>> Is this something you would be willing to include in core, so I  
>> don’t have to repeat it in multiple subclasses?
>
>
> That approach, which is the one used in  
> KinoSearch::Docs::Cookbook::WildCardQuery, really isn't a very good  
> option -- it's just makes for the simplest and shortest code sample.
>
> First, you lose all the information other than document numbers.   
> When iterating over a PostingList, you'd typically want to access  
> info like the number of times the term appears in the document.

Thank you for pointing this out. I’ve just realised that the  
WildCardQuery implementation I’m working on iterates through them  
twice. I’ll optimise it later.


> It would be overkill to add a large, complex CompositePostingList  
> class to KS right now, in order to avoid short-term code duplication.

Fair enough.




More information about the kinosearch mailing list