Class CarmelUniformTermPruningPolicy
- java.lang.Object
-
- org.apache.lucene.index.pruning.PruningPolicy
-
- org.apache.lucene.index.pruning.TermPruningPolicy
-
- org.apache.lucene.index.pruning.CarmelUniformTermPruningPolicy
-
public class CarmelUniformTermPruningPolicy extends TermPruningPolicy
Enhanced implementation of Carmel Uniform Pruning,TermPositions
whose in-document frequency is below a specified thresholdSee
CarmelTopKTermPruningPolicy
for link to the paper describing this policy. are pruned.Conclusions of that paper indicate that it's best to compute per-term thresholds, as we do in
CarmelTopKTermPruningPolicy
. However for large indexes with a large number of terms that method might be too slow, and the (enhanced) uniform approach implemented here may will be faster, although it might produce inferior search quality.This implementation enhances the Carmel uniform pruning approach, as it allows to specify three levels of thresholds:
- one default threshold - globally (for terms in all fields)
- threshold per field
- threshold per term
These thresholds are applied so that always the most specific one takes precedence: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
Threshold are maintained in a map, keyed by either field names or terms in
field:text
format. precedence of these values is the following:Thresholds in this method of pruning are expressed as the percentage of the top-N scoring documents per term that are retained. The list of top-N documents is established by using a regular
IndexSearcher
andSimilarity
to run a simpleTermQuery
.Smaller threshold value will produce a smaller index. See
TermPruningPolicy
for size vs performance considerations.For indexes with a large number of terms this policy might be still too slow, since it issues a term query for each term in the index. In such situations, the term frequency pruning approach in
TFTermPruningPolicy
will be faster, though it might produce inferior search quality.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CarmelUniformTermPruningPolicy.ByDocComparator
-
Field Summary
-
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
fieldFlags, in
-
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
-
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
initPositionsTerm(org.apache.lucene.index.TermPositions tp, org.apache.lucene.index.Term t)
Called when movingTermPositions
to a newTerm
.boolean
pruneAllPositions(org.apache.lucene.index.TermPositions termPositions, org.apache.lucene.index.Term t)
Prune all postings per term (invoked once per term per doc)int
pruneSomePositions(int docNum, int[] positions, org.apache.lucene.index.Term curTerm)
Prune some postings per term (invoked once per term per doc).boolean
pruneTermEnum(org.apache.lucene.index.TermEnum te)
Pruning of all postings for a term (invoked once per term).int
pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, org.apache.lucene.index.TermFreqVector tfv)
Pruning of individual terms in term vectors.-
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
pruneAllFieldPostings, prunePayload, pruneWholeTermVector
-
-
-
-
Method Detail
-
pruneTermEnum
public boolean pruneTermEnum(org.apache.lucene.index.TermEnum te) throws IOException
Description copied from class:TermPruningPolicy
Pruning of all postings for a term (invoked once per term).- Specified by:
pruneTermEnum
in classTermPruningPolicy
- Parameters:
te
- positioned term enum.- Returns:
- true if all postings for this term should be removed, false otherwise.
- Throws:
IOException
-
initPositionsTerm
public void initPositionsTerm(org.apache.lucene.index.TermPositions tp, org.apache.lucene.index.Term t) throws IOException
Description copied from class:TermPruningPolicy
Called when movingTermPositions
to a newTerm
.- Specified by:
initPositionsTerm
in classTermPruningPolicy
- Parameters:
tp
- input term positionst
- current term- Throws:
IOException
-
pruneAllPositions
public boolean pruneAllPositions(org.apache.lucene.index.TermPositions termPositions, org.apache.lucene.index.Term t) throws IOException
Description copied from class:TermPruningPolicy
Prune all postings per term (invoked once per term per doc)- Specified by:
pruneAllPositions
in classTermPruningPolicy
- Parameters:
termPositions
- positioned term positions. Implementations MUST NOT advance this by callingTermPositions
methods that advance either the position pointer (next, skipTo) or term pointer (seek).t
- current term- Returns:
- true if the current posting should be removed, false otherwise.
- Throws:
IOException
-
pruneTermVectorTerms
public int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, org.apache.lucene.index.TermFreqVector tfv) throws IOException
Description copied from class:TermPruningPolicy
Pruning of individual terms in term vectors.- Specified by:
pruneTermVectorTerms
in classTermPruningPolicy
- Parameters:
docNumber
- document numberfield
- field nameterms
- array of termsfreqs
- array of term frequenciestfv
- the original term frequency vector- Returns:
- 0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
- Throws:
IOException
-
pruneSomePositions
public int pruneSomePositions(int docNum, int[] positions, org.apache.lucene.index.Term curTerm)
Description copied from class:TermPruningPolicy
Prune some postings per term (invoked once per term per doc).- Specified by:
pruneSomePositions
in classTermPruningPolicy
- Parameters:
docNum
- current document numberpositions
- original term positions in the document (and indirectly term frequency)curTerm
- current term- Returns:
- 0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.
-
-