Package org.apache.lucene.analysis.standard
Standards-based analyzers implemented with JFlex.
The org.apache.lucene.analysis.standard
package contains three
fast grammar-based tokenizers constructed with JFlex:
StandardTokenizer
: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. UnlikeUAX29URLEmailTokenizer
, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
StandardAnalyzer
includesStandardTokenizer
,StandardFilter
,LowerCaseFilter
andStopFilter
. When theVersion
specified in the constructor is lower than 3.1, theClassicTokenizer
implementation is invoked.ClassicTokenizer
: this class was formerly (prior to Lucene 3.1) namedStandardTokenizer
. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.)ClassicAnalyzer
includesClassicTokenizer
,StandardFilter
,LowerCaseFilter
andStopFilter
.UAX29URLEmailTokenizer
: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzer
includesUAX29URLEmailTokenizer
,StandardFilter
,LowerCaseFilter
andStopFilter
.
-
Interface Summary Interface Description StandardTokenizerInterface Internal interface for supporting versioned grammars. -
Class Summary Class Description ClassicAnalyzer FiltersClassicTokenizer
withClassicFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.ClassicFilter Normalizes tokens extracted withClassicTokenizer
.ClassicTokenizer A grammar-based tokenizer constructed with JFlexStandardAnalyzer FiltersStandardTokenizer
withStandardFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.StandardFilter Normalizes tokens extracted withStandardTokenizer
.StandardTokenizer A grammar-based tokenizer constructed with JFlex.StandardTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29UAX29URLEmailAnalyzer FiltersUAX29URLEmailTokenizer
withStandardFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.UAX29URLEmailTokenizer This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.UAX29URLEmailTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.