Class ICUTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class ICUTokenizer
    extends org.apache.lucene.analysis.Tokenizer
    Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

    Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

    See Also:
    ICUTokenizerConfig
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

        org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
    • Field Summary

      • Fields inherited from class org.apache.lucene.analysis.Tokenizer

        input
    • Constructor Summary

      Constructors 
      Constructor Description
      ICUTokenizer​(Reader input)
      Construct a new ICUTokenizer that breaks text into words from the given Reader.
      ICUTokenizer​(Reader input, ICUTokenizerConfig config)
      Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void end()  
      boolean incrementToken()  
      void reset()  
      void reset​(Reader input)  
      • Methods inherited from class org.apache.lucene.analysis.Tokenizer

        close, correctOffset
      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
    • Constructor Detail

      • ICUTokenizer

        public ICUTokenizer​(Reader input)
        Construct a new ICUTokenizer that breaks text into words from the given Reader.

        The default script-specific handling is used.

        Parameters:
        input - Reader containing text to tokenize.
        See Also:
        DefaultICUTokenizerConfig
      • ICUTokenizer

        public ICUTokenizer​(Reader input,
                            ICUTokenizerConfig config)
        Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
        Parameters:
        input - Reader containing text to tokenize.
        config - Tailored BreakIterator configuration
    • Method Detail

      • incrementToken

        public boolean incrementToken()
                               throws IOException
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • reset

        public void reset()
                   throws IOException
        Overrides:
        reset in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • reset

        public void reset​(Reader input)
                   throws IOException
        Overrides:
        reset in class org.apache.lucene.analysis.Tokenizer
        Throws:
        IOException
      • end

        public void end()
                 throws IOException
        Overrides:
        end in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException