Interface Tokenizer

All Known Implementing Classes:
ExtendedWhitespaceTokenizer

public interface Tokenizer
Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For each token, the following information is available:
token type
Types of tokens: numbers, URIs, punctuation, acronyms and others. See all constants in this class declared with TT_ prefix, e.g. TT_TERM.
token flags
Additional token flags such as an indication whether a punctuation token is a sentence delimiter (TF_SEPARATOR_SENTENCE).
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final short
    The current token is a common word.
    static final short
    The current token is part of the query.
    static final short
    Current token is a document separator (never returned from parsing).
    static final short
    Current token separates document's logical fields.
    static final short
    Current token is a sentence separator.
    static final short
    Current token terminates the input (never returned from parsing).
    static final int
     
    static final int
     
    static final int
     
    static final int
    Indicates the end of the token stream.
    static final int
     
    static final int
     
    static final int
     
    static final int
     
    static final int
     
    static final int
     
    static final int
     
  • Method Summary

    Modifier and Type
    Method
    Description
    short
    Returns the next token from the input stream.
    void
    reset(Reader reader)
    Resets the tokenizer to process new data
    void
    Sets the current token image to the provided buffer.