Class ExtendedWhitespaceTokenizer

java.lang.Object
org.carrot2.language.ExtendedWhitespaceTokenizer
All Implemented Interfaces:
Tokenizer

public final class ExtendedWhitespaceTokenizer extends Object implements Tokenizer
A tokenizer separating input characters on whitespace, but capable of extracting more complex tokens, such as URLs, e-mail addresses and sentence delimiters.
  • Constructor Details

    • ExtendedWhitespaceTokenizer

      public ExtendedWhitespaceTokenizer()
  • Method Details

    • reset

      public void reset(Reader input)
      Reset this tokenizer to start parsing another stream.
      Specified by:
      reset in interface Tokenizer
      Parameters:
      input - the input to tokenize. The reader will not be closed by the tokenizer when the end of stream is reached.
    • nextToken

      public short nextToken() throws IOException
      Description copied from interface: Tokenizer
      Returns the next token from the input stream.
      Specified by:
      nextToken in interface Tokenizer
      Returns:
      the type of the token as defined by the Tokenizer.TT_TERM and other constants or Tokenizer.TT_EOF when the end of the data stream has been reached.
      Throws:
      IOException
      See Also:
    • setTermBuffer

      public void setTermBuffer(MutableCharArray array)
      Description copied from interface: Tokenizer
      Sets the current token image to the provided buffer.
      Specified by:
      setTermBuffer in interface Tokenizer
      Parameters:
      array - buffer in which the current token's image should be stored