Package org.carrot2.language
Interface Tokenizer
- All Known Implementing Classes:
ExtendedWhitespaceTokenizer
public interface Tokenizer
Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For
each token, the following information is available:
- token type
- Types of tokens: numbers, URIs, punctuation, acronyms and others. See all constants in this
class declared with
TT_
prefix, e.g.TT_TERM
. - token flags
- Additional token flags such as an indication whether a punctuation token is a sentence
delimiter (
TF_SEPARATOR_SENTENCE
).
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final short
The current token is a common word.static final short
The current token is part of the query.static final short
Current token is a document separator (never returned from parsing).static final short
Current token separates document's logical fields.static final short
Current token is a sentence separator.static final short
Current token terminates the input (never returned from parsing).static final int
static final int
static final int
static final int
Indicates the end of the token stream.static final int
static final int
static final int
static final int
static final int
static final int
static final int
-
Method Summary
Modifier and TypeMethodDescriptionshort
Returns the next token from the input stream.void
Resets the tokenizer to process new datavoid
setTermBuffer
(MutableCharArray array) Sets the current token image to the provided buffer.
-
Field Details
-
TYPE_MASK
static final int TYPE_MASK- See Also:
-
TT_TERM
static final int TT_TERM- See Also:
-
TT_NUMERIC
static final int TT_NUMERIC- See Also:
-
TT_PUNCTUATION
static final int TT_PUNCTUATION- See Also:
-
TT_EMAIL
static final int TT_EMAIL- See Also:
-
TT_ACRONYM
static final int TT_ACRONYM- See Also:
-
TT_FULL_URL
static final int TT_FULL_URL- See Also:
-
TT_BARE_URL
static final int TT_BARE_URL- See Also:
-
TT_FILE
static final int TT_FILE- See Also:
-
TT_HYPHTERM
static final int TT_HYPHTERM- See Also:
-
TT_EOF
static final int TT_EOFIndicates the end of the token stream.- See Also:
-
TF_SEPARATOR_SENTENCE
static final short TF_SEPARATOR_SENTENCECurrent token is a sentence separator.- See Also:
-
TF_SEPARATOR_DOCUMENT
static final short TF_SEPARATOR_DOCUMENTCurrent token is a document separator (never returned from parsing).- See Also:
-
TF_SEPARATOR_FIELD
static final short TF_SEPARATOR_FIELDCurrent token separates document's logical fields.- See Also:
-
TF_TERMINATOR
static final short TF_TERMINATORCurrent token terminates the input (never returned from parsing).- See Also:
-
TF_COMMON_WORD
static final short TF_COMMON_WORDThe current token is a common word. This flag is not directly available from the tokenizer.- See Also:
-
TF_QUERY_WORD
static final short TF_QUERY_WORDThe current token is part of the query. This flag is not directly available from the tokenizer.- See Also:
-
-
Method Details
-
reset
Resets the tokenizer to process new data- Parameters:
reader
- the input to tokenize. The reader will not be closed by the tokenizer when the end of stream is reached.- Throws:
IOException
-
nextToken
Returns the next token from the input stream.- Returns:
- the type of the token as defined by the
TT_TERM
and other constants orTT_EOF
when the end of the data stream has been reached. - Throws:
IOException
- See Also:
-
setTermBuffer
Sets the current token image to the provided buffer.- Parameters:
array
- buffer in which the current token's image should be stored
-