Package org.carrot2.text.preprocessing
Class CompletePreprocessingPipeline
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.CompletePreprocessingPipeline
- All Implemented Interfaces:
AcceptingVisitor
,ContextPreprocessor
Performs a complete preprocessing on the provided documents. The preprocessing consists of the
following steps:
InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
PhraseExtractor
LabelFilterProcessor
DocumentAssigner
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final org.carrot2.text.preprocessing.CaseNormalizer
Case normalizer used by the algorithm.Document assigner used by the algorithm, contains modifiable parameters..Label filtering is a composite of individual filters.final AttrInteger
Phrase Document Frequency cut-off threshold.protected final org.carrot2.text.preprocessing.LanguageModelStemmer
Stemmer used by the algorithm.protected final org.carrot2.text.preprocessing.StopListMarker
Stop list marker used by the algorithm, contains modifiable parameters..protected final org.carrot2.text.preprocessing.InputTokenizer
Tokenizer used by the algorithm.final AttrInteger
Word Document Frequency cut-off threshold.Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionpreprocess
(Stream<? extends Document> documents, String query, LanguageComponents langModel) Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
Field Details
-
wordDfThreshold
Word Document Frequency cut-off threshold. Words appearing in fewer thanwordDfThreshold
documents will be ignored. -
phraseDfThreshold
Phrase Document Frequency cut-off threshold. Phrases appearing in fewer thanphraseDfThreshold
documents will be ignored. -
labelFilters
Label filtering is a composite of individual filters. -
documentAssigner
Document assigner used by the algorithm, contains modifiable parameters.. -
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizerCase normalizer used by the algorithm. -
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemmingStemmer used by the algorithm. -
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarkerStop list marker used by the algorithm, contains modifiable parameters.. -
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizerTokenizer used by the algorithm.
-
-
Constructor Details
-
CompletePreprocessingPipeline
public CompletePreprocessingPipeline()
-
-
Method Details
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel) - Specified by:
preprocess
in interfaceContextPreprocessor
-