Package org.carrot2.clustering.stc
Class STCClusteringAlgorithm
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.clustering.stc.STCClusteringAlgorithm
- All Implemented Interfaces:
AcceptingVisitor
,ClusteringAlgorithm
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni,
Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were
taken wherever STC's description was not clear enough or where we thought some improvements could
be made.
-
Field Summary
FieldsModifier and TypeFieldDescriptionPer-request overrides of language components (dictionaries).Base cluster document count boost.Ignore words appearing in more than the provided fraction of documents.Maximum number of base cluster.Maximum number final clusters to keep.Maximum cluster phrase overlap.Maximum number of phrases from base clusters to promote to the cluster's label.Maximum allowed number of words per label.Merge all stem-equivalent base clusters before running the merge phase.Base cluster merge threshold.Minimum base cluster score, before coverage merging.Minimum required number of documents in a base cluster.Minimum coverage required for a phrase to appear in cluster description.static final String
Optimal label length.Optimal cluster label length's tolerance.Configuration of the text preprocessing stage.final AttrString
Query terms used to retrieve documents.Balance between cluster score and size during cluster sorting.Base cluster score override for single-term clusters.Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptioncluster
(Stream<? extends T> docStream, LanguageComponents languageComponents) Performs STC clustering of documents.Methods inherited from class org.carrot2.attrs.AttrComposite
accept
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.carrot2.attrs.AcceptingVisitor
accept
Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm
optionalLanguageComponents, supports
-
Field Details
-
NAME
- See Also:
-
queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters. -
ignoreWordIfInHigherDocsPercent
Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored. -
minBaseClusterScore
Minimum base cluster score, before coverage merging. -
minBaseClusterSize
Minimum required number of documents in a base cluster. -
maxBaseClusters
Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase. -
maxClusters
Maximum number final clusters to keep. Clusters beyond the maximum will be discarded. -
mergeThreshold
Base cluster merge threshold. -
maxPhraseOverlap
Maximum cluster phrase overlap. -
mostGeneralPhraseCoverage
Minimum coverage required for a phrase to appear in cluster description. -
maxWordsPerLabel
Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed. -
maxPhrasesPerLabel
Maximum number of phrases from base clusters to promote to the cluster's label. -
singleTermBoost
Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function. -
optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score. -
optimalPhraseLengthDev
Optimal cluster label length's tolerance. A factor in calculation of the base cluster score. -
documentCountBoost
Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster. -
scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score. -
mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.- See Also:
-
preprocessing
Configuration of the text preprocessing stage. -
dictionaries
Per-request overrides of language components (dictionaries).- Since:
- 4.1.0
-
-
Constructor Details
-
STCClusteringAlgorithm
public STCClusteringAlgorithm()
-
-
Method Details
-
requiredLanguageComponents
- Specified by:
requiredLanguageComponents
in interfaceClusteringAlgorithm
- Returns:
- A set of classes required to be present in the
LanguageComponents
instance provided for clustering.
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents) Performs STC clustering of documents.- Specified by:
cluster
in interfaceClusteringAlgorithm
- Type Parameters:
T
- Any subclass ofDocument
. Clusters of objects of the same type are returned.- Parameters:
docStream
- A stream ofdocuments
for clustering.languageComponents
-LanguageComponents
with a set of suppliers for the required language-specific components.- Returns:
- A list of top-level clusters (clusters can form a hierarchy via
Cluster.getClusters()
.
-