Class STCClusteringAlgorithm

java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.clustering.stc.STCClusteringAlgorithm
All Implemented Interfaces:
AcceptingVisitor, ClusteringAlgorithm

public final class STCClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.
  • Field Details

    • NAME

      public static final String NAME
      See Also:
    • queryHint

      public final AttrString queryHint
      Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
    • ignoreWordIfInHigherDocsPercent

      public AttrDouble ignoreWordIfInHigherDocsPercent
      Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
    • minBaseClusterScore

      public AttrDouble minBaseClusterScore
      Minimum base cluster score, before coverage merging.
    • minBaseClusterSize

      public AttrInteger minBaseClusterSize
      Minimum required number of documents in a base cluster.
    • maxBaseClusters

      public AttrInteger maxBaseClusters
      Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.
    • maxClusters

      public AttrInteger maxClusters
      Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.
    • mergeThreshold

      public AttrDouble mergeThreshold
      Base cluster merge threshold.
    • maxPhraseOverlap

      public AttrDouble maxPhraseOverlap
      Maximum cluster phrase overlap.
    • mostGeneralPhraseCoverage

      public AttrDouble mostGeneralPhraseCoverage
      Minimum coverage required for a phrase to appear in cluster description.
    • maxWordsPerLabel

      public AttrInteger maxWordsPerLabel
      Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.
    • maxPhrasesPerLabel

      public AttrInteger maxPhrasesPerLabel
      Maximum number of phrases from base clusters to promote to the cluster's label.
    • singleTermBoost

      public AttrDouble singleTermBoost
      Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
    • optimalPhraseLength

      public AttrInteger optimalPhraseLength
      Optimal label length. A factor in calculation of the base cluster score.
    • optimalPhraseLengthDev

      public AttrDouble optimalPhraseLengthDev
      Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.
    • documentCountBoost

      public AttrDouble documentCountBoost
      Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
    • scoreWeight

      public AttrDouble scoreWeight
      Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
    • mergeStemEquivalentBaseClusters

      public AttrBoolean mergeStemEquivalentBaseClusters
      Merge all stem-equivalent base clusters before running the merge phase.
      See Also:
      • "http://issues.carrot2.org/browse/CARROT-1008"
    • preprocessing

      public BasicPreprocessingPipeline preprocessing
      Configuration of the text preprocessing stage.
    • dictionaries

      public EphemeralDictionaries dictionaries
      Per-request overrides of language components (dictionaries).
      Since:
      4.1.0
  • Constructor Details

    • STCClusteringAlgorithm

      public STCClusteringAlgorithm()
  • Method Details