TfIdfCategorizer (Nuxeo ECM Projects 7.2 API)

java.lang.Object
- org.nuxeo.ecm.platform.categorization.categorizer.tfidf.PrimitiveVectorHelper
- - org.nuxeo.ecm.platform.categorization.categorizer.tfidf.TfIdfCategorizer

All Implemented Interfaces:

Serializable, Categorizer
```
public class TfIdfCategorizer
extends PrimitiveVectorHelper
implements Categorizer, Serializable
```
Maintains a map of TF counts vectors in memory (just for a few reference documents or topics) along with the common IDF estimate of all previously seen text content.
See: http://en.wikipedia.org/wiki/Tfidf
Classification is then achieved using the cosine similarity between the TF-IDF of the document to classify and the registered topics.

See Also:
Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

static org.apache.commons.logging.Log log

Fields
Modifier and Type	Field and Description
`static org.apache.commons.logging.Log`	`log`

Constructor Summary

Constructors
Constructor and Description

TfIdfCategorizer()

TfIdfCategorizer(int dim)

Constructors
Constructor and Description
`TfIdfCategorizer()`
`TfIdfCategorizer(int dim)`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`disableUpdate()` Precompute all the TF-IDF vectors and unload the original count vectors to spare some memory.
`static Float`	`findMedian(Map<String,Float> sortedMap)`
`org.apache.lucene.analysis.Analyzer`	`getAnalyzer()`
`double`	`getDensity()`
`int`	`getDimension()`
`Map<String,Float>`	`getSimilarities(List<String> terms)` For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document given by a list of tokens.
`Map<String,Float>`	`getSimilarities(String allThePets)` For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document.
`Set<String>`	`getTopicNames()`
`HashingVectorizer`	`getVectorizer()`
`List<String>`	`guessCategories(String textContent, int maxSuggestions)` Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
`List<String>`	`guessCategories(String textContent, int maxSuggestions, Double precisionThreshold)` Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
`void`	`learnFiles(File folder)` Utility method to initialize the parameters from a set of UTF-8 encoded text files with names used as topic names.
`static TfIdfCategorizer`	`load(InputStream in)` Load a TfIdfCategorizer instance from it's compressed binary representation.
`static TfIdfCategorizer`	`load(String modelPath)` Load a TfIdfCategorizer instance from it's compressed binary representation from a named resource in the classloading path of the current thread.
`static void`	`main(String[] args)`
`void`	`saveToFile(File file)` Save the model to a compressed binary format on the filesystem.
`void`	`saveToStream(OutputStream out)` Save a compressed binary representation of the trained model.
`static Map<String,Float>`	`sortByDecreasingValue(Map<String,Float> map)`
`List<String>`	`tokenize(String textContent)`
`void`	`update(String topicName, List<String> terms)` Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic.
`void`	`update(String topicName, String textContent)` Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic.

Methods inherited from class org.nuxeo.ecm.platform.categorization.categorizer.tfidf.PrimitiveVectorHelper
add, dot, normOf, sum

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - log
```
public static final org.apache.commons.logging.Log log
```
- Constructor Detail
  - TfIdfCategorizer
```
public TfIdfCategorizer()
```
  - TfIdfCategorizer
```
public TfIdfCategorizer(int dim)
```
- Method Detail
  - getVectorizer
```
public HashingVectorizer getVectorizer()
```
  - getAnalyzer
```
public org.apache.lucene.analysis.Analyzer getAnalyzer()
```
  - disableUpdate
```
public void disableUpdate()
```
    Precompute all the TF-IDF vectors and unload the original count vectors to spare some memory. Updates won't be possible any more.
  - update
```
public void update(String topicName,
          List<String> terms)
```
    Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic. Warning: this method is not thread safe: it should not be used concurrently with @see #getSimilarities(List)
    
    Parameters:
    topicName - the name of the document topic or category
    terms - the list of document tokens (use a lucene analyzer to extract theme for instance)
  - update
```
public void update(String topicName,
          String textContent)
```
    Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic. Warning: this method is not thread safe: it should not be used concurrently with @see #getSimilarities(List)
    
    Parameters:
    topicName - the name of the document topic or category
    textContent - textual content to be tokenized and analyzed
  - getSimilarities
```
public Map<String,Float> getSimilarities(List<String> terms)
```
    For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document given by a list of tokens.
    
    Parameters:
    terms - a tokenized document.
    
    Returns:
    a map of topic names to float values from 0 to 1 sorted by reverse value.
  - getSimilarities
```
public Map<String,Float> getSimilarities(String allThePets)
```
    For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document.
    
    Parameters:
    the - document to be tokenized and analyzed
    
    Returns:
    a map of topic names to float values from 0 to 1 sorted by reverse value.
  - getDimension
```
public int getDimension()
```
  - learnFiles
```
public void learnFiles(File folder)
                throws IOException
```
    Utility method to initialize the parameters from a set of UTF-8 encoded text files with names used as topic names.
    The content of the file to assumed to be lines of terms separated by whitespaces without punctuation.
    
    Parameters:
    folder -
    
    Throws:
    
    IOException
  - saveToFile
```
public void saveToFile(File file)
                throws IOException
```
    Save the model to a compressed binary format on the filesystem.
    
    Parameters:
    file - where to write the model
    
    Throws:
    
    IOException
  - saveToStream
```
public void saveToStream(OutputStream out)
                  throws IOException
```
    Save a compressed binary representation of the trained model.
    
    Parameters:
    out - the output stream to write to
    
    Throws:
    
    IOException
  - load
```
public static TfIdfCategorizer load(InputStream in)
                             throws IOException,
                                    ClassNotFoundException
```
    Load a TfIdfCategorizer instance from it's compressed binary representation.
    
    Parameters:
    in - the input stream to read from
    
    Returns:
    a new instance with parameters coming from the saved version
    
    Throws:
    
    IOException
    
    ClassNotFoundException
  - getDensity
```
public double getDensity()
```
  - getTopicNames
```
public Set<String> getTopicNames()
```
  - load
```
public static TfIdfCategorizer load(String modelPath)
                             throws IOException,
                                    ClassNotFoundException
```
    Load a TfIdfCategorizer instance from it's compressed binary representation from a named resource in the classloading path of the current thread.
    
    Parameters:
    modelPath - the path of the file model in the classloading path
    
    Returns:
    a new instance with parameters coming from the saved version
    
    Throws:
    
    IOException
    
    ClassNotFoundException
  - main
```
public static void main(String[] args)
                 throws FileNotFoundException,
                        IOException,
                        ClassNotFoundException
```
    Throws:
    
    FileNotFoundException
    
    IOException
    
    ClassNotFoundException
  - guessCategories
```
public List<String> guessCategories(String textContent,
                           int maxSuggestions)
```
    Description copied from interface: Categorizer
    
    Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
    
    Specified by:
    
    guessCategories in interface Categorizer
  - guessCategories
```
public List<String> guessCategories(String textContent,
                           int maxSuggestions,
                           Double precisionThreshold)
```
    Description copied from interface: Categorizer
    
    Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
    
    Specified by:
    
    guessCategories in interface Categorizer
    
    precisionThreshold - or null to use the default threshold of the implementation.
    
    Returns:
  - tokenize
```
public List<String> tokenize(String textContent)
```
  - sortByDecreasingValue
```
public static Map<String,Float> sortByDecreasingValue(Map<String,Float> map)
```
  - findMedian
```
public static Float findMedian(Map<String,Float> sortedMap)
```

Class TfIdfCategorizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.nuxeo.ecm.platform.categorization.categorizer.tfidf.PrimitiveVectorHelper

Methods inherited from class java.lang.Object

Field Detail

log

Constructor Detail

TfIdfCategorizer

TfIdfCategorizer

Method Detail

getVectorizer

getAnalyzer

disableUpdate

update

update

getSimilarities

getSimilarities

getDimension

learnFiles

saveToFile

saveToStream

load

getDensity

getTopicNames

load

main

guessCategories

guessCategories

tokenize

sortByDecreasingValue

findMedian