TfIdfCategorizer (Nuxeo ECM Projects 8.3 API)

java.lang.Object
- org.nuxeo.ecm.platform.categorization.categorizer.tfidf.PrimitiveVectorHelper
- - org.nuxeo.ecm.platform.categorization.categorizer.tfidf.TfIdfCategorizer

All Implemented Interfaces:

Serializable, Categorizer
```
public class TfIdfCategorizer
extends PrimitiveVectorHelper
implements Categorizer, Serializable
```
Maintains a map of TF counts vectors in memory (just for a few reference documents or topics) along with the common IDF estimate of all previously seen text content.
See: http://en.wikipedia.org/wiki/Tfidf
Classification is then achieved using the cosine similarity between the TF-IDF of the document to classify and the registered topics.

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected long[]`	`allTermCounts`
`protected org.apache.lucene.analysis.Analyzer`	`analyzer`
`protected float[]`	`cachedIdf`
`protected Map<String,Object>`	`cachedTopicTfIdf`
`protected Map<String,Float>`	`cachedTopicTfIdfNorm`
`protected int`	`dim`
`static org.apache.commons.logging.Log`	`log`
`protected Double`	`ratioOverMedian`
`protected Set<String>`	`topicNames`
`protected Map<String,Object>`	`topicTermCount`
`protected long`	`totalTermCount`
`protected boolean`	`updateDisabled`
`protected HashingVectorizer`	`vectorizer`

Constructor Summary

Constructors
Constructor and Description

TfIdfCategorizer()

TfIdfCategorizer(int dim)

Constructors
Constructor and Description
`TfIdfCategorizer()`
`TfIdfCategorizer(int dim)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`disableUpdate()` Precompute all the TF-IDF vectors and unload the original count vectors to spare some memory.
`static Float`	`findMedian(Map<String,Float> sortedMap)`
`org.apache.lucene.analysis.Analyzer`	`getAnalyzer()`
`double`	`getDensity()`
`int`	`getDimension()`
`protected float[]`	`getIdf()`
`Map<String,Float>`	`getSimilarities(List<String> terms)` For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document given by a list of tokens.
`Map<String,Float>`	`getSimilarities(String allThePets)` For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document.
`protected float[]`	`getTfIdf(long[] counts)`
`Set<String>`	`getTopicNames()`
`HashingVectorizer`	`getVectorizer()`
`List<String>`	`guessCategories(String textContent, int maxSuggestions)` Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
`List<String>`	`guessCategories(String textContent, int maxSuggestions, Double precisionThreshold)` Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
`protected void`	`invalidateCache()`
`protected void`	`invalidateCache(String topicName)`
`void`	`learnFiles(File folder)` Utility method to initialize the parameters from a set of UTF-8 encoded text files with names used as topic names.
`static TfIdfCategorizer`	`load(InputStream in)` Load a TfIdfCategorizer instance from it's compressed binary representation.
`static TfIdfCategorizer`	`load(String modelPath)` Load a TfIdfCategorizer instance from it's compressed binary representation from a named resource in the classloading path of the current thread.
`static void`	`main(String[] args)`
`void`	`saveToFile(File file)` Save the model to a compressed binary format on the filesystem.
`void`	`saveToStream(OutputStream out)` Save a compressed binary representation of the trained model.
`static Map<String,Float>`	`sortByDecreasingValue(Map<String,Float> map)`
`protected float[]`	`tfidf(String topicName)`
`protected float`	`tfidfNorm(String topicName)`
`List<String>`	`tokenize(String textContent)`
`void`	`update(String topicName, List<String> terms)` Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic.
`void`	`update(String topicName, String textContent)` Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic.

Methods inherited from class org.nuxeo.ecm.platform.categorization.categorizer.tfidf.PrimitiveVectorHelper
add, dot, normOf, sum

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - log
```
public static final org.apache.commons.logging.Log log
```
  - topicNames
```
protected final Set<String> topicNames
```
  - topicTermCount
```
protected final Map<String,Object> topicTermCount
```
  - cachedTopicTfIdf
```
protected final Map<String,Object> cachedTopicTfIdf
```
  - cachedTopicTfIdfNorm
```
protected final Map<String,Float> cachedTopicTfIdfNorm
```
  - allTermCounts
```
protected long[] allTermCounts
```
  - dim
```
protected final int dim
```
  - cachedIdf
```
protected float[] cachedIdf
```
  - totalTermCount
```
protected long totalTermCount
```
  - vectorizer
```
protected final HashingVectorizer vectorizer
```
  - analyzer
```
protected transient org.apache.lucene.analysis.Analyzer analyzer
```
  - ratioOverMedian
```
protected Double ratioOverMedian
```
  - updateDisabled
```
protected boolean updateDisabled
```
- Constructor Detail
  - TfIdfCategorizer
```
public TfIdfCategorizer()
```
  - TfIdfCategorizer
```
public TfIdfCategorizer(int dim)
```
- Method Detail
  - getVectorizer
```
public HashingVectorizer getVectorizer()
```
  - getAnalyzer
```
public org.apache.lucene.analysis.Analyzer getAnalyzer()
```
  - disableUpdate
```
public void disableUpdate()
```
    Precompute all the TF-IDF vectors and unload the original count vectors to spare some memory. Updates won't be possible any more.
  - update
```
public void update(String topicName,
                   List<String> terms)
```
    Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic. Warning: this method is not thread safe: it should not be used concurrently with @see #getSimilarities(List)
    
    Parameters:
    
    topicName - the name of the document topic or category
    
    terms - the list of document tokens (use a lucene analyzer to extract theme for instance)
  - update
```
public void update(String topicName,
                   String textContent)
```
    Update the model to take into account the statistical properties of a document that is known to be relevant to the given topic. Warning: this method is not thread safe: it should not be used concurrently with @see #getSimilarities(List)
    
    Parameters:
    
    topicName - the name of the document topic or category
    
    textContent - textual content to be tokenized and analyzed
  - invalidateCache
```
protected void invalidateCache(String topicName)
```
  - invalidateCache
```
protected void invalidateCache()
```
  - getSimilarities
```
public Map<String,Float> getSimilarities(List<String> terms)
```
    For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document given by a list of tokens.
    
    Parameters:
    
    terms - a tokenized document.
    
    Returns:
    
    a map of topic names to float values from 0 to 1 sorted by reverse value.
  - getSimilarities
```
public Map<String,Float> getSimilarities(String allThePets)
```
    For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the document.
    
    Parameters:
    
    the - document to be tokenized and analyzed
    
    Returns:
    
    a map of topic names to float values from 0 to 1 sorted by reverse value.
  - tfidfNorm
```
protected float tfidfNorm(String topicName)
```
  - tfidf
```
protected float[] tfidf(String topicName)
```
  - getTfIdf
```
protected float[] getTfIdf(long[] counts)
```
  - getIdf
```
protected float[] getIdf()
```
  - getDimension
```
public int getDimension()
```
  - learnFiles
```
public void learnFiles(File folder)
                throws IOException
```
    Utility method to initialize the parameters from a set of UTF-8 encoded text files with names used as topic names.
    The content of the file to assumed to be lines of terms separated by whitespaces without punctuation.
    
    Parameters:
    
    folder -
    
    Throws:
    
    IOException
  - saveToFile
```
public void saveToFile(File file)
                throws IOException
```
    Save the model to a compressed binary format on the filesystem.
    
    Parameters:
    
    file - where to write the model
    
    Throws:
    
    IOException
  - saveToStream
```
public void saveToStream(OutputStream out)
                  throws IOException
```
    Save a compressed binary representation of the trained model.
    
    Parameters:
    
    out - the output stream to write to
    
    Throws:
    
    IOException
  - load
```
public static TfIdfCategorizer load(InputStream in)
                             throws IOException,
                                    ClassNotFoundException
```
    Load a TfIdfCategorizer instance from it's compressed binary representation.
    
    Parameters:
    
    in - the input stream to read from
    
    Returns:
    
    a new instance with parameters coming from the saved version
    
    Throws:
    
    IOException
    
    ClassNotFoundException
  - getDensity
```
public double getDensity()
```
  - getTopicNames
```
public Set<String> getTopicNames()
```
  - load
```
public static TfIdfCategorizer load(String modelPath)
                             throws IOException,
                                    ClassNotFoundException
```
    Load a TfIdfCategorizer instance from it's compressed binary representation from a named resource in the classloading path of the current thread.
    
    Parameters:
    
    modelPath - the path of the file model in the classloading path
    
    Returns:
    
    a new instance with parameters coming from the saved version
    
    Throws:
    
    IOException
    
    ClassNotFoundException
  - main
```
public static void main(String[] args)
                 throws FileNotFoundException,
                        IOException,
                        ClassNotFoundException
```
    Throws:
    
    FileNotFoundException
    
    IOException
    
    ClassNotFoundException
  - guessCategories
```
public List<String> guessCategories(String textContent,
                                    int maxSuggestions)
```
    Description copied from interface: Categorizer
    
    Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
    
    Specified by:
    
    guessCategories in interface Categorizer
  - guessCategories
```
public List<String> guessCategories(String textContent,
                                    int maxSuggestions,
                                    Double precisionThreshold)
```
    Description copied from interface: Categorizer
    
    Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the document.
    
    Specified by:
    
    guessCategories in interface Categorizer
    
    precisionThreshold - or null to use the default threshold of the implementation.
    
    Returns:
  - tokenize
```
public List<String> tokenize(String textContent)
```
  - sortByDecreasingValue
```
public static Map<String,Float> sortByDecreasingValue(Map<String,Float> map)
```
  - findMedian
```
public static Float findMedian(Map<String,Float> sortedMap)
```

Class TfIdfCategorizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.nuxeo.ecm.platform.categorization.categorizer.tfidf.PrimitiveVectorHelper

Methods inherited from class java.lang.Object

Field Detail

log

topicNames

topicTermCount

cachedTopicTfIdf

cachedTopicTfIdfNorm

allTermCounts

dim

cachedIdf

totalTermCount

vectorizer

analyzer

ratioOverMedian

updateDisabled

Constructor Detail

TfIdfCategorizer

TfIdfCategorizer

Method Detail

getVectorizer

getAnalyzer

disableUpdate

update

update

invalidateCache

invalidateCache

getSimilarities

getSimilarities

tfidfNorm

tfidf

getTfIdf

getIdf

getDimension

learnFiles

saveToFile

saveToStream

load

getDensity

getTopicNames

load

main

guessCategories

guessCategories

tokenize

sortByDecreasingValue

findMedian