public class TfIdfCategorizer extends PrimitiveVectorHelper implements Categorizer, Serializable
See: http://en.wikipedia.org/wiki/Tfidf
Classification is then achieved using the cosine similarity between the TF-IDF of the document to classify and the registered topics.
Constructor and Description |
---|
TfIdfCategorizer() |
TfIdfCategorizer(int dim) |
Modifier and Type | Method and Description |
---|---|
void |
disableUpdate()
Precompute all the TF-IDF vectors and unload the original count vectors
to spare some memory.
|
static Float |
findMedian(Map<String,Float> sortedMap) |
org.apache.lucene.analysis.Analyzer |
getAnalyzer() |
double |
getDensity() |
int |
getDimension() |
Map<String,Float> |
getSimilarities(List<String> terms)
For each registered topic, compute the cosine similarity of the TFIDF
vector of the topic and the one of the document given by a list of
tokens.
|
Map<String,Float> |
getSimilarities(String allThePets)
For each registered topic, compute the cosine similarity of the TFIDF
vector of the topic and the one of the document.
|
Set<String> |
getTopicNames() |
HashingVectorizer |
getVectorizer() |
List<String> |
guessCategories(String textContent,
int maxSuggestions)
Compute a list of suggested categories, sorted by decreasing confidence
based on the textual content of the document.
|
List<String> |
guessCategories(String textContent,
int maxSuggestions,
Double precisionThreshold)
Compute a list of suggested categories, sorted by decreasing confidence
based on the textual content of the document.
|
void |
learnFiles(File folder)
Utility method to initialize the parameters from a set of UTF-8 encoded
text files with names used as topic names.
|
static TfIdfCategorizer |
load(InputStream in)
Load a TfIdfCategorizer instance from it's compressed binary
representation.
|
static TfIdfCategorizer |
load(String modelPath)
Load a TfIdfCategorizer instance from it's compressed binary
representation from a named resource in the classloading path of the
current thread.
|
static void |
main(String[] args) |
void |
saveToFile(File file)
Save the model to a compressed binary format on the filesystem.
|
void |
saveToStream(OutputStream out)
Save a compressed binary representation of the trained model.
|
static Map<String,Float> |
sortByDecreasingValue(Map<String,Float> map) |
List<String> |
tokenize(String textContent) |
void |
update(String topicName,
List<String> terms)
Update the model to take into account the statistical properties of a
document that is known to be relevant to the given topic.
|
void |
update(String topicName,
String textContent)
Update the model to take into account the statistical properties of a
document that is known to be relevant to the given topic.
|
add, dot, normOf, sum
public static final Log log
public TfIdfCategorizer()
public TfIdfCategorizer(int dim)
public HashingVectorizer getVectorizer()
public org.apache.lucene.analysis.Analyzer getAnalyzer()
public void disableUpdate()
public void update(String topicName, List<String> terms)
topicName
- the name of the document topic or categoryterms
- the list of document tokens (use a lucene analyzer to
extract theme for instance)public void update(String topicName, String textContent)
topicName
- the name of the document topic or categorytextContent
- textual content to be tokenized and analyzedpublic Map<String,Float> getSimilarities(List<String> terms)
terms
- a tokenized document.public Map<String,Float> getSimilarities(String allThePets)
the
- document to be tokenized and analyzedpublic int getDimension()
public void learnFiles(File folder) throws IOException
The content of the file to assumed to be lines of terms separated by whitespaces without punctuation.
folder
- IOException
public void saveToFile(File file) throws IOException
file
- where to write the modelIOException
public void saveToStream(OutputStream out) throws IOException
out
- the output stream to write toIOException
public static TfIdfCategorizer load(InputStream in) throws IOException, ClassNotFoundException
in
- the input stream to read fromIOException
ClassNotFoundException
public double getDensity()
public static TfIdfCategorizer load(String modelPath) throws IOException, ClassNotFoundException
modelPath
- the path of the file model in the classloading pathIOException
ClassNotFoundException
public static void main(String[] args) throws FileNotFoundException, IOException, ClassNotFoundException
public List<String> guessCategories(String textContent, int maxSuggestions)
Categorizer
guessCategories
in interface Categorizer
public List<String> guessCategories(String textContent, int maxSuggestions, Double precisionThreshold)
Categorizer
guessCategories
in interface Categorizer
precisionThreshold
- or null to use the default threshold of the
implementation.Copyright © 2011 Nuxeo SA. All Rights Reserved.