public class TfIdfCategorizer extends PrimitiveVectorHelper implements Categorizer, Serializable
See: http://en.wikipedia.org/wiki/Tfidf
Classification is then achieved using the cosine similarity between the TF-IDF of the document to classify and the registered topics.
Modifier and Type | Field and Description |
---|---|
static org.apache.commons.logging.Log |
log |
Constructor and Description |
---|
TfIdfCategorizer() |
TfIdfCategorizer(int dim) |
Modifier and Type | Method and Description |
---|---|
void |
disableUpdate()
Precompute all the TF-IDF vectors and unload the original count vectors to spare some memory.
|
static Float |
findMedian(Map<String,Float> sortedMap) |
org.apache.lucene.analysis.Analyzer |
getAnalyzer() |
double |
getDensity() |
int |
getDimension() |
Map<String,Float> |
getSimilarities(List<String> terms)
For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the
document given by a list of tokens.
|
Map<String,Float> |
getSimilarities(String allThePets)
For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the
document.
|
Set<String> |
getTopicNames() |
HashingVectorizer |
getVectorizer() |
List<String> |
guessCategories(String textContent,
int maxSuggestions)
Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the
document.
|
List<String> |
guessCategories(String textContent,
int maxSuggestions,
Double precisionThreshold)
Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the
document.
|
void |
learnFiles(File folder)
Utility method to initialize the parameters from a set of UTF-8 encoded text files with names used as topic
names.
|
static TfIdfCategorizer |
load(InputStream in)
Load a TfIdfCategorizer instance from it's compressed binary representation.
|
static TfIdfCategorizer |
load(String modelPath)
Load a TfIdfCategorizer instance from it's compressed binary representation from a named resource in the
classloading path of the current thread.
|
static void |
main(String[] args) |
void |
saveToFile(File file)
Save the model to a compressed binary format on the filesystem.
|
void |
saveToStream(OutputStream out)
Save a compressed binary representation of the trained model.
|
static Map<String,Float> |
sortByDecreasingValue(Map<String,Float> map) |
List<String> |
tokenize(String textContent) |
void |
update(String topicName,
List<String> terms)
Update the model to take into account the statistical properties of a document that is known to be relevant to
the given topic.
|
void |
update(String topicName,
String textContent)
Update the model to take into account the statistical properties of a document that is known to be relevant to
the given topic.
|
add, dot, normOf, sum
public TfIdfCategorizer()
public TfIdfCategorizer(int dim)
public HashingVectorizer getVectorizer()
public org.apache.lucene.analysis.Analyzer getAnalyzer()
public void disableUpdate()
public void update(String topicName, List<String> terms)
topicName
- the name of the document topic or categoryterms
- the list of document tokens (use a lucene analyzer to extract theme for instance)public void update(String topicName, String textContent)
topicName
- the name of the document topic or categorytextContent
- textual content to be tokenized and analyzedpublic Map<String,Float> getSimilarities(List<String> terms)
terms
- a tokenized document.public Map<String,Float> getSimilarities(String allThePets)
the
- document to be tokenized and analyzedpublic int getDimension()
public void learnFiles(File folder) throws IOException
The content of the file to assumed to be lines of terms separated by whitespaces without punctuation.
folder
- IOException
public void saveToFile(File file) throws IOException
file
- where to write the modelIOException
public void saveToStream(OutputStream out) throws IOException
out
- the output stream to write toIOException
public static TfIdfCategorizer load(InputStream in) throws IOException, ClassNotFoundException
in
- the input stream to read fromIOException
ClassNotFoundException
public double getDensity()
public static TfIdfCategorizer load(String modelPath) throws IOException, ClassNotFoundException
modelPath
- the path of the file model in the classloading pathIOException
ClassNotFoundException
public static void main(String[] args) throws FileNotFoundException, IOException, ClassNotFoundException
public List<String> guessCategories(String textContent, int maxSuggestions)
Categorizer
guessCategories
in interface Categorizer
public List<String> guessCategories(String textContent, int maxSuggestions, Double precisionThreshold)
Categorizer
guessCategories
in interface Categorizer
precisionThreshold
- or null to use the default threshold of the implementation.Copyright © 2015 Nuxeo SA. All rights reserved.