public class TfIdfCategorizer extends PrimitiveVectorHelper implements Categorizer, Serializable
See: http://en.wikipedia.org/wiki/Tfidf
Classification is then achieved using the cosine similarity between the TF-IDF of the document to classify and the registered topics.
Modifier and Type | Field and Description |
---|---|
protected long[] |
allTermCounts |
protected org.apache.lucene.analysis.Analyzer |
analyzer |
protected float[] |
cachedIdf |
protected Map<String,Object> |
cachedTopicTfIdf |
protected Map<String,Float> |
cachedTopicTfIdfNorm |
protected int |
dim |
static org.apache.commons.logging.Log |
log |
protected Double |
ratioOverMedian |
protected Set<String> |
topicNames |
protected Map<String,Object> |
topicTermCount |
protected long |
totalTermCount |
protected boolean |
updateDisabled |
protected HashingVectorizer |
vectorizer |
Constructor and Description |
---|
TfIdfCategorizer() |
TfIdfCategorizer(int dim) |
Modifier and Type | Method and Description |
---|---|
void |
disableUpdate()
Precompute all the TF-IDF vectors and unload the original count vectors to spare some memory.
|
static Float |
findMedian(Map<String,Float> sortedMap) |
org.apache.lucene.analysis.Analyzer |
getAnalyzer() |
double |
getDensity() |
int |
getDimension() |
protected float[] |
getIdf() |
Map<String,Float> |
getSimilarities(List<String> terms)
For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the
document given by a list of tokens.
|
Map<String,Float> |
getSimilarities(String allThePets)
For each registered topic, compute the cosine similarity of the TFIDF vector of the topic and the one of the
document.
|
protected float[] |
getTfIdf(long[] counts) |
Set<String> |
getTopicNames() |
HashingVectorizer |
getVectorizer() |
List<String> |
guessCategories(String textContent,
int maxSuggestions)
Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the
document.
|
List<String> |
guessCategories(String textContent,
int maxSuggestions,
Double precisionThreshold)
Compute a list of suggested categories, sorted by decreasing confidence based on the textual content of the
document.
|
protected void |
invalidateCache() |
protected void |
invalidateCache(String topicName) |
void |
learnFiles(File folder)
Utility method to initialize the parameters from a set of UTF-8 encoded text files with names used as topic
names.
|
static TfIdfCategorizer |
load(InputStream in)
Load a TfIdfCategorizer instance from it's compressed binary representation.
|
static TfIdfCategorizer |
load(String modelPath)
Load a TfIdfCategorizer instance from it's compressed binary representation from a named resource in the
classloading path of the current thread.
|
static void |
main(String[] args) |
void |
saveToFile(File file)
Save the model to a compressed binary format on the filesystem.
|
void |
saveToStream(OutputStream out)
Save a compressed binary representation of the trained model.
|
static Map<String,Float> |
sortByDecreasingValue(Map<String,Float> map) |
protected float[] |
tfidf(String topicName) |
protected float |
tfidfNorm(String topicName) |
List<String> |
tokenize(String textContent) |
void |
update(String topicName,
List<String> terms)
Update the model to take into account the statistical properties of a document that is known to be relevant to
the given topic.
|
void |
update(String topicName,
String textContent)
Update the model to take into account the statistical properties of a document that is known to be relevant to
the given topic.
|
add, dot, normOf, sum
public static final org.apache.commons.logging.Log log
protected final Set<String> topicNames
protected final Map<String,Object> topicTermCount
protected final Map<String,Object> cachedTopicTfIdf
protected final Map<String,Float> cachedTopicTfIdfNorm
protected long[] allTermCounts
protected final int dim
protected float[] cachedIdf
protected long totalTermCount
protected final HashingVectorizer vectorizer
protected transient org.apache.lucene.analysis.Analyzer analyzer
protected Double ratioOverMedian
protected boolean updateDisabled
public TfIdfCategorizer()
public TfIdfCategorizer(int dim)
public HashingVectorizer getVectorizer()
public org.apache.lucene.analysis.Analyzer getAnalyzer()
public void disableUpdate()
public void update(String topicName, List<String> terms)
topicName
- the name of the document topic or categoryterms
- the list of document tokens (use a lucene analyzer to extract theme for instance)public void update(String topicName, String textContent)
topicName
- the name of the document topic or categorytextContent
- textual content to be tokenized and analyzedprotected void invalidateCache(String topicName)
protected void invalidateCache()
public Map<String,Float> getSimilarities(List<String> terms)
terms
- a tokenized document.public Map<String,Float> getSimilarities(String allThePets)
the
- document to be tokenized and analyzedprotected float[] getTfIdf(long[] counts)
protected float[] getIdf()
public int getDimension()
public void learnFiles(File folder) throws IOException
The content of the file to assumed to be lines of terms separated by whitespaces without punctuation.
folder
- IOException
public void saveToFile(File file) throws IOException
file
- where to write the modelIOException
public void saveToStream(OutputStream out) throws IOException
out
- the output stream to write toIOException
public static TfIdfCategorizer load(InputStream in) throws IOException, ClassNotFoundException
in
- the input stream to read fromIOException
ClassNotFoundException
public double getDensity()
public Set<String> getTopicNames()
public static TfIdfCategorizer load(String modelPath) throws IOException, ClassNotFoundException
modelPath
- the path of the file model in the classloading pathIOException
ClassNotFoundException
public static void main(String[] args) throws FileNotFoundException, IOException, ClassNotFoundException
public List<String> guessCategories(String textContent, int maxSuggestions)
Categorizer
guessCategories
in interface Categorizer
public List<String> guessCategories(String textContent, int maxSuggestions, Double precisionThreshold)
Categorizer
guessCategories
in interface Categorizer
precisionThreshold
- or null to use the default threshold of the implementation.public static Float findMedian(Map<String,Float> sortedMap)
Copyright © 2016 Nuxeo SA. All rights reserved.