public class HashingVectorizer extends Object implements Serializable
We use a hash representations to be able to maintain low memory requirements by avoiding to store an explicit map from string bigrams to feature vector index in memory.
http://hunch.net/~jl/projects/hash_reps/index.html http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters
Constructor and Description |
---|
HashingVectorizer() |
Modifier and Type | Method and Description |
---|---|
void |
addCounts(List<String> tokens,
long[] counts) |
long[] |
count(List<String> tokens) |
HashingVectorizer |
dimension(int dim)
Chain configuration of the number of buckets, which is also the number of the vectors dimensions, small values
mean high probabilities of collisions.
|
HashingVectorizer |
probes(int probes)
Chain configuration of the number of probes, i.e.
|
HashingVectorizer |
window(int window)
Chain configuration of the number of terms to hash together: window = 1 means unigrams and bigrams, window = 3
would add bigrams of distance 2, and so on.
|
public HashingVectorizer dimension(int dim)
public HashingVectorizer window(int window)
public HashingVectorizer probes(int probes)
Copyright © 2015 Nuxeo SA. All rights reserved.