public class HashingVectorizer extends Object implements Serializable
We use a hash representations to be able to maintain low memory requirements by avoiding to store an explicit map from string bigrams to feature vector index in memory.
http://hunch.net/~jl/projects/hash_reps/index.html http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters
Modifier and Type | Field and Description |
---|---|
protected int |
dim |
protected int |
probes |
protected int |
window |
Constructor and Description |
---|
HashingVectorizer() |
Modifier and Type | Method and Description |
---|---|
void |
addCounts(List<String> tokens,
long[] counts) |
long[] |
count(List<String> tokens) |
HashingVectorizer |
dimension(int dim)
Chain configuration of the number of buckets, which is also the number of the vectors dimensions, small values
mean high probabilities of collisions.
|
protected int |
hash(String token,
int probe) |
protected int |
hash(String token,
String prevToken,
int probe) |
HashingVectorizer |
probes(int probes)
Chain configuration of the number of probes, i.e.
|
HashingVectorizer |
window(int window)
Chain configuration of the number of terms to hash together: window = 1 means unigrams and bigrams, window = 3
would add bigrams of distance 2, and so on.
|
public HashingVectorizer()
public HashingVectorizer dimension(int dim)
public HashingVectorizer window(int window)
public HashingVectorizer probes(int probes)
Copyright © 2016 Nuxeo SA. All rights reserved.