Nuxeo ECM Projects 5.4.3-SNAPSHOT

org.nuxeo.common.utils
Class FullTextUtils

java.lang.Object
  extended by org.nuxeo.common.utils.FullTextUtils

public class FullTextUtils
extends Object

Functions related to simple fulltext parsing. They don't try to be exhaustive but they work for simple cases.


Field Summary
static int MIN_SIZE
           
static String STOP_WORDS
           
static Set<String> stopWords
           
static String UNACCENTED
           
static Pattern wordPattern
           
 
Method Summary
static Set<String> parseFullText(String string, boolean removeDiacritics)
          Extracts the words from a string for simple fulltext indexing.
static String parseWord(String string, boolean removeDiacritics)
          Parses a word and returns a simplified lowercase form.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordPattern

public static final Pattern wordPattern

MIN_SIZE

public static final int MIN_SIZE
See Also:
Constant Field Values

STOP_WORDS

public static final String STOP_WORDS
See Also:
Constant Field Values

stopWords

public static final Set<String> stopWords

UNACCENTED

public static final String UNACCENTED
See Also:
Constant Field Values
Method Detail

parseFullText

public static Set<String> parseFullText(String string,
                                        boolean removeDiacritics)
Extracts the words from a string for simple fulltext indexing.

Initial order is kept, but duplicate words are removed.

It omits short or stop words, removes accents and does pseudo-stemming.

Parameters:
string - the string
removeDiacritics - if the diacritics must be removed
Returns:
an ordered set of resulting words

parseWord

public static String parseWord(String string,
                               boolean removeDiacritics)
Parses a word and returns a simplified lowercase form.

Parameters:
string - the word
removeDiacritics - if the diacritics must be removed
Returns:
the simplified word, or null if it was removed as a stop word or a short word

Nuxeo ECM Projects 5.4.3-SNAPSHOT

Copyright © 2011 Nuxeo SAS. All Rights Reserved.