Nuxeo Enterprise Platform 5.4

org.nuxeo.common.utils
Class FullTextUtils

java.lang.Object
  extended by org.nuxeo.common.utils.FullTextUtils

public class FullTextUtils
extends java.lang.Object

Functions related to simple fulltext parsing. They don't try to be exhaustive but they work for simple cases.


Field Summary
static int MIN_SIZE
           
static java.lang.String STOP_WORDS
           
static java.util.Set<java.lang.String> stopWords
           
static java.lang.String UNACCENTED
           
static java.util.regex.Pattern wordPattern
           
 
Method Summary
static java.util.Set<java.lang.String> parseFullText(java.lang.String string, boolean removeDiacritics)
          Extracts the words from a string for simple fulltext indexing.
static java.lang.String parseWord(java.lang.String string, boolean removeDiacritics)
          Parses a word and returns a simplified lowercase form.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordPattern

public static final java.util.regex.Pattern wordPattern

MIN_SIZE

public static final int MIN_SIZE
See Also:
Constant Field Values

STOP_WORDS

public static final java.lang.String STOP_WORDS
See Also:
Constant Field Values

stopWords

public static final java.util.Set<java.lang.String> stopWords

UNACCENTED

public static final java.lang.String UNACCENTED
See Also:
Constant Field Values
Method Detail

parseFullText

public static java.util.Set<java.lang.String> parseFullText(java.lang.String string,
                                                            boolean removeDiacritics)
Extracts the words from a string for simple fulltext indexing.

Initial order is kept, but duplicate words are removed.

It omits short or stop words, removes accents and does pseudo-stemming.

Parameters:
string - the string
removeDiacritics - if the diacritics must be removed
Returns:
an ordered set of resulting words

parseWord

public static java.lang.String parseWord(java.lang.String string,
                                         boolean removeDiacritics)
Parses a word and returns a simplified lowercase form.

Parameters:
string - the word
removeDiacritics - if the diacritics must be removed
Returns:
the simplified word, or null if it was removed as a stop word or a short word

Nuxeo Enterprise Platform 5.4

Copyright © 2010 Nuxeo SAS. All Rights Reserved.