public class DefaultFulltextParser extends Object implements FulltextParser
The regexp used can be configured using the system property "org.nuxeo.fulltext.wordsplit". The default is "[\\s\\p{Punct}]+".
Modifier and Type | Field and Description |
---|---|
protected static int |
HTML_MAGIC_OFFSET |
protected static String |
TEXT_HTML |
static String |
WORD_SPLIT_DEF |
protected static Pattern |
WORD_SPLIT_PATTERN |
static String |
WORD_SPLIT_PROP |
Constructor and Description |
---|
DefaultFulltextParser() |
Modifier and Type | Method and Description |
---|---|
String |
parse(String s,
String path)
Parses one property value to normalize the fulltext for the database.
|
void |
parse(String s,
String path,
List<String> strings)
Parses one property value to normalize the fulltext for the database.
|
String |
parse(String s,
String path,
String mimeType,
DocumentLocation documentLocation)
Parses one property value to normalize the fulltext for the database.
|
void |
parse(String s,
String path,
String mimeType,
DocumentLocation documentLocation,
List<String> strings)
Parses one property value to normalize the fulltext for the database.
|
protected String |
preprocessField(String s,
String path,
String mimeType)
Preprocesses one field at the given path.
|
protected String |
removeHtml(String s) |
public static final String WORD_SPLIT_PROP
public static final String WORD_SPLIT_DEF
protected static final Pattern WORD_SPLIT_PATTERN
protected static final int HTML_MAGIC_OFFSET
protected static final String TEXT_HTML
public DefaultFulltextParser()
public String parse(String s, String path)
FulltextParser
The passed path
may be null
if the passed string is not coming from a specific path, for instance
when it was extracted from binary data.
parse
in interface FulltextParser
s
- the string to be parsed and normalizedpath
- the abstracted path for the property (where all complex indexes have been replaced by *
), or
null
public void parse(String s, String path, List<String> strings)
FulltextParser
Like FulltextParser.parse(String, String)
but uses the passed list to accumulate words.
parse
in interface FulltextParser
s
- the string to be parsed and normalizedpath
- the abstracted path for the property (where all complex indexes have been replaced by *
), or
null
strings
- the list into which normalized words should be accumulatedpublic String parse(String s, String path, String mimeType, DocumentLocation documentLocation)
FulltextParser
The passed path
may be null
if the passed string is not coming from a specific path, for instance
when it was extracted from binary data.
parse
in interface FulltextParser
s
- the string to be parsed and normalizedpath
- the abstracted path for the property (where all complex indexes have been replaced by *
), or
null
mimeType
- the mimeType
of the string to be parsed and normalized. This may be null
documentLocation
- the documentLocation
of the Document from which the property value string
was extracted. This may be null
public void parse(String s, String path, String mimeType, DocumentLocation documentLocation, List<String> strings)
Like FulltextParser.parse(String, String)
but uses the passed list to accumulate words.
The default implementation normalizes text to lowercase and removes punctuation. The documentLocation parameter is currently unused but has some use cases for potential subclasses.
This can be subclassed.
parse
in interface FulltextParser
s
- the string to be parsed and normalizedpath
- the abstracted path for the property (where all complex indexes have been replaced by *
), or
null
mimeType
- the mimeType
of the string to be parsed and normalized. This may be null
documentLocation
- the documentLocation
of the Document from which the property value string
was extracted. This may be null
strings
- the list into which normalized words should be accumulatedprotected String preprocessField(String s, String path, String mimeType)
The path is unused for now.
protected String removeHtml(String s)
Copyright © 2018 Nuxeo. All rights reserved.