org.apache.stanbol.commons.opennlp
Class OpenNLP

java.lang.Object
  extended by org.apache.stanbol.commons.opennlp.OpenNLP

@Service(value=OpenNLP.class)
public class OpenNLP
extends java.lang.Object

Core of our EnhancementEngine, separated from the OSGi service to make it easier to test this.


Field Summary
protected  java.util.Map<java.lang.String,java.lang.Object> models
          Map holding the already built models TODO: change to use a WeakReferenceMap
 
Constructor Summary
OpenNLP()
          Default constructor
OpenNLP(DataFileProvider dataFileProvider)
          Constructor intended to be used when running outside an OSGI environment (e.g.
 
Method Summary
 opennlp.tools.chunker.ChunkerModel getChunkerModel(java.lang.String language)
          Getter for the chunker model for the parsed language.
 opennlp.tools.namefind.TokenNameFinderModel getNameModel(java.lang.String type, java.lang.String language)
          Getter for the named entity finder model for the parsed entity type and language.
 opennlp.tools.postag.POSModel getPartOfSpeachModel(java.lang.String language)
          Getter for the "part-of-speach" model for the parsed language.
 opennlp.tools.sentdetect.SentenceModel getSentenceModel(java.lang.String language)
          Getter for the sentence detection model of the parsed language.
 opennlp.tools.tokenize.Tokenizer getTokenizer(java.lang.String language)
          Getter for the Tokenizer of a given language.
 opennlp.tools.tokenize.TokenizerModel getTokenizerModel(java.lang.String language)
          Getter for the tokenizer model for the parsed language.
protected  java.io.InputStream lookupModelStream(java.lang.String modelName, java.util.Map<java.lang.String,java.lang.String> properties)
          Lookup an openNLP data file via the dataFileProvider
protected static java.lang.String removeNonUtf8CompliantCharacters(java.lang.String text)
          Remove non UTF-8 compliant characters (typically control characters) so has to avoid polluting the annotation graph with snippets that are not serializable as XML.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

models

protected java.util.Map<java.lang.String,java.lang.Object> models
Map holding the already built models TODO: change to use a WeakReferenceMap

Constructor Detail

OpenNLP

public OpenNLP()
Default constructor


OpenNLP

public OpenNLP(DataFileProvider dataFileProvider)
Constructor intended to be used when running outside an OSGI environment (e.g. when used for UnitTests)

Parameters:
dataFileProvider - the dataFileProvider used to load Model data.
Method Detail

getSentenceModel

public opennlp.tools.sentdetect.SentenceModel getSentenceModel(java.lang.String language)
                                                        throws opennlp.tools.util.InvalidFormatException,
                                                               java.io.IOException
Getter for the sentence detection model of the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
java.io.IOException - on any error while reading the model data

getNameModel

public opennlp.tools.namefind.TokenNameFinderModel getNameModel(java.lang.String type,
                                                                java.lang.String language)
                                                         throws opennlp.tools.util.InvalidFormatException,
                                                                java.io.IOException
Getter for the named entity finder model for the parsed entity type and language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
type - the type of the named entities to find (person, organization)
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
java.io.IOException - on any error while reading the model data

getTokenizerModel

public opennlp.tools.tokenize.TokenizerModel getTokenizerModel(java.lang.String language)
                                                        throws opennlp.tools.util.InvalidFormatException,
                                                               java.io.IOException
Getter for the tokenizer model for the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
java.io.IOException - on any error while reading the model data

getTokenizer

public opennlp.tools.tokenize.Tokenizer getTokenizer(java.lang.String language)
Getter for the Tokenizer of a given language. This first tries to create an TokenizerME instance if the required TokenizerModel for the parsed language is available. if such a model is not available it returns the SimpleTokenizer instance.

Parameters:
language - the language or null to build a SimpleTokenizer
Returns:
the Tokenizer for the parsed language.

getPartOfSpeachModel

public opennlp.tools.postag.POSModel getPartOfSpeachModel(java.lang.String language)
                                                   throws java.io.IOException,
                                                          opennlp.tools.util.InvalidFormatException
Getter for the "part-of-speach" model for the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
java.io.IOException - on any error while reading the model data

getChunkerModel

public opennlp.tools.chunker.ChunkerModel getChunkerModel(java.lang.String language)
                                                   throws opennlp.tools.util.InvalidFormatException,
                                                          java.io.IOException
Getter for the chunker model for the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are present
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
java.io.IOException - on any error while reading the model data

lookupModelStream

protected java.io.InputStream lookupModelStream(java.lang.String modelName,
                                                java.util.Map<java.lang.String,java.lang.String> properties)
                                         throws java.io.IOException
Lookup an openNLP data file via the dataFileProvider

Parameters:
modelName - the name of the model
Returns:
the stream or null if not found
Throws:
java.io.IOException - an any error while opening the model file

removeNonUtf8CompliantCharacters

protected static java.lang.String removeNonUtf8CompliantCharacters(java.lang.String text)
Remove non UTF-8 compliant characters (typically control characters) so has to avoid polluting the annotation graph with snippets that are not serializable as XML.



Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.