org.apache.stanbol.commons.opennlp
Class OpenNLP

java.lang.Object
  extended by org.apache.stanbol.commons.opennlp.OpenNLP

@Service(value=OpenNLP.class)
public class OpenNLP
extends Object

OSGI service that let you load OpenNLP Models via the Stanbol DataFileProvider infrastructure. This allows users to copy models to the 'datafiles' directory or developer to provide models via via OSGI bundles.

This service also provides methods that directly return the OpenNLP component wrapping the model.


Field Summary
protected  Map<String,Object> models
          Map holding the already built models TODO: change to use a WeakReferenceMap
 
Constructor Summary
OpenNLP()
          Default constructor
OpenNLP(org.apache.stanbol.commons.stanboltools.datafileprovider.DataFileProvider dataFileProvider)
          Constructor intended to be used when running outside an OSGI environment (e.g.
 
Method Summary
 opennlp.tools.chunker.Chunker getChunker(String language)
          Getter for the Chunker for a given language
 opennlp.tools.chunker.ChunkerModel getChunkerModel(String language)
          Getter for the chunker model for the parsed language.
<T> T
getModel(Class<T> modelType, String modelName, Map<String,String> properties)
          Getter for the Model with the parsed type, name and properties.
 opennlp.tools.namefind.TokenNameFinder getNameFinder(String type, String language)
          Getter for the TokenNameFinder for the parsed entity type and language.
 opennlp.tools.namefind.TokenNameFinderModel getNameModel(String type, String language)
          Getter for the named entity finder model for the parsed entity type and language.
 opennlp.tools.postag.POSModel getPartOfSpeachModel(String language)
          Getter for the "part-of-speech" model for the parsed language.
 opennlp.tools.postag.POSTagger getPartOfSpeechTagger(String language)
          Getter for the "part-of-speech" tagger for the parsed language.
 opennlp.tools.sentdetect.SentenceDetector getSentenceDetector(String language)
          Getter for the sentence detector of the parsed language.
 opennlp.tools.sentdetect.SentenceModel getSentenceModel(String language)
          Getter for the sentence detection model of the parsed language.
 opennlp.tools.tokenize.Tokenizer getTokenizer(String language)
          Getter for the Tokenizer of a given language.
 opennlp.tools.tokenize.TokenizerModel getTokenizerModel(String language)
          Getter for the tokenizer model for the parsed language.
protected  InputStream lookupModelStream(String modelName, Map<String,String> properties)
          Lookup an openNLP data file via the dataFileProvider
protected static String removeNonUtf8CompliantCharacters(String text)
          Remove non UTF-8 compliant characters (typically control characters) so has to avoid polluting the annotation graph with snippets that are not serializable as XML.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

models

protected Map<String,Object> models
Map holding the already built models TODO: change to use a WeakReferenceMap

Constructor Detail

OpenNLP

public OpenNLP()
Default constructor


OpenNLP

public OpenNLP(org.apache.stanbol.commons.stanboltools.datafileprovider.DataFileProvider dataFileProvider)
Constructor intended to be used when running outside an OSGI environment (e.g. when used for UnitTests)

Parameters:
dataFileProvider - the dataFileProvider used to load Model data.
Method Detail

getSentenceModel

public opennlp.tools.sentdetect.SentenceModel getSentenceModel(String language)
                                                        throws opennlp.tools.util.InvalidFormatException,
                                                               IOException
Getter for the sentence detection model of the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getSentenceDetector

public opennlp.tools.sentdetect.SentenceDetector getSentenceDetector(String language)
                                                              throws IOException
Getter for the sentence detector of the parsed language.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getNameModel

public opennlp.tools.namefind.TokenNameFinderModel getNameModel(String type,
                                                                String language)
                                                         throws opennlp.tools.util.InvalidFormatException,
                                                                IOException
Getter for the named entity finder model for the parsed entity type and language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
type - the type of the named entities to find (person, organization)
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getNameFinder

public opennlp.tools.namefind.TokenNameFinder getNameFinder(String type,
                                                            String language)
                                                     throws IOException
Getter for the TokenNameFinder for the parsed entity type and language.

Parameters:
type - the type of the named entities to find (person, organization)
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getTokenizerModel

public opennlp.tools.tokenize.TokenizerModel getTokenizerModel(String language)
                                                        throws opennlp.tools.util.InvalidFormatException,
                                                               IOException
Getter for the tokenizer model for the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getTokenizer

public opennlp.tools.tokenize.Tokenizer getTokenizer(String language)
Getter for the Tokenizer of a given language. This first tries to create an TokenizerME instance if the required TokenizerModel for the parsed language is available. if such a model is not available it returns the SimpleTokenizer instance.

Parameters:
language - the language or null to build a SimpleTokenizer
Returns:
the Tokenizer for the parsed language.

getPartOfSpeachModel

public opennlp.tools.postag.POSModel getPartOfSpeachModel(String language)
                                                   throws IOException,
                                                          opennlp.tools.util.InvalidFormatException
Getter for the "part-of-speech" model for the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getPartOfSpeechTagger

public opennlp.tools.postag.POSTagger getPartOfSpeechTagger(String language)
                                                     throws IOException
Getter for the "part-of-speech" tagger for the parsed language.

Parameters:
language - the language
Returns:
the model or null if no model data are found
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getModel

public <T> T getModel(Class<T> modelType,
                      String modelName,
                      Map<String,String> properties)
           throws opennlp.tools.util.InvalidFormatException,
                  IOException
Getter for the Model with the parsed type, name and properties.

Parameters:
modelType - the type of the Model (e.g. ChunkerModel)
modelName - the name of the model file. MUST BE available via the DataFileProvider.
properties - additional properties about the model (parsed to the DataFileProvider. NOTE that "Description", "Model Type" and "Download Location" are set to default values if not defined in the parsed value.
Returns:
the loaded (or cached) model
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getChunkerModel

public opennlp.tools.chunker.ChunkerModel getChunkerModel(String language)
                                                   throws opennlp.tools.util.InvalidFormatException,
                                                          IOException
Getter for the chunker model for the parsed language. If the model is not yet available a new one is built. The required data are loaded by using the DataFileProvider service.

Parameters:
language - the language
Returns:
the model or null if no model data are present
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

getChunker

public opennlp.tools.chunker.Chunker getChunker(String language)
                                         throws IOException
Getter for the Chunker for a given language

Parameters:
language - the language
Returns:
the Chunker or null if no model is present
Throws:
opennlp.tools.util.InvalidFormatException - in case the found model data are in the wrong format
IOException - on any error while reading the model data

lookupModelStream

protected InputStream lookupModelStream(String modelName,
                                        Map<String,String> properties)
                                 throws IOException
Lookup an openNLP data file via the dataFileProvider

Parameters:
modelName - the name of the model
Returns:
the stream or null if not found
Throws:
IOException - an any error while opening the model file

removeNonUtf8CompliantCharacters

protected static String removeNonUtf8CompliantCharacters(String text)
Remove non UTF-8 compliant characters (typically control characters) so has to avoid polluting the annotation graph with snippets that are not serializable as XML.



Copyright © 2010-2013 The Apache Software Foundation. All Rights Reserved.