org.apache.clerezza.uima.metadatagenerator.mediatype
Class TikaTextExtractor

java.lang.Object
  extended by org.apache.clerezza.uima.metadatagenerator.mediatype.TikaTextExtractor
All Implemented Interfaces:
MediaTypeTextExtractor

public class TikaTextExtractor
extends Object
implements MediaTypeTextExtractor

An implementation based on Apache Tika.

Author:
Davide Palmisano

Constructor Summary
TikaTextExtractor()
          Construct an instance using the default Tika configuration.
TikaTextExtractor(String tikaConfigPath)
          Construct an instance using a custom tika-config.xml configuration file.
 
Method Summary
 String extract(byte[] bytes)
          Extract the text from the provided input if its Media Type is supported.
 boolean supports(javax.ws.rs.core.MediaType mediaType)
          Check if the provided MediaType is supported by this extractor.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TikaTextExtractor

public TikaTextExtractor()
Construct an instance using the default Tika configuration.


TikaTextExtractor

public TikaTextExtractor(String tikaConfigPath)
Construct an instance using a custom tika-config.xml configuration file.

Parameters:
tikaConfigPath - the path to the tika-config.xml configuration file.
Method Detail

supports

public boolean supports(javax.ws.rs.core.MediaType mediaType)
Check if the provided MediaType is supported by this extractor.

Specified by:
supports in interface MediaTypeTextExtractor
Parameters:
mediaType - to be checked.
Returns:
true if the provided MediaType as input is supported.

extract

public String extract(byte[] bytes)
               throws UnsupportedMediaTypeException
Extract the text from the provided input if its Media Type is supported.

Specified by:
extract in interface MediaTypeTextExtractor
Parameters:
bytes - an array of byte representing the input.
Returns:
a String with the extracted text.
Throws:
UnsupportedMediaTypeException - if the input implicit Media type is not supported.


Copyright © 2012 The Apache Software Foundation. All Rights Reserved.