org.apache.oodt.cas.metadata.extractors
Class ProdTypePatternMetExtractor

java.lang.Object
  extended by org.apache.oodt.cas.metadata.AbstractMetExtractor
      extended by org.apache.oodt.cas.metadata.extractors.CmdLineMetExtractor
          extended by org.apache.oodt.cas.metadata.extractors.ProdTypePatternMetExtractor
All Implemented Interfaces:
MetExtractor

public class ProdTypePatternMetExtractor
extends CmdLineMetExtractor

Assigns a ProductType based on a filename pattern, while simultaneously assigning values to metadata elements embedded in the filename pattern.

Suppose I have files in the staging area ready to be ingested. These files usually have information encoded into the filename in order to distinguish the contents of one file from other files. For example book-1234567890.txt might be the contents of a book with ISBN 1234567890. Or page-1234567890-12.txt might be the text on page 12 of book with ISBN 1234567890.

It would be useful to generate metadata from the information encoded in the filename (think: filename => metadata). The ProdTypePatternMetExtractor allows this in a flexible manner using regular expressions. Let's take a look at the config file for this met extractor.

 product-type-patterns.xml

 <config>
   <!-- <element> MUST be defined before <product-type> so their patterns can be resolved -->
   <!-- name MUST be an element defined in elements.xml (also only upper and lower case alpha chars) -->
   <!-- regexp MUST be valid input to java.util.regex.Pattern.compile() -->
   <element name="ISBN" regexp="[0-9]{10}"/>
   <element name="Page" regexp="[0-9]*"/>

   <!-- name MUST be a ProductType name defined in product-types.xml -->
   <!-- metadata elements inside brackets MUST be mapped to the ProductType,
        as defined in product-type-element-map.xml -->
   <product-type name="Book" template="book-[ISBN].txt"/>
   <product-type name="BookPage" template="page-[ISBN]-[Page].txt"/>
 </config>
 
 

This file defines a regular expression for the "ISBN" metadata element, in this case, a 10-digit number. Also, the "Page" metadata element is defined as a sequence of 0 or more digits.

Next, the file defines a filename pattern for the "Book" product type. The pattern is compiled into a regular expression, substituting the previously defined regexes as capture groups. For example, "book-[ISBN].txt" compiles to "book-([0-9]{10}).txt", and the ISBN met element is assigned to capture group 1. When the filename matches this pattern, 2 metadata assignments occur: (1) the ISBN met element is set to the matched regex group, and (2) the ProductType met element is set to "Book".

Similarly, the second pattern sets ISBN, Page, and ProductType for files matching "page-([0-9]{10})-([0-9]*).txt".

This achieves several things:

  1. assigning met elements based on regular expressions
  2. assigning product type based on easy-to-understand pattern with met elements clearly indicated
  3. reuse of met element regular expressions

Differences from FilenameTokenMetExtractor:

  1. Allows dynamic length metadata (does not rely on offset and length of metadata)
  2. Assigns ProductType

Differences from org.apache.oodt.cas.crawl.AutoDetectProductCrawler:

  1. Does not require definition of custom MIME type and MIME-type regex. Really, all you want is to assign a ProductType, rather than indirectly assigning a custom MIME type that maps to a Product Type.

Differences from org.apache.oodt.cas.filemgr.metadata.extractors.examples.FilenameRegexMetExtractor:

  1. Assigns ProductType. FilenameRegexMetExtractor runs after ProductType is already determined.
  2. Runs on the client-side (crawler). FilenameRegexMetExtractor runs on the server-side (filemgr).
  3. Different patterns for different ProductTypes. FilenameRegexMetExtractor config applies the same pattern to all files.

Prerequisites:

  1. <element> tag occurs before <product-type> tag
  2. <element> @name attribute MUST be defined in FileManager policy elements.xml
  3. <element> @regexp attribute MUST be valid input to Pattern.compile(String)
  4. <product-type> @name attribute MUST be a ProductType name (not ID) defined in product-types.xml
  5. met elements used in <product-type> @template attribute MUST be mapped to the ProductType, as defined in product-type-element-map.xml

Words of Caution

Author:
rickdn (Ricky Nguyen)

Field Summary
 
Fields inherited from class org.apache.oodt.cas.metadata.AbstractMetExtractor
config, LOG, reader
 
Constructor Summary
ProdTypePatternMetExtractor()
           
 
Method Summary
protected  Metadata extrMetadata(File file)
          Extracts Metadata from the given File
static void main(String[] args)
           
 
Methods inherited from class org.apache.oodt.cas.metadata.extractors.CmdLineMetExtractor
processMain, processMain
 
Methods inherited from class org.apache.oodt.cas.metadata.AbstractMetExtractor
extractMetadata, extractMetadata, extractMetadata, extractMetadata, extractMetadata, extractMetadata, extractMetadata, setConfigFile, setConfigFile, setConfigFile
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ProdTypePatternMetExtractor

public ProdTypePatternMetExtractor()
Method Detail

extrMetadata

protected Metadata extrMetadata(File file)
                         throws MetExtractionException
Description copied from class: AbstractMetExtractor
Extracts Metadata from the given File

Specified by:
extrMetadata in class AbstractMetExtractor
Parameters:
file - The File from which Metadata will be extracted
Returns:
The Metadata extracted
Throws:
MetExtractionException - If any error occurs

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 1999-2014 Apache OODT. All Rights Reserved.