Ingest Preparation

Part 1 - Ingest perperation

The first part of the project is to create ingest program to take 4 classes of data. We have called them Class 1, Class 2, Class 3 and Class 4.

Development for these class can be seen below

Class 01

The Tasks and results for this class are shown below.

Default METS Document


  • Create default METS documents
    • Decide where the data is coming from (default, from JHove or from file?)
    • Place pointers in the METS documents that will be replaced after ingest


Get MARC records from Catalogue


  • Export records from Virtua into MARC ISO format then convert to MARCXML
    • Find unique searchable property from MARC
    • Run VTLS scripts to retrieve MARC records
    • Run MARC4j to convert MARC record to MARC XML record
    • Split MARCXML collection into individual records as 1 MARC record = 1 METS document


  • MARC data for the John Thomas Collection extracted successfully from GEAC.
  • The MARC documents were created from GEAC so will have to be updated with VTLS identifiers
  • Split MARC record successfully
    • This program is mets_create/src/uk/org/llgc/utils/
    • Usage: java SplitMarc <Marc File> <Ouput Directory>
    • It creates files named BIB_ID.xml e.g. LLGCb13538389.xml in the output directory

Create 'Ingest METS' documents


  • Create ingest METS documents
    • Inputs:
      • MARCXML Record
      • Default METS document
    • Output:
      • Ingest ready METS document 1 per object


Created the following packages:

  • - Handles Checksuming objects
  • - Creates jhove metadata wrapper for Jhove application
    • - Main class which builds ingest mets
    • - Contains a collection of MARC records
    • - Contains a marc record with convenience methods for retrieving properties from MARC XML

Ingest METS documents into Fedora


  • Ingest METS document into Fedora
    • Add datastream in Fedora for each Image datastream in METS
    • Copy Dublin Core to Dublin Core datastream in object and replace with a pointer
    • Replace certain attributes of the METS document with the handle of the object e.g. the attribute OBJID in the parent METS element.
    • Create a RELS-EXT datastream with the OAI ID for OAI harvesting and relate object to collections


Used mostly existing code from the Bridge Project for ingesting a METS document into Fedora. Unfortunately each METS document has some unique features so the repository bridge ingest mechanism was created with as much flexibility in design to allow each METS document to be handled differently. The advantage of using this program was that it handles all the ingest processes when it has been set up.

The main directory structure is shown below:

 uk/org/llgc/fedora/ingestMets  -- Main files (Explained Below)
 uk/org/llgc/fedora/junit  -- This package test the objects in Fedora to see if they have been 
                              added correctly and the METS is correctly updated
 uk/org/llgc/fedora/metadata  -- Object which handles creating the Dublin Core datastream from the XML contained in the ingest METS
 uk/org/llgc/fedora/mets -- Contains helper classes which help link objects to disseminators

uk/org/llgc/fedora/ingestMets/ -- Main Class which implements the bridge project Handler Adapter and should be able to handle all Class 1 Objects

uk/org/llgc/fedora/ingestMets/utils/ -- A helper class which allows you to pass in a directory containing METS documents and ingests them all

Create Simple Dessmination Programs

Simple dissemination programs need to be created to allow users to check the data in Fedora. These disseminators should be generic enough to work with any collection and object. The three disseminators I created were:

Show Collection:

This displays a collection's Dublin Core and gives a link to all the objects in the collection. This can be seen here. It could be assigned to any collection with has the showObject disseminator.

Show Object:

This displays an object's Dublin Core and allows someone to look at all the datastreams of an object. This can be seen by clicking on one of the links in the show collection disseminator but an example is here. This disseminator requires each object to have the getFullMets disseminator.

Get Full METS:

When the METS document is stored in the repository certain datastreams are extracted from it and a pointer is left in its place. This disseminator pulls together all the datastreams back into a METS document on dissemination. Currently it only pulls in the Dublin Core which is stored in a separate datastream of OAI-PMH harvesting (and Fedora requires it). In future this disseminator could also pull in the rights meta data.

Issues still needing to be addressed

  • METS Rights needs to be decided
    • A conversion from METS Rights to Fedora rights needs to be created
  • PREMIS dictionary needs to be discussed
    • List of actions used e.g. Object Ingested and Object Checksummed
  • DC Legacy data needs to be converted from DOC format to DC for inclusion in the METS

To create the METS Documents it took UNKOWN

To ingest the METS documents it took about 8 hours

Class 02

The Tasks and results for this class are shown below.

Default METS Document


  • Create default METS documents
    • Decide where the data is coming from (default, from JHove or from file?)
    • Place pointers in the METS documents that will be replaced after ingest
  • Create Parent METS with a link in the structural map to point to the location of the child METS documents
  • Create Child METS which will be the same as the Class01 objects


Get MARC records from Catalogue


  • Export records from Virtua into MARC ISO format then convert to MARCXML
    • Find unique searchable property from MARC
    • Run VTLS scripts to retrieve MARC records
    • Run MARC4j to convert MARC record to MARC XML record
    • Split MARCXML collection into individual records as 1 MARC record = 1 METS document


  • MARC data for the Geoff Charles Collection extracted successfully from GEAC.
  • The MARC documents were created from GEAC so will have to be updated with VTLS identifiers
  • Split MARC record successfully
    • This program is mets_create/src/uk/org/llgc/utils/
    • Usage: java SplitMarc <Marc File> <Ouput Directory>
    • It creates files named BIB_ID.xml e.g. LLGCb13538389.xml in the output directory

Create 'Ingest METS' documents


  • Create ingest METS documents
    • Inputs:
      • MARCXML Record
      • Default Parent METS document
      • Default Child METS document (Same as Class01 METS document)
    • Output:
      • Ingest ready METS document 1 Parent per MARC record and 1 Child document per image


Had to re-write METS create functionality. Decided to split the functionality into functionality that adds things to the default METS from the MARC record (a METSEnhancer) and things that process the Datastreams (a METSProcessor). For an object to be a METSEnhancer it must implement the interface and the following methods:

 public void initalize(final XMLProperties pProps) throws IOException;
 public Document getMets();
 public void setMets(final Document pMets);
 public void process(final MARCRecord pMARCRecord) throws JDOMException, IOException;

The main program calls the methods in the following order: initalize, setMets, process and getMets.

Initalize passes an XML Tree of the METSEnhancers individual properties (see properties explained below). This method is used to give the object a chance to setup its attributes before the process method is called.

Once the METS document has been passed to the METSEnhancer and process has been called the result should be accessible from the getMets method.

The process method will get the properties from the Marc record and put them in the METS document. An example of a METSEnhancer is shown below:

 import org.jdom.Document;
 import org.jdom.Element;
 import org.jdom.JDOMException;
 public class HeaderEnhancer implements METSEnhancer {
       protected Document _mets = null;
       public HeaderEnhancer() {
       public void initalize(final XMLProperties pProps) throws IOException {
        * Get mets.
        * @return mets as Document.
       public Document getMets() {
           return _mets;
        * Set mets.
        * @param mets the value to set.
       public void setMets(final Document pMets) {
            _mets = pMets;
       public void process(final MARCRecord pMARCRecord) throws JDOMException {
               // Process Header
               Element tRoot = this.getMets().getRootElement();
               tRoot.setAttribute("ID", pMARCRecord.getID());
               tRoot.setAttribute("LABEL", pMARCRecord.getLabel());

Other METS enhancers include: - Places a Dublin Core datastream in the METS values are from the MARC record using the Library of Congress MARC to DC XSL stylesheet conversion - Places the location of the MARC record from virtua into the METS record - Converts some of the MARC attributes to the MODS record - Adds some PREMIS meta data from the MARC record

METS Processors act in the same way. The methods that need to be implemented are as follows:

 public void initalize(final XMLProperties pProps) throws IOException;
 public Document getMets();
 public void setMets(final Document pMets);
 public void process(final HashMap pDatastreams) throws JDOMException, IOException;

As you can see they are exactly the same method names as the METSEnhancer and the only difference is that a HashMap pDatastreams is passed to process instead of the MARC record. This hash contains the following key value pairs (example URLs only):

Key => Value archive => reference => thumbnail =>

The key values correspond to the use attribute in the METS file sections and the URLS point to the actual image locations. A simple example of a METProccessor is which adds the mime type for the reference and thumbnail images rather than running them trough JHove.

 public void process(final HashMap pDatastreams) throws JDOMException, IOException {
   Namespace METS = Namespace.getNamespace("mets", "");
   Namespace XLINK = Namespace.getNamespace("xlink", "");
   List tMETSandXLINKNS = new ArrayList();
   if (_mimeType == null) {
     throw new IllegalStateException("You must call initalize(XMLProperties) before calling process");
   // Now set links
   Element tFileSec = XMLUtilities.getXPathEl(this.getMets(), "//METS:fileSec", METS);
   Iterator tFiles = tFileSec.getChildren().iterator();
   Element tFile = null;
   Attribute tLink = null;
   while (tFiles.hasNext()) {
     tFile = (Element);
     tLink = XMLUtilities.getXPathAttribute(tFile, "./METS:file/METS:FLocat/@xlink:href", tMETSandXLINKNS);
     // TODO May be should look at running these through Jove to discover mime-type
     tFile.getChild("file", METS).setAttribute("MIMETYPE", _mimeType.getMimeType((String)pDatastreams.get(tFile.getAttributeValue("USE"))));
     /**/_logger.debug("Use is " + tFile.getAttribute("USE"));

Controlling which METSEnhancer to assign to the METS document is handled by the configuration file.

 <?xml version="1.0" encoding="UTF-8"?>
 <CONFIG xmlns:METS="" xmlns:premis="">
               <JHOVE class="">
                                       <METS:digiprovMD ID="**FILL_ME**">
                                               <METS:mdWrap MDTYPE="PREMIS">
                                                                               <premis:eventIdentifierType>WlAbNL</premis:eventIdentifierType>                                                                                <premis:eventIdentifierValue>FORMAT_VALIDATION-001</premis:eventIdentifierValue>
                                                                               <premis:agentIdentifierType>WlAbNL</premis:agentIdentifierType>                                                                                <premis:agentIdentifierValue></premis:agentIdentifierValue>
                                       <METS:digiprovMD ID="**FILL_ME**">
                                               <METS:mdWrap MDTYPE="PREMIS">
                                                                               <premis:agentIdentifierType>WlAbNL</premis:agentIdentifierType>                                                                                <premis:agentIdentifierValue></premis:agentIdentifierValue>
                                                                       <premis:agentName>Jhove version 1.0</premis:agentName>
               <CHECKSUMS class="" generator="">
                       <command type="md5">/usr/bin/md5sum</command>
                       <command type="sha">/usr/bin/gpg --print-md sha1</command>
                                       <METS:digiprovMD ID="**FILL_ME**">
                                               <METS:mdWrap MDTYPE="PREMIS">
                                                                               <premis:eventIdentifierType>WlAbNL</premis:eventIdentifierType>                                                                                <premis:eventIdentifierValue>MESSAGE_DIGEST_CALCULATION-001</premis:eventIdentifierValue>
                                                                       <premis:eventType>message digest calculation</premis:eventType>
                                       <METS:digiprovMD ID="**FILL_ME**">
                                               <METS:mdWrap MDTYPE="PREMIS">
                                                                       <premis:agentName>UNIX checksum tools (/usr/bin/md5sum, /usr/bin/gpg --print-md sha1), see</premis:agentName>
               <ASSIGN_DATASTREAM_URLS class="">
               <HEADER class="" />
               <DUBLIN_CORE class="">
               <MODS_ENHANCER class="" />
               <MARC_POINTER class="">
               <PREMIS_IDENTIFIERS class="" />

This class adds the attribute ID to the root of the METS document and places the ID from the MARC record into it. The label is then placed in the LABEL attribute on the root of the METS document.

Created the following packages:

  • - Handles Checksuming objects
  • - Creates jhove metadata wrapper for Jhove application
    • - Main class which builds ingest mets
    • - Contains a collection of MARC records
    • - Contains a marc record with convenience methods for retrieving properties from MARC XML

Ingest METS documents into Fedora


  • Ingest METS document into Fedora
    • Add datastream in Fedora for each Image datastream in METS
    • Copy Dublin Core to Dublin Core datastream in object and replace with a pointer
    • Replace certain attributes of the METS document with the handle of the object e.g. the attribute OBJID in the parent METS element.
    • Create a RELS-EXT datastream with the OAI ID for OAI harvesting and relate object to collections


Used mostly existing code from the Bridge Project for ingesting a METS document into Fedora. Unfortunately each METS document has some unique features so the repository bridge ingest mechanism was created with as much flexibility in design to allow each METS document to be handled differently. The advantage of using this program was that it handles all the ingest processes when it has been set up.

The main directory structure is shown below:

 uk/org/llgc/fedora/ingestMets  -- Main files (Explained Below)
 uk/org/llgc/fedora/junit  -- This package test the objects in Fedora to see if they have been 
                              added correctly and the METS is correctly updated
 uk/org/llgc/fedora/metadata  -- Object which handles creating the Dublin Core datastream from the XML contained in the ingest METS
 uk/org/llgc/fedora/mets -- Contains helper classes which help link objects to disseminators

uk/org/llgc/fedora/ingestMets/ -- Main Class which implements the bridge project Handler Adapter and should be able to handle all Class 1 Objects

uk/org/llgc/fedora/ingestMets/utils/ -- A helper class which allows you to pass in a directory containing METS documents and ingests them all

Create Simple Dessmination Programs

Simple dissemination programs need to be created to allow users to check the data in Fedora. These disseminators should be generic enough to work with any collection and object. The three disseminators I created were:

Show Collection:

This displays a collection's Dublin Core and gives a link to all the objects in the collection. This can be seen here. It could be assigned to any collection with has the showObject disseminator.

Show Object:

This displays an object's Dublin Core and allows someone to look at all the datastreams of an object. This can be seen by clicking on one of the links in the show collection disseminator but an example is here. This disseminator requires each object to have the getFullMets disseminator.

Get Full METS:

When the METS document is stored in the repository certain datastreams are extracted from it and a pointer is left in its place. This disseminator pulls together all the datastreams back into a METS document on dissemination. Currently it only pulls in the Dublin Core which is stored in a separate datastream of OAI-PMH harvesting (and Fedora requires it). In future this disseminator could also pull in the rights meta data.

Issues still needing to be addressed

  • METS Rights needs to be decided
    • A conversion from METS Rights to Fedora rights needs to be created
  • PREMIS dictionary needs to be discussed
    • List of actions used e.g. Object Ingested and Object Checksummed
  • DC Legacy data needs to be converted from DOC format to DC for inclusion in the METS

To create the METS Documents it took UNKOWN

To ingest the METS documents it took about 8 hours

Class 03

Considering changing Child objects to be part of a collection. For example in the Geoff Charles collection only the parent objects are part of the collection:geoff_charles may be all the images should be as well. This would allow easier searching of children using fedoragsearch.

The following xsl is present in the fedoragsearch file demoFoxmlToLucene.xslt

 <xsl:if test="substring(/foxml:digitalObject/foxml:objectProperties/foxml:extproperty[@NAME='']/@VALUE, 1, 3)='gch'">
   <xsl:if test="/foxml:digitalObject/foxml:objectProperties/foxml:property[@NAME='info:fedora/fedora-system:def/model#contentModel' and @VALUE='METS-VITAL01']">
     <xsl:apply-templates mode="activeDemoFedoraObject"/>

it would be better if we could check the collection in the rels-ext datastream rather than have to look at the nlw_id starting with gch.

This would mean a change to the code which retrieves collections from:

 select $member $title from <#ri> where 
 $member <fedora-rels-ext:isMemberOf> <info:fedora/collection:digital_books> 
 and $member <dc:title> $title


 select $member $title from <#ri> where 
 $member <fedora-rels-ext:isMemberOf> <info:fedora/collection:digital_books> 
 and $member <fedora-model:contentModel> 'METS-VITAL03-Parent'
 and $member <dc:title> $title

Class 04