IIIF Newspapers
The IIIF (International Image Interoperability Framework) is a set of APIs that allow access to the rich metadata the drives some of NLW digital resources. This implementation of IIIF for Newspapers will hopefully be replicated by other IIIF institutions so that code developed against the Welsh Newspapers will also work with other sites. Note the data for the NLW Newspapers is limited to items published before 1890 due to copyright restrictions.
Contents
Structure
The Newspaper Website and IIIF representation is split into four levels:
Title
e.g. The Cambrian News. Contains descriptive data about the title e.g. name, place of publication, publication frequency, subjects etc.
http://newspapers.library.wales/browse/3320639
Where
- 3320639 is the PID (Persistent Identifier) for the Title
Equivalent IIIF information: http://dams.llgc.org.uk/iiif/newspapers/3320639.json
Issue
The physical thing you would buy or read. Contains date published (yyyy-mm-dd) and possibly a note saying if it is a supplement to the main paper.
http://newspapers.library.wales/view/3320860
where
- 3320860 is the PID for an Issue
Equivalent IIIF information: http://dams.llgc.org.uk/iiif/newspaper/issue/3320860/manifest.json
Article
e.g. Advertisement or the stories of the day. Note:
- Articles can link across pages
- Articles are made up of rectangular boxes on a page
- Articles don't cross issues
- Adverts tend to be grouped together into one article
- Articles have types e.g. Advertisement, news, notices etc.
http://newspapers.library.wales/view/3320860/3320863/7
Where
- 3320860 is the PID of the issue
- 3320863 is the PID of the page
- 7 is the 7th article of issue 3320860
Article text: http://dams.llgc.org.uk/iiif/3320863/annotation/list/ART7.json
Where
- 3320863 is the PID of the page
- ART7 is the 7th article of issue 3320860
Page
A single side of a Newspaper Page. A page has a label usually [1],[2],[3] etc. denoting the page number
http://newspapers.library.wales/view/3320860/3320861
where
- 3320860 is the PID for an Issue
- 3320861 is the PID for a page
Search Result
For information the structure of a search query on a page is:
http://newspapers.library.wales/view/3320860/3320863/7/SHIP%20NEWS.%20SMTANSEA
where:
- 3320860 is the PID of the issue
- 3320863 is the PID of the page
- 7 is the 7th article of issue 3320860
- SHIP%20NEWS.%20SMTANSEA is the text searched for
IIIF
Full details on the IIIF standard can be found at http://iiif.io/ but it is made up of two standards the Image API and the Presentation API. Most of this article will discuss the Presentation API which gives access to the metadata, structure and OCR of the Newspapers. The IIIF implementation is split into the following files:
IIIF Collection - equivalent of a Newspaper Title
http://dams.llgc.org.uk/iiif/newspapers/3320639.json
Includes the following sections:
- metadata - metadata about the title in a list of Key Value pairs
{ "label": [ -- this is the descriptive text to show for this field { "@value": "Frequency", -- this is the English descriptive text to use "@language": "en" }, { "@value": "Amlder", -- this is the Welsh descriptive text to use "@language": "cy-GB" } ], "value": "Weekly" -- this is the value of the field }
The above in a viewer may display as:
Frequency: Weekly
if you were viewing the viewer in English.
- manifests - a list of manifests or issues that are part of this title.
{ "@id": "http://dams.llgc.org.uk/iiif/newspaper/issue/3320860/manifest.json", -- this is the link to the manifest/issue "@type": "sc:Manifest", "navDate": "1804-12-01T00:00:00Z", -- this is the issue date "label": "Cambrian (1804-12-01)" -- this is the text to show for this issue (if showing on a webpage as a list of issues) }
IIIF Manifest - equivalent to a Newspaper Issue
Contains enough information for a viewer to render the Newspaper including links to the images, OCR and descriptive information. You can view this issue on one of the IIIF compatible viewers; Universal Viewer or UV. A manifest contains:
- metadata section (this is shown in the 'Information' right hand panel in the UV)
- sequences (list of pages equivalent to the Thumbnails tab in the UV)
- structures which is the list of articles (as shown in the index tab in the UV)
Sequences
Sequences are made up of Canvas which contain the information about a page. A canvas links to a IIIF image for the page and also to the OCR. An example canvas is below:
{ "@id": "http://dams.llgc.org.uk/iiif/3320860/canvas/3320863", "@type": "sc:Canvas", "label": "[3]", "height": 7905, "width": 5826, "images": [ { "@id": "http://dams.llgc.org.uk/iiif/3320860/annotation/3320863.json", "@type": "oa:Annotation", "motivation": "sc:painting", "resource": { "@id": "http://dams.llgc.org.uk/iiif/2.0/3320863/image/full/512,/0/default.jpg", "@type": "dctypes:Image", "format": "image/jpeg", "service": { "@context": "http://iiif.io/api/image/2/context.json", "@id": "http://dams.llgc.org.uk/iiif/2.0/image/3320863", --- this is a link to the IIIF Image API "profile": "http://iiif.io/api/image/2/level1.json", "height": 7905, "width": 5826, "tiles": [ { "width": 256, "scaleFactors":[1,2,4,8,16,32] }] }, "height": 7905, "width": 5826 }, "on": "http://dams.llgc.org.uk/iiif/3320860/canvas/3320863" }], "otherContent": [ { "@id": "http://dams.llgc.org.uk/iiif/3320863/annotation/list/ART3.json", --- This is a link to the OCR "@type": "sc:AnnotationList", "label": "TAAMAYAYIITFP *'mmm Avto-fwrnmpi ( ;", "within": { "@id": "http://dams.llgc.org.uk/iiif/3320860/annotation/layer/ART3.json", "@type": "sc:Layer", "label": "OCR Article Text" } }] }
Annotation Lists
This is the way IIIF exposes OCR text as annotations on images. An example annotation list is below:
http://dams.llgc.org.uk/iiif/3320863/annotation/list/ART7.json
An annotation list contains a list of annotations and an individual annotation looks as follows:
{ "@id": "http://dams.llgc.org.uk/iiif/3320863/annotation/4809243416439", "@type": "oa:Annotation", "motivation": "sc:painting", "resource": { "@type": "cnt:ContentAsText", "format": "text/plain", "chars": "SHIP" }, "on": "http://dams.llgc.org.uk/iiif/3320860/canvas/3320863#xywh=4809,2434,164,39" -- this is the coordinates of the word on the image }
Note to convert a link to a canvas above to a IIIF image for that region you can use the following formula:
http://dams.llgc.org.uk/iiif/3320860/canvas/3320863#xywh=4809,2431,164,39
- 3320860 issue pid
- 3320863 canvas pid
- x y w h of region.
and this can be converted to the following URL:
http://dams.llgc.org.uk/iiif/2.0/image/3320863/4809,2431,164,39/512,/0/default.jpg
which will show the following image:
http://dams.llgc.org.uk/iiif/2.0/image/3320863/4809,2431,164,39/512,/0/default.jpg
where
- 3320863 is the page PID
- 4809,2431,164,39 is the region x,y,w,h
- 512 is the width in pixels
- 0 is the degrees of rotation
- default is the quality (bitonal or greyscale are other options)
- .jpg is the format
More details on the Image API can be found at http://iiif.io/api/image/2.0/.
Working with JSON-LD
The format of the data in IIIF is in JSON-LD which is a combination of JSON and Linked Open Data. It is possible to work with it in two ways either as JSON or as Linked Data (RDF). There are numerous libraries for a variety of languages on the json-ld website the one I've used is the JSON-LD Java which reads in a json-ld file and converts it into Lists and Hashtables e.g.
import com.github.jsonldjava.utils.JsonUtils ... Map<String,Object> tJson = (Map<String,Object>)JsonUtils.fromInputStream(pStream); System.out.println("Manifest label: " + tJson.get("label"));
It is also possible to treat JSON-LD as RDF and load it to a triple store for example Sesame. You can then query it using the Linked Data query language SPARQL which works a bit like SQL for databases. Some example queries in ITQL which is very similar to SPARQL can be found Itql_scripts.
Tips & Tricks
Simple Annotation Server
It is possible to view an annotation list for a Newspaper by downloading and running the Simple Annotation Server. The documentation includes an example Newspaper and instructions how to started correcting the OCR.
--Glen 00:53, 25 February 2016 (GMT)