Download this gist and create a symbolic link
$ ln -s catmandu.yml config.yml
This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.
Ever had a need for passing a lot of configuration parameters to a Catmandu Fix script or needing to write repetitive code over and over again in large scripts? There is a neat Bash trick you can use to preprocess your scripts.
For instance image you have a large JSON file that needs to be processed for many customers. To each record in the file you need to include the URL of their homepage:
add_field("homepage","http://library.edu")
You could do this my creating a Fix script for every customer and run the convert
command for every customer:
$ catmandu convert --fix customer1.fix < data.json
$ catmandu convert --fix customer2.fix < data.json
$ catmandu convert --fix customer3.fix < data.json
There is another way to do this by using named pipe redirects in Bash. Instead of writing one Fix script for each customer you can write one Fix scipt for all customers that includes preprocessing handles:
add_field("homepage",HOMEPAGE)
With this script customer.fix you can use a preprocessor to populate the HOMEPAGE field:
$ catmandu convert --fix <(m4 -DHOMEPAGE=\"http://customer1.edu\" customer.fix) < data.json
$ catmandu convert --fix <(m4 -DHOMEPAGE=\"http://customer2.edu\" customer.fix) < data.json
$ catmandu convert --fix <(m4 -DHOMEPAGE=\"http://customer3.edu\" customer.fix) < data.json
Bash is creating a temporary named pipe that is given as input to the catmandu command while in the background a m4 processor is processing the customer.fix file.
You can enter any command into the named pipe redirects. There are plenty of interesting preprocessors available that can be used to process fix files such as: cpp, m4 and the even Template Toolkit tpage command).
As developers of the Catmandu project we are all shocked by the earthquake in Kathmandu which resulted in thousands of deaths and severe conditions for the survivors. We can all support the people in Kathmandu by supporting the Oxfam actions.
Visit: https://www.oxfam.org/ for more information how you can help!
The Catmandu data processing toolkit facilitates many import, export, and conversion tasks by support of common APIs (e.g. SRU, OAI-PMH) and databases (e.g. MongoDB, CouchDB, SQL…). But sometimes the best API and database is the file system. In this brief article I’ll show how to use a “hotfolder” to automatically import files into another Catmandu store.
A hotfolder is a directory in which files can be placed to automatically get processed. To facilitate the creation of such directories I created the CPAN module File::Hotfolder. Let’s first define a sample importer and storage in catmandu.yml configuration file:
--- importer: json: package: JSON options: multiline: 1 store: couchdb: package: CouchDB options: default_bag: import ...
We can now manually import JSON files into the import
database of a local CouchDB like this:
catmandu import json to couchdb < filename.json
Manually calling such command for each file can be slow and requires access to the command line. How about defining a hotfolder to automatically import all JSON files into CouchDB? Here is an implementation:
use Catmandu -all; use File::Hotfolder; use File::Basename; my $hotfolder = "import"; my $importer = "json"; my $store = "couchdb"; my $suffix = qr{\.json}; my $store = store($store); watch( $hotfolder, filter => $suffix, scan => 1, delete => 1, print => WATCH_DIR | FOUND_FILE | CATCH_ERROR, callback => sub { $store->add_many( importer($importer, file => shift) ); }, catch => 1, )->loop;
The directory import
is first scanned for existing files with extension .json
and then watched for modified or new files. As soon as a file has been found, it is imported. The CATCH_ERROR
options ensures to not kill the program if an import failed, for instance because of invalid JSON.
The current version of File::Hotfolder only works with Unix but it may be extended to other operating systems as well.
![]() ![]() ![]() |
Context
The New Yorker tells us that average life of a Web page is about a hundred days. Websites don’t have to be deliberately deleted to disappear. Sites hosted by corporations tend to die with their hosts. Even the Web page you are viewing now is in flux. New blog posts might appear, comments and reviews are added. Bookmarks or references you are making to Web pages are in general not pointing to the same information you were reading when you visited the page, or when you were writing an article about that page. All this is very problematic in an academic context where provenance and diplomatics is crucial to analyse documents. To point a static version of a Web page one can make use of services like the Internet Archive , Perma.cc and Archive Today. But, these solutions tend to be ad-hoc, there is no common strategy to refer to a static version of a web page. In comes Memento, a protocol created by Herbert Van de Sompel and Michael Nelson which adds services on top of HTTP to travel the web of the past.
During a two day Hackathon event at Ghent University Library technologists from all over Europe gathered to explore time travel using the Memento protocol presented by Herbert Van de Sompel and Harihar Shankar from Los Alamos National Laboratory.
The slides of this event are available here.
Participants
Herbert Van de Sompel – Los Alamos National Laboratory, Harihar Shankar – Los Alamos National Laboratory, Najko Jahn – Bielefeld University, Vitali Peil – Bielefeld University, Christian Pietsch – Bielefeld University, Dries Moreels – Ghent University, Patrick Hochstenbach – Ghent University, Nicolas Steenlant – Ghent University, Nicolas Franck – Ghent University, Katrien Deroo – Ghent University, Ruben Verborgh – iMinds, Miel Vander Sande – iMinds, Snorri Briem – Lund University, Maria Hedberg – Lund University, Benoit Pauwels – Université Libre de Bruxelles, Anthony Leroy – Université Libre de Bruxelles, Benoit Erken – Université Catholique de Louvain, Laurent Dubois – Université Catholique de Louvain
Introduction into Memento
The goal of Memento is to provide a protocol for accessing historical versions of web resources. These archived versions, called Mementos, can reside in the content management system of a website or in external services such as web archives.
Take Wikipedia as an example. To view the current version of the lemma ‘Memento_Project’ one needs to visit the web resource http://en.wikipedia.org/wiki/Memento_Project. Wikipedia provides also historical versions of this resource at http://en.wikipedia.org/w/index.php?title=Memento_Project&action=history . In this case the WikiMedia platform keeps all the historical versions of a resource.
Another example is Tim Berners-Lee’s homepage at: http://www.w3.org/People/Berners-Lee/. The W3C website doesn’t provide an archive of versions of this webpage, but they are archived at Internet Archive , Archive-It, UK Web Archive and Archive Today.
How can a machine discover all versions of a web resource automatically?
As Gerald Sussman says: “Wishful thinking is essential to good engineering, and certainly essential to good computer science”. We might imagine any web resource (such as the Wikipedia page or Berners-Lee homepage above), called the original resource (URI-R), as a box that just tells a machine where to find all it’s archived versions using a standard syntax, the HTTP protocol.
A machine visits the resource URI-R and requests the “2007-05-31” version. The answer should be a link to the archived version of the resource, called the Memento (URI-M). There are some complications which Memento protocol should consider:
To implement the time travel protocol, Memento introduces a service called a TimeGate (URI-G) which can act as a router for time travel request. As input it receives the address of a resource (URI-R) and a date time (e.g. “2007-05-31”) and as response it returns the URL of the archived resource, the Memento (URI-M).
A machine visits URI-R and requests the “2007-05-31” version. The server redirects the machine to a TimeGate (URI-G) which has a routing table where to find archived versions, or at least a version close to the requested date.
The TimeGate can be a service that runs locally querying the local content management system or on the Internet maybe connected to a large web archive or a knowledge base of access routes to versioning systems like GitHub or Wikipedia.
You might ask, how does a TimeGate (URI-G) itself know where the archived version of a particular resource lives? We can look at three cases:
In the example below a machine requests the version “2007-05-31” of a resource to a TimeGate (URI-G). The TimeGate doesn’t know the answer but can query one or more remote TimeGates (which contain an index of Mementos at URI-T) services (e.g. Internet Archive, Archive-It, Archive Today) and request all versions for a resource. Some TimeGate servers might give zero results. Some might answer with a listing of all available versions. Based on this information the TimeGate server can decide which results best fit the original request.
Memento Example
As a practical example one can turn any web browser into a machine that understands the Memento protocol by including a bit of JavaScripting into a web page.
In a <html><head> include the following code snippet:
<link rel="stylesheet" type="text/css" href="http://robustlinks.mementoweb.org/demo/robustlinks.css" />
<script type="text/javascript" src="http://robustlinks.mementoweb.org/demo/robustlinks.js"></script>
Now one can add HTML5 attributes to web links. In this way it is possible to link to a particular version of a web resource. E.g. to link to the “2014-11-01” version of the LibreCat homepage one can write:
<a href="http://librecat.org" data-versiondate="2014-11-01">link</a>
Automatically this link will get a menu option to the archived version of this web page (using http://timetravel.mementoweb.org/ as TimeGate)
See a demonstration here: http://librecat.org/memento/demo.html
Using the Memento plugin for Chrome this JavaScript trick is not even needed. Data-versiondate attributes will automatically be turned into archive links. One can choose may public web archives as TimeGate. In case of Archive Today one can even have an active role in archiving webpages: just provide a URL and it will be stored!
Read more on this project on the Robust Links page.
Hackathon
The second day was used to implement the Memento protocol in various tools and environments. All the results are available as open source projects on Github:
https://github.com/MementoHackathon2015
LDF Memento
The Web is full of high-quality Linked Data but in general it can’t be reliably queried. Public SPARQL endpoints are often unavailable because they need to answer many unique queries. The Linked Data Fragments conceptual framework allows to define more lightweight interfaces, which enable client-side execution of complex queries.
During the Hackathon Miel Vander Sande and Ruben Verborgh of iMinds extended the LDF server and client to allow for Memento based querying. A demonstrator was built where many versions of DBPedia are made available using the Memento protocol. By adding the correct headers to queries historical Linked Data dumps can be queried with SPARQL and compared.
R Memento
In data science, R is the language for data analysis and data mining. The language is known for its strong statistical and graphical support.
Najko Jahn of Bielefeld University created an R client for Memento called Timetravelr. With this tool he demonstrated how HTML tables can be extracted from websites and transformed into a dataset. Using the Memento protocol, this dataset can be tracked over time to generate a time series. In his demonstration Najko showed the evolution of conforming OAI repositories by tracking the OAI registry over time.
GitLab Memento
GitLab is a web-based Git repository manager with wiki and issue tracking features. GitLab is similar to GitHub, but GitLab has an open source version, unlike GitHub. Bielefeld University Library is using GitLab as a platform to manage source code and (soon) research data. During the Hackathon, Christian Pietsch (Bielefeld University) created a GitLab handler for the Memento TimeGate software using the GitLab Commits API.
Plack Memento
PSGI/Plack is a Perl middleware to build web applications, comparable with WSGI in Python and Rack in Ruby. Using Plack it becomes very easy to make RESTful web applications with only a few lines of Perl code. By creating Plack plugins new functionality can be added to existing web applications without needing to change the application specific code.
Nicolas Steenlant (Ghent University) , Vitali Peil (Bielefeld University) and Maria Hedberg (Lund University) created a Memento plugin for Plack which turns every REST application into a Memento TimeGate if a versioning database is available. As a special case Nicolas, Vitali and Maria demonstrated with Catmandu how versioning can be added to databases such as Elastic Search, MongoDB, CouchDB and DBI. Programmers only need to take care of the logic of the database records, Catmandu and Plack take care of the rest.
Catmandu Memento
Catmandu is the ETL backbone of the LibreCat project. Using Catmandu librarians can extract bibliographic data from various sources such as catalogs, institutional repositories, A&I databases, search engines and transform this data with a small language called Fix. The results of these transformations can be published again into catalogs, search engines, CSV reports, Atom feeds and Linked Data.
During the Hackathon Patrick Hochstenbach (Ghent University) and Snorri Briem (Lund University) created Memento support for the Catmandu tools. As a demonstration they showed how librarians can use Catmandu as a URL checker. As input MARC records were exported from a catalog, URL-s extracted from the 856u field and checked against TimeGates for the availability of archived versions.
Thank you all for joining our Catmandu advent calendar this month. We hope that you enjoyed our daily posts. Catmandu is a very rich programming environment which provides command line tools and even an API. In these blogposts we provided only a short introduction into all these modules. Hopefully we will see you next year again with more examples!
The Catmandu community consists of all people involved in the project, no matter if they do programming, documentation, or drawing cats. We want to thank them all for a wonderful year!
And a big round of applause for our contributors who kept us sending bug reports and ideas for new features. If you would like to contribute, then please take a look at the contributions section of Catmandu documentation. Don’t be shy to contact us with questions, feature requests, bug fixes, documentation and cat cartoons!
This advent calendar will stay online for you reference.
As a special gift we have still some catmandu USB sticks available that we can send to you. Please send a line to “patrick dot hochstenbach at ugent dot be”. The first 5 emailers will get a free USB!
Yesterday we learned how to import RDF data with Catmandu. Exporting RDF can be as easy as this:
catmandu convert RDF --url http://d-nb.info/1001703464 to RDF
By default, the RDF exporter Catmandu::Exporter::RDF emits RDF/XML, an ugly and verbose serialization format of RDF. Let’s configure catmandu to use the also verbose but less ugly NTriples. This can either by done by appending --type ntriple
on command line or by adding the following to config file
catmandu.yml
:
exporter: RDF: package: RDF options: type: ntriples
The NTriples format illustrates the “true” nature of RDF data as a set of RDF triples or statements, each consisting of three parts (subject, predicate, object).
Catmandu can be used for converting between one RDF serialization format to another, but more specialized RDF tools, such as such rapper are more performant especially for large data sets. Catmandu can better help to process RDF data to JSON, YAML, CSV etc. and vice versa.
Let’s proceed with a more complex workflow and with what we’ve learned at day 13 about OAI-PMH and another popular repository: http://arxiv.org. There is a dedicated Catmandu module Catmandu::ArXiv for searching the repository, but ArXiv also supports OAI-PMH for bulk download. We could specify all options at command line, but putting the following into catmandu.yml
will simplify each call:
importer: arxiv-cs: package: OAI options: url: http://export.arxiv.org/oai2 metadataPrefix: oai_dc set: cs
Now we can harvest all computer science papers (set: cs
) for a selected day (e.g. 2014-12-19
):
$ catmandu convert arxiv --from 2014-12-19 --to 2014-12-19 to YAML
The repository may impose a delay of 20 seconds, so be patient. For more precise data, we better use the original data format from ArXiV:
$ catmandu convert arxiv --set cs --from 2014-12-19 --to 2014-12-19 --metadataPrefix arXiv to YAML > arxiv.yaml
The resulting format is based on XML. Have a look at the original data (requires module Catmandu::XML):
$ catmandu convert YAML to XML --field _metadata --pretty 1 < arxiv.yaml
$ catmandu convert YAML --fix 'xml_simple(_metadata)' to YAML < arxiv.yaml
Now we’ll transform this XML data to RDF. This is done with the following fix script, saved in file arxiv2rdf.fix
:
xml_simple(_metadata)
retain_field(_metadata)
move_field(_metadata,m)
move_field(m.id,_id)
prepend(_id,”http://arxiv.org/abs/”)
move_field(m.title,dc_title)
remove_field(m)
The following command generates one RDF triple per record, consisting of an arXiv article identifier, the property http://purl.org/dc/elements/1.1/title
and the article title:
$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml
To better understand what’s going on, convert to YAML instead of RDF, so the internal aREF data structure is shown:
$ catmandu convert YAML to YAML --fix arxiv2rdf.fix < arxiv.yaml
_id: http://arxiv.org/abs/1201.1733
dc_title: On Conditional Decomposability
…
This record looks similar to the records imported from RDF at day 13. The special field _id
refers to the subject in RDF triples: a handy feature for small RDF graphs that share the same subject in all RDF triples. Nevertheless, the same RDF graph could have been encoded like this:
--- http://arxiv.org/abs/1201.1733: dc_title: On Conditional Decomposability ...
To transform more parts of the original record to RDF, we only need to map field names to prefixed RDF property names. Here is a more complete version of arxiv2rdf.fix
:
xml_simple(_metadata)
retain_field(_metadata)
move_field(_metadata,m)
move_field(m.id,_id)
prepend(_id,"http://arxiv.org/abs/")
move_field(m.title,dc_title)
move_field(m.abstract,bibo_abstract)
move_field(m.doi,bibo_doi)
copy_field(bibo_doi,owl_sameAs)
prepend(owl_sameAs,"http://dx.doi.org/")
move_field(m.license,cc_license)
move_field(m.authors.author,dc_creator)
unless exists(dc_creator.0)
move_field(dc_creator,dc_creator.0)
end
do list(path=>dc_creator)
add_field(a,foaf_Person)
copy_field(forenames,foaf_name.0)
copy_field(keyname,foaf_name.$append)
join_field(foaf_name,' ')
move_field(forenames,foaf_givenName)
move_field(keyname,foaf_familyName)
move_field(suffix,schema_honoricSuffix)
remove_field(affiliation)
end
remove_field(m)
The result is one big RDF graph for all records:
$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml
Have a look at the internal aREF format by using the same fix with convert to YAML
and try conversion to other RDF serialization forms. The most important part of transformation to RDF is to find matching RDF properties from existing ontologies. The example above uses properties from Dublin Core, Creative Commons, Friend of a Friend, Schema.org, and Bibliographic Ontology.
Continue to Day 18: Merry Christmas! >>
A common problem of data processing is the large number of data formats, dialects, and conceptions. For instance the
author
field in one record format may differ from a similar field another format in its meaning or name. As shown in the previous articles, Catmandu can help to bridge such differences, but it can also help to map from and to data structured in a completely different paradigm. This article will show how to process data expressed in RDF, the language of Semantic Web and Linked Open Data.
RDF differs from previous formats, such as JSON and YAML, MARC, or CSV in two important aspects:
Because graph structures are fundamentally different to record structures, there is no obvious mapping between RDF and records in Catmandu. For this reason you better use dedicated RDF technology as long as your data is RDF. Catmandu, however, can help to process from RDF and to RDF, as shown today and tomorrow, respectively. Let’s first install the Catmandu module Catmandu::RDF for RDF processing:
$ cpanm –sudo Catmandu::RDF
If you happen to use this on a virtual machine from the Catmandu USB stick, you may first have to update another module to remove a nasty bug (the password is “catmandu”):
$ cpanm –sudo List::Util
You can now retrieve RDF data from any Linked Open Data URI like this:
$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 to YAML
We could also download RDF data into a file and parse the file with Catmandu afterwards:
$ curl -L -H 'Accept: application/rdf+xml' http://dx.doi.org/10.2474/trol.7.147 > rdf.xml $ catmandu convert RDF --type rdfxml to YAML < rdf.xml $ catmandu convert RDF --file rdf.xml to YAML # alternatively
Downloading RDF with Catmandu::RDF option --url
, however, is shorter and adds an _url
field that contains the original source. The RDF data converted to YAML with Catmandu looks like this (I removed some parts to keep it shorter). The format is called another RDF Encoding Form (aREF) because it can be transformed from and to other RDF encodings:
--- _url: http://dx.doi.org/10.2474/trol.7.147 http://dx.doi.org/10.2474/trol.7.147: dct_title: Frictional Coefficient under Banana Skin@ dct_creator: - <http://id.crossref.org/contributor/daichi-uchijima-y2ol1uygjx72> - <http://id.crossref.org/contributor/kensei-tanaka-y2ol1uygjx72> - <http://id.crossref.org/contributor/kiyoshi-mabuchi-y2ol1uygjx72> - <http://id.crossref.org/contributor/rina-sakai-y2ol1uygjx72> dct_date:- 2012^xs_gYear dct_isPartOf: <http://id.crossref.org/issn/1881-2198> http://id.crossref.org/issn/1881-2198: a: bibo_Journal bibo_issn: 1881-2198@ dct_title: Tribology Online@ http://id.crossref.org/contributor/daichi-uchijima-y2ol1uygjx72: a: foaf_Person foaf_name:Daichi Uchijima@ http://id.crossref.org/contributor/kensei-tanaka-y2ol1uygjx72: foaf_name: Kensei Tanaka@ http://id.crossref.org/contributor/kiyoshi-mabuchi-y2ol1uygjx72: foaf_name: Kiyoshi Mabuchi@ http://id.crossref.org/contributor/rina-sakai-y2ol1uygjx72: foaf_name: Rina Sakai@ ...
The sample record contains a special field _url
with the original source URL and six fields with URLs (or URIs), each corresponding to an RDF resource. The field with the original source URL (http://dx.doi.org/10.2474/trol.7.147) can be used as starting point. Each subfield (dct_title
, dct_creator
, dct_date
, dct_isPartOf
) corresponds to an RDF property, abbreviated with namespace prefix. To fetch data from these fields, we could use normal fix functions and JSON path expressions, as shown at day 7 but there is a better way:
Catmandu::RDF provides the fix function aref_query
to map selected parts of the RDF graph to another field. Try to get the the title field with this command:
$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 –fix ‘aref_query(dct_title,title)’ to YAML
More complex transformations should better be put into a fix file, so create file rdf.fix
with the following content:
aref_query(dct_title,title) aref_query(dct_date,date); aref_query(dct_creator.foaf_name,author) aref_query(dct_isPartOf.dct_title,journal)
If you apply the fix, there are four additional fields with data extracted from the RDF graph:
$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 –fix rdf.fix to YAML
The aref_query
function also accepts a language, similar to JSON path, but the path is applied to an RDF graph instead of a simple hierarchy. Moreover one can limit results to plain strings or to URIs. For instance the author URIs can be accessed with aref_query(dct_creator.,author)
. This feature is useful especially if RDF data contains a property with multiple types of objects, literal strings, and other resources. We can aggregate both with the following fixes:
aref_query(dct_creator@, authors)
aref_query(dct_creator.foaf_name@, authors)
Before proceeding you should add the following option to config file catmandu.yaml
:
importer: RDF: package: RDF options: ns: 2014091
This makes sure that RDF properties are always abbreviated with the same prefixes, for instance dct
for http://purl.org/dc/terms/.
Continue to Day 17: Exporting RDF data with Catmandu >>
Today we will look a bit further into MARC processing with Catmandu. By now you should already know how to startup the Virtual Catmandu (hint: see day 1) and start up the UNIX command prompt (hint: see day 2). We already saw a bit of MARC processing in day 9 and today we will show you how to transform MARC records into Dublin Core. This as a preparation to create RDF and Linked Data in the later posts.
First I’m going to teach you how to process different types of MARC files. On the Virtual Catmandu system we provided five example MARC files. You can find them in your Documents folder:
When you examine these files with the UNIX less command you will see that all the files have a bit different format:
$ less Documents/camel.mrk
$ less Documents/camel.usmarc
$ less Documents/marc.xml
$ less Documents/rug01.sample
There are many ways in which MARC data can be written into a file. Every vendor likes to use its own format. You can compare this with the different ways a text document can be stored: as Word, as Open Office, as PDF and plain text. If we are going to process these files with catmandu, then we need to tell the system what the exact format is.
We will work today with the last example rug01.sample which is a small export out of the Aleph catalog from Ghent University Library. Ex Libris uses a special MARC format to structure their data which is called Aleph sequential. We need to tell catmandu not only that our input file is in MARC but also in this special Aleph format. Let’s try to create YAML to see what it gives:
$ catmandu convert MARC --type ALEPHSEQ to YAML < Documents/rug01.sample
To transform this MARC file into Dublin Core we need to create a fix file. You can use the UNIX command nano for this (hint: see day 5 how to create files with nano). Create a file dublin.fix:
$ nano dublin.fix
And type into nano the following fixes:
marc_map(245,title)
marc_map(100,creator.$append)
marc_map(700,creator.$append)
marc_map(020a,isbn.$append)
marc_map(022a,issn.$append)
marc_map(260b,publisher)
marc_map(260c,date)
marc_map(650a,subject.$append)
remove_field(record)
Every MARC record contains in the 245-field the title of a record. In the first line we map the MARC-245 field to new field in the record called title:
marc_map(245,title)
In the second and third line we map authors to a field creator. In the rug01.sample file the authors are stored in the MARC-100 and MARC-700 field. Because there is usually more than one author in a record, we need to $append them to create an array (a list) of one or more creator-s.
In line 4 and line 5 we do the same trick to filter out the ISBN and ISSN number out of the record which we store in separate fields isbn and issn (indeed these are not Dublin Core fields, we will process them later).
In line 6 and line 7 we read the MARC-260 field which contains publisher and date information. Here we don’t need the $append trick because there is usually only one 260-field in a MARC record.
In line 8 the subjects are extracted from the 260-field using the same $append trick as above. Notice that we only extracted the $a subfields? If you want to add more subfields you can list them as in marc_map(650abcdefgh,subject.$append)
Given the dublin.txt file above we can execute the filtering command like this:
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample
As always you can type | less at the end of this command to slow down the screen output, or store the results into a file with > results.txt. Hint:
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample | less
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample > results.txt
The results should look like this:
_id: '000000002'
creator:
- Katz, Jerrold J.
date: '1977.'
isbn:
- '0855275103 :'
publisher: Harvester press,
subject:
- Semantics.
- Proposition (Logic)
- Speech acts (Linguistics)
- Generative grammar.
- Competence and performance (Linguistics)
title: Propositional structure and illocutionary force :a study of the contribution of sentence meaning to speech acts /Jerrold J. Katz.
...
Congratulations, you’ve created your first mapping file to transform library data from MARC to Dublin Core! We need to add a bit more cleaning to delete some periods and commas here and there but as is we already have our first mapping.
Below you’ll find a complete example. You can read more about our Fix language online.
marc_map(245,title, -join => " ")
marc_map(100,creator.$append)
marc_map(700,creator.$append)
marc_map(020a,isbn.$append)
marc_map(022a,issn.$append)
replace_all(isbn.," .","")
replace_all(issn.," .","")
marc_map(260b,publisher)
replace_all(publisher,",$","")
marc_map(260c,date)
replace_all(date,"\D+","")
marc_map(650a,subject.$append)
remove_field(record)
Continue to Day 16: Importing RDF data with Catmandu >>
In the last days you have learned how to store data with Catmandu. Storing data is a cool thing, but sharing data is awesome. Interoperability is important as other people may use your data (and you will profit from other people’s interoperable data)
In the day 13 tutorial we’ve learned the basic principle of metadata harvesting via OAI-PMH.
We will set up our OAI service with the Perl Dancer framework and an easy-to-use plugin called Dancer::Plugin::Catmandu::OAI. To install the required modules run:
$ cpanm Dancer
$ cpanm Dancer::Plugin::Catmandu::OAI
and you also might need
$ cpanm Template
Let’s start and index some data with Elasticsearch as learned in the previous post:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc to Elasticsearch --index_name oai --bag publication
After this, you should have some data in your Elasticsearch index. Run the following command to check this:
$ catmandu export Elasticsearch --index_name oai --bag publication
Everything is fine, so let’s create a simple webservice which exposes to collected data via OAI-PMH. The following code can be downloaded from this gist.
Download this gist and create a symbolic link
$ ln -s catmandu.yml config.yml
This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.
store: | |
oai: | |
package: Elasticsearch | |
options: | |
index_name: oai | |
bags: | |
publication: | |
cql_mapping: | |
default_index: basic | |
indexes: | |
_id: | |
op: | |
'any': true | |
'all': true | |
'=': true | |
'exact': true | |
field: '_id' | |
basic: | |
op: | |
'any': true | |
'all': true | |
'=': true | |
'<>': true | |
field: '_all' | |
description: "index with common fields..." | |
datestamp: | |
op: | |
'=': true | |
'<': true | |
'<=': true | |
'>=': true | |
'>': true | |
'exact': true | |
field: '_datestamp' | |
index_mappings: | |
publication: | |
properties: | |
_datestamp: {type: date, format: date_time_no_millis} | |
plugins: | |
'Catmandu::OAI': | |
store: oai | |
bag: publication | |
datestamp_field: datestamp | |
repositoryName: "My OAI DataProvider" | |
uri_base: "http://oai.service.com/oai" | |
adminEmail: me@example.com | |
earliestDatestamp: "1970-01-01T00:00:01Z" | |
deletedRecord: persistent | |
repositoryIdentifier: oai.service.com | |
cql_filter: "datestamp>2014-12-01T00:00:00Z" | |
limit: 200 | |
delimiter: ":" | |
sampleIdentifier: "oai:oai.service.com:1585315" | |
metadata_formats: | |
- | |
metadataPrefix: oai_dc | |
schema: "http://www.openarchives.org/OAI/2.0/oai_dc.xsd" | |
metadataNamespace: "http://www.openarchives.org/OAI/2.0/oai_dc/" | |
template: oai_dc.tt | |
fix: | |
- nothing() | |
sets: | |
- | |
setSpec: openaccess | |
setName: Open Access | |
cql: 'oa=1' |
#!/usr/bin/env perl | |
use Dancer; | |
use Catmandu; | |
use Dancer::Plugin::Catmandu::OAI; | |
Catmandu->load; | |
Catmandu->config; | |
oai_provider '/oai'; | |
dance; |
<oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" | |
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" | |
xmlns:dc="http://purl.org/dc/elements/1.1/" | |
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | |
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> | |
[%- FOREACH var IN ['title' 'creator' 'subject' 'description' 'publisher' 'contributor' 'date' 'type' 'format' 'identifier' 'source' 'language' 'relation' 'coverage' 'rights'] %] | |
[%- FOREACH val IN $var %] | |
<dc:[% var %]>[% val | html %]</dc:[% var %]> | |
[%- END %] | |
[%- END %] | |
</oai_dc:dc> |
What’s going on here? Well, the script oai-app.pl defines a route /oai via the plugin Dancer::Plugin::Catmandu::OAI.
The template oai_dc.tt defines the xml output of the records. And finally the configuration file catmandu.yml handles the settings for the Dancer plugin as well as for the Elasticsearch indexing and querying.
Run the following command to start a local webserver
$ perl oai-app.pl
and point your browser to https://localhost:3000/oai?verb=Identify
. To get some records go to http://localhost:3000/oai?verb=ListRecords&metadataPrefix=oai_dc
.
Yes, it’s that easy. You can extend this simple example by adding fixes to transform the data as you need it.
Continue to Day 15: MARC to Dublin Core >>
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol to harvest metadata records from OAI compliant repositories. It was developed by the Open Archives Initiative as a low-barrier mechanism for repository interoperability. The Open Archives Initiative maintains a registry of OAI data providers.
Every OAI server must provide metadata records in Dublin Core, other (bibliographic) formats like MARC may be supported additionally. Available metadata formats can be detected with “ListMetadataFormats“. You can set the metadata format for the Catmandu OAI client via the --metadataPrefix parameter.
The OAI server may support selective harvesting, so OAI clients can get only subsets of records from a repository. The client requests could be limited via datestamps (--from, --until) or set membership (--set).
To get some Dublin Core records from the collection of Ghent University Library and convert it to JSON (default) run the following catmandu command:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc
You can also harvest MARC data and store it in a file:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml to MARC --type USMARC > ugent.mrc
Instead of harvesting the whole metadata you can get the record identifiers (--listIdentifiers) only:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --listIdentifiers 1 to YAML
You can also transform incoming data and immediately store/index it with MongoDB or Elasticsearch. For the transformation you need to create a fix (see Day 6):
$ nano simple.fix
Add the following fixes to the file:
marc_map(245,title)
marc_map(100,creator.$append)
marc_map(260c,date)
remove_field(record)
Now you can run an ETL process (extract, transform, load) with one command:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to Elasticsearch --index_name oai --bag ugent
$ catmandu import OAI ---url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to MongoDB --database_name oai --bag ugent
The Catmandu OAI client provides special handler (--handler) for Dublin Core (oai_dc) and MARC (marcxml). For other metadata formats use the default handler (raw) or implement your own. Read our documentation for further details.
Continue to Day 14: Set up your own OAI data service >>