Download this gist and create a symbolic link
$ ln -s catmandu.yml config.yml
This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.
On May 21th 2019, Nicolas Steenlant (our main developer and guru of Catmandu) released version 1.20 of our Catmandu toolkit with some very interesting new features. The main addition is a brand new way how Catmandu Fix-es can be implemented using the new Catmandu::Path implementation. This coding by Nicolas will make it much easier and straightforward to implement any kind of fixes in Perl.
In the previous versions of Catmandu there were only two options to create new fixes:
fix
method. This was very easy: update the $data
hash you got as first argument, return the updated $data
and you were done. Then disadvantage was that accessing fields in a deeply nested record was tricky and slow to code.emit
functions. These were functions that generate Perl code on the fly. Using emit functions it was easier to get fast access to deeply nested data. But, to create Fix packages was pretty complex.In Catmandu 1.20 there is now support for a third and easy way to create new Fixes using the Catmandu::Fix::Builder and Catmandu::Fix::Path class. Let me give an simple example of a skeleton Fix that does nothing:
package Catmandu::Fix::rot13; use Catmandu::Sane; use Moo; use Catmandu::Util::Path qw(as_path); use Catmandu::Fix::Has; with 'Catmandu::Fix::Builder'; has path => (fix_arg => 1); sub _build_fixer { my ($self) = @_; sub { my $data = $_[0]; # ..do some magic here ... $data; } } 1;
In the code above we start implementing a rot13(path)
Fix that should read a string on a JSON path and encrypt it using the ROT13 algorithm. This Fix is only the skeleton which doesn’t do anything. What we have is:
as_path
method be able to easily access data on JSON paths/has path
constructs to read in arguments for our Fix._build_fixer
method.We can use this skeleton builder to implement our ROT13 algorithm. Add these lines instead of the # do some magic
part:
# On the path update the string value... as_path($self->path)->updater( if_string => sub { my $value = shift; $value =~ tr{N-ZA-Mn-za-m}{A-Za-z}; $value; }, )->($data);
The as_path
method receives a JSON path string an creates an object which you can use to manipulate data on that path. One can update the values found with the updater
method, or read data at that path with the getter
method or create a new path with the creator
method. In our example, we update the string found at the JSON path using if_string
condition. The updater
has many conditions:
if_string
needs a closure what should happen when a string is found on the JSON path.if_array_ref
needs a closure what should happen when an array is found on the JSON path.if_hash_ref
needs a closure what should happen when a hash is found on the JSON path.In our case we are only interested in transforming strings using our rot13(path)
fix. The ROT13 algorithm is very easy and only switched the order of some characters. When we execute this fix on some sample data we get this result:
$ catmandu -I lib convert Null to YAML --fix 'add_field(demo,hello);rot13v2(demo)'
---
demo: uryyb
...
In this case the Fix can be written much shorter when we know that every Catmandu::Path method return a closure (hint: look at the ->($data)
in the code. The complete Fix can look like:
package Catmandu::Fix::rot13; use Catmandu::Sane; use Moo; use Catmandu::Util::Path qw(as_path); use Catmandu::Fix::Has; with 'Catmandu::Fix::Builder'; has path => (fix_arg => 1); sub _build_fixer { my ($self) = @_; # On the path update the string value... as_path($self->path)->updater( if_string => sub { my $value = shift; $value =~ tr{N-ZA-Mn-za-m}{A-Za-z}; $value; }, ); } 1;
This is as easy as it can get to manipulate deeply nested data with your own Perl tools. All the code is in Perl, there is no limit on the number of external CPAN packages one can include in these Builder fixes.
We can’t wait what Catmandu extensions you will create.
In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion:
$ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json
added 100 (55/sec)
added 200 (76/sec)
added 300 (87/sec)
added 400 (92/sec)
added 500 (90/sec)
added 600 (94/sec)
added 700 (97/sec)
added 800 (97/sec)
added 900 (96/sec)
added 1000 (97/sec)
In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records.
Can we make this any faster?
When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’ if all the processors could be used. One technique to do that is called ‘parallel processing’.
To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system:
$ cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
The example above shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain 4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle).
To check if your computer is using all that calculating power, use the ‘uptime’ command:
$ uptime
11:15:21 up 622 days, 1:53, 2 users, load average: 1.23, 1.70, 1.95
In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores it means: the server is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait.
Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel:
sudo yum install parallel
The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation.
First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation):
$ sudo cat > /etc/yum.repos.d/mongodb.repo <<EOF
[mongodb]
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
enabled=1
name=MongoDB.org repository
EOF
$ sudo yum install -y mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-org-tools
$ sudo cpanm Catmandu::Store::MongoDB
Next, we are going to store our input data in a MongoDB database with help of a Catmandu Fix script that adds some random numbers the data:
$ catmandu import MARC to MongoDB --database_name data --fix random.fix < input.marc
With the ‘random.fix’ like:
random("part.rand2","2")
random("part.rand4","4")
random("part.rand8","8")
random("part.rand16","16")
random("part.rand32","32")
The ‘random()’ Fix function will be available in Catmandu 1.003 but can also be downloaded here (install it in a directory ‘lib/Catmandu/Fix’). This will will make sure that every record in your input file contains four random numbers ‘part.rand2’, ‘part.rand4′ ,’part.rand8′,’part.rand16′,’part.rand32’. This will makes it possible to chop your data into two, four, eight, sixteen or thirty-two parts depending on the number of processors you have in your machine.
To access one chunk of your data the ‘catmandu export’ command can be used with a query. For instance, to export two equal chunks do:
$ catmandu export MongoDB --database_name -q '{"part.rand2":0}' > part1
$ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2
We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously.
#!/bin/bash
# file: parallel.sh
CPU=$1
if [ "${CPU}" == "" ]; then
/usr/bin/parallel -u $0 {} <<EOF
0
1
EOF
elif [ "${CPU}" != "" ]; then
catmandu export MongoDB --database_name data -q "{\"part.rand2\":${CPU}}" to JSON --line_delimited 1 --fix heavy_load.fix > result.${CPU}.json
fi
This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script.
If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’.
GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command:
$ cat result.*.json > final_result.json
Yesterday we learned how to import RDF data with Catmandu. Exporting RDF can be as easy as this:
catmandu convert RDF --url http://d-nb.info/1001703464 to RDF
By default, the RDF exporter Catmandu::Exporter::RDF emits RDF/XML, an ugly and verbose serialization format of RDF. Let’s configure catmandu to use the also verbose but less ugly NTriples. This can either by done by appending --type ntriple
on command line or by adding the following to config file
catmandu.yml
:
exporter: RDF: package: RDF options: type: ntriples
The NTriples format illustrates the “true” nature of RDF data as a set of RDF triples or statements, each consisting of three parts (subject, predicate, object).
Catmandu can be used for converting between one RDF serialization format to another, but more specialized RDF tools, such as such rapper are more performant especially for large data sets. Catmandu can better help to process RDF data to JSON, YAML, CSV etc. and vice versa.
Let’s proceed with a more complex workflow and with what we’ve learned at day 13 about OAI-PMH and another popular repository: http://arxiv.org. There is a dedicated Catmandu module Catmandu::ArXiv for searching the repository, but ArXiv also supports OAI-PMH for bulk download. We could specify all options at command line, but putting the following into catmandu.yml
will simplify each call:
importer: arxiv-cs: package: OAI options: url: http://export.arxiv.org/oai2 metadataPrefix: oai_dc set: cs
Now we can harvest all computer science papers (set: cs
) for a selected day (e.g. 2014-12-19
):
$ catmandu convert arxiv --from 2014-12-19 --to 2014-12-19 to YAML
The repository may impose a delay of 20 seconds, so be patient. For more precise data, we better use the original data format from ArXiV:
$ catmandu convert arxiv --set cs --from 2014-12-19 --to 2014-12-19 --metadataPrefix arXiv to YAML > arxiv.yaml
The resulting format is based on XML. Have a look at the original data (requires module Catmandu::XML):
$ catmandu convert YAML to XML --field _metadata --pretty 1 < arxiv.yaml
$ catmandu convert YAML --fix 'xml_simple(_metadata)' to YAML < arxiv.yaml
Now we’ll transform this XML data to RDF. This is done with the following fix script, saved in file arxiv2rdf.fix
:
xml_simple(_metadata)
retain_field(_metadata)
move_field(_metadata,m)
move_field(m.id,_id)
prepend(_id,”http://arxiv.org/abs/”)
move_field(m.title,dc_title)
remove_field(m)
The following command generates one RDF triple per record, consisting of an arXiv article identifier, the property http://purl.org/dc/elements/1.1/title
and the article title:
$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml
To better understand what’s going on, convert to YAML instead of RDF, so the internal aREF data structure is shown:
$ catmandu convert YAML to YAML --fix arxiv2rdf.fix < arxiv.yaml
_id: http://arxiv.org/abs/1201.1733
dc_title: On Conditional Decomposability
…
This record looks similar to the records imported from RDF at day 13. The special field _id
refers to the subject in RDF triples: a handy feature for small RDF graphs that share the same subject in all RDF triples. Nevertheless, the same RDF graph could have been encoded like this:
--- http://arxiv.org/abs/1201.1733: dc_title: On Conditional Decomposability ...
To transform more parts of the original record to RDF, we only need to map field names to prefixed RDF property names. Here is a more complete version of arxiv2rdf.fix
:
xml_simple(_metadata)
retain_field(_metadata)
move_field(_metadata,m)
move_field(m.id,_id)
prepend(_id,"http://arxiv.org/abs/")
move_field(m.title,dc_title)
move_field(m.abstract,bibo_abstract)
move_field(m.doi,bibo_doi)
copy_field(bibo_doi,owl_sameAs)
prepend(owl_sameAs,"http://dx.doi.org/")
move_field(m.license,cc_license)
move_field(m.authors.author,dc_creator)
unless exists(dc_creator.0)
move_field(dc_creator,dc_creator.0)
end
do list(path=>dc_creator)
add_field(a,foaf_Person)
copy_field(forenames,foaf_name.0)
copy_field(keyname,foaf_name.$append)
join_field(foaf_name,' ')
move_field(forenames,foaf_givenName)
move_field(keyname,foaf_familyName)
move_field(suffix,schema_honoricSuffix)
remove_field(affiliation)
end
remove_field(m)
The result is one big RDF graph for all records:
$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml
Have a look at the internal aREF format by using the same fix with convert to YAML
and try conversion to other RDF serialization forms. The most important part of transformation to RDF is to find matching RDF properties from existing ontologies. The example above uses properties from Dublin Core, Creative Commons, Friend of a Friend, Schema.org, and Bibliographic Ontology.
Continue to Day 18: Merry Christmas! >>
A common problem of data processing is the large number of data formats, dialects, and conceptions. For instance the
author
field in one record format may differ from a similar field another format in its meaning or name. As shown in the previous articles, Catmandu can help to bridge such differences, but it can also help to map from and to data structured in a completely different paradigm. This article will show how to process data expressed in RDF, the language of Semantic Web and Linked Open Data.
RDF differs from previous formats, such as JSON and YAML, MARC, or CSV in two important aspects:
Because graph structures are fundamentally different to record structures, there is no obvious mapping between RDF and records in Catmandu. For this reason you better use dedicated RDF technology as long as your data is RDF. Catmandu, however, can help to process from RDF and to RDF, as shown today and tomorrow, respectively. Let’s first install the Catmandu module Catmandu::RDF for RDF processing:
$ cpanm –sudo Catmandu::RDF
If you happen to use this on a virtual machine from the Catmandu USB stick, you may first have to update another module to remove a nasty bug (the password is “catmandu”):
$ cpanm –sudo List::Util
You can now retrieve RDF data from any Linked Open Data URI like this:
$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 to YAML
We could also download RDF data into a file and parse the file with Catmandu afterwards:
$ curl -L -H 'Accept: application/rdf+xml' http://dx.doi.org/10.2474/trol.7.147 > rdf.xml $ catmandu convert RDF --type rdfxml to YAML < rdf.xml $ catmandu convert RDF --file rdf.xml to YAML # alternatively
Downloading RDF with Catmandu::RDF option --url
, however, is shorter and adds an _url
field that contains the original source. The RDF data converted to YAML with Catmandu looks like this (I removed some parts to keep it shorter). The format is called another RDF Encoding Form (aREF) because it can be transformed from and to other RDF encodings:
--- _url: http://dx.doi.org/10.2474/trol.7.147 http://dx.doi.org/10.2474/trol.7.147: dct_title: Frictional Coefficient under Banana Skin@ dct_creator: - <http://id.crossref.org/contributor/daichi-uchijima-y2ol1uygjx72> - <http://id.crossref.org/contributor/kensei-tanaka-y2ol1uygjx72> - <http://id.crossref.org/contributor/kiyoshi-mabuchi-y2ol1uygjx72> - <http://id.crossref.org/contributor/rina-sakai-y2ol1uygjx72> dct_date:- 2012^xs_gYear dct_isPartOf: <http://id.crossref.org/issn/1881-2198> http://id.crossref.org/issn/1881-2198: a: bibo_Journal bibo_issn: 1881-2198@ dct_title: Tribology Online@ http://id.crossref.org/contributor/daichi-uchijima-y2ol1uygjx72: a: foaf_Person foaf_name:Daichi Uchijima@ http://id.crossref.org/contributor/kensei-tanaka-y2ol1uygjx72: foaf_name: Kensei Tanaka@ http://id.crossref.org/contributor/kiyoshi-mabuchi-y2ol1uygjx72: foaf_name: Kiyoshi Mabuchi@ http://id.crossref.org/contributor/rina-sakai-y2ol1uygjx72: foaf_name: Rina Sakai@ ...
The sample record contains a special field _url
with the original source URL and six fields with URLs (or URIs), each corresponding to an RDF resource. The field with the original source URL (http://dx.doi.org/10.2474/trol.7.147) can be used as starting point. Each subfield (dct_title
, dct_creator
, dct_date
, dct_isPartOf
) corresponds to an RDF property, abbreviated with namespace prefix. To fetch data from these fields, we could use normal fix functions and JSON path expressions, as shown at day 7 but there is a better way:
Catmandu::RDF provides the fix function aref_query
to map selected parts of the RDF graph to another field. Try to get the the title field with this command:
$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 –fix ‘aref_query(dct_title,title)’ to YAML
More complex transformations should better be put into a fix file, so create file rdf.fix
with the following content:
aref_query(dct_title,title) aref_query(dct_date,date); aref_query(dct_creator.foaf_name,author) aref_query(dct_isPartOf.dct_title,journal)
If you apply the fix, there are four additional fields with data extracted from the RDF graph:
$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 –fix rdf.fix to YAML
The aref_query
function also accepts a language, similar to JSON path, but the path is applied to an RDF graph instead of a simple hierarchy. Moreover one can limit results to plain strings or to URIs. For instance the author URIs can be accessed with aref_query(dct_creator.,author)
. This feature is useful especially if RDF data contains a property with multiple types of objects, literal strings, and other resources. We can aggregate both with the following fixes:
aref_query(dct_creator@, authors)
aref_query(dct_creator.foaf_name@, authors)
Before proceeding you should add the following option to config file catmandu.yaml
:
importer: RDF: package: RDF options: ns: 2014091
This makes sure that RDF properties are always abbreviated with the same prefixes, for instance dct
for http://purl.org/dc/terms/.
Continue to Day 17: Exporting RDF data with Catmandu >>
In the last days you have learned how to store data with Catmandu. Storing data is a cool thing, but sharing data is awesome. Interoperability is important as other people may use your data (and you will profit from other people’s interoperable data)
In the day 13 tutorial we’ve learned the basic principle of metadata harvesting via OAI-PMH.
We will set up our OAI service with the Perl Dancer framework and an easy-to-use plugin called Dancer::Plugin::Catmandu::OAI. To install the required modules run:
$ cpanm Dancer
$ cpanm Dancer::Plugin::Catmandu::OAI
and you also might need
$ cpanm Template
Let’s start and index some data with Elasticsearch as learned in the previous post:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc to Elasticsearch --index_name oai --bag publication
After this, you should have some data in your Elasticsearch index. Run the following command to check this:
$ catmandu export Elasticsearch --index_name oai --bag publication
Everything is fine, so let’s create a simple webservice which exposes to collected data via OAI-PMH. The following code can be downloaded from this gist.
Download this gist and create a symbolic link
$ ln -s catmandu.yml config.yml
This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.
store: | |
oai: | |
package: Elasticsearch | |
options: | |
index_name: oai | |
bags: | |
publication: | |
cql_mapping: | |
default_index: basic | |
indexes: | |
_id: | |
op: | |
'any': true | |
'all': true | |
'=': true | |
'exact': true | |
field: '_id' | |
basic: | |
op: | |
'any': true | |
'all': true | |
'=': true | |
'<>': true | |
field: '_all' | |
description: "index with common fields..." | |
datestamp: | |
op: | |
'=': true | |
'<': true | |
'<=': true | |
'>=': true | |
'>': true | |
'exact': true | |
field: '_datestamp' | |
index_mappings: | |
publication: | |
properties: | |
_datestamp: {type: date, format: date_time_no_millis} | |
plugins: | |
'Catmandu::OAI': | |
store: oai | |
bag: publication | |
datestamp_field: datestamp | |
repositoryName: "My OAI DataProvider" | |
uri_base: "http://oai.service.com/oai" | |
adminEmail: me@example.com | |
earliestDatestamp: "1970-01-01T00:00:01Z" | |
deletedRecord: persistent | |
repositoryIdentifier: oai.service.com | |
cql_filter: "datestamp>2014-12-01T00:00:00Z" | |
limit: 200 | |
delimiter: ":" | |
sampleIdentifier: "oai:oai.service.com:1585315" | |
metadata_formats: | |
- | |
metadataPrefix: oai_dc | |
schema: "http://www.openarchives.org/OAI/2.0/oai_dc.xsd" | |
metadataNamespace: "http://www.openarchives.org/OAI/2.0/oai_dc/" | |
template: oai_dc.tt | |
fix: | |
- nothing() | |
sets: | |
- | |
setSpec: openaccess | |
setName: Open Access | |
cql: 'oa=1' |
#!/usr/bin/env perl | |
use Dancer; | |
use Catmandu; | |
use Dancer::Plugin::Catmandu::OAI; | |
Catmandu->load; | |
Catmandu->config; | |
oai_provider '/oai'; | |
dance; |
<oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" | |
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" | |
xmlns:dc="http://purl.org/dc/elements/1.1/" | |
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | |
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> | |
[%- FOREACH var IN ['title' 'creator' 'subject' 'description' 'publisher' 'contributor' 'date' 'type' 'format' 'identifier' 'source' 'language' 'relation' 'coverage' 'rights'] %] | |
[%- FOREACH val IN $var %] | |
<dc:[% var %]>[% val | html %]</dc:[% var %]> | |
[%- END %] | |
[%- END %] | |
</oai_dc:dc> |
What’s going on here? Well, the script oai-app.pl defines a route /oai via the plugin Dancer::Plugin::Catmandu::OAI.
The template oai_dc.tt defines the xml output of the records. And finally the configuration file catmandu.yml handles the settings for the Dancer plugin as well as for the Elasticsearch indexing and querying.
Run the following command to start a local webserver
$ perl oai-app.pl
and point your browser to https://localhost:3000/oai?verb=Identify
. To get some records go to http://localhost:3000/oai?verb=ListRecords&metadataPrefix=oai_dc
.
Yes, it’s that easy. You can extend this simple example by adding fixes to transform the data as you need it.
Continue to Day 15: MARC to Dublin Core >>
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol to harvest metadata records from OAI compliant repositories. It was developed by the Open Archives Initiative as a low-barrier mechanism for repository interoperability. The Open Archives Initiative maintains a registry of OAI data providers.
Every OAI server must provide metadata records in Dublin Core, other (bibliographic) formats like MARC may be supported additionally. Available metadata formats can be detected with “ListMetadataFormats“. You can set the metadata format for the Catmandu OAI client via the --metadataPrefix parameter.
The OAI server may support selective harvesting, so OAI clients can get only subsets of records from a repository. The client requests could be limited via datestamps (--from, --until) or set membership (--set).
To get some Dublin Core records from the collection of Ghent University Library and convert it to JSON (default) run the following catmandu command:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc
You can also harvest MARC data and store it in a file:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml to MARC --type USMARC > ugent.mrc
Instead of harvesting the whole metadata you can get the record identifiers (--listIdentifiers) only:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --listIdentifiers 1 to YAML
You can also transform incoming data and immediately store/index it with MongoDB or Elasticsearch. For the transformation you need to create a fix (see Day 6):
$ nano simple.fix
Add the following fixes to the file:
marc_map(245,title)
marc_map(100,creator.$append)
marc_map(260c,date)
remove_field(record)
Now you can run an ETL process (extract, transform, load) with one command:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to Elasticsearch --index_name oai --bag ugent
$ catmandu import OAI ---url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to MongoDB --database_name oai --bag ugent
The Catmandu OAI client provides special handler (--handler) for Dublin Core (oai_dc) and MARC (marcxml). For other metadata formats use the default handler (raw) or implement your own. Read our documentation for further details.
Continue to Day 14: Set up your own OAI data service >>
ElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine. You can store structured JSON documents and by default ElasticSearch will try to detect the data structure and index the data. ElasticSearch uses Lucene to provide full text search capabilities with a powerful query language. Install guides for various platforms are available at the ElasticSearch reference. To install the corresponding Catmandu module run:
$ cpanm Catmandu::Store::ElasticSearch
$ cpanm Search::Elasticsearch::Client::5_0::Direct
[For those of you running the Catmandu VirtualBox this installation is not required. ElasticSearch is by default installed]
Now get some JSON data to work with:
$ wget -O banned_books.json https://lib.ugent.be/download/librecat/data/verbannte-buecher.json
First index the data with ElasticSearch. You have to specify an index (--index_name) and type (--bag), client version (--client) for Elasticsearch versions >= 2.0 you also have to add a prefix to your IDs (--key_prefix):
$ catmandu import -v JSON --multiline 1 to ElasticSearch --index_name books --bag banned --key_prefix my_ --client '5_0::Direct' < banned_books.json
Now you can export all items from an index to different formats, like XLSX, YAML and XML:
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' to YAML
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' to XML
$ catmandu export -v ElasticSearch --index_name books --bag banned --client '5_0::Direct' to XLSX --file banned_books.xlsx
You can count all indexed items or those which match a query:
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct'
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationYear: "1937"'
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"'
You can search an index for a specific value and export all matching items:
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationYear: "1937"' to JSON
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"' to CSV --fields 'my_id,authorFirstname,authorLastname,title,firstEditionPublicationPlace'
You can delete whole collections from a database or just items which match a query:
$ catmandu delete ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"'
$ catmandu delete ElasticSearch --index_name books --bag banned --client '5_0::Direct'
Catmandu::Store::ElasticSearch supports CQL as query language. For setup and usage see documentation.
Continue to Day 13: Harvest data with OAI-PMH >>
MongoDB is a cross-platform document-oriented database. As a NoSQL database, MongoDB uses JSON-like documents (BSON) with dynamic schemas, making the integration of data in applications easier and faster. Install guides for various platforms are available at the MongoDB manual. To install the corresponding Catmandu module run:
$ cpanm Catmandu::Store::MongoDB
Now get some JSON data to work with:
$ wget -O banned_books.json https://lib.ugent.be/download/librecat/data/verbannte-buecher.json
First import the data to MongoDB. You have to specify in which database (--database_name) and collection (--bag) you want to store the data:
$ catmandu import -v JSON --multiline 1 to MongoDB --database_name books --bag banned < banned_books.json
Now you can export all items from a collection to different formats, like XLSX, YAML and XML:
$ catmandu export MongoDB --database_name books --bag banned to YAML
$ catmandu export MongoDB --database_name books --bag banned to XML
$ catmandu export -v MongoDB --database_name books --bag banned to XLSX --file banned_books.xlsx
You can count all items in a collection or those which match a query:
$ catmandu count MongoDB --database_name books --bag banned
$ catmandu count MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": "1937"}'
$ catmandu count MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}'
MongoDB uses a JSON-like query language that supports a variety of operators.
You can query a collection for a specific value and export all matching items:
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": "1937"}' to JSON
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}' to CSV --fields '_id,authorFirstname,authorLastname,title,firstEditionPublicationPlace'
You can use regular expressions for queries, e.g. to get all items which where published at a place starting with “B”:
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": {"$regex":"^B.*"}}' to CSV --fields '_id,firstEditionPublicationPlace'
MongoDB supports several comparison operators, e.g. you can query items which where published before/after a specific date or at specific places:
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": {"$lt":"1940"}}' to CSV --fields '_id,firstEditionPublicationYear'
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": {"$gt":"1940"}}' to CSV --fields '_id,firstEditionPublicationYear'
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace":{"$in":["Berlin","Bern"]}}' to CSV --fields '_id,firstEditionPublicationPlace'
Logical operators are also supported, so you can combine query clauses:
$ catmandu export MongoDB --database_name books --bag banned --query '{"$and":[{"firstEditionPublicationYear": "1937"},{"firstEditionPublicationPlace": "Berlin"}]}' to JSON
$ catmandu export MongoDB --database_name books --bag banned --query '{"$or":[{"firstEditionPublicationPlace": "Berlin"},{"secondEditionPublicationPlace": "Berlin"}]}' to JSON
With the element query operators you can match items that contain a specified field
$ catmandu export MongoDB --database_name books --bag banned --query '{"field_xyz":{"$exists":"true"}}'
Collection and items can be moved within MongoDB or even to other stores or search engines:
$ catmandu move MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}' to MongoDB --database_name books --bag berlin
$ catmandu move MongoDB --database_name books --bag banned to Elasticsearch --index_name books --bag banned
You can delete whole collections from a database or just items which match a query:
$ catmandu delete MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}'
$ catmandu delete MongoDB --database_name books --bag banned
MongoDB supports several more methods. These methods are not available via the Catmandu commandline interface, but can be used in Catmandu modules and scripts.
See Catmandu::Store::MongoDB for further documentation.
Continue to Day 12: Index your data with ElasticSearch >>
CSV and Excel files are widely-used to store and exchange simple structured data. Many open datasets are published as CSV files, e.g. datahub.io. Within the library community CSV files are used for the distribution of title lists (KBART), e.g Knowledge Base+. Excel spreadsheets are often used to generate reports.
Catmandu implements importer and exporter for both formats. The CVS module is already part of the core system, the Catmandu::XLS and Catmandu::Exporter::Table modules may have to be installed separatly (note these steps are not required if you have the virtual catmandu box):
$ sudo cpanm Catmandu::XLS
$ sudo cpanm Catmandu::Exporter::Table
Get some CSV data to work with:
$ curl "https://lib.ugent.be/download/librecat/data/goodreads.csv" > goodreads.csv
Now you can convert the data to different formats, like JSON, YAML and XML.
$ catmandu convert CSV to XML < goodreads.csv
$ catmandu convert CSV to XLS --file goodreads.xls < goodreads.csv
$ catmandu convert XLS to JSON < goodreads.xls
$ catmandu convert CSV to XLSX --file goodreads.xlsx < goodreads.csv
$ catmandu convert XLSX to YAML < goodreads.xlsx
You can extract specified fields while converting to another tabular format. This is quite handy for analysis of specific fields or to generate reports.
$ catmandu convert CSV to TSV --fields ISBN,Title < goodreads.csv
$ catmandu convert CSV to XLS --fields 'ISBN,Title,Author' --file goodreads.xls < goodreads.csv
The field names are read from the header line or must be given via the ‘fields’ parameter.
By default Catmandu expects that CSV fields are separated by comma ‘,’ and strings are quoted with double qoutes ‘”‘. You can specify other characters as separator or quotes with the parameters ‘sep_char’ and ‘quote_char’:
$ echo '12157;$The Journal of Headache and Pain$;2193-1801' | catmandu convert CSV --header 0 --fields 'id,title,issn' --sep_char ';' --quote_char '$'
In the example above we create a little CSV fragment using to “echo” command for our small test. It will print a tiny CSV string which uses “;” and “$” as separation and quotation characters.
When exporting data a tabular format you can change the field names in the header or omit the header:
$ catmandu convert CSV to CSV --fields 'ISBN,Title,Author' --columns 'A,B,C' < goodreads.csv
$ catmandu convert CSV to TSV --fields 'ISBN,Title,Author' --header 0 < goodreads.csv
If you want to export complex/nested data structures to a tabular format, you must “flatten” the datastructure. This could be done with “Fixes“.
See Catmandu::Importer::CSV, Catmandu::Exporter::CSV and Catmandu::XLS for further documentation.
Continue in Day 11: Store your data in MongoDB >>
In the previous days we learned how we can use the catmandu command to process structured data like JSON. Today we will use the same command to process MARC metadata records. In this process we will see that MARC can be processed using JSON paths but this is a bit cumbersome. We will introduce MARCspec as an easier way to point to parts of a MARC record.
As always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).
In the Virtual Catmandu installation we provided a couple of example MARC files that we can inspect with the UNIX command cat or less. In the UNIX prompt inspect the file Documents/camel.usmarc, for instance, with cat:
$ cat Documents/camel.usmarc
You should see something like this:
Like JSON the MARC file contains structured data but the format is different. All the data is on one line, but there isn’t at first sight a clear separation between fields and values. The field/value structure there but you need to use a MARC parser to extract this information. Catmandu contains a MARC parser which can be used to interpret this file. Type the following command to transform the MARC data into YAML (which we introduced in the previous posts):
$ catmandu convert MARC to YAML < Documents/camel.usmarc
You will see something like this:
When transforming MARC into YAML it looks like something with a simple top level field _id containing the identifier of the MARC record and a record field with a deeper array structure (or more correct an array-of-an-array structure).
We can use catmandu to read the _id fields of the MARC record with the retain_field fix we learned in the Day 6 post:
$ catmandu convert MARC --fix 'retain_field(_id)' to YAML < Documents/camel.usmarc
You will see:
--- _id: 'fol05731351 ' ... --- _id: 'fol05754809 ' ... --- _id: 'fol05843555 ' ... --- _id: 'fol05843579 ' ...
What is happening here? The MARC file Documents/camel.usmarc contains more than one MARC record. For every MARC record catmandu extracts the _id field.
Extracting data out of the MARC record itself is a bit more difficult. MARC is an array-an-array, you need indexes to extract the data. For instance the MARC leader is usually in the first field of a MARC record. In the previous posts we learned that you need to use the 0 index to extract the first field out of an array:
$ catmandu convert MARC --fix 'retain_field(record.0)' to YAML < Documents/camel.usmarc --- _id: 'fol05731351 ' record: - - LDR - ~ - ~ - _ - 00755cam 22002414a 4500 ...
The leader value itself is the fifth entry in the resulting array. So, we need index 4 to extract it:
$ catmandu convert MARC --fix 'copy_field(record.0.4,leader); retain_field(leader)' to YAML < Documents/camel.usmarc
We used here a copy_field fix to extract the value into a field called leader. The retain_field fix is used to keep only this leader field in the result. To process MARC data this way would be very verbose, plus you need to know at which index position the fields are that you are interested in. This is something you usually don’t know.
Catmandu introduces Carsten Klee’s MARCspec to ease the extraction of MARC values out of a record. With the marc_map fix the command above would read:
marc_map("LDR",leader) retain_field(leader)
I skipped here writing the catmandu commands (they will be the same everytime). You can put these fixes into a file using nano (see the Day 5 post) and execute it as:
catmandu convert MARC --fix myfixes.txt to YAML < Documents/camel.usmarc
Where myfixes.txt contains the fixes above.
To extract the title fields, the field 245 remember? ;), you can write:
marc_map("245",title) retain_field(title)
Or, if you are only interested in the $a subfield you could write:
marc_map("245a",title) retain_field(title)
More elaborate mappings are possible. I’ll show you more complete examples in the next posts. As a warming up, here is some code to extract all the record identifiers, titles and isbn numbers in a MARC file into a CSV listing (which you can open in Excel).
Step 1, create a fix file myfixes.txt containing:
marc_map("245",title) marc_map("020a",isbn.$append) join_field(isbn,",") remove_field(record)
Step 2, execute this command:
$ catmandu convert MARC --fix myfixes.txt to CSV < Documents/camel.usmarc
You will see this as output:
_id,isbn,title "fol05731351 ","0471383147 (paper/cd-rom : alk. paper)","ActivePerl with ASP and ADO /Tobias Martinsson." "fol05754809 ",1565926994,"Programming the Perl DBI /Alligator Descartes and Tim Bunce." "fol05843555 ",,"Perl :programmer's reference /Martin C. Brown." "fol05843579 ",0072120002,"Perl :the complete reference /Martin C. Brown." "fol05848297 ",1565924193,"CGI programming with Perl /Scott Guelich, Shishir Gundavaram & Gunther Birznieks." "fol05865950 ",0596000138,"Proceedings of the Perl Conference 4.0 :July 17-20, 2000, Monterey, California." "fol05865956 ",1565926099,"Perl for system administration /David N. Blank-Edelman." "fol05865967 ",0596000278,"Programming Perl /Larry Wall, Tom Christiansen & Jon Orwant." "fol05872355 ",013020868X,"Perl programmer's interactive workbook /Vincent Lowe." "fol05882032 ","0764547291 (alk. paper)","Cross-platform Perl /Eric F. Johnson.
In the fix above we mapped the 245-field to the title. The ISBN is in the 020-field. Because MARC records can contain one or more 020 fields we created an isbn array using the isbn.$append syntax. Next we turned the isbn array back into a comma separated string using the join_field fix. As last step we deleted all the fields we didn’t need in the output with the remove_field syntax.
In this post we demonstrated how to process MARC data. In the next post we will show some examples how catmandu typically can be used to process library data.
Continue with Day 10: Working with CSV and Excel files >>