Category: Intermediate

Preprocessing Catmandu fixes

Ever had a need for passing a lot of configuration parameters to a Catmandu Fix script or needing to write repetitive code over and over again in large scripts? There is a neat Bash trick you can use to preprocess your scripts.

For instance image you have a large JSON file that needs to be processed for many customers. To each record in the file you need to include the URL of their homepage:


add_field("homepage","http://library.edu")

You could do this my creating a Fix script for every customer and run the convert command for every customer:


$ catmandu convert --fix customer1.fix < data.json
$ catmandu convert --fix customer2.fix < data.json
$ catmandu convert --fix customer3.fix < data.json

There is another way to do this by using named pipe redirects in Bash. Instead of writing one Fix script for each customer you can write one Fix scipt for all customers that includes preprocessing handles:


add_field("homepage",HOMEPAGE)

With this script customer.fix you can use a preprocessor to populate the HOMEPAGE field:


$ catmandu convert --fix <(m4 -DHOMEPAGE=\"http://customer1.edu\" customer.fix) < data.json

$ catmandu convert --fix <(m4 -DHOMEPAGE=\"http://customer2.edu\" customer.fix) < data.json

$ catmandu convert --fix <(m4 -DHOMEPAGE=\"http://customer3.edu\" customer.fix) < data.json

Bash is creating a temporary named pipe that is given as input to the catmandu command while in the background a m4 processor is processing the customer.fix file.

You can enter any command into the named pipe redirects. There are plenty of interesting preprocessors available that can be used to process fix files such as: cpp, m4 and the even Template Toolkit tpage command).

Day 17: Exporting RDF data with Catmandu

Yesterday we learned how to import RDF data with Catmandu. Exporting RDF can be as easy as this:

catmandu convert RDF --url http://d-nb.info/1001703464 to RDF

By default, the RDF exporter Catmandu::Exporter::RDF emits RDF/XML, an ugly and verbose serialization format of RDF. Let’s configure catmandu to use the also verbose but less ugly NTriples. This can either by done by appending --type ntriple on command line or by adding 17_librecatthe following to config file catmandu.yml:

exporter:
  RDF:
    package: RDF
      options:
        type: ntriples

The NTriples format illustrates the “true” nature of RDF data as a set of RDF triples or statements, each consisting of three parts (subject, predicate, object).

Catmandu can be used for converting between one RDF serialization format to another, but more specialized RDF tools, such as such rapper are more performant especially for large data sets. Catmandu can better help to process RDF data to JSON, YAML, CSV etc. and vice versa.

Let’s proceed with a more complex workflow and with what we’ve learned at day 13 about OAI-PMH and another popular repository: http://arxiv.org. There is a dedicated Catmandu module Catmandu::ArXiv for searching the repository, but ArXiv also supports OAI-PMH for bulk download. We could specify all options at command line, but putting the following into catmandu.yml will simplify each call:

importer:
  arxiv-cs:
    package: OAI
    options:
      url: http://export.arxiv.org/oai2
      metadataPrefix: oai_dc
      set: cs

Now we can harvest all computer science papers (set: cs) for a selected day (e.g. 2014-12-19):

$ catmandu convert arxiv --from 2014-12-19 --to 2014-12-19 to YAML

The repository may impose a delay of 20 seconds, so be patient. For more precise data, we better use the original data format from ArXiV:

$ catmandu convert arxiv --set cs --from 2014-12-19 --to 2014-12-19 --metadataPrefix arXiv to YAML > arxiv.yaml

The resulting format is based on XML. Have a look at the original data (requires module Catmandu::XML):

$ catmandu convert YAML to XML --field _metadata --pretty 1 < arxiv.yaml
$ catmandu convert YAML --fix 'xml_simple(_metadata)' to YAML < arxiv.yaml

Now we’ll transform this XML data to RDF. This is done with the following fix script, saved in file arxiv2rdf.fix:

xml_simple(_metadata)
retain_field(_metadata)
move_field(_metadata,m)

move_field(m.id,_id)
prepend(_id,”http://arxiv.org/abs/&#8221;)

move_field(m.title,dc_title)
remove_field(m)

The following command generates one RDF triple per record, consisting of an arXiv article identifier, the property http://purl.org/dc/elements/1.1/title and the article title:

$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml

To better understand what’s going on, convert to YAML instead of RDF, so the internal aREF data structure is shown:

$ catmandu convert YAML to YAML --fix arxiv2rdf.fix < arxiv.yaml

_id: http://arxiv.org/abs/1201.1733
dc_title: On Conditional Decomposability

This record looks similar to the records imported from RDF at day 13. The special field _id refers to the subject in RDF triples: a handy feature for small RDF graphs that share the same subject in all RDF triples. Nevertheless, the same RDF graph could have been encoded like this:

---
http://arxiv.org/abs/1201.1733:
  dc_title: On Conditional Decomposability
...

To transform more parts of the original record to RDF, we only need to map field names to prefixed RDF property names. Here is a more complete version of arxiv2rdf.fix:


xml_simple(_metadata)
retain_field(_metadata)
move_field(_metadata,m)
    
move_field(m.id,_id)
prepend(_id,"http://arxiv.org/abs/")
    
move_field(m.title,dc_title)
move_field(m.abstract,bibo_abstract)
    
move_field(m.doi,bibo_doi)
copy_field(bibo_doi,owl_sameAs)
prepend(owl_sameAs,"http://dx.doi.org/")
            
move_field(m.license,cc_license)
          
move_field(m.authors.author,dc_creator)
unless exists(dc_creator.0)
  move_field(dc_creator,dc_creator.0)
end         
            
do list(path=>dc_creator)
  add_field(a,foaf_Person)
  copy_field(forenames,foaf_name.0)
  copy_field(keyname,foaf_name.$append)
  join_field(foaf_name,' ')
  move_field(forenames,foaf_givenName)
  move_field(keyname,foaf_familyName)
  move_field(suffix,schema_honoricSuffix)
  remove_field(affiliation)
end 
    
remove_field(m)

The result is one big RDF graph for all records:

$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml

Have a look at the internal aREF format by using the same fix with convert to YAML and try conversion to other RDF serialization forms. The most important part of transformation to RDF is to find matching RDF properties from existing ontologies. The example above uses properties from Dublin Core, Creative Commons, Friend of a Friend, Schema.org, and Bibliographic Ontology.

Continue to Day 18: Merry Christmas! >>

Day 16: Importing RDF data with Catmandu

16_librecatA common problem of data processing is the large number of data formats, dialects, and conceptions. For instance the author field in one record format may differ from a similar field another format in its meaning or name. As shown in the previous articles, Catmandu can help to bridge such differences, but it can also help to map from and to data structured in a completely different paradigm. This article will show how to process data expressed in RDF, the language of Semantic Web and Linked Open Data.

RDF differs from previous formats, such as JSON and YAML, MARC, or CSV in two important aspects:

  1. There are no records and fields: RDF data instead is a graph structure, build of nodes (“resources” or “values”) and directed links.
  2. Link types (“properties”) are identified by URI and defined in “ontologies”. In theory this removes the introductory common problem of data processing.

Because graph structures are fundamentally different to record structures, there is no obvious mapping between RDF and records in Catmandu. For this reason you better use dedicated RDF technology as long as your data is RDF. Catmandu, however, can help to process from RDF and to RDF, as shown today and tomorrow, respectively. Let’s first install the Catmandu module Catmandu::RDF for RDF processing:

$ cpanm –sudo Catmandu::RDF

If you happen to use this on a virtual machine from the Catmandu USB stick, you may first have to update another module to remove a nasty bug (the password is “catmandu”):

$ cpanm –sudo List::Util

You can now retrieve RDF data from any Linked Open Data URI like this:

$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 to YAML

We could also download RDF data into a file and parse the file with Catmandu afterwards:

$ curl -L -H 'Accept: application/rdf+xml' http://dx.doi.org/10.2474/trol.7.147 > rdf.xml
$ catmandu convert RDF --type rdfxml to YAML < rdf.xml
$ catmandu convert RDF --file rdf.xml to YAML # alternatively

Downloading RDF with Catmandu::RDF option --url, however, is shorter and adds an _url field that contains the original source. The RDF data converted to YAML with Catmandu looks like this (I removed some parts to keep it shorter). The format is called another RDF Encoding Form (aREF) because it can be transformed from and to other RDF encodings:

---
_url: http://dx.doi.org/10.2474/trol.7.147
http://dx.doi.org/10.2474/trol.7.147:
  dct_title: Frictional Coefficient under Banana Skin@
  dct_creator:
  - <http://id.crossref.org/contributor/daichi-uchijima-y2ol1uygjx72>
  - <http://id.crossref.org/contributor/kensei-tanaka-y2ol1uygjx72>
  - <http://id.crossref.org/contributor/kiyoshi-mabuchi-y2ol1uygjx72>
  - <http://id.crossref.org/contributor/rina-sakai-y2ol1uygjx72>
  dct_date:- 2012^xs_gYear
  dct_isPartOf: <http://id.crossref.org/issn/1881-2198>
http://id.crossref.org/issn/1881-2198:
  a: bibo_Journal
  bibo_issn: 1881-2198@
  dct_title: Tribology Online@
http://id.crossref.org/contributor/daichi-uchijima-y2ol1uygjx72:
  a: foaf_Person
  foaf_name:Daichi Uchijima@
http://id.crossref.org/contributor/kensei-tanaka-y2ol1uygjx72:
  foaf_name: Kensei Tanaka@
http://id.crossref.org/contributor/kiyoshi-mabuchi-y2ol1uygjx72:
  foaf_name: Kiyoshi Mabuchi@
http://id.crossref.org/contributor/rina-sakai-y2ol1uygjx72:
  foaf_name: Rina Sakai@
...

The sample record contains a special field _url with the original source URL and six fields with URLs (or URIs), each corresponding to an RDF resource. The field with the original source URL (http://dx.doi.org/10.2474/trol.7.147) can be used as starting point. Each subfield (dct_title, dct_creator, dct_date, dct_isPartOf) corresponds to an RDF property, abbreviated with namespace prefix. To fetch data from these fields, we could use normal fix functions and JSON path expressions, as shown at day 7 but there is a better way:

Catmandu::RDF provides the fix function aref_query to map selected parts of the RDF graph to another field. Try to get the the title field with this command:

$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 –fix ‘aref_query(dct_title,title)’ to YAML

More complex transformations should better be put into a fix file, so create file rdf.fix with the following content:

aref_query(dct_title,title)
aref_query(dct_date,date);
aref_query(dct_creator.foaf_name,author)
aref_query(dct_isPartOf.dct_title,journal)

If you apply the fix, there are four additional fields with data extracted from the RDF graph:

$ catmandu convert RDF –url http://dx.doi.org/10.2474/trol.7.147 –fix rdf.fix to YAML

The aref_query function also accepts a language, similar to JSON path, but the path is applied to an RDF graph instead of a simple hierarchy. Moreover one can limit results to plain strings or to URIs. For instance the author URIs can be accessed with aref_query(dct_creator.,author). This feature is useful especially if RDF data contains a property with multiple types of objects, literal strings, and other resources. We can aggregate both with the following fixes:

aref_query(dct_creator@, authors)
aref_query(dct_creator.foaf_name@, authors)

Before proceeding you should add the following option to config file catmandu.yaml:

importer:
  RDF:
    package: RDF
    options:
      ns: 2014091

This makes sure that RDF properties are always abbreviated with the same prefixes, for instance dct for http://purl.org/dc/terms/.

Continue to Day 17: Exporting RDF data with Catmandu >>

Day 15 : MARC to Dublin Core

13_librecatToday we will look a bit further into MARC processing with Catmandu. By now you should already know how to startup the Virtual Catmandu (hint: see day 1) and start up the UNIX command prompt (hint: see day 2). We already saw a bit of MARC processing in day 9 and today we will show you how to transform MARC records into Dublin Core. This as a preparation to create RDF and Linked Data in the later posts.

First I’m going to teach you how to process different types of MARC files. On the Virtual Catmandu system we provided five  example MARC files. You can find them in your Documents folder:

  • Documents/camel.mrk
  • Documents/camel.usmarc
  • Documents/marc.xml
  • Documents/rug01.aleph
  • Documents/rug01.sample

When you examine these files with the UNIX less command you will see that all the files have a bit different format:

$ less Documents/camel.mrk
$ less Documents/camel.usmarc
$ less Documents/marc.xml
$ less Documents/rug01.sample

There are many ways in which MARC data can be written into a file. Every vendor likes to use its own format. You can compare this with the different ways a text document can be stored: as Word, as Open Office, as PDF and plain text. If we are going to process these files with catmandu, then we need to tell the system what the exact format is.

We will work today with the last example rug01.sample which is a small export out of the Aleph catalog from Ghent University Library. Ex Libris uses a special MARC format to structure their data which is called Aleph sequential. We need to tell catmandu not only that our input file is in MARC but also in this special Aleph format. Let’s try to create YAML to see what it gives:

$ catmandu convert MARC --type ALEPHSEQ to YAML < Documents/rug01.sample

To transform this MARC file into Dublin Core we need to create a fix file. You can use the UNIX command nano for this (hint: see day 5 how to create files with nano). Create a file dublin.fix:

$ nano dublin.fix

And type into nano the following fixes:

marc_map(245,title)

marc_map(100,creator.$append)
marc_map(700,creator.$append)

marc_map(020a,isbn.$append)
marc_map(022a,issn.$append)

marc_map(260b,publisher)
marc_map(260c,date)

marc_map(650a,subject.$append)

remove_field(record)

Every MARC record contains in the 245-field the title of a record. In the first line we map the MARC-245 field to new field in the record called title:

marc_map(245,title)

In the second and third line we map authors to a field creator. In the rug01.sample file the authors are stored in the MARC-100 and MARC-700 field. Because there is usually more than one author in a record, we need to $append them to create an array (a list) of one or more creator-s.

In line 4 and line 5 we do the same trick to filter out the ISBN and ISSN number out of the record which we store in separate fields isbn and issn (indeed these are not Dublin Core fields, we will process them later).

In line 6 and line 7 we read the MARC-260 field which contains publisher and date information. Here we don’t need the $append trick because there is usually only one 260-field in a MARC record.

In line 8 the subjects are extracted from the 260-field using the same $append trick as above. Notice that we only extracted the $a subfields? If you want to add more subfields you can list them as in marc_map(650abcdefgh,subject.$append)

Given the dublin.txt file above we can execute the filtering command like this:

$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample

As always you can type | less at the end of this command to slow down the screen output, or store the results into a file with > results.txt. Hint:

$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample | less
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample > results.txt

The results should look like this:

_id: '000000002'
creator:
- Katz, Jerrold J.
date: '1977.'
isbn:
- '0855275103 :'
publisher: Harvester press,
subject:
- Semantics.
- Proposition (Logic)
- Speech acts (Linguistics)
- Generative grammar.
- Competence and performance (Linguistics)
title: Propositional structure and illocutionary force :a study of the contribution of sentence meaning to speech acts /Jerrold J. Katz.
...

Congratulations, you’ve created your first mapping file to transform library data from MARC to Dublin Core! We need to add a bit more cleaning to delete some periods and commas here and there but as is we already have our first mapping.

Below you’ll find a complete example. You can read more about our Fix language online.

marc_map(245,title, -join => " ")

marc_map(100,creator.$append)
marc_map(700,creator.$append)

marc_map(020a,isbn.$append)
marc_map(022a,issn.$append)

replace_all(isbn.," .","")
replace_all(issn.," .","")

marc_map(260b,publisher)
replace_all(publisher,",$","")

marc_map(260c,date)
replace_all(date,"\D+","")

marc_map(650a,subject.$append)
remove_field(record)

Continue to Day 16: Importing RDF data with Catmandu >>

Day 14: Set up your own OAI data service

14_librecatIn the last days you have learned how to store data with Catmandu. Storing data is a cool thing, but sharing data is awesome. Interoperability is important as other people may use your data (and you will profit from other people’s interoperable data)

In the day 13 tutorial we’ve learned the basic principle of metadata harvesting via OAI-PMH.

We will set up our OAI service with the Perl Dancer framework and an easy-to-use plugin called Dancer::Plugin::Catmandu::OAI. To install the required modules run:

$ cpanm Dancer

$ cpanm Dancer::Plugin::Catmandu::OAI

and you also might need

$ cpanm Template

Let’s start and index some data with Elasticsearch as learned in the previous post:

$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc to Elasticsearch --index_name oai --bag publication

After this, you should have some data in your Elasticsearch index. Run the following command to check this:

$ catmandu export Elasticsearch --index_name oai --bag publication

Everything is fine, so let’s create a simple webservice which exposes to collected data via OAI-PMH. The following code can be downloaded from this gist.

Download this gist and create a symbolic link

$ ln -s catmandu.yml config.yml

This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.

view raw README.md hosted with ❤ by GitHub
store:
oai:
package: Elasticsearch
options:
index_name: oai
bags:
publication:
cql_mapping:
default_index: basic
indexes:
_id:
op:
'any': true
'all': true
'=': true
'exact': true
field: '_id'
basic:
op:
'any': true
'all': true
'=': true
'<>': true
field: '_all'
description: "index with common fields..."
datestamp:
op:
'=': true
'<': true
'<=': true
'>=': true
'>': true
'exact': true
field: '_datestamp'
index_mappings:
publication:
properties:
_datestamp: {type: date, format: date_time_no_millis}
plugins:
'Catmandu::OAI':
store: oai
bag: publication
datestamp_field: datestamp
repositoryName: "My OAI DataProvider"
uri_base: "http://oai.service.com/oai&quot;
adminEmail: me@example.com
earliestDatestamp: "1970-01-01T00:00:01Z"
deletedRecord: persistent
repositoryIdentifier: oai.service.com
cql_filter: "datestamp>2014-12-01T00:00:00Z"
limit: 200
delimiter: ":"
sampleIdentifier: "oai:oai.service.com:1585315"
metadata_formats:
-
metadataPrefix: oai_dc
schema: "http://www.openarchives.org/OAI/2.0/oai_dc.xsd&quot;
metadataNamespace: "http://www.openarchives.org/OAI/2.0/oai_dc/&quot;
template: oai_dc.tt
fix:
- nothing()
sets:
-
setSpec: openaccess
setName: Open Access
cql: 'oa=1'
view raw catmandu.yml hosted with ❤ by GitHub
#!/usr/bin/env perl
use Dancer;
use Catmandu;
use Dancer::Plugin::Catmandu::OAI;
Catmandu->load;
Catmandu->config;
oai_provider '/oai';
dance;
view raw oai-app.pl hosted with ❤ by GitHub
<oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/&quot;
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/&quot;
xmlns:dc="http://purl.org/dc/elements/1.1/&quot;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"&gt;
[%- FOREACH var IN ['title' 'creator' 'subject' 'description' 'publisher' 'contributor' 'date' 'type' 'format' 'identifier' 'source' 'language' 'relation' 'coverage' 'rights'] %]
[%- FOREACH val IN $var %]
<dc:[% var %]>[% val | html %]</dc:[% var %]>
[%- END %]
[%- END %]
</oai_dc:dc>
view raw oai_dc.tt hosted with ❤ by GitHub

What’s going on here? Well, the script oai-app.pl defines a route /oai via the plugin Dancer::Plugin::Catmandu::OAI.
The template oai_dc.tt defines the xml output of the records. And finally the configuration file catmandu.yml handles the settings for the Dancer plugin as well as for the Elasticsearch indexing and querying.

Run the following command to start a local webserver

$ perl oai-app.pl

and point your browser to https://localhost:3000/oai?verb=Identify. To get some records go to http://localhost:3000/oai?verb=ListRecords&metadataPrefix=oai_dc.

Yes, it’s that easy. You can extend this simple example by adding fixes to transform the data as you need it.

Continue to Day 15: MARC to Dublin Core >>

Day 12: Index your data with ElasticSearch

12_librecatElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine. You can store structured JSON documents and by default ElasticSearch will try to detect the data structure and index the data. ElasticSearch uses Lucene to provide full text search capabilities with a powerful query language. Install guides for various platforms are available at the ElasticSearch reference. To install the corresponding Catmandu module run:

$ cpanm Catmandu::Store::ElasticSearch
$ cpanm Search::Elasticsearch::Client::5_0::Direct

[For those of you running the Catmandu VirtualBox this installation is not required. ElasticSearch is by default installed]
Now get some JSON data to work with:

$ wget -O banned_books.json https://lib.ugent.be/download/librecat/data/verbannte-buecher.json

First index the data with ElasticSearch. You have to specify an index (--index_name) and type (--bag), client version (--client) for Elasticsearch versions >= 2.0 you also have to add a prefix to your IDs (--key_prefix):

$ catmandu import -v JSON --multiline 1 to ElasticSearch --index_name books --bag banned --key_prefix my_ --client '5_0::Direct' < banned_books.json

Now you can export all items from an index to different formats, like XLSX, YAML and XML:

$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' to YAML
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' to XML
$ catmandu export -v ElasticSearch --index_name books --bag banned --client '5_0::Direct' to XLSX --file banned_books.xlsx

You can count all indexed items or those which match a query:

$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct'
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationYear: "1937"'
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"'

You can search an index for a specific value and export all matching items:

$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationYear: "1937"' to JSON
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"' to CSV --fields 'my_id,authorFirstname,authorLastname,title,firstEditionPublicationPlace'


You can delete whole collections from a database or just items which match a query:

$ catmandu delete ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"'
$ catmandu delete ElasticSearch --index_name books --bag banned --client '5_0::Direct'

Catmandu::Store::ElasticSearch supports CQL as query language. For setup and usage see documentation.

Continue to Day 13: Harvest data with OAI-PMH >>

Day 11: Store your data in MongoDB

11_librecatMongoDB is a cross-platform document-oriented database. As a NoSQL database, MongoDB uses JSON-like documents (BSON) with dynamic schemas, making the integration of data in applications easier and faster. Install guides for various platforms are available at the MongoDB manual. To install the corresponding Catmandu module run:

$ cpanm Catmandu::Store::MongoDB

Now get some JSON data to work with:

$ wget -O banned_books.json https://lib.ugent.be/download/librecat/data/verbannte-buecher.json

First import the data to MongoDB. You have to specify in which database (--database_name) and collection (--bag) you want to store the data:

$ catmandu import -v JSON --multiline 1 to MongoDB --database_name books --bag banned < banned_books.json

Now you can export all items from a collection to different formats, like XLSX, YAML and XML:

$ catmandu export MongoDB --database_name books --bag banned to YAML
$ catmandu export MongoDB --database_name books --bag banned to XML
$ catmandu export -v MongoDB --database_name books --bag banned to XLSX --file banned_books.xlsx

You can count all items in a collection or those which match a query:

$ catmandu count MongoDB --database_name books --bag banned
$ catmandu count MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": "1937"}'
$ catmandu count MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}'

MongoDB uses a JSON-like query language that supports a variety of operators.

You can query a collection for a specific value and export all matching items:

$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": "1937"}' to JSON
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}' to CSV --fields '_id,authorFirstname,authorLastname,title,firstEditionPublicationPlace'

You can use regular expressions for queries, e.g. to get all items which where published at a place starting with “B”:

$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": {"$regex":"^B.*"}}' to CSV --fields '_id,firstEditionPublicationPlace'

MongoDB supports several comparison operators, e.g. you can query items which where published before/after a specific date or at specific places:

$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": {"$lt":"1940"}}' to CSV --fields '_id,firstEditionPublicationYear'
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationYear": {"$gt":"1940"}}' to CSV --fields '_id,firstEditionPublicationYear'
$ catmandu export MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace":{"$in":["Berlin","Bern"]}}' to CSV --fields '_id,firstEditionPublicationPlace'

Logical operators are also supported, so you can combine query clauses:

$ catmandu export MongoDB --database_name books --bag banned --query '{"$and":[{"firstEditionPublicationYear": "1937"},{"firstEditionPublicationPlace": "Berlin"}]}' to JSON
$ catmandu export MongoDB --database_name books --bag banned --query '{"$or":[{"firstEditionPublicationPlace": "Berlin"},{"secondEditionPublicationPlace": "Berlin"}]}' to JSON

With the element query operators you can match items that contain a specified field

$ catmandu export MongoDB --database_name books --bag banned --query '{"field_xyz":{"$exists":"true"}}'

Collection and items can be moved within MongoDB or even to other stores or search engines:

$ catmandu move MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}' to MongoDB --database_name books --bag berlin
$ catmandu move MongoDB --database_name books --bag banned to Elasticsearch --index_name books --bag banned

You can delete whole collections from a database or just items which match a query:

$ catmandu delete MongoDB --database_name books --bag banned --query '{"firstEditionPublicationPlace": "Berlin"}'
$ catmandu delete MongoDB --database_name books --bag banned

MongoDB supports several more methods. These methods are not available via the Catmandu commandline interface, but can be used in Catmandu modules and scripts.

See Catmandu::Store::MongoDB for further documentation.

Continue to Day 12: Index your data with ElasticSearch >>

Create a fixer [Part 1]

By Patrick Hochstenbach

This is part one of a two part overview of extending the Catmandu Fix language. As you might know already, the Fix language is used in Catmandu to transform JSON data. Our rationale for is is to ease the manipulation of library data formats. Fix is to JSON data what XSLT is for XML.

If you have a JSON input file containing name/value pairs you can use fixes to add or delete fields:

$ echo '{ "name": "Patrick" }' | catmandu convert
{"name":"Patrick"}

$ echo '{ "name": "Patrick" }' | catmandu convert --fix 'add_field("age","42")'
{"age":42,"name":"Patrick"}

You can write fixes on the command line or in a file:

$ cat myfixes.txt
add_field("age","42");
add_field("my.favorite.color","blue");
$ echo '{ "name": "Patrick" }' | catmandu convert --fix myfixes.txt
{"name":"Patrick","my":{"favorite":{"color":"blue"}},"age":42}

Check out our Cheat Sheet for more examples of possible fixes.

Fix code can also be integrated into Perl scripts. We can repeat the experiments above with the Perl code below.

use Catmandu::Fix;
use DDP;

my $fixer = Catmandu::Fix->new(fixes => ['add_field("age","42")']);
my $hash  = $fixer->fix({ name => 'Patrick '});

p $hash;

# Read the fixes from a file
my $fixer = Catmandu::Fix->new(fixes => ['myfixes.txt']);
my $hash  = $fixer->fix({ name => 'Patrick '});

p $hash;

Starting from Catmandu version 0.8006 it is also possible to inline all the fixes.

use Catmandu::Fix::add_field as => 'add_field';
use DDP;

my $hash  = { name => 'Patrick ' };

add_field($hash,'age','42');
add_field($hash,'my.favorite.color','blue');

p $hash;

To extend the Fix language you need to create a new Perl package in the Catmandu::Fix namespace which implements a ‘fix’ method. This method gets as input a Perl hash and should return the changed Perl hash. Here is an trivial example of an ‘invert_field’ Fix which transforms a name/value-pair into a value/name pair:

package Catmandu::Fix::invert_field;

sub new {
    my ($class,$path) = @_;
    return bless { path => $path } , $class;
}

sub fix {
    my ($self,$data) = @_;
    my $name  = $self->{path};
    my $value = $data->{$name};

    delete $data->{$name};

    $data->{$value} = $name;

    $data;
}

1;

With this fix installed we can invert name/value pairs in our input data:

$ echo '{ "name": "Patrick" }' | catmandu convert --fix 'invert_field("name")'
 {"Patrick":"name"}

If you know exactly where in the Perl hash you need to make data changes, then this method of creating Fix functions is quite easy and straight forward. Things get complicated when you want to manipulate deeply nested hashes and arrays. For instance, almost all of the Catmandu provided fixes can manipulate very deeply nested structures:

add_field("my.very.deep.field.13.subfield.1.name","Patrick");

This ‘add_field’ fix above will operate on a very deep hash of arrays. To support this in your own created Fix functions you need to be able to parse JSON paths. In a next blog post we will go into more details of this process.

Versioning using Catmandu::Plugin

By Patrick Hochstenbach

Recently we had in our mailing list a question by Jakob Voss if Catmandu supports versioning of records. Indeed we do! In this post I’ll give an example what is possible using the Catmandu::Plugin modules developed by Nicolas Steenlant.

First, lets create a sample application in Perl to store some records in an in-memory hash for test purposes:

 


#!/usr/bin/env perl

use Catmandu::Store::Hash;
use Data::Dumper;

my $store = Catmandu::Store::Hash->new();

$store->bag->add({
    '_id'  => '001',
    'name' => 'John Doe'
});

$store->bag->add({
    '_id'  => '001',
    'name' => 'John Moo'
});

$store->each(sub { print Dumper($_[0]) });

First we store in a record with id ‘001’ a ‘John Doe’ and then we overwrite this record with ‘John Moo’. As result you should see only the version ‘John Moo’ printed on screen.

To add versioning we have to edit the initiation of the Hash and add a ‘Versioning’ plugin to the ‘data’ bag. As you may recall ‘data’ is the default name of the storage container in your Catmandu::Store. Using the get_history method all the versions of your records can be retrieved.


#!/usr/bin/env perl

use Catmandu::Store::Hash;
use Data::Dumper;

my $store = Catmandu::Store::Hash->new(
    bags => {
        'data' => {
            plugins => [qw(Versioning)]
        }
});

$store->bag->add({
    '_id'  => '001',
    'name' => 'John Doe'
});

$store->bag->add({
    '_id'  => '001',
    'name' => 'John Moo'
});

print "Versions:\n";

for (@{$store->bag->get_history('001')}) {
    print Dumper($_);
}

Or, you can get one particular version with the ‘get_version’ method:

my $obj = $store->bag->get_version('001',2);

Or, the previous version with ‘get_previous_version’:

my $obj = $store->bag->get_previous_version('001');

Sometimes you don’t want to create a new version records when nothing changed to the data except for some date field for instance. This can be set using the ‘version_compare_ignore’ field when creating a store:

 


my $store = Catmandu::Store::Hash->new(
        bags => {
            'data' => {
                plugins => [qw(Versioning)] ,
                version_compare_ignore => [qw(access_date)],
            }
});

Using this setting, new versions will only be created when any field except ‘access_date’ will change in value.

Extend Catmandu without Perl

By Patrick Hochstenbach

With Catmandu we create ETL-pipelines for library workflows. Read data from OAI, SRU, Z39.50, PubMed, arXive, transform it with Catmandu Fixes and load the results into Solr, MongoDB, CouchDB or serialize into YAML, CSV, XML whatever you like. Read my blog post about the Catmandu Cheat Sheet to get a quick recap.

Today I want to show you how you can create your own Fix routines in any programming language using the Catmandu::Fix::cmd which Nicolas Steenlant created.

First we create a small Perl script to generate some sample JSON we will use in our examples (you can use your own JSON file or translate this trivial script into Python, Ruby, Java, C, Clojore, Go …).

Here is our little JSON generator:

#!/usr/bin/env perl
# file: generate.pl

use JSON;

for (1...1000) {
    print encode_json({ random => rand }) , "\n";
}

When we execute the script we’ll get one thousand lines of JSON in our terminal:

$ ./generate.pl
{"random":0.721613357218615}
{"random":0.491180438229559}
{"random":0.868290266595814}
.
.
.

It is now easy to use Catmandu Fixes to transform these JSON records. E.g. we can add a new field ‘title’ with content ‘test’:

$ ./generate.pl | catmandu convert JSON --fix 'add_field("title","test")'
{"random":0.611390470122803,"title":"test"}
{"random":0.915937067437753,"title":"test"}
{"random":0.461684127836374,"title":"test"}
.
.
.

This add_field() Fix was written in Perl. What if you need to write a new complicated Fix-routine and don’t want to use Perl? Well, we have Catmandu::Fix::Cmd to the rescue! You can create fixes in any language you like, as long as your program can read JSON records from the standard input and can write JSON records to the standard output you are cool. Lets try that out.

As example we create a Python script to read JSON from the stdin, add a title field and write the JSON back to stdout.

#!/usr/bin/env python
# file: catjson.py
import sys
import json

while 1:
    line = sys.stdin.readline()
    if not line: break
    data = json.loads(line.strip())
    data['title'] = "test";
    print json.dumps(data)

If we run this we can see the expected result.

$ ./generate.pl | ./catjson.py
{"random": 0.530965947974309, "title": "test"}
{"random": 0.371021223752646, "title": "test"}
{"random": 0.0907161737840951, "title": "test"}
.
.
.

With the Catmandu Fix ‘cmd’ we can make this Python program part of an ETL-pipeline. In the simple example below we will repeat the previous test:

$ ./generate.pl | catmandu convert JSON --fix 'cmd("./catjson.py")'
{"random":0.554686750713572,"title":"test"}
{"random":0.275637603863029,"title":"test"}
{"random":0.318374223918873,"title":"test"}
.
.
.

Now this is working you can add the whole Catmandu stack to this pipeline. Add different importers, new fixes, store into ElasticSearch or MongoDB. E.g. we can do an SRU query and use our Python and Perl fixes simultaneously:

$ catmandu convert SRU --base http://www.unicat.be/sru --query dna --fix 'cmd("./catjson.py");remove_field("recordData")'
{"recordPacking":"xml","recordPosition":"1","title":"test","recordSchema":"info:srw/schema/1/dc-schema"}
{"recordPacking":"xml","recordPosition":"2","title":"test","recordSchema":"info:srw/schema/1/dc-schema"}
{"recordPacking":"xml","recordPosition":"3","title":"test","recordSchema":"info:srw/schema/1/dc-schema"}
.
.
.

Here is how the same program might look like in Lua

#!/usr/bin/env luajit
# file: catjson.lua
-- requires dkjson http://chiselapp.com/user/dhkolf/repository/dkjson/home
local json = require ("dkjson")

for line in io.lines() do
    local obj, pos, err = json.decode (line, 1, nil)
    obj['title'] = 'test'
    print(json.encode(obj))
end

With the same expected results:

$ ./generate.pl | catmandu convert JSON --fix 'cmd("./catjson.lua")'
{"random":0.54868770433573,"title":"test"}
{"random":0.26483418097243,"title":"test"}
{"random":0.15708750198151,"title":"test"}
.
.
.

Using Catmandu::Fix::cmd you can create complicated fix routines to extend your data crunching needs.