Matching authors against VIAF identities

At Ghent University Library we enrich catalog records with VIAF identities to enhance the search experience in the catalog. When searching for all the books about ‘Chekov’ we want to match all name variants of this author. Consult VIAF,_Anton_Pavlovich,_1860-1904 and you will see many of them.

  • Chekhov
  • Čehov
  • Tsjechof
  • Txékhov
  • etc

Any of the these names variants can be available in the catalog data if authority control is not in place (or not maintained). Searching any of these names should result in results for all the variants. In the past it was a labor intensive, manual job for catalogers to maintain an authority file. Using results from Linked Data Fragments research by Ruben Verborgh (iMinds) and the Catmandu-RDF tools created by Jakob Voss (GBV) and RDF-LDF by Patrick Hochstenbach, Ghent University started an experiment to automatically enrich authors with VIAF identities. In this blog post we will report on the setup and results of this experiment which will also be reported at ELAG2015.


Three ingredients are needed to create a web of data:

  1. A scalable way to produce data.
  2. The infrastructure to publish data.
  3. Clients accessing the data and reusing them in new contexts.

On the production site there doesn’t seem to be any problem creating huge datasets by libraries. Any transformation of library data to linked data will quickly generate an enormous number of RDF triples. We see this in the size of public available datasets:

Also for accessing data, from a consumers perspective the “easy” part seems to be covered. Instead of thousands of APIs available and many documents formats for any dataset, SPARQL and RDF provide the programmer a single protocol and document model.

The claim of the Linked Data Fragments researchers is that on the publication side, reliable queryable access to public Linked Data datasets largely remains problematic due to the low availability percentages of public SPARQL endpoints [Ref]. This is confirmed by the 2013 study by researchers from Pontificia Universidad Católica in Chili and National University of Ireland where more than half of the public SPARQL endpoints seem to be offline 1.5 days per month. This gives an availability rate of less than 95% [Ref].

The source of this high rate of inavailability can be traced back to the service model of Linked Data where two extremes exists to publish data (see image below).

At one side, data dumps (or dereferencing of URLs) can be made available which requires a simple HTTP server and lots of processing power on the client side. At the other side, an open SPARQL endpoint can be provided which requires a lot of processing power (hence, hardware investment) on the serverside. With SPARQL endpoints, clients can demand the execution of arbitrarily complicated queries. Furthermore, since each client requests unique, highly specific queries, regular caching mechanisms are ineffective, since they can only optimized for repeated identical requests.

This situation can be compared with providing a database SQL dump to endusers or open database connection on which any possible SQL statement can be executed. To a lesser extent libraries are well aware of the different modes of operation between running OAI-PMH services and Z39.50/SRU services.

Linked Data Fragment researchers provide a third way, Triple Pattern Fragments, to publish data which tries to provide the best of both worlds: access to a full dump of datasets while providing a queryable and cachable interface. For more information on the scalability of this solution I refer to the report  presented at the 5th International USEWOD Workshop.

The experiment

VIAF doesn’t provide a public SPARQL endpoint, but a complete dump of the data is available at In our experiments we used the VIAF (Virtual International Authority File), which is made available under the ODC Attribution License.  From this dump we created a HDT database. HDT provides a very efficient format to compress RDF data while maintaining browser and search functionality. Using command line tools RDF/XML, Turtle and NTriples can be compressed into a HDT file with an index. This standalone file can be used to without the need of a database to query huge datasets. A VIAF conversion to HDT results in a 7 GB file and a 4 GB index.

Using the Linked Data Fragments server by Ruben Verborgh, available at, this HDT file can be published as a NodeJS application.

For a demonstration of this server visit the iMinds experimental setup at:

Using Triple Pattern Fragments a simple REST protocol is available to query this dataset. For instance it is possible to download the complete dataset using this query:

$ curl -H "Accept: text/turtle"

If we only want the triples concerning Chekhov ( we can provide a query parameter:

$ curl -H "Accept: text/turtle"

Likewise, using the predicate and object query any combination of triples can be requested from the server.

$ curl -H "Accept: text/turtle""Chekhov"

The memory requirements of this server are small enough to run a copy of the VIAF database on a MacBook Air laptop with 8GB RAM.

Using specialised Triple Pattern Fragments clients, SPARQL queries can be executed against this server. For the Catmandu project we created a Perl client RDF::LDF which is integrated into Catmandu-RDF.

To request all triples from the endpoint use:

$ catmandu convert RDF --url --sparql 'SELECT * {?s ?p ?o}'

Or, only those Triples that are about “Chekhov”:

$ catmandu convert RDF --url --sparql 'SELECT * {?s ?p "Chekhov"}'

In the Ghent University experiment a more direct approach was taken to match authors to VIAF. First, as input a MARC dump from the catalog is being streamed into a Perl program using a Catmandu iterator. Then, we extract the 100 and 700 fields which contain $a (name) and $d (date) subfields. These two fields are combined in a search query, as if we would search:

Chekhov, Anton Pavlovich, 1860-1904

If there is exactly one hit in our local VIAF copy, then the result is reported. A complete script to process MARC files this way is available at a GitHub gist. To run the program against a MARC dump execute the command:

$ ./ --type USMARC file.mrc
000000089-2 7001  L $$aEdwards, Everett Eugene,$$d1900-
000000122-8 1001  L $$aClelland, Marjorie Bolton,$$d1912-
000000124-4 7001  L $$aSchein, Edgar H.
000000124-4 7001  L $$aKilbridge, Maurice D.,$$d1920-
000000124-4 7001  L $$aWiseman, Frederick.
000000221-6 1001  L $$aMiller, Wilhelm,$$d1869-
000000256-9 1001  L $$aHazlett, Thomas C.,$$d1928-

[edit: 2017-05-18 an updated version of the code is available as a Git project ]

All the authors in the MARC dump will be exported. If there is exactly one single match against VIAF it will be added to the author field. We ran this command for one night in a single thread against 338.426 authors containing a date and found 135.257 exact matches in VIAF (=40%).

In a quite recent follow up of our experiments, we investigated how LDF clients can be used in a federated setup. When combining in the LDF algorithm the triples result from many LDF servers, one SPARQL query can be run over many machines. These results are demonstrated at the iMinds demo site where a single SPARQL query can be executed over the combined VIAF and DBPedia datasets. A Perl implementation of this federated search is available in the latest version of RDF-LDF at GitHub.

We strongly believe in the success of this setup and the scalability of this solution as demonstrated by Ruben Verborgh at the USEWOD Workshop. Using Linked Data Fragments a range of solutions are available to publish data on the web. From simple data dumps to a full SPARQL endpoint any service level can be provided given the resources available. For more than a half year DBPedia has been running an LDF server with 99.9994% availability on a 8 CPU , 15 GB RAM Amazon server with 4.5 million requests. Scaling out, services such has the LOD Laundromat cleans 650.000 datasets and provides access to them using a single fat LDF server (256 GB RAM).

For more information on the Federated searches with  Linked Data Fragments  visit the blog post of Ruben Verborgh at:

Preprocessing Catmandu fixes

Ever had a need for passing a lot of configuration parameters to a Catmandu Fix script or needing to write repetitive code over and over again in large scripts? There is a neat Bash trick you can use to preprocess your scripts.

For instance image you have a large JSON file that needs to be processed for many customers. To each record in the file you need to include the URL of their homepage:


You could do this my creating a Fix script for every customer and run the convert command for every customer:

$ catmandu convert --fix customer1.fix < data.json
$ catmandu convert --fix customer2.fix < data.json
$ catmandu convert --fix customer3.fix < data.json

There is another way to do this by using named pipe redirects in Bash. Instead of writing one Fix script for each customer you can write one Fix scipt for all customers that includes preprocessing handles:


With this script customer.fix you can use a preprocessor to populate the HOMEPAGE field:

$ catmandu convert --fix <(m4 -DHOMEPAGE=\"\" customer.fix) < data.json

$ catmandu convert --fix <(m4 -DHOMEPAGE=\"\" customer.fix) < data.json

$ catmandu convert --fix <(m4 -DHOMEPAGE=\"\" customer.fix) < data.json

Bash is creating a temporary named pipe that is given as input to the catmandu command while in the background a m4 processor is processing the customer.fix file.

You can enter any command into the named pipe redirects. There are plenty of interesting preprocessors available that can be used to process fix files such as: cpp, m4 and the even Template Toolkit tpage command).

Importing files from a hotfolder directory

The Catmandu data processing toolkit facilitates many import, export, and conversion tasks by support of common APIs (e.g. SRU, OAI-PMH) and databases (e.g. MongoDB, CouchDB, SQL…). But sometimes the best API and database is the file system. In this brief article I’ll show how to use a “hotfolder” to automatically import files into another Catmandu store.

A hotfolder is a directory in which files can be placed to automatically get processed. To facilitate the creation of such directories I created the CPAN module File::Hotfolder. Let’s first define a sample importer and storage in catmandu.yml configuration file:

    package: JSON
      multiline: 1
    package: CouchDB
      default_bag: import

We can now manually import JSON files into the import database of a local CouchDB like this:

catmandu import json to couchdb < filename.json

Manually calling such command for each file can be slow and requires access to the command line. How about defining a hotfolder to automatically import all JSON files into CouchDB? Here is an implementation:

use Catmandu -all;
use File::Hotfolder;
use File::Basename;
my $hotfolder = "import";
my $importer  = "json";
my $store     = "couchdb";
my $suffix    = qr{\.json};
my $store    = store($store);

watch( $hotfolder, 
    filter   => $suffix,
    scan     => 1,    
    delete   => 1,
    callback => sub {
        $store->add_many( importer($importer, file => shift) );
    catch    => 1,

The directory import is first scanned for existing files with extension .json and then watched for modified or new files. As soon as a file has been found, it is imported. The CATCH_ERROR options ensures to not kill the program if an import failed, for instance because of invalid JSON.

The current version of File::Hotfolder only works with Unix but it may be extended to other operating systems as well.

LibreCat/Memento Hackathon



The New Yorker tells us that average life of a Web page is about a hundred days. Websites don’t have to be deliberately deleted to disappear. Sites hosted by corporations tend to die with their hosts. Even the Web page you are viewing now is in flux. New blog posts might appear, comments and reviews are added. Bookmarks or references you are making to Web pages are in general not pointing to the same information you were reading when you visited the page, or when you were writing an article about that page. All this is very problematic in an academic context where provenance and diplomatics is crucial to analyse documents. To point a static version of a  Web page one can make use of services like the Internet Archive , and Archive Today. But, these solutions tend to be ad-hoc, there is no common strategy to refer to a static version of a web page. In comes Memento, a protocol created by Herbert Van de Sompel and Michael Nelson which adds services on top of HTTP to travel the web of the past.

During a two day Hackathon event at Ghent University Library technologists from all over Europe gathered to explore time travel using the Memento protocol presented by Herbert Van de Sompel and Harihar Shankar from Los Alamos National Laboratory.

The slides of this event are available here.


Herbert Van de Sompel – Los Alamos National Laboratory, Harihar Shankar – Los Alamos National Laboratory, Najko Jahn – Bielefeld University, Vitali Peil – Bielefeld University, Christian Pietsch – Bielefeld University, Dries Moreels – Ghent University, Patrick Hochstenbach – Ghent University, Nicolas Steenlant – Ghent University, Nicolas Franck – Ghent University, Katrien Deroo – Ghent University, Ruben Verborgh – iMinds, Miel Vander Sande – iMinds, Snorri Briem – Lund University, Maria Hedberg – Lund University, Benoit Pauwels – Université Libre de Bruxelles, Anthony Leroy – Université Libre de Bruxelles, Benoit Erken – Université Catholique de Louvain, Laurent Dubois – Université Catholique de Louvain

Introduction into Memento

The goal of Memento is to provide a protocol for accessing historical versions of web resources. These archived versions, called Mementos, can reside in the content management system of a website or in external services such as web archives.

Take Wikipedia as an example. To view the current version of the lemma ‘Memento_Project’ one needs to visit the web resource Wikipedia provides also historical versions of this resource at .  In this case the WikiMedia platform keeps all the historical versions of a resource.

Another example is Tim Berners-Lee’s homepage at: The W3C website doesn’t provide an archive of versions of this webpage, but they are archived at  Internet Archive , Archive-It, UK Web Archive and Archive Today.

How can a machine discover all versions of a web resource automatically?

As Gerald Sussman says: “Wishful thinking is essential to good engineering, and certainly essential to good computer science”. We might imagine any web resource (such as the Wikipedia page or Berners-Lee homepage above), called the original resource (URI-R), as a box that just tells a machine where to find all it’s archived versions using a standard syntax, the HTTP protocol.


A machine visits the resource URI-R and requests the “2007-05-31” version. The answer should be a link to the archived version of the resource, called the Memento (URI-M). There are some complications which Memento protocol should consider:

  • Not all websites have a content management system with a versioning system, the resource at URI-R might not know where an archived Memento is, or exists at all.
  • Web archives such as the Internet Archive don’t have a complete coverage of the Internet, many web archives might need to be visited to find a Memento.
  • Even when a resource is available in a Web archive, then not all versions of the resource might be available: the archive contains only a fragmented history of versions.

To implement the time travel protocol, Memento introduces a service called a TimeGate (URI-G) which can act as a router for time travel request. As input it receives the address of a resource (URI-R) and a date time (e.g. “2007-05-31”) and as response it returns the URL of the archived resource, the Memento (URI-M).

A machine visits URI-R and requests the “2007-05-31” version. The server redirects the machine to a TimeGate (URI-G) which has a routing table where to find archived versions, or at least a version close to the requested date.


The TimeGate can be a service that runs locally querying  the local content management system or on the Internet maybe connected to a large web archive or a knowledge base of access routes to versioning systems like GitHub or Wikipedia.

You might ask, how does a TimeGate (URI-G) itself know where the archived version of a particular resource lives? We can look at three cases:

  • When the TimeGate is connected to the content management system of a website it can query the local version database. Given a local URL and date it can find out which local versions are available. The TimeGate can even provide a complete listing of all versions of a particular local URL, this is known as the TimeMap (URI-T) of a resource.
  • When a TimeGate needs to find an archived version of a remote URL for which locally no further information is known it can forward these requests to other well known TimeGate servers. Typically a TimeGate running at a webarchive has a huge repository or URL-s for which Mementos exists. Based on this information the request can be answered.
  • Or, the TimeGate knows the version API-s of services such as GitHub, Wikipedia, Internet Archive, and act as gateway translating a Memento requests into service specific version requests.

In the example below a machine requests the version “2007-05-31” of a resource to a TimeGate (URI-G). The TimeGate doesn’t know the answer but can query one or more remote TimeGates (which contain an index of Mementos at URI-T) services (e.g. Internet Archive, Archive-It, Archive Today) and request all versions for a resource. Some TimeGate servers might give zero results. Some might answer with a listing of all available versions. Based on this information the TimeGate server can decide which results best fit the original request.


Memento Example

As a practical example one can turn any web browser into a machine that understands the Memento protocol by including a bit of JavaScripting into a web page.

In a <html><head> include the following code snippet:

<link rel="stylesheet" type="text/css" href="" />
<script type="text/javascript" src=""></script>

Now one can add HTML5 attributes  to web links. In this way it is possible to link to a particular version of a web resource. E.g. to link to the “2014-11-01” version of the LibreCat homepage one can write:

<a href="" data-versiondate="2014-11-01">link</a>

Automatically this link will get a menu option to the archived version of this web page (using as TimeGate)


See a demonstration here:

Using the Memento plugin for Chrome this JavaScript trick is not even needed. Data-versiondate attributes will automatically be turned into archive links. One can choose may public web archives as TimeGate. In case of Archive Today one can even have an active role in archiving webpages: just provide a URL and it will be stored!

Read more on this project on the Robust Links page.


The second day was used to implement the Memento protocol in various tools and environments. All the results are available as open source projects on Github:

LDF Memento

The Web is  full of high-quality Linked Data but in general it can’t be reliably queried. Public SPARQL endpoints are often unavailable because they need  to answer many unique queries. The Linked Data Fragments conceptual framework allows to define more lightweight interfaces, which enable client-side execution of complex queries.

During the Hackathon Miel Vander Sande and Ruben Verborgh of iMinds extended the LDF server and client to allow for Memento based querying. A demonstrator was built where  many versions of DBPedia are made available using the Memento protocol. By adding the correct headers to queries historical Linked Data dumps can be queried with SPARQL and compared.

R Memento

In data science, R is the language for data analysis and data mining. The language is known for its strong statistical and graphical support.

Najko Jahn of Bielefeld University created an R client for Memento called Timetravelr. With this tool he demonstrated how HTML tables can be extracted from websites and transformed into a dataset. Using the Memento protocol, this dataset can be tracked over time to generate a time series. In his demonstration Najko showed the evolution of conforming OAI repositories by tracking the OAI registry over time.

GitLab Memento

GitLab is a web-based Git repository manager with wiki and issue tracking features. GitLab is similar to GitHub, but GitLab has an open source version, unlike GitHub. Bielefeld University Library is using GitLab as a platform to manage source code and (soon) research data. During the Hackathon, Christian Pietsch (Bielefeld University) created a GitLab handler for the Memento TimeGate software using the GitLab Commits API.

Plack Memento

PSGI/Plack is a Perl middleware to build web applications, comparable with WSGI in Python and Rack in Ruby. Using Plack it becomes very easy to make RESTful web applications with only a few lines of Perl code. By creating Plack plugins new functionality can be added to existing web applications without needing to change the application specific code.

Nicolas Steenlant (Ghent University) , Vitali Peil (Bielefeld University)  and Maria Hedberg (Lund University) created a Memento plugin for Plack which turns every REST application into a Memento TimeGate if a versioning database is available. As a special case Nicolas, Vitali and Maria demonstrated with Catmandu how versioning can be added to databases such as Elastic Search, MongoDB, CouchDB and DBI. Programmers only need to take care of the logic of the database records, Catmandu and Plack take care of the rest.

Catmandu Memento

Catmandu is the ETL backbone of the LibreCat project. Using Catmandu librarians can extract bibliographic data from various sources such as catalogs, institutional repositories, A&I databases, search engines and transform this data with a small language called Fix. The results of these transformations can be published again into catalogs, search engines, CSV reports, Atom feeds and Linked Data.

During the Hackathon Patrick Hochstenbach (Ghent University) and Snorri Briem (Lund University) created Memento support for the Catmandu tools. As a demonstration  they showed how librarians can use Catmandu as a URL checker. As input MARC records were exported from a catalog, URL-s extracted from the 856u field and checked against TimeGates for the availability of archived versions.

Day 18: Merry Christmas!


Thank you all for joining our Catmandu advent calendar this month. We hope that you enjoyed our daily posts. Catmandu is a very rich programming environment which provides command line tools and even an API. In these blogposts we provided only a short introduction into all these modules. Hopefully we will see you next year again with more examples!

The Catmandu community consists of all people involved in the project, no matter if they do programming, documentation, or drawing cats. We want to thank them all for a wonderful year!

  • Christian Pietsch
  • Dave Sherohman
  • Friedrich Summann
  • Jakob Voss
  • Johann Rolschewski
  • Jörgen Eriksson
  • Maria Hedberg
  • Mathias Lösch
  • Najko Jahn
  • Nicolas Franck
  • Nicolas Steenlant
  • Patrick Hochstenbach
  • Petra Kohorst
  • Snorri Briem
  • Vitali Peil

And a big round of applause for our contributors who kept us sending bug reports and ideas for new features. If you would like to contribute, then please take a look at the contributions section of Catmandu documentation. Don’t be shy to contact us with questions, feature requests, bug fixes, documentation and cat cartoons!

This advent calendar will stay online for you reference.

As a special gift we have still some catmandu USB sticks available that we can send to you. Please send a line to “patrick dot hochstenbach at ugent dot be”. The first 5  emailers will get a free USB!


Day 17: Exporting RDF data with Catmandu

Yesterday we learned how to import RDF data with Catmandu. Exporting RDF can be as easy as this:

catmandu convert RDF --url to RDF

By default, the RDF exporter Catmandu::Exporter::RDF emits RDF/XML, an ugly and verbose serialization format of RDF. Let’s configure catmandu to use the also verbose but less ugly NTriples. This can either by done by appending --type ntriple on command line or by adding 17_librecatthe following to config file catmandu.yml:

    package: RDF
        type: ntriples

The NTriples format illustrates the “true” nature of RDF data as a set of RDF triples or statements, each consisting of three parts (subject, predicate, object).

Catmandu can be used for converting between one RDF serialization format to another, but more specialized RDF tools, such as such rapper are more performant especially for large data sets. Catmandu can better help to process RDF data to JSON, YAML, CSV etc. and vice versa.

Let’s proceed with a more complex workflow and with what we’ve learned at day 13 about OAI-PMH and another popular repository: There is a dedicated Catmandu module Catmandu::ArXiv for searching the repository, but ArXiv also supports OAI-PMH for bulk download. We could specify all options at command line, but putting the following into catmandu.yml will simplify each call:

    package: OAI
      metadataPrefix: oai_dc
      set: cs

Now we can harvest all computer science papers (set: cs) for a selected day (e.g. 2014-12-19):

$ catmandu convert arxiv --from 2014-12-19 --to 2014-12-19 to YAML

The repository may impose a delay of 20 seconds, so be patient. For more precise data, we better use the original data format from ArXiV:

$ catmandu convert arxiv --set cs --from 2014-12-19 --to 2014-12-19 --metadataPrefix arXiv to YAML > arxiv.yaml

The resulting format is based on XML. Have a look at the original data (requires module Catmandu::XML):

$ catmandu convert YAML to XML --field _metadata --pretty 1 < arxiv.yaml
$ catmandu convert YAML --fix 'xml_simple(_metadata)' to YAML < arxiv.yaml

Now we’ll transform this XML data to RDF. This is done with the following fix script, saved in file arxiv2rdf.fix:




The following command generates one RDF triple per record, consisting of an arXiv article identifier, the property and the article title:

$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml

To better understand what’s going on, convert to YAML instead of RDF, so the internal aREF data structure is shown:

$ catmandu convert YAML to YAML --fix arxiv2rdf.fix < arxiv.yaml

dc_title: On Conditional Decomposability

This record looks similar to the records imported from RDF at day 13. The special field _id refers to the subject in RDF triples: a handy feature for small RDF graphs that share the same subject in all RDF triples. Nevertheless, the same RDF graph could have been encoded like this:

  dc_title: On Conditional Decomposability

To transform more parts of the original record to RDF, we only need to map field names to prefixed RDF property names. Here is a more complete version of arxiv2rdf.fix:

unless exists(dc_creator.0)
do list(path=>dc_creator)
  join_field(foaf_name,' ')

The result is one big RDF graph for all records:

$ catmandu convert YAML to RDF --fix arxiv2rdf.fix < arxiv.yaml

Have a look at the internal aREF format by using the same fix with convert to YAML and try conversion to other RDF serialization forms. The most important part of transformation to RDF is to find matching RDF properties from existing ontologies. The example above uses properties from Dublin Core, Creative Commons, Friend of a Friend,, and Bibliographic Ontology.

Continue to Day 18: Merry Christmas! >>