Tagged: marc
Parallel Processing with Catmandu
In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion:
$ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json
added 100 (55/sec)
added 200 (76/sec)
added 300 (87/sec)
added 400 (92/sec)
added 500 (90/sec)
added 600 (94/sec)
added 700 (97/sec)
added 800 (97/sec)
added 900 (96/sec)
added 1000 (97/sec)
In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records.
Can we make this any faster?
When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’ if all the processors could be used. One technique to do that is called ‘parallel processing’.
To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system:
$ cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
The example above shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain 4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle).
To check if your computer is using all that calculating power, use the ‘uptime’ command:
$ uptime
11:15:21 up 622 days, 1:53, 2 users, load average: 1.23, 1.70, 1.95
In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores it means: the server is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait.
Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel:
sudo yum install parallel
The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation.
First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation):
$ sudo cat > /etc/yum.repos.d/mongodb.repo <<EOF
[mongodb]
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
enabled=1
name=MongoDB.org repository
EOF
$ sudo yum install -y mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-org-tools
$ sudo cpanm Catmandu::Store::MongoDB
Next, we are going to store our input data in a MongoDB database with help of a Catmandu Fix script that adds some random numbers the data:
$ catmandu import MARC to MongoDB --database_name data --fix random.fix < input.marc
With the ‘random.fix’ like:
random("part.rand2","2")
random("part.rand4","4")
random("part.rand8","8")
random("part.rand16","16")
random("part.rand32","32")
The ‘random()’ Fix function will be available in Catmandu 1.003 but can also be downloaded here (install it in a directory ‘lib/Catmandu/Fix’). This will will make sure that every record in your input file contains four random numbers ‘part.rand2’, ‘part.rand4′ ,’part.rand8′,’part.rand16′,’part.rand32’. This will makes it possible to chop your data into two, four, eight, sixteen or thirty-two parts depending on the number of processors you have in your machine.
To access one chunk of your data the ‘catmandu export’ command can be used with a query. For instance, to export two equal chunks do:
$ catmandu export MongoDB --database_name -q '{"part.rand2":0}' > part1
$ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2
We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously.
#!/bin/bash
# file: parallel.sh
CPU=$1
if [ "${CPU}" == "" ]; then
/usr/bin/parallel -u $0 {} <<EOF
0
1
EOF
elif [ "${CPU}" != "" ]; then
catmandu export MongoDB --database_name data -q "{\"part.rand2\":${CPU}}" to JSON --line_delimited 1 --fix heavy_load.fix > result.${CPU}.json
fi
This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script.
If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’.
GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command:
$ cat result.*.json > final_result.json
Matching authors against VIAF identities
At Ghent University Library we enrich catalog records with VIAF identities to enhance the search experience in the catalog. When searching for all the books about ‘Chekov’ we want to match all name variants of this author. Consult VIAF http://viaf.org/viaf/95216565/#Chekhov,_Anton_Pavlovich,_1860-1904 and you will see many of them.
- Chekhov
- Čehov
- Tsjechof
- Txékhov
- etc
Any of the these names variants can be available in the catalog data if authority control is not in place (or not maintained). Searching any of these names should result in results for all the variants. In the past it was a labor intensive, manual job for catalogers to maintain an authority file. Using results from Linked Data Fragments research by Ruben Verborgh (iMinds) and the Catmandu-RDF tools created by Jakob Voss (GBV) and RDF-LDF by Patrick Hochstenbach, Ghent University started an experiment to automatically enrich authors with VIAF identities. In this blog post we will report on the setup and results of this experiment which will also be reported at ELAG2015.
Context
Three ingredients are needed to create a web of data:
- A scalable way to produce data.
- The infrastructure to publish data.
- Clients accessing the data and reusing them in new contexts.
On the production site there doesn’t seem to be any problem creating huge datasets by libraries. Any transformation of library data to linked data will quickly generate an enormous number of RDF triples. We see this in the size of public available datasets:
- UGent Academic Bibliography: 12.000.000 triples
- Libris catalog: 50.000.000 triples
- Gallica: 72.000.000 triples
- DBPedia: 500.000.000 triples
- VIAF: 600.000.000 triples
- Europeana: 900.000.000 triples
- The European Library: 3.500.000.000 triples
- PubChem: 60.000.000.000 triples
Also for accessing data, from a consumers perspective the “easy” part seems to be covered. Instead of thousands of APIs available and many documents formats for any dataset, SPARQL and RDF provide the programmer a single protocol and document model.
The claim of the Linked Data Fragments researchers is that on the publication side, reliable queryable access to public Linked Data datasets largely remains problematic due to the low availability percentages of public SPARQL endpoints [Ref]. This is confirmed by the 2013 study by researchers from Pontificia Universidad Católica in Chili and National University of Ireland where more than half of the public SPARQL endpoints seem to be offline 1.5 days per month. This gives an availability rate of less than 95% [Ref].
The source of this high rate of inavailability can be traced back to the service model of Linked Data where two extremes exists to publish data (see image below).
At one side, data dumps (or dereferencing of URLs) can be made available which requires a simple HTTP server and lots of processing power on the client side. At the other side, an open SPARQL endpoint can be provided which requires a lot of processing power (hence, hardware investment) on the serverside. With SPARQL endpoints, clients can demand the execution of arbitrarily complicated queries. Furthermore, since each client requests unique, highly specific queries, regular caching mechanisms are ineffective, since they can only optimized for repeated identical requests.
This situation can be compared with providing a database SQL dump to endusers or open database connection on which any possible SQL statement can be executed. To a lesser extent libraries are well aware of the different modes of operation between running OAI-PMH services and Z39.50/SRU services.
Linked Data Fragment researchers provide a third way, Triple Pattern Fragments, to publish data which tries to provide the best of both worlds: access to a full dump of datasets while providing a queryable and cachable interface. For more information on the scalability of this solution I refer to the report presented at the 5th International USEWOD Workshop.
The experiment
VIAF doesn’t provide a public SPARQL endpoint, but a complete dump of the data is available at http://viaf.org/viaf/data/. In our experiments we used the VIAF (Virtual International Authority File), which is made available under the ODC Attribution License. From this dump we created a HDT database. HDT provides a very efficient format to compress RDF data while maintaining browser and search functionality. Using command line tools RDF/XML, Turtle and NTriples can be compressed into a HDT file with an index. This standalone file can be used to without the need of a database to query huge datasets. A VIAF conversion to HDT results in a 7 GB file and a 4 GB index.
Using the Linked Data Fragments server by Ruben Verborgh, available at https://github.com/LinkedDataFragments/Server.js, this HDT file can be published as a NodeJS application.
For a demonstration of this server visit the iMinds experimental setup at: http://data.linkeddatafragments.org/viaf
Using Triple Pattern Fragments a simple REST protocol is available to query this dataset. For instance it is possible to download the complete dataset using this query:
$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf
If we only want the triples concerning Chekhov (http://viaf.org/viaf/95216565) we can provide a query parameter:
$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?subject=http://viaf.org/viaf/95216565
Likewise, using the predicate and object query any combination of triples can be requested from the server.
$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?object="Chekhov"
The memory requirements of this server are small enough to run a copy of the VIAF database on a MacBook Air laptop with 8GB RAM.
Using specialised Triple Pattern Fragments clients, SPARQL queries can be executed against this server. For the Catmandu project we created a Perl client RDF::LDF which is integrated into Catmandu-RDF.
To request all triples from the endpoint use:
$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p ?o}'
Or, only those Triples that are about “Chekhov”:
$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p "Chekhov"}'
In the Ghent University experiment a more direct approach was taken to match authors to VIAF. First, as input a MARC dump from the catalog is being streamed into a Perl program using a Catmandu iterator. Then, we extract the 100 and 700 fields which contain $a (name) and $d (date) subfields. These two fields are combined in a search query, as if we would search:
Chekhov, Anton Pavlovich, 1860-1904
If there is exactly one hit in our local VIAF copy, then the result is reported. A complete script to process MARC files this way is available at a GitHub gist. To run the program against a MARC dump execute the import_viaf.pl command:
$ ./import_viaf.pl --type USMARC file.mrc
000000089-2 7001 L $$aEdwards, Everett Eugene,$$d1900- http://viaf.org/viaf/110156902
000000122-8 1001 L $$aClelland, Marjorie Bolton,$$d1912- http://viaf.org/viaf/24253418
000000124-4 7001 L $$aSchein, Edgar H.
000000124-4 7001 L $$aKilbridge, Maurice D.,$$d1920- http://viaf.org/viaf/29125668
000000124-4 7001 L $$aWiseman, Frederick.
000000221-6 1001 L $$aMiller, Wilhelm,$$d1869- http://viaf.org/viaf/104464511
000000256-9 1001 L $$aHazlett, Thomas C.,$$d1928- http://viaf.org/viaf/65541341
[edit: 2017-05-18 an updated version of the code is available as a Git project https://github.com/LibreCat/MARC2RDF ]
All the authors in the MARC dump will be exported. If there is exactly one single match against VIAF it will be added to the author field. We ran this command for one night in a single thread against 338.426 authors containing a date and found 135.257 exact matches in VIAF (=40%).
In a quite recent follow up of our experiments, we investigated how LDF clients can be used in a federated setup. When combining in the LDF algorithm the triples result from many LDF servers, one SPARQL query can be run over many machines. These results are demonstrated at the iMinds demo site where a single SPARQL query can be executed over the combined VIAF and DBPedia datasets. A Perl implementation of this federated search is available in the latest version of RDF-LDF at GitHub.
We strongly believe in the success of this setup and the scalability of this solution as demonstrated by Ruben Verborgh at the USEWOD Workshop. Using Linked Data Fragments a range of solutions are available to publish data on the web. From simple data dumps to a full SPARQL endpoint any service level can be provided given the resources available. For more than a half year DBPedia has been running an LDF server with 99.9994% availability on a 8 CPU , 15 GB RAM Amazon server with 4.5 million requests. Scaling out, services such has the LOD Laundromat cleans 650.000 datasets and provides access to them using a single fat LDF server (256 GB RAM).
For more information on the Federated searches with Linked Data Fragments visit the blog post of Ruben Verborgh at: http://ruben.verborgh.org/blog/2015/06/09/federated-sparql-queries-in-your-browser/
Day 15 : MARC to Dublin Core
Today we will look a bit further into MARC processing with Catmandu. By now you should already know how to startup the Virtual Catmandu (hint: see day 1) and start up the UNIX command prompt (hint: see day 2). We already saw a bit of MARC processing in day 9 and today we will show you how to transform MARC records into Dublin Core. This as a preparation to create RDF and Linked Data in the later posts.
First I’m going to teach you how to process different types of MARC files. On the Virtual Catmandu system we provided five example MARC files. You can find them in your Documents folder:
- Documents/camel.mrk
- Documents/camel.usmarc
- Documents/marc.xml
- Documents/rug01.aleph
- Documents/rug01.sample
When you examine these files with the UNIX less command you will see that all the files have a bit different format:
$ less Documents/camel.mrk
$ less Documents/camel.usmarc
$ less Documents/marc.xml
$ less Documents/rug01.sample
There are many ways in which MARC data can be written into a file. Every vendor likes to use its own format. You can compare this with the different ways a text document can be stored: as Word, as Open Office, as PDF and plain text. If we are going to process these files with catmandu, then we need to tell the system what the exact format is.
We will work today with the last example rug01.sample which is a small export out of the Aleph catalog from Ghent University Library. Ex Libris uses a special MARC format to structure their data which is called Aleph sequential. We need to tell catmandu not only that our input file is in MARC but also in this special Aleph format. Let’s try to create YAML to see what it gives:
$ catmandu convert MARC --type ALEPHSEQ to YAML < Documents/rug01.sample
To transform this MARC file into Dublin Core we need to create a fix file. You can use the UNIX command nano for this (hint: see day 5 how to create files with nano). Create a file dublin.fix:
$ nano dublin.fix
And type into nano the following fixes:
marc_map(245,title)
marc_map(100,creator.$append)
marc_map(700,creator.$append)
marc_map(020a,isbn.$append)
marc_map(022a,issn.$append)
marc_map(260b,publisher)
marc_map(260c,date)
marc_map(650a,subject.$append)
remove_field(record)
Every MARC record contains in the 245-field the title of a record. In the first line we map the MARC-245 field to new field in the record called title:
marc_map(245,title)
In the second and third line we map authors to a field creator. In the rug01.sample file the authors are stored in the MARC-100 and MARC-700 field. Because there is usually more than one author in a record, we need to $append them to create an array (a list) of one or more creator-s.
In line 4 and line 5 we do the same trick to filter out the ISBN and ISSN number out of the record which we store in separate fields isbn and issn (indeed these are not Dublin Core fields, we will process them later).
In line 6 and line 7 we read the MARC-260 field which contains publisher and date information. Here we don’t need the $append trick because there is usually only one 260-field in a MARC record.
In line 8 the subjects are extracted from the 260-field using the same $append trick as above. Notice that we only extracted the $a subfields? If you want to add more subfields you can list them as in marc_map(650abcdefgh,subject.$append)
Given the dublin.txt file above we can execute the filtering command like this:
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample
As always you can type | less at the end of this command to slow down the screen output, or store the results into a file with > results.txt. Hint:
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample | less
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample > results.txt
The results should look like this:
_id: '000000002'
creator:
- Katz, Jerrold J.
date: '1977.'
isbn:
- '0855275103 :'
publisher: Harvester press,
subject:
- Semantics.
- Proposition (Logic)
- Speech acts (Linguistics)
- Generative grammar.
- Competence and performance (Linguistics)
title: Propositional structure and illocutionary force :a study of the contribution of sentence meaning to speech acts /Jerrold J. Katz.
...
Congratulations, you’ve created your first mapping file to transform library data from MARC to Dublin Core! We need to add a bit more cleaning to delete some periods and commas here and there but as is we already have our first mapping.
Below you’ll find a complete example. You can read more about our Fix language online.
marc_map(245,title, -join => " ")
marc_map(100,creator.$append)
marc_map(700,creator.$append)
marc_map(020a,isbn.$append)
marc_map(022a,issn.$append)
replace_all(isbn.," .","")
replace_all(issn.," .","")
marc_map(260b,publisher)
replace_all(publisher,",$","")
marc_map(260c,date)
replace_all(date,"\D+","")
marc_map(650a,subject.$append)
remove_field(record)
Continue to Day 16: Importing RDF data with Catmandu >>
Day 9: Processing MARC with Catmandu
In the previous days we learned how we can use the catmandu command to process structured data like JSON. Today we will use the same command to process MARC metadata records. In this process we will see that MARC can be processed using JSON paths but this is a bit cumbersome. We will introduce MARCspec as an easier way to point to parts of a MARC record.
As always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).
In the Virtual Catmandu installation we provided a couple of example MARC files that we can inspect with the UNIX command cat or less. In the UNIX prompt inspect the file Documents/camel.usmarc, for instance, with cat:
$ cat Documents/camel.usmarc
You should see something like this:
Like JSON the MARC file contains structured data but the format is different. All the data is on one line, but there isn’t at first sight a clear separation between fields and values. The field/value structure there but you need to use a MARC parser to extract this information. Catmandu contains a MARC parser which can be used to interpret this file. Type the following command to transform the MARC data into YAML (which we introduced in the previous posts):
$ catmandu convert MARC to YAML < Documents/camel.usmarc
You will see something like this:
When transforming MARC into YAML it looks like something with a simple top level field _id containing the identifier of the MARC record and a record field with a deeper array structure (or more correct an array-of-an-array structure).
We can use catmandu to read the _id fields of the MARC record with the retain_field fix we learned in the Day 6 post:
$ catmandu convert MARC --fix 'retain_field(_id)' to YAML < Documents/camel.usmarc
You will see:
--- _id: 'fol05731351 ' ... --- _id: 'fol05754809 ' ... --- _id: 'fol05843555 ' ... --- _id: 'fol05843579 ' ...
What is happening here? The MARC file Documents/camel.usmarc contains more than one MARC record. For every MARC record catmandu extracts the _id field.
Extracting data out of the MARC record itself is a bit more difficult. MARC is an array-an-array, you need indexes to extract the data. For instance the MARC leader is usually in the first field of a MARC record. In the previous posts we learned that you need to use the 0 index to extract the first field out of an array:
$ catmandu convert MARC --fix 'retain_field(record.0)' to YAML < Documents/camel.usmarc --- _id: 'fol05731351 ' record: - - LDR - ~ - ~ - _ - 00755cam 22002414a 4500 ...
The leader value itself is the fifth entry in the resulting array. So, we need index 4 to extract it:
$ catmandu convert MARC --fix 'copy_field(record.0.4,leader); retain_field(leader)' to YAML < Documents/camel.usmarc
We used here a copy_field fix to extract the value into a field called leader. The retain_field fix is used to keep only this leader field in the result. To process MARC data this way would be very verbose, plus you need to know at which index position the fields are that you are interested in. This is something you usually don’t know.
Catmandu introduces Carsten Klee’s MARCspec to ease the extraction of MARC values out of a record. With the marc_map fix the command above would read:
marc_map("LDR",leader) retain_field(leader)
I skipped here writing the catmandu commands (they will be the same everytime). You can put these fixes into a file using nano (see the Day 5 post) and execute it as:
catmandu convert MARC --fix myfixes.txt to YAML < Documents/camel.usmarc
Where myfixes.txt contains the fixes above.
To extract the title fields, the field 245 remember? ;), you can write:
marc_map("245",title) retain_field(title)
Or, if you are only interested in the $a subfield you could write:
marc_map("245a",title) retain_field(title)
More elaborate mappings are possible. I’ll show you more complete examples in the next posts. As a warming up, here is some code to extract all the record identifiers, titles and isbn numbers in a MARC file into a CSV listing (which you can open in Excel).
Step 1, create a fix file myfixes.txt containing:
marc_map("245",title) marc_map("020a",isbn.$append) join_field(isbn,",") remove_field(record)
Step 2, execute this command:
$ catmandu convert MARC --fix myfixes.txt to CSV < Documents/camel.usmarc
You will see this as output:
_id,isbn,title "fol05731351 ","0471383147 (paper/cd-rom : alk. paper)","ActivePerl with ASP and ADO /Tobias Martinsson." "fol05754809 ",1565926994,"Programming the Perl DBI /Alligator Descartes and Tim Bunce." "fol05843555 ",,"Perl :programmer's reference /Martin C. Brown." "fol05843579 ",0072120002,"Perl :the complete reference /Martin C. Brown." "fol05848297 ",1565924193,"CGI programming with Perl /Scott Guelich, Shishir Gundavaram & Gunther Birznieks." "fol05865950 ",0596000138,"Proceedings of the Perl Conference 4.0 :July 17-20, 2000, Monterey, California." "fol05865956 ",1565926099,"Perl for system administration /David N. Blank-Edelman." "fol05865967 ",0596000278,"Programming Perl /Larry Wall, Tom Christiansen & Jon Orwant." "fol05872355 ",013020868X,"Perl programmer's interactive workbook /Vincent Lowe." "fol05882032 ","0764547291 (alk. paper)","Cross-platform Perl /Eric F. Johnson.
In the fix above we mapped the 245-field to the title. The ISBN is in the 020-field. Because MARC records can contain one or more 020 fields we created an isbn array using the isbn.$append syntax. Next we turned the isbn array back into a comma separated string using the join_field fix. As last step we deleted all the fields we didn’t need in the output with the remove_field syntax.
In this post we demonstrated how to process MARC data. In the next post we will show some examples how catmandu typically can be used to process library data.
Continue with Day 10: Working with CSV and Excel files >>