Tagged: perl
Catmandu 1.20
On May 21th 2019, Nicolas Steenlant (our main developer and guru of Catmandu) released version 1.20 of our Catmandu toolkit with some very interesting new features. The main addition is a brand new way how Catmandu Fix-es can be implemented using the new Catmandu::Path implementation. This coding by Nicolas will make it much easier and straightforward to implement any kind of fixes in Perl.
In the previous versions of Catmandu there were only two options to create new fixes:
- Create a Perl package in the Catmandu::Fix namespace which implements a
fix
method. This was very easy: update the$data
hash you got as first argument, return the updated$data
and you were done. Then disadvantage was that accessing fields in a deeply nested record was tricky and slow to code. - Create a Perl package in the Catmandu::Fix namespace which implemented
emit
functions. These were functions that generate Perl code on the fly. Using emit functions it was easier to get fast access to deeply nested data. But, to create Fix packages was pretty complex.
In Catmandu 1.20 there is now support for a third and easy way to create new Fixes using the Catmandu::Fix::Builder and Catmandu::Fix::Path class. Let me give an simple example of a skeleton Fix that does nothing:
package Catmandu::Fix::rot13; use Catmandu::Sane; use Moo; use Catmandu::Util::Path qw(as_path); use Catmandu::Fix::Has; with 'Catmandu::Fix::Builder'; has path => (fix_arg => 1); sub _build_fixer { my ($self) = @_; sub { my $data = $_[0]; # ..do some magic here ... $data; } } 1;
In the code above we start implementing a rot13(path)
Fix that should read a string on a JSON path and encrypt it using the ROT13 algorithm. This Fix is only the skeleton which doesn’t do anything. What we have is:
-
- We import the
as_path
method be able to easily access data on JSON paths/ - We import Catmandu::Fix::Has to be able to use
has path
constructs to read in arguments for our Fix. - We import Catmandu::Fix::Builder to use the new Catmandu 1.20 builder class provides a
_build_fixer
method. - The builder is nothing more than a closure that reads the data, does some action on the data and return the data.
- We import the
We can use this skeleton builder to implement our ROT13 algorithm. Add these lines instead of the # do some magic
part:
# On the path update the string value... as_path($self->path)->updater( if_string => sub { my $value = shift; $value =~ tr{N-ZA-Mn-za-m}{A-Za-z}; $value; }, )->($data);
The as_path
method receives a JSON path string an creates an object which you can use to manipulate data on that path. One can update the values found with the updater
method, or read data at that path with the getter
method or create a new path with the creator
method. In our example, we update the string found at the JSON path using if_string
condition. The updater
has many conditions:
if_string
needs a closure what should happen when a string is found on the JSON path.if_array_ref
needs a closure what should happen when an array is found on the JSON path.if_hash_ref
needs a closure what should happen when a hash is found on the JSON path.
In our case we are only interested in transforming strings using our rot13(path)
fix. The ROT13 algorithm is very easy and only switched the order of some characters. When we execute this fix on some sample data we get this result:
$ catmandu -I lib convert Null to YAML --fix 'add_field(demo,hello);rot13v2(demo)'
---
demo: uryyb
...
In this case the Fix can be written much shorter when we know that every Catmandu::Path method return a closure (hint: look at the ->($data)
in the code. The complete Fix can look like:
package Catmandu::Fix::rot13; use Catmandu::Sane; use Moo; use Catmandu::Util::Path qw(as_path); use Catmandu::Fix::Has; with 'Catmandu::Fix::Builder'; has path => (fix_arg => 1); sub _build_fixer { my ($self) = @_; # On the path update the string value... as_path($self->path)->updater( if_string => sub { my $value = shift; $value =~ tr{N-ZA-Mn-za-m}{A-Za-z}; $value; }, ); } 1;
This is as easy as it can get to manipulate deeply nested data with your own Perl tools. All the code is in Perl, there is no limit on the number of external CPAN packages one can include in these Builder fixes.
We can’t wait what Catmandu extensions you will create.
Parallel Processing with Catmandu
In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion:
$ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json
added 100 (55/sec)
added 200 (76/sec)
added 300 (87/sec)
added 400 (92/sec)
added 500 (90/sec)
added 600 (94/sec)
added 700 (97/sec)
added 800 (97/sec)
added 900 (96/sec)
added 1000 (97/sec)
In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records.
Can we make this any faster?
When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’ if all the processors could be used. One technique to do that is called ‘parallel processing’.
To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system:
$ cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
The example above shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain 4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle).
To check if your computer is using all that calculating power, use the ‘uptime’ command:
$ uptime
11:15:21 up 622 days, 1:53, 2 users, load average: 1.23, 1.70, 1.95
In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores it means: the server is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait.
Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel:
sudo yum install parallel
The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation.
First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation):
$ sudo cat > /etc/yum.repos.d/mongodb.repo <<EOF
[mongodb]
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
enabled=1
name=MongoDB.org repository
EOF
$ sudo yum install -y mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-org-tools
$ sudo cpanm Catmandu::Store::MongoDB
Next, we are going to store our input data in a MongoDB database with help of a Catmandu Fix script that adds some random numbers the data:
$ catmandu import MARC to MongoDB --database_name data --fix random.fix < input.marc
With the ‘random.fix’ like:
random("part.rand2","2")
random("part.rand4","4")
random("part.rand8","8")
random("part.rand16","16")
random("part.rand32","32")
The ‘random()’ Fix function will be available in Catmandu 1.003 but can also be downloaded here (install it in a directory ‘lib/Catmandu/Fix’). This will will make sure that every record in your input file contains four random numbers ‘part.rand2’, ‘part.rand4′ ,’part.rand8′,’part.rand16′,’part.rand32’. This will makes it possible to chop your data into two, four, eight, sixteen or thirty-two parts depending on the number of processors you have in your machine.
To access one chunk of your data the ‘catmandu export’ command can be used with a query. For instance, to export two equal chunks do:
$ catmandu export MongoDB --database_name -q '{"part.rand2":0}' > part1
$ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2
We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously.
#!/bin/bash
# file: parallel.sh
CPU=$1
if [ "${CPU}" == "" ]; then
/usr/bin/parallel -u $0 {} <<EOF
0
1
EOF
elif [ "${CPU}" != "" ]; then
catmandu export MongoDB --database_name data -q "{\"part.rand2\":${CPU}}" to JSON --line_delimited 1 --fix heavy_load.fix > result.${CPU}.json
fi
This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script.
If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’.
GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command:
$ cat result.*.json > final_result.json
Matching authors against VIAF identities
At Ghent University Library we enrich catalog records with VIAF identities to enhance the search experience in the catalog. When searching for all the books about ‘Chekov’ we want to match all name variants of this author. Consult VIAF http://viaf.org/viaf/95216565/#Chekhov,_Anton_Pavlovich,_1860-1904 and you will see many of them.
- Chekhov
- Čehov
- Tsjechof
- Txékhov
- etc
Any of the these names variants can be available in the catalog data if authority control is not in place (or not maintained). Searching any of these names should result in results for all the variants. In the past it was a labor intensive, manual job for catalogers to maintain an authority file. Using results from Linked Data Fragments research by Ruben Verborgh (iMinds) and the Catmandu-RDF tools created by Jakob Voss (GBV) and RDF-LDF by Patrick Hochstenbach, Ghent University started an experiment to automatically enrich authors with VIAF identities. In this blog post we will report on the setup and results of this experiment which will also be reported at ELAG2015.
Context
Three ingredients are needed to create a web of data:
- A scalable way to produce data.
- The infrastructure to publish data.
- Clients accessing the data and reusing them in new contexts.
On the production site there doesn’t seem to be any problem creating huge datasets by libraries. Any transformation of library data to linked data will quickly generate an enormous number of RDF triples. We see this in the size of public available datasets:
- UGent Academic Bibliography: 12.000.000 triples
- Libris catalog: 50.000.000 triples
- Gallica: 72.000.000 triples
- DBPedia: 500.000.000 triples
- VIAF: 600.000.000 triples
- Europeana: 900.000.000 triples
- The European Library: 3.500.000.000 triples
- PubChem: 60.000.000.000 triples
Also for accessing data, from a consumers perspective the “easy” part seems to be covered. Instead of thousands of APIs available and many documents formats for any dataset, SPARQL and RDF provide the programmer a single protocol and document model.
The claim of the Linked Data Fragments researchers is that on the publication side, reliable queryable access to public Linked Data datasets largely remains problematic due to the low availability percentages of public SPARQL endpoints [Ref]. This is confirmed by the 2013 study by researchers from Pontificia Universidad Católica in Chili and National University of Ireland where more than half of the public SPARQL endpoints seem to be offline 1.5 days per month. This gives an availability rate of less than 95% [Ref].
The source of this high rate of inavailability can be traced back to the service model of Linked Data where two extremes exists to publish data (see image below).
At one side, data dumps (or dereferencing of URLs) can be made available which requires a simple HTTP server and lots of processing power on the client side. At the other side, an open SPARQL endpoint can be provided which requires a lot of processing power (hence, hardware investment) on the serverside. With SPARQL endpoints, clients can demand the execution of arbitrarily complicated queries. Furthermore, since each client requests unique, highly specific queries, regular caching mechanisms are ineffective, since they can only optimized for repeated identical requests.
This situation can be compared with providing a database SQL dump to endusers or open database connection on which any possible SQL statement can be executed. To a lesser extent libraries are well aware of the different modes of operation between running OAI-PMH services and Z39.50/SRU services.
Linked Data Fragment researchers provide a third way, Triple Pattern Fragments, to publish data which tries to provide the best of both worlds: access to a full dump of datasets while providing a queryable and cachable interface. For more information on the scalability of this solution I refer to the report presented at the 5th International USEWOD Workshop.
The experiment
VIAF doesn’t provide a public SPARQL endpoint, but a complete dump of the data is available at http://viaf.org/viaf/data/. In our experiments we used the VIAF (Virtual International Authority File), which is made available under the ODC Attribution License. From this dump we created a HDT database. HDT provides a very efficient format to compress RDF data while maintaining browser and search functionality. Using command line tools RDF/XML, Turtle and NTriples can be compressed into a HDT file with an index. This standalone file can be used to without the need of a database to query huge datasets. A VIAF conversion to HDT results in a 7 GB file and a 4 GB index.
Using the Linked Data Fragments server by Ruben Verborgh, available at https://github.com/LinkedDataFragments/Server.js, this HDT file can be published as a NodeJS application.
For a demonstration of this server visit the iMinds experimental setup at: http://data.linkeddatafragments.org/viaf
Using Triple Pattern Fragments a simple REST protocol is available to query this dataset. For instance it is possible to download the complete dataset using this query:
$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf
If we only want the triples concerning Chekhov (http://viaf.org/viaf/95216565) we can provide a query parameter:
$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?subject=http://viaf.org/viaf/95216565
Likewise, using the predicate and object query any combination of triples can be requested from the server.
$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?object="Chekhov"
The memory requirements of this server are small enough to run a copy of the VIAF database on a MacBook Air laptop with 8GB RAM.
Using specialised Triple Pattern Fragments clients, SPARQL queries can be executed against this server. For the Catmandu project we created a Perl client RDF::LDF which is integrated into Catmandu-RDF.
To request all triples from the endpoint use:
$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p ?o}'
Or, only those Triples that are about “Chekhov”:
$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p "Chekhov"}'
In the Ghent University experiment a more direct approach was taken to match authors to VIAF. First, as input a MARC dump from the catalog is being streamed into a Perl program using a Catmandu iterator. Then, we extract the 100 and 700 fields which contain $a (name) and $d (date) subfields. These two fields are combined in a search query, as if we would search:
Chekhov, Anton Pavlovich, 1860-1904
If there is exactly one hit in our local VIAF copy, then the result is reported. A complete script to process MARC files this way is available at a GitHub gist. To run the program against a MARC dump execute the import_viaf.pl command:
$ ./import_viaf.pl --type USMARC file.mrc
000000089-2 7001 L $$aEdwards, Everett Eugene,$$d1900- http://viaf.org/viaf/110156902
000000122-8 1001 L $$aClelland, Marjorie Bolton,$$d1912- http://viaf.org/viaf/24253418
000000124-4 7001 L $$aSchein, Edgar H.
000000124-4 7001 L $$aKilbridge, Maurice D.,$$d1920- http://viaf.org/viaf/29125668
000000124-4 7001 L $$aWiseman, Frederick.
000000221-6 1001 L $$aMiller, Wilhelm,$$d1869- http://viaf.org/viaf/104464511
000000256-9 1001 L $$aHazlett, Thomas C.,$$d1928- http://viaf.org/viaf/65541341
[edit: 2017-05-18 an updated version of the code is available as a Git project https://github.com/LibreCat/MARC2RDF ]
All the authors in the MARC dump will be exported. If there is exactly one single match against VIAF it will be added to the author field. We ran this command for one night in a single thread against 338.426 authors containing a date and found 135.257 exact matches in VIAF (=40%).
In a quite recent follow up of our experiments, we investigated how LDF clients can be used in a federated setup. When combining in the LDF algorithm the triples result from many LDF servers, one SPARQL query can be run over many machines. These results are demonstrated at the iMinds demo site where a single SPARQL query can be executed over the combined VIAF and DBPedia datasets. A Perl implementation of this federated search is available in the latest version of RDF-LDF at GitHub.
We strongly believe in the success of this setup and the scalability of this solution as demonstrated by Ruben Verborgh at the USEWOD Workshop. Using Linked Data Fragments a range of solutions are available to publish data on the web. From simple data dumps to a full SPARQL endpoint any service level can be provided given the resources available. For more than a half year DBPedia has been running an LDF server with 99.9994% availability on a 8 CPU , 15 GB RAM Amazon server with 4.5 million requests. Scaling out, services such has the LOD Laundromat cleans 650.000 datasets and provides access to them using a single fat LDF server (256 GB RAM).
For more information on the Federated searches with Linked Data Fragments visit the blog post of Ruben Verborgh at: http://ruben.verborgh.org/blog/2015/06/09/federated-sparql-queries-in-your-browser/
Day 6: Introduction into Catmandu
In the previous days we learned the UNIX commands grep, nano, ls and less. Today we will introduce you to a UNIX command we have created in the LibreCat project called catmandu. The catmandu command is used to process structured information. To demo this command, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).
In this tutorial we are going to process structured information. We call data structured when it organised in such a way is that it easy processable by computers. Previously we processed text documents like War and Peace which is structured only in words and sentences, but a computer doesn’t know which words are part of the title or which words contain names. We had to tell the computer that. Today we will download a weather report in a structured format called JSON and inspect it with the command catmandu.
At the UNIX prompt type in this command:
$ curl http://api.openweathermap.org/data/2.5/weather?q=Gent,be
[Update: as of end 2015 the OpenWeatherMap API requires an API key. Use this link to download a copy of the Ghent weather report :
$ curl https://gist.githubusercontent.com/phochste/7673781b19690f66cada/raw/67050da98a7e04b3c56bb4a8bc8261839af57e35/weather.json
]
You will see a JSON output like:
{"coord":{"lon":3.72,"lat":51.05},"sys":{"type":3,"id":4839,"message":0.0349,"country":"BE", "sunrise":1417159365,"sunset":1417189422},"weather":[{"id":500,"main":"Rain","description":"light rain", "icon":"10d"}],"base":"cmc stations","main":{"temp":281.15,"pressure":1006,"humidity":87,"temp_min":281.15, "temp_max":281.15},"wind":{"speed":3.6,"deg":100},"rain":{"3h":0.5} ,"clouds":{"all":56},"dt":1417166878,"id":2797656, "name":"Gent","cod":200}
All these fields tell something about the current weather in Gent, Belgium. You can recognise that there is a light rain and the temperature is 281.15 degrees Kelvin (about 8 degrees Celsius). Write the output of this command to a file weather.json (using the ‘>’ sign we learned in the day 5 tutorial) so that we can use it in the next examples.
$ curl https://gist.githubusercontent.com/phochste/7673781b19690f66cada/raw/67050da98a7e04b3c56bb4a8bc8261839af57e35/weather.json > weather.json
When you type the ls command you should see the new file name weather.json appearing.
With the catmandu command you can process this file to make it a bit easier readable. For instance type:
$ catmandu convert JSON to YAML < weather.json
YAML is another format for structured information which is a bit easier to read for human eyes. Our weather report should now look like this:
Catmandu can be used to process structured information like the UNIX grep command can process unstructured information. For instance lets try to filter out the name of this report. Type in this command:
$ catmandu convert JSON --fix 'retain_field(name)' to YAML < weather.json
You should end up with something like:
--- name: Gent ...
The –fix option in Catmandu is used to ‘massage’ the input weather.json filtering fields we would like to see. Only one fix was used ‘retain_field’, which throws away all the data from the input except the ‘name’ field. By the way, the file weather.json wasn’t changed! We only read the file and displayed the output of catmandu command.
The temperature in Gent is the in ‘temp’ part of the ‘main’ section in weather.json. To filter this out we need two retain_field fixes: one for the main section, one for the temp section:
$ catmandu convert JSON --fix 'retain_field(main); retain_field(main.temp)' to YAML < weather.json
You should now see something like this:
--- main: temp: 281.15 ...
When massaging data you often need to create many fixes to process a data file in the format you need. With the nano command you can write all the fixes in a file. Start the nano editor with the command:
$ nano weather.fix
In nano type now the two fixes above:
retain_field(main) retain_field(main.temp)
To exit nano type Ctrl-X, press Y to confirm the changes and press Enter to confirm the file name.
With this file it will be a bit easier to create many fixes. The name of the fix file can be used to repeat the commands above:
$ catmandu convert JSON --fix weather.fix to YAML < weather.json
To add more fixes we can again edit the weather.fix file. Type:
$ nano weather.fix
And add these lines after the two previous lines:
prepend(main.temp,"The temperature is") append(main.temp," degrees Kelvin")
Save the changes with Ctrl-X, Y, Enter and execute catmandu again:
$ catmandu convert JSON --fix weather.fix to YAML < weather.json
You should now see as ouput:
--- main: temp: The weather is 281.15 degrees Kelvin ...
Catmandu contains many fixes to manipulate data. Check the documentation to get a complete list. This post only presented a short introduction into catmandu. In the next posts we will go deeper into its capabilities.
Continue to Day 7: Catmandu JSON paths >>
Day 4: grep, less and wc
Yesterday we gave you a little introduction into UNIX programming. Today we will show you how many words there are in Tolstoy’s War and Peace. First, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and open the UNIX prompt (hint: see our day 2 tutorial). As a result you should see this on your screen:
Ready?
First we are going to learn you a new UNIX command, cat. In our Catmandu project the cat command is our favorite. With this command you can read War and Peace in 2 seconds! Lets try it out, type ‘cat Documents/war_and_peace.txt‘ on the UNIX prompt and press ‘enter’.
$ cat Documents/war_and_peace.txt
“What was that?!” you might wonder? Well, that was the complete War and Peace running across your screen. We provided cat command with one argument ‘Documents/war_and_peace.txt’ which is the filename that contains the complete text of War and Peace (this text I downloaded from the Gutenberg Project for you).
In UNIX it is possible to glue the output from one command to the input of another command this is called a pipe. To use a pipe in a command you have to find a funny little key on your computer which has this sign ‘|’. On my computer it looks like this:
With this pipe symbol ‘|’ you can glue the output of one command to another command. We will use a pipe to count the number of lines, words and characters in a file with the UNIX wc command like this:
$ cat Documents/war_and_peace.txt | wc
64620 563290 3272023
The output contains three numbers: 64620 , 563290 and 327203. The first number counts the number of lines in the file Documents/war_and_peace.txt. The second number counts the number of words in Documents/war_and_peace.txt. And the third number counts the number of characters in Documents/war_and_peace.txt.
Five hundred sixty-three thousand two hundred and ninety words counted with one simple command! This is the power of command line processing.
The file Documents/war_and_peace.txt contains the English translation of War and Peace. We can count the number of times the word ‘war’ is mentioned in this novel. You need to use a new UNIX command, grep, to do this trick. Type the following commands at the UNIX prompt and I will explain in a moment what happens.
$ cat Documents/war_and_peace.txt | grep -ow war | wc
274 274 1096
We count 274 occurrences of the word ‘war’ in War and Peace. With the cat command we read the document Documents/war_and_peace.txt. With the pipe ‘|’ symbol we send all the text of this document to the grep command, where we use the -ow option to search for the word ‘war’. With the pipe ‘|’ symbol we send all the ‘war’ words to the wc command which will count the number of ‘war’-s.
You can experiment with these commands to better understand what happens. If you type:
$ cat Documents/war_and_peace.txt
, then you will see the complete War and Peace in the output. When you type:
$ cat Documents/war_and_peace.txt | grep -ow war
, then you will see many lines containing ‘war’, ‘war’, ‘war’ (one line for every occurence of the word ‘war’ in War and Peace). When you type:
$ cat Documents/war_and_peace.txt | grep -ow war | wc
, then you will count all these ‘war’ lines. Pretty neat, eh?
What about ‘war’ at the beginning of the sentence like ‘War’ or someone shouting ‘WAR!’? To have a correct count we need to be case-insensitive. You can go this by adding the -i option to the grep command like this:
$ cat Documents/war_and_peace.txt | grep -i -ow war | wc
297 297 1188
We get 23 more occurences.
Now we can try to find out if War and Peace is more about ‘war’ than ‘peace’ by counting the number of times ‘peace’ is mentioned:
$ cat Documents/war_and_peace.txt | grep -i -ow peace | wc
110 110 660
This proves ‘peace’ is mentioned only 110 times and ‘war’ 297 times!
To finish this tutorial I will learn you one more UNIX command: less. When experimenting with commands like cat and grep you might want to inspect intermediary results. With the less command you page through output. Lets try this out. Type:
$ cat Documents/war_and_peace.txt | less
Your screen will now show the first page of War and Peace. When you press the spacebar the next page will be displayed. Pressing spacebar again you see again a next page, etc etc. This way you can slowly page through long result lists. When you press the ‘b’ key you will go one page back. To exit this less command press the ‘q’ key.
Some more examples of what we have learned.
Show all the lines which contain the word Bolkonski
$ cat Documents/war_and_peace.txt | grep Bolkonski
Or, use ‘less’ to page through the results:
$ cat Documents/war_and_peace.txt | grep Bolkonski | less
Count all the lines which contain the word Bolkonski
$ cat Documents/war_and_peace.txt | grep Bolkonski | wc
178 1848 11705
The answer is 178 lines.
Count all the number of times Napeleon is mentioned
$ cat Documents/war_and_peace.txt | grep -ow Napoleon | wc
580 580 5220
This answer is 580 times.
Continue to Day 5: Editing text with nano >>
Day 3: Bash basics
Today we are going to learn you some UNIX commands. To get started you need to start the Virtual Catmandu application (hints: see our day 1 tutorial). When you see the Catmandu desktop you need to start the “UNIX prompt” terminal window (hints: see our day 2 tutorial). As a result you should see this on your screen:
The screen above is called the “UNIX prompt” or the “command line” and we will use it to execute UNIX commands. All Catmandu commands were written for this “command line” because it provides the powers of UNIX coupled with the Perl programming language.
Yesterday we saw the date command. Try it again: type ‘date’ in this window and press ‘enter’:
$ date
Wed Dec 3 07:34:45 UTC 2014
You will see the current date appearing in this window. With the w command you can see who is logged into the system. Type ‘w’ and press ‘enter’:
$ w
10:37:11 up 1:04, 2 users, load average: 0.08, 0.02, 0.01
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
catmandu tty1 :0 09:33 1:04m 10.13s 0.37s pam: gdm-autolo
catmandu pts/0 :0.0 10:25 0.00s 0.12s 0.11s w
You see from this output that there is named ‘catmandu’ working in your Virtual Catmandu. That is you!
UNIX provides thousands and thousands of commands you can excute this way. For every imaginable task there exists a command. In the coming days we will use the UNIX prompt to execute Catmandu commands to process library data. Before we can do this, we need to give you a little more context how these UNIX commands work.
First, the commands date and w above are simple commands that don’t required any other input than typing their name. You execute these command by hitting the return button at the end of the name. Some commands need one or more arguments. For instance the echo command is used to print text on a screen. When you type ‘echo’ and hit return, then it prints a blank line
$ echo
$
(by the way, the $-sign you see above isn’t part of the command but an indication where the UNIX prompt is so that I don’t have to upload a screenshot for every example).
When you type “echo 123” the command will print ‘123’ as output.
$ echo 123
123
$
When you type “echo 123 abc def” the command will print “123 abc def” as output.
$ echo 123 abc def
123 abc def
$
These “123”, “abc”, “def” are called the arguments of the command. In the example above we ran the echo command with 3 arguments. All the arguments in our examples are words. You can also have use sentence as arguments, but then you need to put the sentence in double quotes:
$ echo "Hello, my name is Catmandu" 123 abc def
Hello, my name is Catmandu 123 abc def
$
In the example above, the echo command had 4 arguments: 1 sentence and 3 words.
Some commands accept also options. With options you can change the behaviour of commands. Some command have no options, some have many options. With Catmandu you will encounter commands that love to provide you a lot of options. For now lets keep things simple. The UNIX cal command can be used to output a calendar for the current month.
$ cal December 2014 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 $
When you provide a month and a year you can select which month you would like to see the calendar. E.g. for May (month 5) of 2014
$ cal 5 2014 May 2014 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 $
The cal command also accepts a year with the -y option. E.g. the calendar for the year 1755 is:
$ cal -y 1755
To find out more about a specific command, call the man command. A documentation will show up where you can browse with arrow keys and quit with hitting the key ‘q‘:
$ man cal
That was enough for today. Tomorrow we will show you how to work with files and do a bit of text analysis.
Continue to Day 4: grep, less and wc >>
Day 2: Virtual Box introduction
Today we will give you a short introduction in Virtual Box so that you can find your way in the next lessons. In the previous post you have learned how to install a Virtual Catmandu and start the system. Do it now, we will show you some basic commands.
Ready? You should see this screen now:
On this desktop you will find some icons we will be using a lot in the next days. Double click on the icon named ‘LXTerminal’. You will be presented with what is called a “UNIX prompt”.
In this window you will be typing in the next days UNIX and catmandu commands. Type ‘date’ in this window and press enter. You will see that the computer calculated the current UTC date.
This is what IT-pros are doing the whole day: calculating the date. Next, we can try to close this window. This you can do with the little X-icon at the top right of the UNIX prompt screen. Click it, you’ll see again the desktop.
The next icon we are going to try is ‘Leafpad’. Double click on it and you will presented with a text editor.
Now, in this window you don’t type UNIX commands but you will use it to create programming code. Let’s try this out! Type a nice poem in this window:
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you
This is so nice that we are going to safe it as a file. To do this you need to go to the ‘File’ menu at the top and choose ‘Save As…’.
We are going to safe this text as ‘poem.txt’ in the ‘catmandu’ folder. Type this filename and choose the catmandu folder and click on ‘Save’.
Voila, the poem is saved. You can now close the Leafpad window with the little x-icon at the top right.
And we are back on the desktop.
If you want to read the poem tomorrow. You need to open ‘Leafpad’ again (hint: double-click the icon). Go to the ‘File’ menu and choose ‘Open…’
In the new window you see ‘poem.txt’ click on the name and at the bottom on ‘Open’.
As a result you will see again your poem in the ‘Leafpad’ window. Congratulations! Now that you know how to edit files you know how to edit programs. Tomorrow we will learn some UNIX commands.
Continue to Day 3: Bash basics >>

Day 1: Getting Catmandu
Catmandu is a data processing toolkit developed as part of the LibreCat project. Catmandu provides a command line tool and a suite of Perl modules to ease the import, storage, retrieval, export and transformation of library related data sets.
In this advent calender we are going to teach you how to use UNIX command line tools to preform simple and advanced data processing tasks. To be able to follow our examples you need to have access to a UNIX account with Catmandu installed. If you are already familiar with UNIX and have access to a UNIX machine, the following command line instructions should install Catmandu:
https://github.com/LibreCat/Catmandu/wiki/Installation
If you are not familiar with UNIX, we prepared a virtual machine with Catmandu. Your local IT-department should be able to get it you up and running within a day. You have to provide them a short list of requirements: Download and Install VirtualBox and Catmandu.
When you have a Virtual Box with Catmandu running you should be welcomed by our Catmandu Cat.
Now that you know how to start Catmandu, let me show you who to stop the Catmandu machine. Go in the Virtual Catmandu screen to the menu item ‘System’ at the top and choose ‘Shutdown’. After a few seconds the Virtual Catmandu machine will stop and you can be proud of your first steps into UNIX data processing with Catmandu!
Continue to Day 2: Virtual Box introduction >>