Tagged: unix

Day 9: Processing MARC with Catmandu

09_librecatIn the previous days we learned how we can use the catmandu command to process structured data like JSON. Today we will use the same command to process MARC metadata records. In this process we will see that MARC can be processed using JSON paths but this is a bit cumbersome. We will introduce MARCspec as an easier way to point to parts of a MARC record.

As always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).

In the Virtual Catmandu installation we provided a couple of example MARC files that we can inspect with the UNIX command cat or less. In the UNIX prompt inspect the file Documents/camel.usmarc, for instance, with cat:

$ cat Documents/camel.usmarc

You should see something like this:


Like JSON the MARC file contains structured data but the format is different. All the data is on one line, but there isn’t at first sight a clear separation between fields and values. The field/value structure there but you need to use a MARC parser to extract this information. Catmandu contains a MARC parser which can be used to interpret this file. Type the following command to transform the MARC data into YAML (which we introduced in the previous posts):

$ catmandu convert MARC to YAML < Documents/camel.usmarc

You will see something like this:


When transforming MARC into YAML it looks like something with a simple top level field _id containing the identifier of the MARC record and a record field with a deeper array structure (or more correct an array-of-an-array structure).

We can use catmandu to read the _id fields of the MARC record with the retain_field fix we learned in the Day 6 post:

$ catmandu convert MARC --fix 'retain_field(_id)' to YAML < Documents/camel.usmarc

You will see:

_id: 'fol05731351 '
_id: 'fol05754809 '
_id: 'fol05843555 '
_id: 'fol05843579 '

What is happening here? The MARC file Documents/camel.usmarc contains more than one MARC record. For every MARC record catmandu extracts the _id field.

Extracting data out of the MARC record itself is a bit more difficult. MARC is an array-an-array, you need indexes to extract the data. For instance the MARC leader is usually in the first field of a MARC record. In the previous posts we learned that you need to use the 0 index to extract the first field out of an array:

$ catmandu convert MARC --fix 'retain_field(record.0)' to YAML < Documents/camel.usmarc
_id: 'fol05731351 '
- - LDR
  - ~
  - ~
  - _
  - 00755cam  22002414a 4500

The leader value itself is the fifth entry in the resulting array. So, we need index 4 to extract it:

$ catmandu convert MARC --fix 'copy_field(record.0.4,leader); retain_field(leader)' to YAML < Documents/camel.usmarc

We used here a copy_field fix to extract the value into a field called leader. The retain_field fix is used to keep only this leader field in the result. To process MARC data this way would be very verbose, plus you need to know at which index position the fields are that you are interested in. This is something you usually don’t know.

Catmandu introduces Carsten Klee’s MARCspec to ease the extraction of MARC values out of a record. With the marc_map fix the command above would read:


I skipped here writing the catmandu commands (they will be the same everytime). You can put these fixes into a file using nano (see the Day 5 post) and execute it as:

catmandu convert MARC --fix myfixes.txt to YAML < Documents/camel.usmarc

Where myfixes.txt contains the fixes above.

To extract the title fields, the field 245 remember? ;), you can write:


Or, if you are only interested in the $a subfield you could write:


More elaborate mappings are possible. I’ll show you more complete examples in the next posts. As a warming up, here is some code to extract all the record identifiers, titles and isbn numbers in a MARC file into a CSV listing (which you can open in Excel).

Step 1, create a fix file myfixes.txt containing:


Step 2, execute this command:

$ catmandu convert MARC --fix myfixes.txt to CSV < Documents/camel.usmarc

You will see this as output:

"fol05731351 ","0471383147 (paper/cd-rom : alk. paper)","ActivePerl with ASP and ADO /Tobias Martinsson."
"fol05754809 ",1565926994,"Programming the Perl DBI /Alligator Descartes and Tim Bunce."
"fol05843555 ",,"Perl :programmer's reference /Martin C. Brown."
"fol05843579 ",0072120002,"Perl :the complete reference /Martin C. Brown."
"fol05848297 ",1565924193,"CGI programming with Perl /Scott Guelich, Shishir Gundavaram & Gunther Birznieks."
"fol05865950 ",0596000138,"Proceedings of the Perl Conference 4.0 :July 17-20, 2000, Monterey, California."
"fol05865956 ",1565926099,"Perl for system administration /David N. Blank-Edelman."
"fol05865967 ",0596000278,"Programming Perl /Larry Wall, Tom Christiansen & Jon Orwant."
"fol05872355 ",013020868X,"Perl programmer's interactive workbook /Vincent Lowe."
"fol05882032 ","0764547291 (alk. paper)","Cross-platform Perl /Eric F. Johnson.

In the fix above we mapped the 245-field to the title. The ISBN is in the 020-field. Because MARC records can contain one or more 020 fields we created an isbn array using the isbn.$append syntax. Next we turned the isbn array back into a comma separated string using the join_field fix. As last step we deleted all the fields we didn’t need in the output with the remove_field syntax.

In this post we demonstrated how to process MARC data. In the next post we will show some examples how catmandu typically can be used to process library data.

Continue with Day 10: Working with CSV and Excel files >>


Day 6: Introduction into Catmandu

06_librecatprojectIn the previous days we learned the UNIX commands grep, nano, ls and less. Today we will introduce you to a UNIX command we have created in the LibreCat project called catmandu. The catmandu command is used to process structured information.  To demo this command, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).

In this tutorial we are going to process structured information. We call data structured when it organised in such a way is that it easy processable by computers. Previously we processed text documents like War and Peace which is structured only in words and sentences, but a computer doesn’t know which words are part of the title or which words contain names. We had to tell the computer that. Today we will download a weather report in a structured format called JSON and inspect it with the command catmandu.

At the UNIX prompt type in this command:

$ curl http://api.openweathermap.org/data/2.5/weather?q=Gent,be

[Update: as of end 2015 the OpenWeatherMap API requires an API key. Use this link to download a copy of the Ghent weather report :

$ curl https://gist.githubusercontent.com/phochste/7673781b19690f66cada/raw/67050da98a7e04b3c56bb4a8bc8261839af57e35/weather.json


You will see a JSON output like:

"sunrise":1417159365,"sunset":1417189422},"weather":[{"id":500,"main":"Rain","description":"light rain",
"icon":"10d"}],"base":"cmc stations","main":{"temp":281.15,"pressure":1006,"humidity":87,"temp_min":281.15,

All these fields tell something about the current weather in Gent, Belgium. You can recognise that there is a light rain and the temperature is 281.15 degrees Kelvin (about 8 degrees Celsius).  Write the output of this command to a file weather.json (using the ‘>’ sign we learned in the day 5 tutorial) so that we can use it in the next examples.

$ curl https://gist.githubusercontent.com/phochste/7673781b19690f66cada/raw/67050da98a7e04b3c56bb4a8bc8261839af57e35/weather.json > weather.json

When you type the ls command you should see the new file name weather.json appearing.

With the catmandu command you can process this file to make it a bit easier readable. For instance type:

$ catmandu convert JSON to YAML < weather.json

YAML is another format for structured information which is a bit easier to read for human eyes. Our weather report should now look like this:


Catmandu can be used to process structured information like the UNIX grep command can process unstructured information. For instance lets try to filter out the name of this report. Type in this command:

$ catmandu convert JSON --fix 'retain_field(name)' to YAML < weather.json

You should end up with something like:

name: Gent

The –fix option in Catmandu is used to ‘massage’ the input weather.json filtering fields we would like to see. Only one fix was used ‘retain_field’, which throws away all the data from the input except the ‘name’ field. By the way, the file weather.json wasn’t changed! We only read the file and displayed the output of catmandu command.

The temperature in Gent is the in ‘temp’ part of the ‘main’ section in weather.json. To filter this out we need two retain_field fixes: one for the main section, one for the temp section:

$ catmandu convert JSON --fix 'retain_field(main); retain_field(main.temp)' to YAML < weather.json

You should now see something like this:

  temp: 281.15

When massaging data you often need to create many fixes to process a data file in the format you need. With the nano command you can write all the fixes in a file. Start the nano editor with the command:

$ nano weather.fix

In nano type now the two fixes above:


To exit nano type Ctrl-X, press Y to confirm the changes and press Enter to confirm the file name.

With this file it will be a bit easier to create many fixes. The name of the fix file can be used to repeat the commands above:

$ catmandu convert JSON --fix weather.fix to YAML < weather.json

To add more fixes we can again edit the weather.fix file. Type:

$ nano weather.fix

And add these lines after the two previous lines:

prepend(main.temp,"The temperature is")
append(main.temp," degrees Kelvin")

Save the changes with Ctrl-X, Y, Enter and execute catmandu  again:

$ catmandu convert JSON --fix weather.fix to YAML < weather.json

You should now see as ouput:

  temp: The weather is 281.15 degrees Kelvin

Catmandu contains many fixes to manipulate data. Check the documentation to get a complete list. This post only presented a short introduction into catmandu. In the next posts we will go deeper into its capabilities.

Continue to Day 7: Catmandu JSON paths >>

Day 5: Editing text with nano

05_librecatprojectYesterday we looked at the commands grep, wc and less. Today we will show you how to store and edit files in UNIX First, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) . Start up the UNIX prompt (hint: see our day 2 tutorial) and type in the command ‘nano’:

$ nano

You will be presented with the GNU nano text editor.


In this text editor you can type text or programs you can save on disk for later use. In this short tutorial I will guide you to some basic commands we will need in later tutorials. Type for instance a short text in this screen:

“Hello  world. My name is …”

When you want to save this text into a file type Ctrl-o (that is pressing the Ctrl-key and ‘o’ key on your keyboard). In the bottom of the screen nano will ask for a filename.



Type for instance ‘hello.txt’ as filename as press return. The file ‘hello.txt’ is now created on disk. We can test this with the commands we learned in the previous tutorial.

First exit the nano editor by typing Ctrl-x. And type ‘cat hello.txt’

$ cat hello.txt

You will see now the text created in the nano editor. With the UNIX command ‘ls‘ you can view all the filenames in the current directory.

$ ls

If you want to add more text to this file you can start again the nano editor with a file name.

$ nano hello.txt

You will again see the text you can edit and save again with Ctrl-o and exit nano with Ctrl-x.

Output of UNIX commands can also be written to a file.  Lets try to find all the lines in War and Piece that contain Bolkonski and inspect the results with nano:

$ cat Documents/war_and_peace.txt | grep Bolkonski > bolkonski.txt

Here we use the key ‘>’ to redirect the output of the command grep to a file named ‘bolkonski.txt’. Next we can use nano to inspect the contents of this file.

$ nano bolkonski.txt

By the way, you don’t need to type in the complete filenames in all the commands we have shown in the examples. When you type ‘bo’ and hit the tab-key then UNIX will autocomplete the file name to ‘bolkonski.txt’. I’m lazy and would type ‘cat bol’ and press tab .

Again you can use Ctrl-x to exit nano. You can view all the files with the ls command.

$ ls

bolkonski.txt  Documents  hello.txt  Pictures  
Templates  Videos Desktop        Downloads  
Music      Public    test.fix

If you want to delete a file you can use the rm command. We can try to remove our bolkonski.txt file with like:

$ rm bolkonski.txt

This concludes our short excursion into UNIX. Monday we will be back with a new chapter: processing JSON with Catmandu. Have a nice weekend!

Continue with Day 6: Introduction to Catmandu >>

Day 4: grep, less and wc

04_librecatprojectYesterday we gave you a little introduction into UNIX programming. Today we will show you how many words there are in Tolstoy’s War and Peace. First, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and open the UNIX prompt (hint: see our day 2 tutorial). As a result you should see this on your screen:



First we are going to learn you a new UNIX command, cat. In our Catmandu project the cat command is our favorite. With this command you can read War and Peace in 2 seconds! Lets try it out, type ‘cat Documents/war_and_peace.txt‘ on the UNIX prompt and press ‘enter’.

$ cat Documents/war_and_peace.txt

“What was that?!” you might wonder? Well, that was the complete War and Peace running across your screen. We provided cat command with one argument ‘Documents/war_and_peace.txt’ which is the filename that contains the complete text of War and Peace (this text I downloaded from the Gutenberg Project for you).

In UNIX it is possible to glue the output from one command to the input of another command this is called a pipe. To use a pipe in a command you have to find a funny little key on your computer which has this sign ‘|’. On my computer it looks like this:



With this pipe symbol ‘|’ you can glue the output of one command to another command. We will use a pipe to count the number of lines, words and characters in a file with the UNIX wc command like this:

$ cat Documents/war_and_peace.txt | wc
64620 563290 3272023

The output contains three numbers: 64620 , 563290 and 327203. The first number counts the number of lines in the file Documents/war_and_peace.txt. The second number counts the number of words in Documents/war_and_peace.txt. And the third number counts the number of characters in Documents/war_and_peace.txt.

Five hundred sixty-three thousand two hundred and ninety words counted with one simple command! This is the power of command line processing.

The file Documents/war_and_peace.txt contains the English translation of War and Peace. We can count the number of times the word ‘war’ is mentioned in this novel. You need to use a new UNIX command, grep,  to do this trick. Type the following commands at the UNIX prompt and I will explain in a moment what happens.

$ cat  Documents/war_and_peace.txt | grep -ow war | wc
274     274    1096

We count 274 occurrences of the word ‘war’ in War and Peace. With the cat command we read the document Documents/war_and_peace.txt. With the pipe ‘|’ symbol we send all the text of this document to the grep command, where we use the -ow option to search for the word ‘war’. With the pipe ‘|’ symbol we send all the ‘war’ words to the wc command which will count the number of ‘war’-s.

You can experiment with these commands to better understand what happens. If you type:

$ cat  Documents/war_and_peace.txt

, then you will see the complete War and Peace in the output. When you type:

$ cat  Documents/war_and_peace.txt | grep -ow war

, then you will see many lines containing ‘war’, ‘war’, ‘war’ (one line for every occurence of the word ‘war’ in War and Peace). When you type:

$ cat  Documents/war_and_peace.txt | grep -ow war | wc

, then you will count all these ‘war’ lines. Pretty neat, eh?

What about ‘war’ at the beginning of the sentence like ‘War’ or someone shouting ‘WAR!’? To have a correct count we need to be case-insensitive. You can go this by adding the -i option to the grep command like this:

$ cat  Documents/war_and_peace.txt | grep -i -ow war | wc
297     297    1188

We get 23 more occurences.

Now we can try to find out if War and Peace is more about ‘war’ than ‘peace’ by counting the number of times ‘peace’ is mentioned:

$ cat  Documents/war_and_peace.txt | grep -i -ow peace | wc
110     110     660

This proves ‘peace’ is mentioned only 110 times and ‘war’ 297 times!

To finish this tutorial I will learn you one more UNIX command: less. When experimenting with commands like cat and grep you might want to inspect intermediary results. With the less command you page through output. Lets try this out. Type:

$ cat  Documents/war_and_peace.txt | less

Your screen will now show the first page of War and Peace. When you press the spacebar the next page will be displayed. Pressing spacebar again you see again a next page, etc etc. This way you can slowly page through long result lists. When you press the ‘b’ key you will go one page back. To exit this less command press the ‘q’ key.

Some more examples of what we have learned.

Show all the lines which contain the word Bolkonski

$ cat  Documents/war_and_peace.txt | grep Bolkonski

Or, use ‘less’ to page through the results:

$ cat  Documents/war_and_peace.txt | grep Bolkonski | less

Count all the lines which contain the word Bolkonski

$ cat  Documents/war_and_peace.txt | grep Bolkonski | wc
178 1848 11705

The answer is 178 lines.

Count all the number of times Napeleon is mentioned

$ cat  Documents/war_and_peace.txt | grep -ow Napoleon | wc
580 580 5220

This answer is 580 times.

Continue to Day 5: Editing text with nano >>

Day 3: Bash basics

Today we are going to learn you some UNIX commands. To get started you need to start the Virtual Catmandu application (hints: see our day 1 tutorial). 03_librecatproject When you see the Catmandu desktop you need to start the “UNIX prompt” terminal window (hints: see our day 2 tutorial). As a result you should see this on your screen:


The screen above is called the “UNIX prompt” or the “command line” and we will use it to execute UNIX commands. All Catmandu commands were written for this “command line” because it provides the powers of UNIX coupled with the Perl programming language.

Yesterday we saw the date command. Try it again: type ‘date’ in this window and press ‘enter’:

$ date
Wed Dec  3 07:34:45 UTC 2014

You will see the current date appearing in this window. With the w command you can see who is logged into the system. Type ‘w’ and press ‘enter’:

$ w
10:37:11 up 1:04, 2 users, load average: 0.08, 0.02, 0.01
catmandu tty1 :0 09:33 1:04m 10.13s 0.37s pam: gdm-autolo
catmandu pts/0 :0.0 10:25 0.00s 0.12s 0.11s w

You see from this output that there is named ‘catmandu’ working in your Virtual Catmandu. That is you!

UNIX provides thousands and thousands of commands you can excute this way. For every imaginable task there exists a command. In the coming days we will use the UNIX prompt to execute Catmandu commands to process library data. Before we can do this, we need to give you a little more context how these UNIX commands work.

First, the commands date and w above are simple commands that don’t required any other input than typing their name. You execute these command by hitting the return button at the end of the name. Some commands need one or more arguments. For instance the echo command is used to print text on a screen. When you type ‘echo’ and hit return, then it prints a blank line

$ echo


(by the way, the $-sign you see above isn’t part of the command but an indication where the UNIX prompt is so that I don’t have to upload a screenshot for every example).

When you type “echo 123” the command will print ‘123’ as output.

$ echo 123

When you type “echo 123 abc def” the command will print “123 abc def” as output.

$ echo 123 abc def
123 abc def

These “123”, “abc”, “def” are called the arguments of the command. In the example above we ran the echo command with 3 arguments. All the arguments in our examples are words. You can also have use sentence as arguments, but then you need to put the sentence in double quotes:

$ echo "Hello, my name is Catmandu" 123 abc def
Hello, my name is Catmandu 123 abc def

In the example above, the echo command had 4 arguments: 1 sentence and 3 words.

Some commands accept also options. With options you can change the behaviour of commands. Some command have no options, some have many options. With Catmandu you will encounter commands that love to provide you a lot of options. For now lets keep things simple. The UNIX cal command can be used to output a calendar for the current month.

$ cal
   December 2014
Su Mo Tu We Th Fr Sa
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

When you provide a month and a year you can select which month you would like to see the calendar. E.g. for May (month 5) of 2014

$ cal 5 2014
      May 2014
Su Mo Tu We Th Fr Sa
             1  2  3
 4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

The cal command also accepts a year with the -y option. E.g. the calendar for the year 1755 is:

$ cal -y 1755


To find out more about a specific command, call the man command. A documentation will show up where you can browse with arrow keys and quit with hitting the key ‘q‘:

$ man cal

That was enough for today. Tomorrow we will show you how to work with files and do a bit of text analysis.

Continue to Day 4: grep, less and wc >>

Day 2: Virtual Box introduction

Today we will give you a short introduction in Virtual Box so that you can find your way in the next lessons. In the previous post you have learned how to install a Virtual Catmandu and start the system. Do it now, we will show you some basic commands.02_librecatproject

Ready? You should see this screen now:


On this desktop you will find some icons we will be using a lot in the next days. Double click on the icon named ‘LXTerminal’. You will be presented with what is called a “UNIX prompt”.


In this window you will be typing in the next days UNIX and catmandu commands. Type ‘date’ in this window and press enter. You will see that the computer calculated the current UTC date.


This is what IT-pros are doing the whole day: calculating the date. Next, we can try to close this window. This you can do with the little X-icon at the top right of the UNIX prompt screen. Click it, you’ll see again the desktop.


The next icon we are going to try is ‘Leafpad’. Double click on it and you will presented with a text editor.


Now, in this window you don’t type UNIX commands but you will use it to create programming code. Let’s try this out! Type a nice poem in this window:

Roses are red,
Violets are blue,
Sugar is sweet,
And so are you


This is so nice that we are going to safe it as a file. To do this you need to go to the ‘File’ menu at the top and choose ‘Save As…’.


We are going to safe this text as ‘poem.txt’ in the ‘catmandu’ folder. Type this filename and choose the catmandu folder and click on ‘Save’.


Voila, the poem is saved. You can now close the Leafpad window with the little x-icon at the top right.


And we are back on the desktop.

If you want to read the poem tomorrow. You need to open ‘Leafpad’ again (hint: double-click the icon). Go to the ‘File’ menu and choose ‘Open…’


In the new window you see ‘poem.txt’ click on the name and at the bottom on ‘Open’.


As a result you will see again your poem in the ‘Leafpad’ window. Congratulations! Now that you know how to edit files you know how to edit programs. Tomorrow we will learn some UNIX commands.

Continue to Day 3: Bash basics >>


Day 1: Getting Catmandu

Catmandu is a data processing toolkit developed as part of the LibreCat project. Catmandu provides a command line tool and a suite of Perl modules to ease the import, storage, retrieval, export and transformation  of library related data sets.01_librecatproject

In this advent calender we are going to teach you how to use UNIX command line tools to preform simple and advanced data processing tasks. To be able to follow our examples you need to have access to a UNIX account with Catmandu installed. If you are already familiar with UNIX and have access to a UNIX machine, the following command line instructions should install Catmandu:


If you are not familiar with UNIX, we prepared a virtual machine with Catmandu. Your local IT-department should be able to get it you up and running within a day. You have to provide them a short list of requirements: Download and Install VirtualBox and Catmandu.

When you have a Virtual Box with Catmandu running  you should be welcomed by our Catmandu Cat.


Now that you know how to start Catmandu, let me show you who to stop the Catmandu machine. Go in the Virtual Catmandu screen to the menu item ‘System’ at the top and choose ‘Shutdown’. After a few seconds the Virtual Catmandu machine will stop and you can be proud of your first steps into UNIX data processing with Catmandu!


Continue to Day 2: Virtual Box introduction >>