In the previous days we learned how we can use the catmandu command to process structured data like JSON. Today we will use the same command to process MARC metadata records. In this process we will see that MARC can be processed using JSON paths but this is a bit cumbersome. We will introduce MARCspec as an easier way to point to parts of a MARC record.
In the Virtual Catmandu installation we provided a couple of example MARC files that we can inspect with the UNIX command cat or less. In the UNIX prompt inspect the file Documents/camel.usmarc, for instance, with cat:
$ cat Documents/camel.usmarc
You should see something like this:
Like JSON the MARC file contains structured data but the format is different. All the data is on one line, but there isn’t at first sight a clear separation between fields and values. The field/value structure there but you need to use a MARC parser to extract this information. Catmandu contains a MARC parser which can be used to interpret this file. Type the following command to transform the MARC data into YAML (which we introduced in the previous posts):
$ catmandu convert MARC to YAML < Documents/camel.usmarc
You will see something like this:
When transforming MARC into YAML it looks like something with a simple top level field _id containing the identifier of the MARC record and a record field with a deeper array structure (or more correct an array-of-an-array structure).
We can use catmandu to read the _id fields of the MARC record with the retain_field fix we learned in the Day 6 post:
$ catmandu convert MARC --fix 'retain_field(_id)' to YAML < Documents/camel.usmarc
You will see:
--- _id: 'fol05731351 ' ... --- _id: 'fol05754809 ' ... --- _id: 'fol05843555 ' ... --- _id: 'fol05843579 ' ...
What is happening here? The MARC file Documents/camel.usmarc contains more than one MARC record. For every MARC record catmandu extracts the _id field.
Extracting data out of the MARC record itself is a bit more difficult. MARC is an array-an-array, you need indexes to extract the data. For instance the MARC leader is usually in the first field of a MARC record. In the previous posts we learned that you need to use the 0 index to extract the first field out of an array:
$ catmandu convert MARC --fix 'retain_field(record.0)' to YAML < Documents/camel.usmarc --- _id: 'fol05731351 ' record: - - LDR - ~ - ~ - _ - 00755cam 22002414a 4500 ...
The leader value itself is the fifth entry in the resulting array. So, we need index 4 to extract it:
$ catmandu convert MARC --fix 'copy_field(record.0.4,leader); retain_field(leader)' to YAML < Documents/camel.usmarc
We used here a copy_field fix to extract the value into a field called leader. The retain_field fix is used to keep only this leader field in the result. To process MARC data this way would be very verbose, plus you need to know at which index position the fields are that you are interested in. This is something you usually don’t know.
Catmandu introduces Carsten Klee’s MARCspec to ease the extraction of MARC values out of a record. With the marc_map fix the command above would read:
I skipped here writing the catmandu commands (they will be the same everytime). You can put these fixes into a file using nano (see the Day 5 post) and execute it as:
catmandu convert MARC --fix myfixes.txt to YAML < Documents/camel.usmarc
Where myfixes.txt contains the fixes above.
To extract the title fields, the field 245 remember? ;), you can write:
Or, if you are only interested in the $a subfield you could write:
More elaborate mappings are possible. I’ll show you more complete examples in the next posts. As a warming up, here is some code to extract all the record identifiers, titles and isbn numbers in a MARC file into a CSV listing (which you can open in Excel).
Step 1, create a fix file myfixes.txt containing:
marc_map("245",title) marc_map("020a",isbn.$append) join_field(isbn,",") remove_field(record)
Step 2, execute this command:
$ catmandu convert MARC --fix myfixes.txt to CSV < Documents/camel.usmarc
You will see this as output:
_id,isbn,title "fol05731351 ","0471383147 (paper/cd-rom : alk. paper)","ActivePerl with ASP and ADO /Tobias Martinsson." "fol05754809 ",1565926994,"Programming the Perl DBI /Alligator Descartes and Tim Bunce." "fol05843555 ",,"Perl :programmer's reference /Martin C. Brown." "fol05843579 ",0072120002,"Perl :the complete reference /Martin C. Brown." "fol05848297 ",1565924193,"CGI programming with Perl /Scott Guelich, Shishir Gundavaram & Gunther Birznieks." "fol05865950 ",0596000138,"Proceedings of the Perl Conference 4.0 :July 17-20, 2000, Monterey, California." "fol05865956 ",1565926099,"Perl for system administration /David N. Blank-Edelman." "fol05865967 ",0596000278,"Programming Perl /Larry Wall, Tom Christiansen & Jon Orwant." "fol05872355 ",013020868X,"Perl programmer's interactive workbook /Vincent Lowe." "fol05882032 ","0764547291 (alk. paper)","Cross-platform Perl /Eric F. Johnson.
In the fix above we mapped the 245-field to the title. The ISBN is in the 020-field. Because MARC records can contain one or more 020 fields we created an isbn array using the isbn.$append syntax. Next we turned the isbn array back into a comma separated string using the join_field fix. As last step we deleted all the fields we didn’t need in the output with the remove_field syntax.
In this post we demonstrated how to process MARC data. In the next post we will show some examples how catmandu typically can be used to process library data.
Continue with Day 10: Working with CSV and Excel files >>