Today we will look a bit further into MARC processing with Catmandu. By now you should already know how to startup the Virtual Catmandu (hint: see day 1) and start up the UNIX command prompt (hint: see day 2). We already saw a bit of MARC processing in day 9 and today we will show you how to transform MARC records into Dublin Core. This as a preparation to create RDF and Linked Data in the later posts.
First I’m going to teach you how to process different types of MARC files. On the Virtual Catmandu system we provided five example MARC files. You can find them in your Documents folder:
When you examine these files with the UNIX less command you will see that all the files have a bit different format:
$ less Documents/camel.mrk
$ less Documents/camel.usmarc
$ less Documents/marc.xml
$ less Documents/rug01.sample
There are many ways in which MARC data can be written into a file. Every vendor likes to use its own format. You can compare this with the different ways a text document can be stored: as Word, as Open Office, as PDF and plain text. If we are going to process these files with catmandu, then we need to tell the system what the exact format is.
We will work today with the last example rug01.sample which is a small export out of the Aleph catalog from Ghent University Library. Ex Libris uses a special MARC format to structure their data which is called Aleph sequential. We need to tell catmandu not only that our input file is in MARC but also in this special Aleph format. Let’s try to create YAML to see what it gives:
$ catmandu convert MARC --type ALEPHSEQ to YAML < Documents/rug01.sample
To transform this MARC file into Dublin Core we need to create a fix file. You can use the UNIX command nano for this (hint: see day 5 how to create files with nano). Create a file dublin.fix:
$ nano dublin.fix
And type into nano the following fixes:
Every MARC record contains in the 245-field the title of a record. In the first line we map the MARC-245 field to new field in the record called title:
In the second and third line we map authors to a field creator. In the rug01.sample file the authors are stored in the MARC-100 and MARC-700 field. Because there is usually more than one author in a record, we need to $append them to create an array (a list) of one or more creator-s.
In line 4 and line 5 we do the same trick to filter out the ISBN and ISSN number out of the record which we store in separate fields isbn and issn (indeed these are not Dublin Core fields, we will process them later).
In line 6 and line 7 we read the MARC-260 field which contains publisher and date information. Here we don’t need the $append trick because there is usually only one 260-field in a MARC record.
In line 8 the subjects are extracted from the 260-field using the same $append trick as above. Notice that we only extracted the $a subfields? If you want to add more subfields you can list them as in marc_map(650abcdefgh,subject.$append)
Given the dublin.txt file above we can execute the filtering command like this:
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample
As always you can type | less at the end of this command to slow down the screen output, or store the results into a file with > results.txt. Hint:
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample | less
$ catmandu convert MARC --type ALEPHSEQ to YAML --fix dublin.fix < Documents/rug01.sample > results.txt
The results should look like this:
- Katz, Jerrold J.
- '0855275103 :'
publisher: Harvester press,
- Proposition (Logic)
- Speech acts (Linguistics)
- Generative grammar.
- Competence and performance (Linguistics)
title: Propositional structure and illocutionary force :a study of the contribution of sentence meaning to speech acts /Jerrold J. Katz.
Congratulations, you’ve created your first mapping file to transform library data from MARC to Dublin Core! We need to add a bit more cleaning to delete some periods and commas here and there but as is we already have our first mapping.
Below you’ll find a complete example. You can read more about our Fix language online.
marc_map(245,title, -join => " ")
Continue to Day 16: Importing RDF data with Catmandu >>
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol to harvest metadata records from OAI compliant repositories. It was developed by the Open Archives Initiative as a low-barrier mechanism for repository interoperability. The Open Archives Initiative maintains a registry of OAI data providers.
Every OAI server must provide metadata records in Dublin Core, other (bibliographic) formats like MARC may be supported additionally. Available metadata formats can be detected with “ListMetadataFormats“. You can set the metadata format for the Catmandu OAI client via the --metadataPrefix parameter.
The OAI server may support selective harvesting, so OAI clients can get only subsets of records from a repository. The client requests could be limited via datestamps (--from, --until) or set membership (--set).
To get some Dublin Core records from the digital collection of the University of Michigan and convert them to JSON (default) run the following catmandu command:
$ catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix oai_dc --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler oai_dc
You can also harvest MARC data and store it in a file:
$ catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix marc21 --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler marcxml to MARC --type USMARC > umich.mrc
Instead of harvesting the whole metadata you can get the record identifiers (--listIdentifiers) only:
$ catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --listIdentifiers 1 to YAML
You can also transform incoming data and immediately store/index it with MongoDB or Elasticsearch. For the transformation you need to create a fix (see Day 6):
$ nano simple.fix
Add the following fixes to the file:
Now you can run an ETL process (extract, transform, load) with one command:
$ catmandu import OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix marc21 --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler marcxml --fix simple.fix to Elasticsearch --index_name oai --bag umich
$ catmandu import OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix marc21 --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler marcxml --fix simple.fix to MongoDB --database_name oai --bag umich
The Catmandu OAI client provides special handler (--handler) for Dublin Core (oai_dc) and MARC (marcxml). For other metadata formats use the default handler (raw) or implement your own. See documentation for further details.
Continue to Day 14: Set up your own OAI data service >>