Day 13: Harvest data with OAI-PMH
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol to harvest metadata records from OAI compliant repositories. It was developed by the Open Archives Initiative as a low-barrier mechanism for repository interoperability. The Open Archives Initiative maintains a registry of OAI data providers.
Every OAI server must provide metadata records in Dublin Core, other (bibliographic) formats like MARC may be supported additionally. Available metadata formats can be detected with “ListMetadataFormats“. You can set the metadata format for the Catmandu OAI client via the --metadataPrefix parameter.
The OAI server may support selective harvesting, so OAI clients can get only subsets of records from a repository. The client requests could be limited via datestamps (--from, --until) or set membership (--set).
To get some Dublin Core records from the collection of Ghent University Library and convert it to JSON (default) run the following catmandu command:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc
You can also harvest MARC data and store it in a file:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml to MARC --type USMARC > ugent.mrc
Instead of harvesting the whole metadata you can get the record identifiers (--listIdentifiers) only:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --listIdentifiers 1 to YAML
You can also transform incoming data and immediately store/index it with MongoDB or Elasticsearch. For the transformation you need to create a fix (see Day 6):
$ nano simple.fix
Add the following fixes to the file:
marc_map(245,title)
marc_map(100,creator.$append)
marc_map(260c,date)
remove_field(record)
Now you can run an ETL process (extract, transform, load) with one command:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to Elasticsearch --index_name oai --bag ugent
$ catmandu import OAI ---url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to MongoDB --database_name oai --bag ugent
The Catmandu OAI client provides special handler (--handler) for Dublin Core (oai_dc) and MARC (marcxml). For other metadata formats use the default handler (raw) or implement your own. Read our documentation for further details.
Continue to Day 14: Set up your own OAI data service >>
Hi,
I’m new to perl and Catmandu and have some problems to try this out. I do not use the virtual machine, so it might be a problem with my Catmandu Version ( 0.9209 ) .
$ catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –metadataPrefix oai_dc –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z –handler oai_dc
does not work for me, because:
Can’t locate object method “metadataPrefix” via package “oai_dc” (perhaps you forgot to load “oai_dc”?)
The list record example is working:
$ catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z –listIdentifiers 1 to YAML
But all commands that use the –metadataPrefix [oai_dc | marc21 | …] method result in the “Can’t locate object method” error.
How do I load the package oai_dc? I already installed the Harvester, and the XML::SAX::ExpatXS (just in case).
$ cpanm XML::SAX::ExpatXS
$ cpanm Net::OAI::Harvester
The test failed but it said that it was installed after using –force.
$ Net::OAI::Harvester is up to date. (1.15)
Any hints what I can try now?
Cheers,
Johannes
BTW: Thanks for your great Tutorials here!
LikeLike
Hi Johannes, thanks for joining us!
There are several things maybe are happening here.
First, you don’t need the ‘handler’ option to be able to read the oai_dc output from Michigan. This should work:
catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –metadataPrefix oai_dc –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z
The handler option you only need to use when you extract special formats other than oai_dc or marc from an OAI server. You maybe need to write in these cases your own Perl module to interpret these formats.
Also make sure that the two characters in front of ‘url’, ‘metadataPrefix’, ‘from’ and ‘until’ are two min-signs (-). When copy and pasting from the web these signs could be interpreted as a Dash (–) which is another sign which looks like a minus sign but does mean something else in Unix.
Cheers
Patrick
LikeLike
Hi Patrick,
thanks for the hint! I omitted the handler option and successfully got Dublin Core. I still ran into errors when harvesting other formats that require the handler option (e.g. marc21).
For instance:
$ catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –metadataPrefix marc21 –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z –handler marcxml to MARC –type USMARC > umich.mrc
-> “Can’t locate object method “metadataPrefix” via package “marcxml” (perhaps you forgot to load “marcxml”?) at /Users/XYZ/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/Catmandu/Importer/OAI.pm line 63.”
Then I figured out that I need to install the OAI packages for Catmandu:
$ cpanm Catmandu::OAI
Now it’s working fine.
Cheers,
Johannes
LikeLiked by 1 person
I got mysterious errors after doing “sudo cpanm Catmandu Catmandu::Importer::OAI” on Fedora 23 (catmandu 1.0201), turns out I had forgotten “cpanm Catmandu::OAI” as well…
For instance
LikeLike
FYI
I see that https://air.unimi.it/oai/request has a broken SSL certificate. To be able to connect to such a host, you need to switch off verification of SSL certificates in your local environment (PERL_LWP_SSL_VERIFY_HOSTNAME=0) . E.g.
PERL_LWP_SSL_VERIFY_HOSTNAME=0 catmandu -I lib convert OAI –url https://air.unimi.it/oai/request –metadataPrefix oai_dc –listIdentifiers 1
LikeLike
Are you sure? The certificate looks correct on fedora 23 (with Firefox 46, curl 7.43.0, python-requests 2.10.0) and https://www.ssllabs.com/ssltest/analyze.html?d=air.unimi.it gives a B; probably you’re using Chrome which complains about SHA-1.
The error above was resolved for me by doing “cpanm Catmandu::OAI”.
LikeLike
I have trouble using “select marc_match()” with OAI: I’m able to retrieve all objects (sometimes after a few retries) with something like
catmandu convert OAI --url http://atena.beic.it/OAI-PUB --fix 'retain_field("_id")' to CSV
(nice way to list all IDs) orcatmandu convert -v OAI --url http://atena.beic.it/OAI-PUB --metadataPrefix marc21
, but I’m not able to retrieve only a few of them:catmandu convert -v OAI --url http://atena.beic.it/OAI-PUB --metadataPrefix marc21 --fix "select marc_match(852a,'.*Marucelliana.*'); retain_field('_id')"
ends either with nothing or with “converted 0 objects done”.LikeLike
To work with MARC you have to add the handler option
"--handler marcxml"
. The fix “marc_match” is a “condition” which should be used like this:if marc_match(852a,'.*Marucelliana.*')
add_field('Location','Marucelliana');
end
LikeLike
Thanks for a nice presentation! Unfortunately, running
cpanm –sudo Catmandu::OAI
I receive the following error:
Fetching http://www.cpan.org/authors/id/H/HO/HOCHSTEN/MODS-Record-0.11.tar.gz … OK
Configuring MODS-Record-0.11 … OK
Building and testing MODS-Record-0.11 … FAIL
! Installing MODS::Record failed.
From log:
Building and testing MODS-Record-0.11
cp lib/MODS/Record.pm blib/lib/MODS/Record.pm
# Testing MODS::Record 0.11
t/00_load.t …………. ok
# Failed test ‘from_xml’
# at t/01_mods.t line 74.
Can’t call method “get_mods” on an undefined value at t/01_mods.t line 75.
# Looks like your test exited with 255 just after 43.
t/01_mods.t ………….
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 22/64 subtests
t/release-pod-syntax.t .. skipped: these tests are for release candidate testing
I’m running CentOS 6.8. Appreciate any help on this.
LikeLike
Hi, just a note to say I managed to skip the tests by running:
cpanm –sudo –notest Catmandu::OAI
Now it works.
LikeLike
Thanks for the feedback. I will investigate this further.
LikeLike
Hello, I need pass a specific user agent to the importer,OAI, SOmenthing lilke –user_agent=’Mozilla/5.0′ but i can’t understand how exactly, my fault, probably, my perl knowledge is too low (Have looked in to the documentation on cpan for ::importer). Can you explain me with an example from command line or from configuration file ? Thank You
LikeLiked by 1 person
Hi , to do this I needed to create a change in the HTTP::OAI library. There is a new release (4.12) now on its way to CPAN [1]. If you install that version of HTTP::OAI, then you can do on the command line something like:
HTTP_OAI_AGENT=”Mozilla/5.0″ catmandu convert OAI https://biblio.ugent.be/oai
[1] https://metacpan.org/release/HOCHSTEN/HTTP-OAI-4.12/view/lib/HTTP/OAI.pm
LikeLiked by 1 person
thank you very much Patrick, with your syntax I was able to make it work. I was able to harvest specific sets of my own DSPACE. Now I’m working at how to convert to rdf, because my goal is to convert to repec. I have already repec submission, and want to automatically convert and create the files with proper names.
Thanks again,
Greetings,
F.
LikeLiked by 1 person