Day 13: Harvest data with OAI-PMH

14_librecatThe Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol to harvest metadata records from OAI compliant repositories. It was developed by the Open Archives Initiative as a low-barrier mechanism for repository interoperability. The Open Archives Initiative maintains a registry of OAI data providers.

Every OAI server must provide metadata records in Dublin Core, other (bibliographic) formats like MARC may be supported additionally. Available metadata formats can be detected with “ListMetadataFormats“. You can set the metadata format for the Catmandu OAI client via the --metadataPrefix parameter.

The OAI server may support selective harvesting, so OAI clients can get only subsets of records from a repository. The client requests could be limited via datestamps (--from, --until) or set membership (--set).

To get some Dublin Core records from the digital collection of the University of Michigan and convert them to JSON (default) run the following catmandu command:

$ catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix oai_dc --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler oai_dc

You can also harvest MARC data and store it in a file:

$ catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix marc21 --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler marcxml to MARC --type USMARC > umich.mrc

Instead of harvesting the whole metadata you can get the record identifiers (--listIdentifiers) only:

$ catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --listIdentifiers 1 to YAML

You can also transform incoming data and immediately store/index it with MongoDB or Elasticsearch. For the transformation you need to create a fix (see Day 6):

$ nano simple.fix

Add the following fixes to the file:

marc_map(245,title)
marc_map(100,creator.$append)
marc_map(260b,publisher)
marc_map(260c,date)
marc_map(650a,subject.$append)
remove_field(record)

Now you can run an ETL process (extract, transform, load) with one command:

$ catmandu import OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix marc21 --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler marcxml --fix simple.fix to Elasticsearch --index_name oai --bag umich
$ catmandu import OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix marc21 --from 2014-12-01T07:00:00Z --until 2014-12-01T07:04:00Z --handler marcxml --fix simple.fix to MongoDB --database_name oai --bag umich

The Catmandu OAI client provides special handler (--handler) for Dublin Core (oai_dc) and MARC (marcxml). For other metadata formats use the default handler (raw) or implement your own. See documentation for further details.

Continue to Day 14: Set up your own OAI data service >>

Advertisements

14 comments

  1. Pingback: Day 17: Exporting RDF data with Catmandu | LibreCat
  2. Joh

    Hi,
    I’m new to perl and Catmandu and have some problems to try this out. I do not use the virtual machine, so it might be a problem with my Catmandu Version ( 0.9209 ) .

    $ catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –metadataPrefix oai_dc –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z –handler oai_dc
    does not work for me, because:
    Can’t locate object method “metadataPrefix” via package “oai_dc” (perhaps you forgot to load “oai_dc”?)

    The list record example is working:
    $ catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z –listIdentifiers 1 to YAML

    But all commands that use the –metadataPrefix [oai_dc | marc21 | …] method result in the “Can’t locate object method” error.

    How do I load the package oai_dc? I already installed the Harvester, and the XML::SAX::ExpatXS (just in case).
    $ cpanm XML::SAX::ExpatXS
    $ cpanm Net::OAI::Harvester

    The test failed but it said that it was installed after using –force.
    $ Net::OAI::Harvester is up to date. (1.15)

    Any hints what I can try now?

    Cheers,
    Johannes

    BTW: Thanks for your great Tutorials here!

    Like

    • hochstenbach

      Hi Johannes, thanks for joining us!

      There are several things maybe are happening here.

      First, you don’t need the ‘handler’ option to be able to read the oai_dc output from Michigan. This should work:

      catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –metadataPrefix oai_dc –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z

      The handler option you only need to use when you extract special formats other than oai_dc or marc from an OAI server. You maybe need to write in these cases your own Perl module to interpret these formats.

      Also make sure that the two characters in front of ‘url’, ‘metadataPrefix’, ‘from’ and ‘until’ are two min-signs (-). When copy and pasting from the web these signs could be interpreted as a Dash (–) which is another sign which looks like a minus sign but does mean something else in Unix.

      Cheers
      Patrick

      Like

      • Joh

        Hi Patrick,

        thanks for the hint! I omitted the handler option and successfully got Dublin Core. I still ran into errors when harvesting other formats that require the handler option (e.g. marc21).

        For instance:
        $ catmandu convert OAI –url http://quod.lib.umich.edu/cgi/o/oai/oai –metadataPrefix marc21 –from 2014-12-01T07:00:00Z –until 2014-12-01T07:04:00Z –handler marcxml to MARC –type USMARC > umich.mrc

        -> “Can’t locate object method “metadataPrefix” via package “marcxml” (perhaps you forgot to load “marcxml”?) at /Users/XYZ/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/Catmandu/Importer/OAI.pm line 63.”

        Then I figured out that I need to install the OAI packages for Catmandu:

        $ cpanm Catmandu::OAI

        Now it’s working fine.

        Cheers,
        Johannes

        Liked by 1 person

    • nemobis

      I got mysterious errors after doing “sudo cpanm Catmandu Catmandu::Importer::OAI” on Fedora 23 (catmandu 1.0201), turns out I had forgotten “cpanm Catmandu::OAI” as well…

      For instance

      $ catmandu convert OAI --url https://air.unimi.it/oai/request --metadataPrefix oai_dc --listIdentifiers 1
      Odd number of elements in hash assignment at /usr/local/share/perl5/HTTP/OAI/UserAgent.pm line 32.
      Use of uninitialized value $args[0] in list assignment at /usr/local/share/perl5/HTTP/OAI/UserAgent.pm line 32.
      Invalid Request (use 'force' to force a non-conformant request): 
      No verb supplied
      []
      

      Like

  3. Pingback: Day 14: Set up your own OAI data service | LibreCat
  4. Pingback: Day 12: Index your data with Elasticsearch | LibreCat
  5. nemobis

    I have trouble using “select marc_match()” with OAI: I’m able to retrieve all objects (sometimes after a few retries) with something like catmandu convert OAI --url http://atena.beic.it/OAI-PUB --fix 'retain_field("_id")' to CSV (nice way to list all IDs) or catmandu convert -v OAI --url http://atena.beic.it/OAI-PUB --metadataPrefix marc21, but I’m not able to retrieve only a few of them: catmandu convert -v OAI --url http://atena.beic.it/OAI-PUB --metadataPrefix marc21 --fix "select marc_match(852a,'.*Marucelliana.*'); retain_field('_id')" ends either with nothing or with “converted 0 objects done”.

    Like

    • johrols

      To work with MARC you have to add the handler option "--handler marcxml". The fix “marc_match” is a “condition” which should be used like this:


      if marc_match(852a,'.*Marucelliana.*')
      add_field('Location','Marucelliana');
      end

      Like

  6. srl

    Thanks for a nice presentation! Unfortunately, running
    cpanm –sudo Catmandu::OAI
    I receive the following error:
    Fetching http://www.cpan.org/authors/id/H/HO/HOCHSTEN/MODS-Record-0.11.tar.gz … OK
    Configuring MODS-Record-0.11 … OK
    Building and testing MODS-Record-0.11 … FAIL
    ! Installing MODS::Record failed.

    From log:
    Building and testing MODS-Record-0.11
    cp lib/MODS/Record.pm blib/lib/MODS/Record.pm
    # Testing MODS::Record 0.11
    t/00_load.t …………. ok
    # Failed test ‘from_xml’
    # at t/01_mods.t line 74.
    Can’t call method “get_mods” on an undefined value at t/01_mods.t line 75.
    # Looks like your test exited with 255 just after 43.
    t/01_mods.t ………….
    Dubious, test returned 255 (wstat 65280, 0xff00)
    Failed 22/64 subtests
    t/release-pod-syntax.t .. skipped: these tests are for release candidate testing

    I’m running CentOS 6.8. Appreciate any help on this.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s