Day 10: Working with CSV and Excel files

10_librecatCSV and Excel files are widely-used to store and exchange simple structured data. Many open datasets are published as CSV files, e.g. datahub.io. Within the library community CSV files are used for the distribution of title lists (KBART), e.g Knowledge Base+. Excel spreadsheets are often used to generate reports.

Catmandu implements importer and exporter for both formats. The CVS module is already part of the core system, the Catmandu::XLS and Catmandu::Exporter::Table modules may have to be installed separatly (note these steps are not required if you have the virtual catmandu box):

$ sudo cpanm Catmandu::XLS
$ sudo cpanm Catmandu::Exporter::Table

Get some CSV data to work with:

$ curl "https://www.kbplus.ac.uk/kbplus/publicExport/pkg/1?format=csv&omitHeader=Y" > springer-oa.csv

Now you can convert the data to different formats, like JSON, YAML and XML.

$ catmandu convert CSV to XML < springer-oa.csv
$ catmandu convert CSV to XLS --file springer-oa.xls < springer-oa.csv
$ catmandu convert XLS to JSON < springer-oa.xls
$ catmandu convert CSV to XLSX --file springer-oa.xlsx < springer-oa.csv
$ catmandu convert XLSX to YAML < springer-oa.xlsx

You can extract specified fields while converting to another tabular format. This is quite handy for analysis of specific fields or to generate reports.

$ catmandu convert CSV to Table --fields publication_title,online_identifier
$ catmandu convert CSV to CSV --fields 'identifier.jusp,publication_title,online_identifier' < springer-oa.csv
$ catmandu convert CSV to XLS --fields 'identifier.jusp,publication_title,online_identifier' --file springer-oa.xls < springer-oa.csv

The field names are read from the header line or must be given via the ‘fields’ parameter. In the second case you have to set the ‘header’ parameter to ‘0’.

$ echo '12157,"The Journal of Headache and Pain",2193-1801' | catmandu convert CSV --header 0 --fields 'id,title,issn'

By default Catmandu expects that CSV fields are separated by comma ‘,’ and strings are quoted with double qoutes ‘”‘. You can specify other characters as separator or quotes with the parameters ‘sep_char’ and ‘quote_char’:

$ echo '12157;$The Journal of Headache and Pain$;2193-1801' | catmandu convert CSV --header 0 --fields 'id,title,issn' --sep_char ';' --quote_char '$'

When exporting data a tabular format you can change the field names in the header or omit the header:

$ catmandu convert CSV to CSV --fields 'identifier.jusp,publication_title,online_identifier' --header 'id,title,issn' < springer-oa.csv
$ catmandu convert CSV to CSV --fields 'identifier.jusp,publication_title,online_identifier' --header 0 < springer-oa.csv

If you want to export complex/nested data structures to a tabular format, you must “flatten” the datastructure. This could be done with “Fixes“.

See Catmandu::Importer::CSV, Catmandu::Exporter::CSV and Catmandu::XLS for further documentation.

Continue in Day 11: Store your data in MongoDB >>

Advertisements

7 comments

  1. Pingback: Day 9: Processing MARC with Catmandu | LibreCat
  2. nemobis

    Forgive the stupid question, but what’s the point of converting from one spreadsheet format to another? ALl that can be done much more easily on LibreOffice, as far as I can see.

    Like

    • johrols

      Yes, if you like GUI programs, you are fine. The Catmandu XLS(X) importer and exporter offers much more:

      * you can convert your data to JSON, YAML, … and vice versa

      * you can import your spreadsheet to a store like MongoDB or Elasticsearch with one command

      * you can export data from your store to XLS(X) with one command

      Best regards,

      Johann

      Like

      • nemobis

        Yes, I certainly appreciate the ability to convert from a spreadsheet format to a non-spreadsheet format, and vice versa; I just don’t understand what’s the point of using Catmandu to convert from XLS to CSV or similar. But maybe I misunderstood the guide, because I’m not able to export from OAI-PMH to XLS or CSV either: what’s the best way to ask for support/report potential bugs?

        Like

  3. johrols

    Most of the time you have to fix the incoming data and then map the desired fields:

    catmandu convert OAI --url http://quod.lib.umich.edu/cgi/o/oai/oai --metadataPrefix oai_dc --from 2016-06-01T07:00:00Z --until 2016-06-02T07:04:00Z --handler oai_dc to CSV --fix 'join_field(title,", ")' --fields _identifier,title

    You can get more information about Catmandu here: http://librecat.org/Catmandu/ . You can ask questions via our mailing list mail.librecat.org/mailman/listinfo/librecat-dev. Bug reports should be submitted to the correspondent GitHub repository, e.g https://github.com/LibreCat/Catmandu-XLS/issues

    Like

    • nemobis

      Thank you. The mailing list server is currently unreachable, I’ll retry later; is the XLS repository about CSV export too? When I change my command, how can I tell when I’m going in the right direction or making things worse? Is there some “very verbose” or “debug” mode?

      I’ve tried various things but I never managed to get any output to CSV yet, other than header and empty lines (if I add –fields). One of the commands I tried is catmandu convert -v OAI --url http://atena.beic.it/OAI-PUB --metadataPrefix marc21 to CSV --fix "marc_map('852a','Location'); retain_field('Location'); retain_field('_id')" --fields _id,Location.

      http://librecat.org/Catmandu/ says «The is an ARRAY output, indicating that the colors field is nested. To fix this, a transformation needs to be provided», but I no longer get any ARRAY, so it seems I don’t need a join_field(). I don’t know what other variants to test; I guess I’ll try again next time I need to extract some data from OAI-PMH.

      Like

      • johrols

        Sorry, the correct link to the mailing list is librecat-dev.

        For working mit MARC data via OAI you have to use the corresponding “–handler” option. The “retain_field” fix deletes all fields except the specified one, so you couldn’t use it multiple times. Just specify the fields you want to output via the “–fields” option. This command will work:


        catmandu convert -v OAI --url http://atena.beic.it/OAI-PUB --metadataPrefix marc21 --handler marcxml to CSV --fix "marc_map('852a','Location'); retain_field('Location');" --fields Location

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s