Category: Uncategorized

Introducing FileStores

Catmandu is always our tool of choice when working with structured data. Using the Elasticsearch or MongoDB Catmandu::Store-s it is quite trivial to store and retrieve metadata records. Storing and retrieving a YAML, JSON (and by extension XML, MARC, CSV,…) files can be as easy as the commands below:

$ catmandu import YAML to database < input.yml
$ catmandu import JSON to database < input.json
$ catmandu import MARC to database < marc.data
$ catmandu export database to YAML > output.yml

catmandu.yml  configuration file is required with the connection parameters to the database:

$ cat catmandu.yml
---
store:
  database:
    package: ElasticSearch
    options:
       client: '1_0::Direct' 
       index_name: catmandu
...

Given these tools to import and export and even transform structured data, can this be extended to unstructured data? In institutional repositories like LibreCat we would like to manage metadata records and binary content (for example PDF files related to the metadata).  Catmandu 1.06 introduces the Catmandu::FileStore as an extension to the already existing Catmandu::Store to manage binary content.

A Catmandu::FileStore is a Catmandu::Store where each Catmandu::Bag acts as a “container” or a “folder” that can contain zero or more records describing File content. The files records themselves contain pointers to a backend storage implementation capable of serialising and streaming binary files. Out of the box, one Catmandu::FileStore implementation is available Catmandu::Store::File::Simple, or short File::Simple, which stores files in a directory.

Some examples. To add a file to a FileStore, the stream command needs to be executed:


$ catmandu stream /tmp/myfile.pdf to File::Simple --root /data --bag 1234 --id myfile.pdf

In the command above: /tmp/myfile.pdf is the file up be uploaded to the File::Store. File::Simple is the name of the File::Store implementation which requires one mandatory parameter, --root /data which is the root directory where all files are stored.  The--bag 1234 is the “container” or “folder” which contains the uploaded files (with a numeric identifier 1234). And the --id myfile.pdf is the identifier for the new created file record.

To download the file from the File::Store, the stream command needs to be executed in opposite order:

$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf to /tmp/file.pdf

or

$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf > /tmp/file.pdf

On the file system the files are stored in some deep nested structure to be able to spread out the File::Store over many disks:


/data
  `--/000
      `--/001
          `--/234
              `--/myfile.pdf

A listing of all “containers” can be retreived by requesting an export of the default (index) bag of the File::Store:


$ catmandu export File::Simple --root /data to YAML
_id: 1234
...

A listing of all files in the container “1234” can be done by adding the bag name to the export command:

$ catmandu export File::Simple --root /data --bag 1234 to YAML
_id: myfile.pdf
_stream: !!perl/code '{ "DUMMY" }'
content_type: application/pdf
created: 1498125394
md5: ''
modified: 1498125394
size: 883202
...

Each File::Store implementation supports at least the fields presented above:

  • _id: the name of the file
  • _stream: a callback function to retrieve the content of the file (requires an IO::Handle as input)
  • content_type: the MIME-Type of the file
  • created: a timestamp when the file was created
  • modified: a timestamp when the file was last modified
  • size: the byte length of the file
  • md5: optional a MD5 checksum

We envision in Catmandu that many implementations of FileStores can be created to be able to store files in GitHub, BagIts, Fedora Commons and more backends.

Using the Catmandu::Plugin::SideCar  Catmandu::FileStore-s and Catmandu::Store-s can be combined as one endpoint. Using Catmandu::Store::Multi and Catmandu::Store::File::Multi many different implementations of Stores and FileStores can be combined.

This is a short introduction, but I hope you will experiment a bit with the new functionality and provide feedback to our project.

Catmandu 1.04

Catmandu 1.04 has been released to with some nice new features. There are some new Fix routines that were asked by our community:

error

The “error” fix stops immediately the execution of the Fix script and throws an error. Use this to abort the processing of a data stream:

$ cat myfix.fix
unless exists(id)
    error("no id found?!")
end
$ catmandu convert JSON --fix myfix.fix < data.json

valid

The “valid” fix condition can be used to validate a record (or part of a record) against a JSONSchema. For instance we can select only the valid records from a stream:

$ catmandu convert JSON --fix 'select valid('', JSONSchema, schema:myschema.json)' < data.json

Or, create some logging:

$ cat myfix.fix
unless valid(author, JSONSchema, schema:authors.json)
log("errors in the author field")
end
$ catmandu convert JSON --fix myfix.fix < data.json

rename

The “rename” fix can be used to recursively change the names of fields in your documents. For example, when you have this JSON input:

{
"foo.bar": "123",
"my.name": "Patrick"
}

you can transform all periods (.) in the key names to underscores with this fix:

rename('','\.','_')

The first parameter is the fields “rename” should work on (in our case it is an empty string, meaning the complete record). The second and third parameters are the regex search and replace parameters. The result of this fix is:

{
"foo_bar": "123",
"my_name": "Patrick"
}

The “rename” fix will only work on the keys of JSON paths. For example, given the following path:

my.deep.path.x.y.z

The keys are:

  • my
  • deep
  • path
  • x
  • y
  • z

The second and third argument search and replaces these seperate keys. When you want to change the paths as a whole take a look at the “collapse()” and “expand()” fixes in combination with the “rename” fix:

collapse()
rename('',"my\.deep","my.very.very.deep")
expand()

Now the generated path will be:

my.very.very.deep.path.x.y.z

Of course the example above could be written more simple as “move_field(my.deep,my.very.very.deep)”, but it serves as an example  that powerful renaming is possible.

import_from_string

This Fix is a generalisation of the “from_json” Fix. It can transform a serialised string field in your data into an array of data. For instance, take the following YAML record:


---
foo: '{"name":"patrick"}'
...

The field ‘foo’ contains a JSON fragment. You can transform this JSON into real data using the following fix:


import_from_string(foo,JSON)

Which creates a ‘foo’ array containing the deserialised JSON:


---
foo:
- name: patrick

The “import_from_string” look very much like the “from_json” string, but you can use any Catmandu::Importer. It always created an array of hashes. For instance, given the following YAML record:


---
foo: "name;hobby\nnicolas;drawing\npatrick;music"

You can transform the CSV fragment in the ‘foo’ field into data by using this fix:


import_from_string(foo,CSV,sep_char:";")

Which gives as result:


---
foo:
- hobby: drawing
  name: nicolas
- hobby: music
  name: patrick
...

I the same way it can process MARC, XML, RDF, YAML or any other format supported by Catmandu.

export_to_string

The fix “export_to_string” is the opposite of “import_from_string” and is the generalisation of the “to_json” fix. Given the YAML from the previous example:


---
foo:
- hobby: drawing
  name: nicolas
- hobby: music
  name: patrick
...

You can create a CSV fragment in the ‘foo’ field with the following fix:


export_to_string(foo,CSV,sep_char:";")

Which gives as result:


---
foo: "name;hobby\nnicolas;drawing\npatrick;music"

search_in_store

The fix “search_in_store” is a generalisation of the “lookup_in_store” fix. The latter is used to query the “_id” field in a Catmandu::Store and return the first hit. The former, “search_in_store” can query any field in a store and return all (or a subset) of the results. For instance, given the YAML record:


---
foo: "(title:ABC OR author:dave) AND NOT year:2013"
...

then the following fix will replace the ‘foo’ field with the result of the query in a Solr index:


search_in_store('foo', store:Solr, url: 'http://localhost:8983/solr/catalog')

As a result, the document will be updated like:


---
foo:
    start: 0,
    limit: 0,
    hits: [...],
    total: 1000
...

where

  • start: the starting index of the search result
  • limit: the number of result per page
  • hits: an array containing the data from the result page
  • total: the total number of search results

Every Catmandu::Solr can have another layout of the result page. Look at the documentation of the Catmandu::Solr implementations for the specific details.

Thanks for all your support for Catmandu and keep on data converting 🙂

Metadata Analysis at the Command-Line

I was last week at the ELAG  2016 conference in Copenhagen and attended the excellent workshop by Christina Harlow  of Cornell University on migrating digital collections metadata to RDF and Fedora4. One of the important steps required to migrate and model data to RDF is understanding what your data is about. Probably old systems need to be converted for which little or no documentation is available. Instead of manually processing large XML or MARC dumps, tools like metadata breakers can be used to find out which fields are available in the legacy system and how they are used. Mark Phillips of the University of North Texas wrote recently in Code4Lib a very inspiring article how this could be done in Python. In this blog post I’ll demonstrate how this can be done using a new Catmandu tool: Catmandu::Breaker.

To follow the examples below, you need to have a system with Catmandu installed. The Catmandu::Breaker tools can then be installed with the command:

$ sudo cpan Catmandu::Breaker

A breaker is a command that transforms data into a line format that can be easily processed with Unix command line tools such as grep, sort, uniq, cut and many more. If you need an introduction into Unix tools for data processing please follow the examples Johan Rolschewski of Berlin State Library and I presented as an ELAG bootcamp.

As a simple example lets create a YAML file and demonstrate how this file can be analysed using Catmandu::Breaker:

$ cat test.yaml
---
name: John
colors:
 - black
 - yellow
 - red
institution:
 name: Acme
  years:
   - 1949
   - 1950
   - 1951
   - 1952

This example has a combination of simple name/value pairs a list of colors and a deeply nested field. To transform this data into the breaker format execute the command:

$ catmandu convert YAML to Breaker < test.yaml
1 colors[]  black
1 colors[]  yellow
1 colors[]  red
1 institution.name  Acme
1 institution.years[] 1949
1 institution.years[] 1950
1 institution.years[] 1951
1 institution.years[] 1952
1 name  John

The breaker format is a tab-delimited output with three columns:

  1. An record identifier: read from the _id field in the input data, or a counter when no such field is present.
  2. A field name. Nested fields are seperated by dots (.) and list are indicated by the square brackets ([])
  3. A field value

When you have a very large JSON or YAML field and need to find all the values of a deeply nested field you could do something like:

$ catmandu convert YAML to Breaker < data.yaml | grep "institution.years"

Using Catmandu you can do this analysis on input formats such as JSON, YAML, XML, CSV, XLS (Excell). Just replace the YAML by any of these formats and run the breaker command. Catmandu can also connect to OAI-PMH, Z39.50 or databases such as MongoDB, ElasticSearch, Solr or even relational databases such as MySQL, Postgres and Oracle. For instance to get a breaker format for an OAI-PMH repository issue a command like:

$ catmandu convert OAI --url http://lib.ugent.be/oai to Breaker

If your data is in a database you could issue an SQL query like:

$ catmandu convert DBI --dsn 'dbi:Oracle' --query 'SELECT * from TABLE WHERE ...' --user 'user/password' to Breaker

Some formats, such as MARC, doesn’t provide a great breaker format. In Catmandu, MARC files are parsed into a list of list. Running a breaker on a MARC input you get this:

$ catmandu convert MARC to Breaker < t/camel.usmarc  | head
fol05731351     record[][]  LDR
fol05731351     record[][]  _
fol05731351     record[][]  00755cam  22002414a 4500
fol05731351     record[][]  001
fol05731351     record[][]  _
fol05731351     record[][]  fol05731351
fol05731351     record[][]  082
fol05731351     record[][]  0
fol05731351     record[][]  0
fol05731351     record[][]  a

The MARC fields are part of the data, not part of the field name. This can be fixed by adding a special ‘marc’ handler to the breaker command:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc  | head
fol05731351     LDR 00755cam  22002414a 4500
fol05731351     001 fol05731351
fol05731351     003 IMchF
fol05731351     005 20000613133448.0
fol05731351     008 000107s2000    nyua          001 0 eng
fol05731351     010a       00020737
fol05731351     020a    0471383147 (paper/cd-rom : alk. paper)
fol05731351     040a    DLC
fol05731351     040c    DLC
fol05731351     040d    DLC

Now all the MARC subfields are visible in the output.

You can use this format to find, for instance, all unique values in a MARC file. Lets try to find all unique 008 values:

$ catmandu convert MARC to Breaker --handler marc < camel.usmarc | grep "\t008" | cut -f 3 | sort -u
000107s2000 nyua 001 0 eng
000203s2000 mau 001 0 eng
000315s1999 njua 001 0 eng
000318s1999 cau b 001 0 eng
000318s1999 caua 001 0 eng
000518s2000 mau 001 0 eng
000612s2000 mau 000 0 eng
000612s2000 mau 100 0 eng
000614s2000 mau 000 0 eng
000630s2000 cau 001 0 eng
00801nam 22002778a 4500

Catmandu::Breaker doesn’t only break input data in a easy format for command line processing, it can also do a statistical analysis on the breaker output. First process some data into the breaker format and save the result in a file:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > result.breaker

Now, use this file as input for the ‘catmandu breaker’ command:

$ catmandu breaker result.breaker
| name | count | zeros | zeros% | min | max | mean | median | mode   | variance | stdev | uniq | entropy |
|------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------|
| 001  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 003  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 005  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 008  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 010a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 020a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 9    | 3.3/3.3 |
| 040a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040d | 5     | 5     | 50.0   | 0   | 1   | 0.5  | 0.5    | [0, 1] | 0.25     | 0.5   | 1    | 1.0/3.3 |
| 042a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 0822 | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 082a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 3    | 0.9/3.3 |
| 100a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 100d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 100q | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111c | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 245a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 9    | 3.1/3.3 |
| 245b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 245c | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 250a | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 260a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 6    | 2.3/3.3 |
| 260b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 2.0/3.3 |
| 260c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 2    | 0.9/3.3 |
| 263a | 6     | 4     | 40.0   | 0   | 1   | 0.6  | 1      | 1      | 0.24     | 0.49  | 4    | 2.0/3.3 |
| 300a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 1.8/3.3 |
| 300b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 1    | 0.9/3.3 |
| 300c | 4     | 6     | 60.0   | 0   | 1   | 0.4  | 0      | 0      | 0.24     | 0.49  | 4    | 1.8/3.3 |
| 300e | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 500a | 2     | 8     | 80.0   | 0   | 1   | 0.2  | 0      | 0      | 0.16     | 0.4   | 2    | 0.9/3.3 |
| 504a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 630a | 2     | 9     | 90.0   | 0   | 2   | 0.2  | 0      | 0      | 0.36     | 0.6   | 2    | 0.9/3.5 |
| 650a | 15    | 0     | 0.0    | 1   | 3   | 1.5  | 1      | 1      | 0.65     | 0.81  | 6    | 1.7/3.9 |
| 650v | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 700a | 5     | 7     | 70.0   | 0   | 2   | 0.5  | 0      | 0      | 0.65     | 0.81  | 5    | 1.9/3.6 |
| LDR  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3

As a result you get a table listing the usage of subfields in all the input records. From this output we can learn:

  • The ‘001’ field is available in 10 records (see: count)
  • One record doesn’t contain a ‘020a’ subfield (see: zeros)
  • The ‘650a’ is available in all records at least once at most 3 times (see: min, max)
  • Only 8 out of 10 ‘100a’ subfields have unique values (see: uniq)
  • The last column ‘entropy’ provides a number how interesting the field is for search engines. The higher the entropy, the more uniq content can be found.

I hope this tools are of some use in your projects!

Catmandu 1.00

catmandu_100After 4 years of programming, 88 minor releases we are finally there: the release of Catmandu 1.00! We have pushed the test coverage of the code to 93.97% and added and cleaned a lot of our documentation.

For the new features read our Changes file.

A few important changes should be noted.

 

 

By default Catmandu will read and write valid JSON files. In previous versions the default input format was (new)line delimited JSON records as in:


{"record":"1"}
{"record":"2"}
{"record":"3"}

instead of the valid JSON array format:


[{"record":"1"},{"record":"2"},{"record":"3"}]

The old format can still be used as input but will be read much faster when using the –line_delimited  option on the command line. Thus, write:


# fast
$ catmandu convert JSON --line_delimited 1  < lines.json.txt

instead of:


# slow
$ catmandu convert JSON < lines.json.txt

By default Catmandu will export in the valid JSON-array format. If you still need to use the old format, then provide the –line_delimited option on the command line:


$ catmandu convert YAML to JSON --line_delimited 1 < data.yaml

We thank all contributors for these wonderful four years of open source coding and we wish you all four new hacking years. Our thanks goes to:

Importing files from a hotfolder directory

The Catmandu data processing toolkit facilitates many import, export, and conversion tasks by support of common APIs (e.g. SRU, OAI-PMH) and databases (e.g. MongoDB, CouchDB, SQL…). But sometimes the best API and database is the file system. In this brief article I’ll show how to use a “hotfolder” to automatically import files into another Catmandu store.

A hotfolder is a directory in which files can be placed to automatically get processed. To facilitate the creation of such directories I created the CPAN module File::Hotfolder. Let’s first define a sample importer and storage in catmandu.yml configuration file:

---
importer:
  json:
    package: JSON
    options:
      multiline: 1
store:
  couchdb:
    package: CouchDB
    options:
      default_bag: import
...

We can now manually import JSON files into the import database of a local CouchDB like this:

catmandu import json to couchdb < filename.json

Manually calling such command for each file can be slow and requires access to the command line. How about defining a hotfolder to automatically import all JSON files into CouchDB? Here is an implementation:

use Catmandu -all;
use File::Hotfolder;
use File::Basename;
    
my $hotfolder = "import";
my $importer  = "json";
my $store     = "couchdb";
my $suffix    = qr{\.json};
    
my $store    = store($store);

watch( $hotfolder, 
    filter   => $suffix,
    scan     => 1,    
    delete   => 1,
    print    => WATCH_DIR | FOUND_FILE | CATCH_ERROR,
    callback => sub {
        $store->add_many( importer($importer, file => shift) );
    },
    catch    => 1,
)->loop;

The directory import is first scanned for existing files with extension .json and then watched for modified or new files. As soon as a file has been found, it is imported. The CATCH_ERROR options ensures to not kill the program if an import failed, for instance because of invalid JSON.

The current version of File::Hotfolder only works with Unix but it may be extended to other operating systems as well.

One Day of a Catmandu Developer

By Patrick Hochstenbach

At Ghent University Library we are using Catmandu these days in a project to create a new discovery interface for our Aleph catalog. Daily we export MARC sequential files from several ALEPH catalogs and store them in a MongoDB store. Into this store we also add records from our SFX server and our institutional repository Biblio.

We use the MongoDB store to do cleaning of our datasets plus a FRBR-ized merge of records. This merge is logical in our setup. One collection contains MARC records, one other collection is used to create relations between these records. When the data is cleaned and merged, we export the data to a Solr indexer which is used by the BlackLight frontend.

In the image below the architecture is shown. The Catmandu trail is clearly visible. For importing MARC records into MongoDB we use Catmandu importers. When we have all the data in the store we run a bunch of Catmandu fixers to cleanup the data. At the end of the day we use Catmandu exporters to send the data as JSON files to Solr where we index the data and make it available in BlackLight.

20130618_discovery