Introducing FileStores

Catmandu is always our tool of choice when working with structured data. Using the Elasticsearch or MongoDB Catmandu::Store-s it is quite trivial to store and retrieve metadata records. Storing and retrieving a YAML, JSON (and by extension XML, MARC, CSV,…) files can be as easy as the commands below:

$ catmandu import YAML to database < input.yml
$ catmandu import JSON to database < input.json
$ catmandu import MARC to database < marc.data
$ catmandu export database to YAML > output.yml

catmandu.yml  configuration file is required with the connection parameters to the database:

$ cat catmandu.yml
---
store:
  database:
    package: ElasticSearch
    options:
       client: '1_0::Direct' 
       index_name: catmandu
...

Given these tools to import and export and even transform structured data, can this be extended to unstructured data? In institutional repositories like LibreCat we would like to manage metadata records and binary content (for example PDF files related to the metadata).  Catmandu 1.06 introduces the Catmandu::FileStore as an extension to the already existing Catmandu::Store to manage binary content.

A Catmandu::FileStore is a Catmandu::Store where each Catmandu::Bag acts as a “container” or a “folder” that can contain zero or more records describing File content. The files records themselves contain pointers to a backend storage implementation capable of serialising and streaming binary files. Out of the box, one Catmandu::FileStore implementation is available Catmandu::Store::File::Simple, or short File::Simple, which stores files in a directory.

Some examples. To add a file to a FileStore, the stream command needs to be executed:


$ catmandu stream /tmp/myfile.pdf to File::Simple --root /data --bag 1234 --id myfile.pdf

In the command above: /tmp/myfile.pdf is the file up be uploaded to the File::Store. File::Simple is the name of the File::Store implementation which requires one mandatory parameter, --root /data which is the root directory where all files are stored.  The--bag 1234 is the “container” or “folder” which contains the uploaded files (with a numeric identifier 1234). And the --id myfile.pdf is the identifier for the new created file record.

To download the file from the File::Store, the stream command needs to be executed in opposite order:

$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf to /tmp/file.pdf

or

$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf > /tmp/file.pdf

On the file system the files are stored in some deep nested structure to be able to spread out the File::Store over many disks:


/data
  `--/000
      `--/001
          `--/234
              `--/myfile.pdf

A listing of all “containers” can be retreived by requesting an export of the default (index) bag of the File::Store:


$ catmandu export File::Simple --root /data to YAML
_id: 1234
...

A listing of all files in the container “1234” can be done by adding the bag name to the export command:

$ catmandu export File::Simple --root /data --bag 1234 to YAML
_id: myfile.pdf
_stream: !!perl/code '{ "DUMMY" }'
content_type: application/pdf
created: 1498125394
md5: ''
modified: 1498125394
size: 883202
...

Each File::Store implementation supports at least the fields presented above:

  • _id: the name of the file
  • _stream: a callback function to retrieve the content of the file (requires an IO::Handle as input)
  • content_type: the MIME-Type of the file
  • created: a timestamp when the file was created
  • modified: a timestamp when the file was last modified
  • size: the byte length of the file
  • md5: optional a MD5 checksum

We envision in Catmandu that many implementations of FileStores can be created to be able to store files in GitHub, BagIts, Fedora Commons and more backends.

Using the Catmandu::Plugin::SideCar  Catmandu::FileStore-s and Catmandu::Store-s can be combined as one endpoint. Using Catmandu::Store::Multi and Catmandu::Store::File::Multi many different implementations of Stores and FileStores can be combined.

This is a short introduction, but I hope you will experiment a bit with the new functionality and provide feedback to our project.

Catmandu 1.04

Catmandu 1.04 has been released to with some nice new features. There are some new Fix routines that were asked by our community:

error

The “error” fix stops immediately the execution of the Fix script and throws an error. Use this to abort the processing of a data stream:

$ cat myfix.fix
unless exists(id)
    error("no id found?!")
end
$ catmandu convert JSON --fix myfix.fix < data.json

valid

The “valid” fix condition can be used to validate a record (or part of a record) against a JSONSchema. For instance we can select only the valid records from a stream:

$ catmandu convert JSON --fix 'select valid('', JSONSchema, schema:myschema.json)' < data.json

Or, create some logging:

$ cat myfix.fix
unless valid(author, JSONSchema, schema:authors.json)
log("errors in the author field")
end
$ catmandu convert JSON --fix myfix.fix < data.json

rename

The “rename” fix can be used to recursively change the names of fields in your documents. For example, when you have this JSON input:

{
"foo.bar": "123",
"my.name": "Patrick"
}

you can transform all periods (.) in the key names to underscores with this fix:

rename('','\.','_')

The first parameter is the fields “rename” should work on (in our case it is an empty string, meaning the complete record). The second and third parameters are the regex search and replace parameters. The result of this fix is:

{
"foo_bar": "123",
"my_name": "Patrick"
}

The “rename” fix will only work on the keys of JSON paths. For example, given the following path:

my.deep.path.x.y.z

The keys are:

  • my
  • deep
  • path
  • x
  • y
  • z

The second and third argument search and replaces these seperate keys. When you want to change the paths as a whole take a look at the “collapse()” and “expand()” fixes in combination with the “rename” fix:

collapse()
rename('',"my\.deep","my.very.very.deep")
expand()

Now the generated path will be:

my.very.very.deep.path.x.y.z

Of course the example above could be written more simple as “move_field(my.deep,my.very.very.deep)”, but it serves as an example  that powerful renaming is possible.

import_from_string

This Fix is a generalisation of the “from_json” Fix. It can transform a serialised string field in your data into an array of data. For instance, take the following YAML record:


---
foo: '{"name":"patrick"}'
...

The field ‘foo’ contains a JSON fragment. You can transform this JSON into real data using the following fix:


import_from_string(foo,JSON)

Which creates a ‘foo’ array containing the deserialised JSON:


---
foo:
- name: patrick

The “import_from_string” look very much like the “from_json” string, but you can use any Catmandu::Importer. It always created an array of hashes. For instance, given the following YAML record:


---
foo: "name;hobby\nnicolas;drawing\npatrick;music"

You can transform the CSV fragment in the ‘foo’ field into data by using this fix:


import_from_string(foo,CSV,sep_char:";")

Which gives as result:


---
foo:
- hobby: drawing
  name: nicolas
- hobby: music
  name: patrick
...

I the same way it can process MARC, XML, RDF, YAML or any other format supported by Catmandu.

export_to_string

The fix “export_to_string” is the opposite of “import_from_string” and is the generalisation of the “to_json” fix. Given the YAML from the previous example:


---
foo:
- hobby: drawing
  name: nicolas
- hobby: music
  name: patrick
...

You can create a CSV fragment in the ‘foo’ field with the following fix:


export_to_string(foo,CSV,sep_char:";")

Which gives as result:


---
foo: "name;hobby\nnicolas;drawing\npatrick;music"

search_in_store

The fix “search_in_store” is a generalisation of the “lookup_in_store” fix. The latter is used to query the “_id” field in a Catmandu::Store and return the first hit. The former, “search_in_store” can query any field in a store and return all (or a subset) of the results. For instance, given the YAML record:


---
foo: "(title:ABC OR author:dave) AND NOT year:2013"
...

then the following fix will replace the ‘foo’ field with the result of the query in a Solr index:


search_in_store('foo', store:Solr, url: 'http://localhost:8983/solr/catalog')

As a result, the document will be updated like:


---
foo:
    start: 0,
    limit: 0,
    hits: [...],
    total: 1000
...

where

  • start: the starting index of the search result
  • limit: the number of result per page
  • hits: an array containing the data from the result page
  • total: the total number of search results

Every Catmandu::Solr can have another layout of the result page. Look at the documentation of the Catmandu::Solr implementations for the specific details.

Thanks for all your support for Catmandu and keep on data converting 🙂

Metadata Analysis at the Command-Line

I was last week at the ELAG  2016 conference in Copenhagen and attended the excellent workshop by Christina Harlow  of Cornell University on migrating digital collections metadata to RDF and Fedora4. One of the important steps required to migrate and model data to RDF is understanding what your data is about. Probably old systems need to be converted for which little or no documentation is available. Instead of manually processing large XML or MARC dumps, tools like metadata breakers can be used to find out which fields are available in the legacy system and how they are used. Mark Phillips of the University of North Texas wrote recently in Code4Lib a very inspiring article how this could be done in Python. In this blog post I’ll demonstrate how this can be done using a new Catmandu tool: Catmandu::Breaker.

To follow the examples below, you need to have a system with Catmandu installed. The Catmandu::Breaker tools can then be installed with the command:

$ sudo cpan Catmandu::Breaker

A breaker is a command that transforms data into a line format that can be easily processed with Unix command line tools such as grep, sort, uniq, cut and many more. If you need an introduction into Unix tools for data processing please follow the examples Johan Rolschewski of Berlin State Library and I presented as an ELAG bootcamp.

As a simple example lets create a YAML file and demonstrate how this file can be analysed using Catmandu::Breaker:

$ cat test.yaml
---
name: John
colors:
 - black
 - yellow
 - red
institution:
 name: Acme
  years:
   - 1949
   - 1950
   - 1951
   - 1952

This example has a combination of simple name/value pairs a list of colors and a deeply nested field. To transform this data into the breaker format execute the command:

$ catmandu convert YAML to Breaker < test.yaml
1 colors[]  black
1 colors[]  yellow
1 colors[]  red
1 institution.name  Acme
1 institution.years[] 1949
1 institution.years[] 1950
1 institution.years[] 1951
1 institution.years[] 1952
1 name  John

The breaker format is a tab-delimited output with three columns:

  1. An record identifier: read from the _id field in the input data, or a counter when no such field is present.
  2. A field name. Nested fields are seperated by dots (.) and list are indicated by the square brackets ([])
  3. A field value

When you have a very large JSON or YAML field and need to find all the values of a deeply nested field you could do something like:

$ catmandu convert YAML to Breaker < data.yaml | grep "institution.years"

Using Catmandu you can do this analysis on input formats such as JSON, YAML, XML, CSV, XLS (Excell). Just replace the YAML by any of these formats and run the breaker command. Catmandu can also connect to OAI-PMH, Z39.50 or databases such as MongoDB, ElasticSearch, Solr or even relational databases such as MySQL, Postgres and Oracle. For instance to get a breaker format for an OAI-PMH repository issue a command like:

$ catmandu convert OAI --url http://lib.ugent.be/oai to Breaker

If your data is in a database you could issue an SQL query like:

$ catmandu convert DBI --dsn 'dbi:Oracle' --query 'SELECT * from TABLE WHERE ...' --user 'user/password' to Breaker

Some formats, such as MARC, doesn’t provide a great breaker format. In Catmandu, MARC files are parsed into a list of list. Running a breaker on a MARC input you get this:

$ catmandu convert MARC to Breaker < t/camel.usmarc  | head
fol05731351     record[][]  LDR
fol05731351     record[][]  _
fol05731351     record[][]  00755cam  22002414a 4500
fol05731351     record[][]  001
fol05731351     record[][]  _
fol05731351     record[][]  fol05731351
fol05731351     record[][]  082
fol05731351     record[][]  0
fol05731351     record[][]  0
fol05731351     record[][]  a

The MARC fields are part of the data, not part of the field name. This can be fixed by adding a special ‘marc’ handler to the breaker command:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc  | head
fol05731351     LDR 00755cam  22002414a 4500
fol05731351     001 fol05731351
fol05731351     003 IMchF
fol05731351     005 20000613133448.0
fol05731351     008 000107s2000    nyua          001 0 eng
fol05731351     010a       00020737
fol05731351     020a    0471383147 (paper/cd-rom : alk. paper)
fol05731351     040a    DLC
fol05731351     040c    DLC
fol05731351     040d    DLC

Now all the MARC subfields are visible in the output.

You can use this format to find, for instance, all unique values in a MARC file. Lets try to find all unique 008 values:

$ catmandu convert MARC to Breaker --handler marc < camel.usmarc | grep "\t008" | cut -f 3 | sort -u
000107s2000 nyua 001 0 eng
000203s2000 mau 001 0 eng
000315s1999 njua 001 0 eng
000318s1999 cau b 001 0 eng
000318s1999 caua 001 0 eng
000518s2000 mau 001 0 eng
000612s2000 mau 000 0 eng
000612s2000 mau 100 0 eng
000614s2000 mau 000 0 eng
000630s2000 cau 001 0 eng
00801nam 22002778a 4500

Catmandu::Breaker doesn’t only break input data in a easy format for command line processing, it can also do a statistical analysis on the breaker output. First process some data into the breaker format and save the result in a file:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > result.breaker

Now, use this file as input for the ‘catmandu breaker’ command:

$ catmandu breaker result.breaker
| name | count | zeros | zeros% | min | max | mean | median | mode   | variance | stdev | uniq | entropy |
|------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------|
| 001  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 003  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 005  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 008  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 010a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 020a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 9    | 3.3/3.3 |
| 040a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040d | 5     | 5     | 50.0   | 0   | 1   | 0.5  | 0.5    | [0, 1] | 0.25     | 0.5   | 1    | 1.0/3.3 |
| 042a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 0822 | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 082a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 3    | 0.9/3.3 |
| 100a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 100d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 100q | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111c | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 245a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 9    | 3.1/3.3 |
| 245b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 245c | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 250a | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 260a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 6    | 2.3/3.3 |
| 260b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 2.0/3.3 |
| 260c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 2    | 0.9/3.3 |
| 263a | 6     | 4     | 40.0   | 0   | 1   | 0.6  | 1      | 1      | 0.24     | 0.49  | 4    | 2.0/3.3 |
| 300a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 1.8/3.3 |
| 300b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 1    | 0.9/3.3 |
| 300c | 4     | 6     | 60.0   | 0   | 1   | 0.4  | 0      | 0      | 0.24     | 0.49  | 4    | 1.8/3.3 |
| 300e | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 500a | 2     | 8     | 80.0   | 0   | 1   | 0.2  | 0      | 0      | 0.16     | 0.4   | 2    | 0.9/3.3 |
| 504a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 630a | 2     | 9     | 90.0   | 0   | 2   | 0.2  | 0      | 0      | 0.36     | 0.6   | 2    | 0.9/3.5 |
| 650a | 15    | 0     | 0.0    | 1   | 3   | 1.5  | 1      | 1      | 0.65     | 0.81  | 6    | 1.7/3.9 |
| 650v | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 700a | 5     | 7     | 70.0   | 0   | 2   | 0.5  | 0      | 0      | 0.65     | 0.81  | 5    | 1.9/3.6 |
| LDR  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3

As a result you get a table listing the usage of subfields in all the input records. From this output we can learn:

  • The ‘001’ field is available in 10 records (see: count)
  • One record doesn’t contain a ‘020a’ subfield (see: zeros)
  • The ‘650a’ is available in all records at least once at most 3 times (see: min, max)
  • Only 8 out of 10 ‘100a’ subfields have unique values (see: uniq)
  • The last column ‘entropy’ provides a number how interesting the field is for search engines. The higher the entropy, the more uniq content can be found.

I hope this tools are of some use in your projects!

Catmandu 1.01

Catmandu 1.01 has been released today. There has been some speed improvements processing fixes due to switching from the Data::Util to the Ref::Util package which has better a support on many Perl platforms.

For the command line there is now support for preprocessing  Fix scripts. This means, one can read in variables from the command line into a Fix script. For instance, when processing data you might want to keep some provenance data about your data sources in the output. This can be done with the following commands:


$ catmandu convert MARC --fix myfixes.fix --var source=Publisher1 
--var date=2014-2015 < data.mrc

with a myfixes.fix like:


add_field(my_source,{{source}})
add_field(my_data,{{date}})
marc_field(245,title)
marc_field(022,issn)
.
.
.
etc
.
.

Your JSON output will now contain the clean ‘title’ and ‘issn’ fields but also for each record a ‘my_source’ with value ‘Publisher1’ and a ‘my_date’ with value ‘2014-2015’.

By using the Text::Hogan compiler full support of the mustache language is available.

In this new Catmandu version there have been also some new fix functions you might want to try out, see our Fixes Cheat Sheet for a full overview.

 

Parallel Processing with Catmandu

In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion:

$ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json
added       100 (55/sec)
added       200 (76/sec)
added       300 (87/sec)
added       400 (92/sec)
added       500 (90/sec)
added       600 (94/sec)
added       700 (97/sec)
added       800 (97/sec)
added       900 (96/sec)
added      1000 (97/sec)

In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records.

Can we make this any faster?

When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’  if all the processors could be used. One technique to do that is called ‘parallel processing’.

To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system:

$ cat /proc/cpuinfo | grep processor
processor   : 0
processor   : 1

The example above  shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain  4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle).

To check if your computer  is using all that calculating power, use the ‘uptime’ command:

$ uptime
11:15:21 up 622 days,  1:53,  2 users,  load average: 1.23, 1.70, 1.95

In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores  it means: the server  is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait.

Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu  scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel:

sudo yum install parallel

The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation.

First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation):

$ sudo cat > /etc/yum.repos.d/mongodb.repo <<EOF
[mongodb]
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
enabled=1
name=MongoDB.org repository
EOF

$ sudo yum install -y mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-org-tools
$ sudo cpanm Catmandu::Store::MongoDB

Next, we are going to store our input data in a MongoDB database with help of a Catmandu Fix script that adds some random numbers the data:

$ catmandu import MARC to MongoDB --database_name data --fix random.fix < input.marc

With the ‘random.fix’ like:


random("part.rand2","2")
random("part.rand4","4")
random("part.rand8","8")
random("part.rand16","16")
random("part.rand32","32")

The ‘random()’ Fix function will be available in Catmandu 1.003 but can also be downloaded here (install it in a directory ‘lib/Catmandu/Fix’). This will will make sure that every record in your input file contains four random numbers ‘part.rand2’, ‘part.rand4′ ,’part.rand8′,’part.rand16′,’part.rand32’. This will makes it possible to chop your data into two, four, eight, sixteen or thirty-two parts depending on the number of processors you have in your machine.

To access one chunk of your data the ‘catmandu export’ command can be used with a query. For instance, to export two equal chunks  do:

$ catmandu export MongoDB --database_name -q '{"part.rand2":0}' > part1
$ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2

We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously.

#!/bin/bash
# file: parallel.sh
CPU=$1

if [ "${CPU}" == "" ]; then
    /usr/bin/parallel -u $0 {} <<EOF
0
1
EOF
elif [ "${CPU}" != "" ]; then
     catmandu export MongoDB --database_name data -q "{\"part.rand2\":${CPU}}" to JSON --line_delimited 1 --fix heavy_load.fix > result.${CPU}.json
fi

This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script.

If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’.

GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command:

$ cat result.*.json > final_result.json

Catmandu 1.00

catmandu_100After 4 years of programming, 88 minor releases we are finally there: the release of Catmandu 1.00! We have pushed the test coverage of the code to 93.97% and added and cleaned a lot of our documentation.

For the new features read our Changes file.

A few important changes should be noted.

 

 

By default Catmandu will read and write valid JSON files. In previous versions the default input format was (new)line delimited JSON records as in:


{"record":"1"}
{"record":"2"}
{"record":"3"}

instead of the valid JSON array format:


[{"record":"1"},{"record":"2"},{"record":"3"}]

The old format can still be used as input but will be read much faster when using the –line_delimited  option on the command line. Thus, write:


# fast
$ catmandu convert JSON --line_delimited 1  < lines.json.txt

instead of:


# slow
$ catmandu convert JSON < lines.json.txt

By default Catmandu will export in the valid JSON-array format. If you still need to use the old format, then provide the –line_delimited option on the command line:


$ catmandu convert YAML to JSON --line_delimited 1 < data.yaml

We thank all contributors for these wonderful four years of open source coding and we wish you all four new hacking years. Our thanks goes to:

Catmandu Chat

0560_001On Friday June 26 2015 16:00 CEST, we’ll  provide a one hour introduction/demo into processing data with Catmandu.

If you are interested, join us on the event page:

https://plus.google.com/hangouts/_/event/c6jcknos8egjlthk658m1btha9o

More instructions on the exact Google Hangout coordinates for this chat will follow on this web page at Friday June 26 15:45.

To enter the chat session, a working version of the Catmandu VirtualBox needs to be running on your system:

https://librecatproject.wordpress.com/get-catmandu/