June 16, 2016

Metadata Analysis at the Command-Line

I was last week at the ELAG 2016 conference in Copenhagen and attended the excellent workshop by Christina Harlow of Cornell University on migrating digital collections metadata to RDF and Fedora4. One of the important steps required to migrate and model data to RDF is understanding what your data is about. Probably old systems need to be converted for which little or no documentation is available. Instead of manually processing large XML or MARC dumps, tools like metadata breakers can be used to find out which fields are available in the legacy system and how they are used. Mark Phillips of the University of North Texas wrote recently in Code4Lib a very inspiring article how this could be done in Python. In this blog post I’ll demonstrate how this can be done using a new Catmandu tool: Catmandu::Breaker.

To follow the examples below, you need to have a system with Catmandu installed. The Catmandu::Breaker tools can then be installed with the command:

$ sudo cpan Catmandu::Breaker

A breaker is a command that transforms data into a line format that can be easily processed with Unix command line tools such as grep, sort, uniq, cut and many more. If you need an introduction into Unix tools for data processing please follow the examples Johan Rolschewski of Berlin State Library and I presented as an ELAG bootcamp.

As a simple example lets create a YAML file and demonstrate how this file can be analysed using Catmandu::Breaker:

$ cat test.yaml
---
name: John
colors:
 - black
 - yellow
 - red
institution:
 name: Acme
  years:
   - 1949
   - 1950
   - 1951
   - 1952

This example has a combination of simple name/value pairs a list of colors and a deeply nested field. To transform this data into the breaker format execute the command:

$ catmandu convert YAML to Breaker < test.yaml
1 colors[]  black
1 colors[]  yellow
1 colors[]  red
1 institution.name  Acme
1 institution.years[] 1949
1 institution.years[] 1950
1 institution.years[] 1951
1 institution.years[] 1952
1 name  John

The breaker format is a tab-delimited output with three columns:

An record identifier: read from the _id field in the input data, or a counter when no such field is present.
A field name. Nested fields are seperated by dots (.) and list are indicated by the square brackets ([])
A field value

When you have a very large JSON or YAML field and need to find all the values of a deeply nested field you could do something like:

$ catmandu convert YAML to Breaker < data.yaml | grep "institution.years"

Using Catmandu you can do this analysis on input formats such as JSON, YAML, XML, CSV, XLS (Excell). Just replace the YAML by any of these formats and run the breaker command. Catmandu can also connect to OAI-PMH, Z39.50 or databases such as MongoDB, ElasticSearch, Solr or even relational databases such as MySQL, Postgres and Oracle. For instance to get a breaker format for an OAI-PMH repository issue a command like:

$ catmandu convert OAI --url http://lib.ugent.be/oai to Breaker

If your data is in a database you could issue an SQL query like:

$ catmandu convert DBI --dsn 'dbi:Oracle' --query 'SELECT * from TABLE WHERE ...' --user 'user/password' to Breaker

Some formats, such as MARC, doesn’t provide a great breaker format. In Catmandu, MARC files are parsed into a list of list. Running a breaker on a MARC input you get this:

$ catmandu convert MARC to Breaker < t/camel.usmarc  | head
fol05731351     record[][]  LDR
fol05731351     record[][]  _
fol05731351     record[][]  00755cam  22002414a 4500
fol05731351     record[][]  001
fol05731351     record[][]  _
fol05731351     record[][]  fol05731351
fol05731351     record[][]  082
fol05731351     record[][]  0
fol05731351     record[][]  0
fol05731351     record[][]  a

The MARC fields are part of the data, not part of the field name. This can be fixed by adding a special ‘marc’ handler to the breaker command:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc  | head
fol05731351     LDR 00755cam  22002414a 4500
fol05731351     001 fol05731351
fol05731351     003 IMchF
fol05731351     005 20000613133448.0
fol05731351     008 000107s2000    nyua          001 0 eng
fol05731351     010a       00020737
fol05731351     020a    0471383147 (paper/cd-rom : alk. paper)
fol05731351     040a    DLC
fol05731351     040c    DLC
fol05731351     040d    DLC

Now all the MARC subfields are visible in the output.

You can use this format to find, for instance, all unique values in a MARC file. Lets try to find all unique 008 values:

$ catmandu convert MARC to Breaker --handler marc < camel.usmarc | grep "\t008" | cut -f 3 | sort -u
000107s2000 nyua 001 0 eng
000203s2000 mau 001 0 eng
000315s1999 njua 001 0 eng
000318s1999 cau b 001 0 eng
000318s1999 caua 001 0 eng
000518s2000 mau 001 0 eng
000612s2000 mau 000 0 eng
000612s2000 mau 100 0 eng
000614s2000 mau 000 0 eng
000630s2000 cau 001 0 eng
00801nam 22002778a 4500

Catmandu::Breaker doesn’t only break input data in a easy format for command line processing, it can also do a statistical analysis on the breaker output. First process some data into the breaker format and save the result in a file:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > result.breaker

Now, use this file as input for the ‘catmandu breaker’ command:

$ catmandu breaker result.breaker
| name | count | zeros | zeros% | min | max | mean | median | mode   | variance | stdev | uniq | entropy |
|------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------|
| 001  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 003  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 005  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 008  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 010a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 020a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 9    | 3.3/3.3 |
| 040a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040d | 5     | 5     | 50.0   | 0   | 1   | 0.5  | 0.5    | [0, 1] | 0.25     | 0.5   | 1    | 1.0/3.3 |
| 042a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 0822 | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 082a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 3    | 0.9/3.3 |
| 100a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 100d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 100q | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111c | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 245a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 9    | 3.1/3.3 |
| 245b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 245c | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 250a | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 260a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 6    | 2.3/3.3 |
| 260b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 2.0/3.3 |
| 260c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 2    | 0.9/3.3 |
| 263a | 6     | 4     | 40.0   | 0   | 1   | 0.6  | 1      | 1      | 0.24     | 0.49  | 4    | 2.0/3.3 |
| 300a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 1.8/3.3 |
| 300b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 1    | 0.9/3.3 |
| 300c | 4     | 6     | 60.0   | 0   | 1   | 0.4  | 0      | 0      | 0.24     | 0.49  | 4    | 1.8/3.3 |
| 300e | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 500a | 2     | 8     | 80.0   | 0   | 1   | 0.2  | 0      | 0      | 0.16     | 0.4   | 2    | 0.9/3.3 |
| 504a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 630a | 2     | 9     | 90.0   | 0   | 2   | 0.2  | 0      | 0      | 0.36     | 0.6   | 2    | 0.9/3.5 |
| 650a | 15    | 0     | 0.0    | 1   | 3   | 1.5  | 1      | 1      | 0.65     | 0.81  | 6    | 1.7/3.9 |
| 650v | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 700a | 5     | 7     | 70.0   | 0   | 2   | 0.5  | 0      | 0      | 0.65     | 0.81  | 5    | 1.9/3.6 |
| LDR  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3

As a result you get a table listing the usage of subfields in all the input records. From this output we can learn:

The ‘001’ field is available in 10 records (see: count)
One record doesn’t contain a ‘020a’ subfield (see: zeros)
The ‘650a’ is available in all records at least once at most 3 times (see: min, max)
Only 8 out of 10 ‘100a’ subfields have unique values (see: uniq)
The last column ‘entropy’ provides a number how interesting the field is for search engines. The higher the entropy, the more uniq content can be found.

I hope this tools are of some use in your projects!

8 comments

June 17, 2016 - 10:07 am Péter Király

Congrats, it is a nice feature! A question: how do you calculate uniqness and entropy? Do you call an external service? I am working on a similar project (measuring metadata quality) where I use Solr’s term vectors to read TF-IDF values, but for that I have to set special configurations in Solr. I’m sure Elasticsearch also able to return the same information, but might needs also special configuration.

LikeLike

Reply
- June 17, 2016 - 11:41 am hochstenbach
  
  The entropy H of field is defined as:
  
  H = – SUM{x} p(x) log p(x)
  
  where ‘x’ is the set of all possible values in a field and ‘p(x)’ :
  
  p(x) = number.of.uniq.value(x) / total.number.of.values
  
  For instance, if records have a field ‘year’ you can make an list of all years you find in all records:
  
  year = { 1989, 1970 , 1989 , 1990 , 1989 , 2016 , 2000 }
  
  There are 7 items in this list. The p(x) for all unique values are:
  
  p(1970) = 1 / 7
  p(1989) = 3 / 7
  p(1990) = 1 / 7
  p(2000) = 1 / 7
  p(2016) = 1 / 7
  
  The entropy H is then:
  
  H = -1 * { p(1970)*log( p(1970) ) + p(1989)*log( p(1989) ) + p(1990)*log( p(1990) ) + p(2000)*log( p(2000) ) + p(2016)*log( p(2016) ) }
  = -1 * { 1/7 * log(1/7) + 3/7 * log(3/7) + 1/7 * log(1/7) + 1/7 * log (1/7) + 1/7 * log(1/7) }
  = -1 * { – 0.401 – 0.524 – 0.401 – 0.401 – 0.401 }
  = 2.128
  
  The minimum possible entropy for a field is when all values are the same. E.g. if all the 7 years would be 1989 then p(1989) = 1 and H = -1 * { 1 * log(1) } = 0
  
  The maximum possible entropy for a field is when all values are different. E.g. you have 7 different years and each has p(x) = 1 / 7 which gives H = -1 * { 7 * 1/7 * log(1/7) } = -1 * { log (1/7} } = 2.807
  
  All ‘log’ here are log_2.
  
  LikeLike
  
  Reply
June 17, 2016 - 12:21 pm Péter Király

Thanks! It is similar method what Ochoa and Duval described in their article “Automatic evaluation of metadata quality in digital repositories” (http://link.springer.com/article/10.1007%2Fs00799-009-0054-4). My problem with this approach is that it seems it is not scalable, so it might not work if you have a huge amount of records (more GB than your memory is) – at least that’s why I chosed to use Lucene’s help. Do you have experience with larger data sets?
What does “3.3/3.3” means in the entropy column?
One more note: the URL of Mark Philipps’ articles is http://journal.code4lib.org/articles/7818 (now it points to somewhere else).

LikeLike

Reply
- June 17, 2016 - 4:37 pm hochstenbach
  
  You mean the software itself is not scalable, but the definition of entropy is? Yes, indeed the
  current version of the Catmandu::Breaker (which uses Catmandu::Stat) requires that all the data
  fits into memory. But I don’t see that much issues to use a database backend for temporary files
  (e.g. https://metacpan.org/pod/MLDBM or our own Catmandu::Store backends). I’ll check what is
  possible.
  
  LikeLike
  
  Reply
- July 2, 2016 - 9:19 am hochstenbach
  
  A new version of Catmandu::Breaker and Catmandu::Stat is available which scales with the size of the dataset. This version doesn’t require to keep all the data in memory and has a steady memory print of about a few tens of megabytes on an input sample of MARC data tens of gigabytes in size . Use it as:
  
  # Create a data.breaker file and store all unique fields in data.fields
  $ catmandu convert MARC to Breaker –handler marc –fields data.fields data.breaker
  $ catmandu breaker -v –fields data.fields data.breaker
  
  It uses internally the HyperLogLog algorithm (See: https://en.wikipedia.org/wiki/HyperLogLog) to create a pretty accurate estimate (for large datasets) on the number of unique values in a field. This is used with some other algorithms to calculate an estimate on the entropy which even for small datasets is within 1-2% of the real value.
  
  As for your other question, 3.3/3.3 in the entropy column, means: the calculated entropy for the field is 3.3 (the first value) , the maximum entropy for this field (if all the values were different) is 3.3. For this field, the maximum entropy is reached, there are no doubles.
  
  LikeLike
  
  Reply
November 8, 2016 - 8:10 pm Emmanuel Di Pretoro

Hello, I’m testing the code of your article, unfortunately the following code is not working: catmandu convert OAI –url http://lib.ugent.be/oai to Breaker

I got the following error message: Odd number of elements in hash assignment at /Users/manu/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/HTTP/OAI/UserAgent.pm line 32.
Use of uninitialized value $args[0] in list assignment at /Users/manu/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/HTTP/OAI/UserAgent.pm line 32.
Invalid Request (use ‘force’ to force a non-conformant request):
No verb supplied

Can you tell me what went wrong?

Thanks in advance!

LikeLike

Reply
- November 9, 2016 - 2:57 pm hochstenbach
  
  Hi, can you tell me what version of Catmandu Catmandu::OAI and Catmandu::Breaker you are using?
  You can find this with the ‘catmandu info’ command. With Catmandu 1.0303, Catmandu::OAI 0.11 an Catmandu::Breaker 0.10 combinration I don’t see any issue with the command in your reply.
  
  LikeLike
  
  Reply
  - November 9, 2016 - 7:41 pm Emmanuel Di Pretoro
    
    Hi, catmandu info didn’t give me the info. Here is the kind of info I received:
    Catmandu::Breaker | undef | Package that exports data in a Breaker format
    Catmandu::Cmd | undef | A base class for extending the Catmandu command line
    Catmandu::OAI | undef | Catmandu modules for working with OAI repositories
    
    But I ran a cpan-outdated | cpanm -n command and Catmandu::OAI was upgraded. My version of Catmandu was already the latest one (1.0303). And yes, it is working now!
    
    Thanks for your quick answer!
    
    LikeLike

Share this:

Related

8 comments

Leave a comment Cancel reply