Metadata Analysis at the Command-Line
I was last week at the ELAG 2016 conference in Copenhagen and attended the excellent workshop by Christina Harlow of Cornell University on migrating digital collections metadata to RDF and Fedora4. One of the important steps required to migrate and model data to RDF is understanding what your data is about. Probably old systems need to be converted for which little or no documentation is available. Instead of manually processing large XML or MARC dumps, tools like metadata breakers can be used to find out which fields are available in the legacy system and how they are used. Mark Phillips of the University of North Texas wrote recently in Code4Lib a very inspiring article how this could be done in Python. In this blog post I’ll demonstrate how this can be done using a new Catmandu tool: Catmandu::Breaker.
To follow the examples below, you need to have a system with Catmandu installed. The Catmandu::Breaker tools can then be installed with the command:
$ sudo cpan Catmandu::Breaker
A breaker is a command that transforms data into a line format that can be easily processed with Unix command line tools such as grep, sort, uniq, cut and many more. If you need an introduction into Unix tools for data processing please follow the examples Johan Rolschewski of Berlin State Library and I presented as an ELAG bootcamp.
As a simple example lets create a YAML file and demonstrate how this file can be analysed using Catmandu::Breaker:
$ cat test.yaml --- name: John colors: - black - yellow - red institution: name: Acme years: - 1949 - 1950 - 1951 - 1952
This example has a combination of simple name/value pairs a list of colors and a deeply nested field. To transform this data into the breaker format execute the command:
$ catmandu convert YAML to Breaker < test.yaml 1 colors[] black 1 colors[] yellow 1 colors[] red 1 institution.name Acme 1 institution.years[] 1949 1 institution.years[] 1950 1 institution.years[] 1951 1 institution.years[] 1952 1 name John
The breaker format is a tab-delimited output with three columns:
- An record identifier: read from the _id field in the input data, or a counter when no such field is present.
- A field name. Nested fields are seperated by dots (.) and list are indicated by the square brackets ([])
- A field value
When you have a very large JSON or YAML field and need to find all the values of a deeply nested field you could do something like:
$ catmandu convert YAML to Breaker < data.yaml | grep "institution.years"
Using Catmandu you can do this analysis on input formats such as JSON, YAML, XML, CSV, XLS (Excell). Just replace the YAML by any of these formats and run the breaker command. Catmandu can also connect to OAI-PMH, Z39.50 or databases such as MongoDB, ElasticSearch, Solr or even relational databases such as MySQL, Postgres and Oracle. For instance to get a breaker format for an OAI-PMH repository issue a command like:
$ catmandu convert OAI --url http://lib.ugent.be/oai to Breaker
If your data is in a database you could issue an SQL query like:
$ catmandu convert DBI --dsn 'dbi:Oracle' --query 'SELECT * from TABLE WHERE ...' --user 'user/password' to Breaker
Some formats, such as MARC, doesn’t provide a great breaker format. In Catmandu, MARC files are parsed into a list of list. Running a breaker on a MARC input you get this:
$ catmandu convert MARC to Breaker < t/camel.usmarc | head fol05731351 record[][] LDR fol05731351 record[][] _ fol05731351 record[][] 00755cam 22002414a 4500 fol05731351 record[][] 001 fol05731351 record[][] _ fol05731351 record[][] fol05731351 fol05731351 record[][] 082 fol05731351 record[][] 0 fol05731351 record[][] 0 fol05731351 record[][] a
The MARC fields are part of the data, not part of the field name. This can be fixed by adding a special ‘marc’ handler to the breaker command:
$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc | head fol05731351 LDR 00755cam 22002414a 4500 fol05731351 001 fol05731351 fol05731351 003 IMchF fol05731351 005 20000613133448.0 fol05731351 008 000107s2000 nyua 001 0 eng fol05731351 010a 00020737 fol05731351 020a 0471383147 (paper/cd-rom : alk. paper) fol05731351 040a DLC fol05731351 040c DLC fol05731351 040d DLC
Now all the MARC subfields are visible in the output.
You can use this format to find, for instance, all unique values in a MARC file. Lets try to find all unique 008 values:
$ catmandu convert MARC to Breaker --handler marc < camel.usmarc | grep "\t008" | cut -f 3 | sort -u 000107s2000 nyua 001 0 eng 000203s2000 mau 001 0 eng 000315s1999 njua 001 0 eng 000318s1999 cau b 001 0 eng 000318s1999 caua 001 0 eng 000518s2000 mau 001 0 eng 000612s2000 mau 000 0 eng 000612s2000 mau 100 0 eng 000614s2000 mau 000 0 eng 000630s2000 cau 001 0 eng 00801nam 22002778a 4500
Catmandu::Breaker doesn’t only break input data in a easy format for command line processing, it can also do a statistical analysis on the breaker output. First process some data into the breaker format and save the result in a file:
$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > result.breaker
Now, use this file as input for the ‘catmandu breaker’ command:
$ catmandu breaker result.breaker | name | count | zeros | zeros% | min | max | mean | median | mode | variance | stdev | uniq | entropy | |------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------| | 001 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 003 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 005 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 008 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 010a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 020a | 9 | 1 | 10.0 | 0 | 1 | 0.9 | 1 | 1 | 0.09 | 0.3 | 9 | 3.3/3.3 | | 040a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 040c | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 040d | 5 | 5 | 50.0 | 0 | 1 | 0.5 | 0.5 | [0, 1] | 0.25 | 0.5 | 1 | 1.0/3.3 | | 042a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 050a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 050b | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 0822 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 082a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 3 | 0.9/3.3 | | 100a | 9 | 1 | 10.0 | 0 | 1 | 0.9 | 1 | 1 | 0.09 | 0.3 | 8 | 3.1/3.3 | | 100d | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 100q | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 111a | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 111c | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 111d | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 245a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 9 | 3.1/3.3 | | 245b | 3 | 7 | 70.0 | 0 | 1 | 0.3 | 0 | 0 | 0.21 | 0.46 | 3 | 1.4/3.3 | | 245c | 9 | 1 | 10.0 | 0 | 1 | 0.9 | 1 | 1 | 0.09 | 0.3 | 8 | 3.1/3.3 | | 250a | 3 | 7 | 70.0 | 0 | 1 | 0.3 | 0 | 0 | 0.21 | 0.46 | 3 | 1.4/3.3 | | 260a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 2.3/3.3 | | 260b | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 | 2.0/3.3 | | 260c | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 2 | 0.9/3.3 | | 263a | 6 | 4 | 40.0 | 0 | 1 | 0.6 | 1 | 1 | 0.24 | 0.49 | 4 | 2.0/3.3 | | 300a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 | 1.8/3.3 | | 300b | 3 | 7 | 70.0 | 0 | 1 | 0.3 | 0 | 0 | 0.21 | 0.46 | 1 | 0.9/3.3 | | 300c | 4 | 6 | 60.0 | 0 | 1 | 0.4 | 0 | 0 | 0.24 | 0.49 | 4 | 1.8/3.3 | | 300e | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 500a | 2 | 8 | 80.0 | 0 | 1 | 0.2 | 0 | 0 | 0.16 | 0.4 | 2 | 0.9/3.3 | | 504a | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 630a | 2 | 9 | 90.0 | 0 | 2 | 0.2 | 0 | 0 | 0.36 | 0.6 | 2 | 0.9/3.5 | | 650a | 15 | 0 | 0.0 | 1 | 3 | 1.5 | 1 | 1 | 0.65 | 0.81 | 6 | 1.7/3.9 | | 650v | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 700a | 5 | 7 | 70.0 | 0 | 2 | 0.5 | 0 | 0 | 0.65 | 0.81 | 5 | 1.9/3.6 | | LDR | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3
As a result you get a table listing the usage of subfields in all the input records. From this output we can learn:
- The ‘001’ field is available in 10 records (see: count)
- One record doesn’t contain a ‘020a’ subfield (see: zeros)
- The ‘650a’ is available in all records at least once at most 3 times (see: min, max)
- Only 8 out of 10 ‘100a’ subfields have unique values (see: uniq)
- The last column ‘entropy’ provides a number how interesting the field is for search engines. The higher the entropy, the more uniq content can be found.
I hope this tools are of some use in your projects!
Congrats, it is a nice feature! A question: how do you calculate uniqness and entropy? Do you call an external service? I am working on a similar project (measuring metadata quality) where I use Solr’s term vectors to read TF-IDF values, but for that I have to set special configurations in Solr. I’m sure Elasticsearch also able to return the same information, but might needs also special configuration.
LikeLike
The entropy H of field is defined as:
H = – SUM{x} p(x) log p(x)
where ‘x’ is the set of all possible values in a field and ‘p(x)’ :
p(x) = number.of.uniq.value(x) / total.number.of.values
For instance, if records have a field ‘year’ you can make an list of all years you find in all records:
year = { 1989, 1970 , 1989 , 1990 , 1989 , 2016 , 2000 }
There are 7 items in this list. The p(x) for all unique values are:
p(1970) = 1 / 7
p(1989) = 3 / 7
p(1990) = 1 / 7
p(2000) = 1 / 7
p(2016) = 1 / 7
The entropy H is then:
H = -1 * { p(1970)*log( p(1970) ) + p(1989)*log( p(1989) ) + p(1990)*log( p(1990) ) + p(2000)*log( p(2000) ) + p(2016)*log( p(2016) ) }
= -1 * { 1/7 * log(1/7) + 3/7 * log(3/7) + 1/7 * log(1/7) + 1/7 * log (1/7) + 1/7 * log(1/7) }
= -1 * { – 0.401 – 0.524 – 0.401 – 0.401 – 0.401 }
= 2.128
The minimum possible entropy for a field is when all values are the same. E.g. if all the 7 years would be 1989 then p(1989) = 1 and H = -1 * { 1 * log(1) } = 0
The maximum possible entropy for a field is when all values are different. E.g. you have 7 different years and each has p(x) = 1 / 7 which gives H = -1 * { 7 * 1/7 * log(1/7) } = -1 * { log (1/7} } = 2.807
All ‘log’ here are log_2.
LikeLike
Thanks! It is similar method what Ochoa and Duval described in their article “Automatic evaluation of metadata quality in digital repositories” (http://link.springer.com/article/10.1007%2Fs00799-009-0054-4). My problem with this approach is that it seems it is not scalable, so it might not work if you have a huge amount of records (more GB than your memory is) – at least that’s why I chosed to use Lucene’s help. Do you have experience with larger data sets?
What does “3.3/3.3” means in the entropy column?
One more note: the URL of Mark Philipps’ articles is http://journal.code4lib.org/articles/7818 (now it points to somewhere else).
LikeLike
You mean the software itself is not scalable, but the definition of entropy is? Yes, indeed the
current version of the Catmandu::Breaker (which uses Catmandu::Stat) requires that all the data
fits into memory. But I don’t see that much issues to use a database backend for temporary files
(e.g. https://metacpan.org/pod/MLDBM or our own Catmandu::Store backends). I’ll check what is
possible.
LikeLike
A new version of Catmandu::Breaker and Catmandu::Stat is available which scales with the size of the dataset. This version doesn’t require to keep all the data in memory and has a steady memory print of about a few tens of megabytes on an input sample of MARC data tens of gigabytes in size . Use it as:
# Create a data.breaker file and store all unique fields in data.fields
$ catmandu convert MARC to Breaker –handler marc –fields data.fields data.breaker
$ catmandu breaker -v –fields data.fields data.breaker
It uses internally the HyperLogLog algorithm (See: https://en.wikipedia.org/wiki/HyperLogLog) to create a pretty accurate estimate (for large datasets) on the number of unique values in a field. This is used with some other algorithms to calculate an estimate on the entropy which even for small datasets is within 1-2% of the real value.
As for your other question, 3.3/3.3 in the entropy column, means: the calculated entropy for the field is 3.3 (the first value) , the maximum entropy for this field (if all the values were different) is 3.3. For this field, the maximum entropy is reached, there are no doubles.
LikeLike
Hello, I’m testing the code of your article, unfortunately the following code is not working: catmandu convert OAI –url http://lib.ugent.be/oai to Breaker
I got the following error message: Odd number of elements in hash assignment at /Users/manu/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/HTTP/OAI/UserAgent.pm line 32.
Use of uninitialized value $args[0] in list assignment at /Users/manu/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/HTTP/OAI/UserAgent.pm line 32.
Invalid Request (use ‘force’ to force a non-conformant request):
No verb supplied
Can you tell me what went wrong?
Thanks in advance!
LikeLike
Hi, can you tell me what version of Catmandu Catmandu::OAI and Catmandu::Breaker you are using?
You can find this with the ‘catmandu info’ command. With Catmandu 1.0303, Catmandu::OAI 0.11 an Catmandu::Breaker 0.10 combinration I don’t see any issue with the command in your reply.
LikeLike
Hi, catmandu info didn’t give me the info. Here is the kind of info I received:
Catmandu::Breaker | undef | Package that exports data in a Breaker format
Catmandu::Cmd | undef | A base class for extending the Catmandu command line
Catmandu::OAI | undef | Catmandu modules for working with OAI repositories
But I ran a cpan-outdated | cpanm -n command and Catmandu::OAI was upgraded. My version of Catmandu was already the latest one (1.0303). And yes, it is working now!
Thanks for your quick answer!
LikeLike