Tagged: Linux

Parallel Processing with Catmandu

In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion:

$ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json
added       100 (55/sec)
added       200 (76/sec)
added       300 (87/sec)
added       400 (92/sec)
added       500 (90/sec)
added       600 (94/sec)
added       700 (97/sec)
added       800 (97/sec)
added       900 (96/sec)
added      1000 (97/sec)

In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records.

Can we make this any faster?

When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’  if all the processors could be used. One technique to do that is called ‘parallel processing’.

To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system:

$ cat /proc/cpuinfo | grep processor
processor   : 0
processor   : 1

The example above  shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain  4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle).

To check if your computer  is using all that calculating power, use the ‘uptime’ command:

$ uptime
11:15:21 up 622 days,  1:53,  2 users,  load average: 1.23, 1.70, 1.95

In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores  it means: the server  is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait.

Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu  scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel:

sudo yum install parallel

The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation.

First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation):

$ sudo cat > /etc/yum.repos.d/mongodb.repo <<EOF
[mongodb]
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
enabled=1
name=MongoDB.org repository
EOF

$ sudo yum install -y mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-org-tools
$ sudo cpanm Catmandu::Store::MongoDB

Next, we are going to store our input data in a MongoDB database with help of a Catmandu Fix script that adds some random numbers the data:

$ catmandu import MARC to MongoDB --database_name data --fix random.fix < input.marc

With the ‘random.fix’ like:


random("part.rand2","2")
random("part.rand4","4")
random("part.rand8","8")
random("part.rand16","16")
random("part.rand32","32")

The ‘random()’ Fix function will be available in Catmandu 1.003 but can also be downloaded here (install it in a directory ‘lib/Catmandu/Fix’). This will will make sure that every record in your input file contains four random numbers ‘part.rand2’, ‘part.rand4′ ,’part.rand8′,’part.rand16′,’part.rand32’. This will makes it possible to chop your data into two, four, eight, sixteen or thirty-two parts depending on the number of processors you have in your machine.

To access one chunk of your data the ‘catmandu export’ command can be used with a query. For instance, to export two equal chunks  do:

$ catmandu export MongoDB --database_name -q '{"part.rand2":0}' > part1
$ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2

We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously.

#!/bin/bash
# file: parallel.sh
CPU=$1

if [ "${CPU}" == "" ]; then
    /usr/bin/parallel -u $0 {} <<EOF
0
1
EOF
elif [ "${CPU}" != "" ]; then
     catmandu export MongoDB --database_name data -q "{\"part.rand2\":${CPU}}" to JSON --line_delimited 1 --fix heavy_load.fix > result.${CPU}.json
fi

This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script.

If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’.

GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command:

$ cat result.*.json > final_result.json

Day 7: Catmandu JSON paths

07_librecatrojectYesterday we learned the command catmandu and how it can be used to parse structured information. Today we will go deeper into catmandu and describe how to pluck data out of structured information. As always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).

Today will we fetch a new weather report and store it in a new file weather2.json. Lets try to download Tokyo:

$ curl http://api.openweathermap.org/data/2.5/weather?q=Tokyo,jp > weather2.json

From the previous tutorials we know many commands how to examine this data set. For instance, to get a quick overview of the content of weather2.json we can use the cat command:

$ cat weather2.json

Or, we could use the less command:

$ less weather2.json

Remember to type the ‘q’ key to exit less.

We could also use nano to inspect the data, but we skip that for now. Nano is a text editor and is not particularly suited for data.

To count the number of lines, words and characters in weather2.json we can use the wc command:

$ wc weather2.json
1 3 463

This output shows that weather2.json contains 1 line , 3 words and 463 characters. The 1 line is indeed correct: the file contains one big line of JSON. The 463 characters is also correct: when you count every character including spaces you get to 463. But 3 words is obviously wrong. Generic UNIX programs like wc have trouble with counting words in structured information. The command doesn’t know this file is in the JSON format which contains fields and values. You need to use specialized tools like catmandu to make sense of this output.

We also saw in the previous post how you can use catmandu to transform the JSON format into the YAML format which is easier to read and contains the same information:

$ catmandu convert JSON to YAML < weather2.json

Screenshot_28_11_14_14_06-2

 

We also learned some fixes to retrieve information out of the JSON file like retain_field(main.temp).

In this post we delve a bit deeper into ways how to point to fields in a JSON file.

This main.temp is called a JSON Path and points to a part of the JSON data you are interested in. The data, as shown above, is structured like a tree. There are top level simple fields like: base,cod,dt,id which contain only text values or numbers. There are also fields like coord that contain a deeper structure like lat and lon.

Using a JSON path you can point to every part of the JSON file using a dot-notation. For simple top level fields the path is just the name of the field:

  • base
  • cod
  • dt
  • id
  • name

For the fields with deeper structure you add a dot ‘.’ to point to the leaves:

  • clouds.all
  • coord.lat
  • coord.lon
  • main.temp
  • etc…

So for example. If you would have a deeply nested structure like:

Screenshot_28_11_14_14_34

Then you would point to the c field with the JSON Path x.y.z.a.b.c.

There is one extra path structure I would like to explain and that is the when a field can have more than one value. This is called an array and looks like this in YAML:

Screenshot_28_11_14_14_39

In the example above you see a field my which contains a deeper field colors which has 3 values. To point to one of the colors you need to use an index. The first index in a array has value 0, the second the value 1, the third the value 2. So, the JSON path of the color red would be:

  • my.color.2

In almost all programming languages things get counted starting with 0. An old programming joke is:

There are 10 types of people in the world:
Those who understand binary,
Those who don’t,
And those who count from zero.

(hint: this is a double joke, 10 in binary == 2 if you count from 0, or 3 when you count from 1).

There is one array type in our JSON report and that is the weather field. To point to the description of the weather you need the JSON Path weather.0.description.

In this post we learned the JSON Path syntax and how it can be used to point to parts of a JSON data set want to manipulate. We explained the JSON path using a YAML transformation as example, because this is easier to read. YAML and JSON are two formats that contain the same informational content (and thus both can work with JSON Path) but look different when written into a file.

Continue to Day 8: Processing JSON data from webservices >>

Day 6: Introduction into Catmandu

06_librecatprojectIn the previous days we learned the UNIX commands grep, nano, ls and less. Today we will introduce you to a UNIX command we have created in the LibreCat project called catmandu. The catmandu command is used to process structured information.  To demo this command, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and start up the UNIX prompt (hint: see our day 2 tutorial).

In this tutorial we are going to process structured information. We call data structured when it organised in such a way is that it easy processable by computers. Previously we processed text documents like War and Peace which is structured only in words and sentences, but a computer doesn’t know which words are part of the title or which words contain names. We had to tell the computer that. Today we will download a weather report in a structured format called JSON and inspect it with the command catmandu.

At the UNIX prompt type in this command:

$ curl http://api.openweathermap.org/data/2.5/weather?q=Gent,be

[Update: as of end 2015 the OpenWeatherMap API requires an API key. Use this link to download a copy of the Ghent weather report :

$ curl https://gist.githubusercontent.com/phochste/7673781b19690f66cada/raw/67050da98a7e04b3c56bb4a8bc8261839af57e35/weather.json

]

You will see a JSON output like:

{"coord":{"lon":3.72,"lat":51.05},"sys":{"type":3,"id":4839,"message":0.0349,"country":"BE",
"sunrise":1417159365,"sunset":1417189422},"weather":[{"id":500,"main":"Rain","description":"light rain",
"icon":"10d"}],"base":"cmc stations","main":{"temp":281.15,"pressure":1006,"humidity":87,"temp_min":281.15,
"temp_max":281.15},"wind":{"speed":3.6,"deg":100},"rain":{"3h":0.5}
,"clouds":{"all":56},"dt":1417166878,"id":2797656,
"name":"Gent","cod":200}

All these fields tell something about the current weather in Gent, Belgium. You can recognise that there is a light rain and the temperature is 281.15 degrees Kelvin (about 8 degrees Celsius).  Write the output of this command to a file weather.json (using the ‘>’ sign we learned in the day 5 tutorial) so that we can use it in the next examples.

$ curl https://gist.githubusercontent.com/phochste/7673781b19690f66cada/raw/67050da98a7e04b3c56bb4a8bc8261839af57e35/weather.json > weather.json

When you type the ls command you should see the new file name weather.json appearing.

With the catmandu command you can process this file to make it a bit easier readable. For instance type:

$ catmandu convert JSON to YAML < weather.json

YAML is another format for structured information which is a bit easier to read for human eyes. Our weather report should now look like this:

Screenshot_28_11_14_11_06

Catmandu can be used to process structured information like the UNIX grep command can process unstructured information. For instance lets try to filter out the name of this report. Type in this command:

$ catmandu convert JSON --fix 'retain_field(name)' to YAML < weather.json

You should end up with something like:

---
name: Gent
...

The –fix option in Catmandu is used to ‘massage’ the input weather.json filtering fields we would like to see. Only one fix was used ‘retain_field’, which throws away all the data from the input except the ‘name’ field. By the way, the file weather.json wasn’t changed! We only read the file and displayed the output of catmandu command.

The temperature in Gent is the in ‘temp’ part of the ‘main’ section in weather.json. To filter this out we need two retain_field fixes: one for the main section, one for the temp section:

$ catmandu convert JSON --fix 'retain_field(main); retain_field(main.temp)' to YAML < weather.json

You should now see something like this:

---
main:
  temp: 281.15
...

When massaging data you often need to create many fixes to process a data file in the format you need. With the nano command you can write all the fixes in a file. Start the nano editor with the command:

$ nano weather.fix

In nano type now the two fixes above:

retain_field(main)
retain_field(main.temp)

To exit nano type Ctrl-X, press Y to confirm the changes and press Enter to confirm the file name.

With this file it will be a bit easier to create many fixes. The name of the fix file can be used to repeat the commands above:

$ catmandu convert JSON --fix weather.fix to YAML < weather.json

To add more fixes we can again edit the weather.fix file. Type:

$ nano weather.fix

And add these lines after the two previous lines:


prepend(main.temp,"The temperature is")
append(main.temp," degrees Kelvin")

Save the changes with Ctrl-X, Y, Enter and execute catmandu  again:

$ catmandu convert JSON --fix weather.fix to YAML < weather.json

You should now see as ouput:

---
main:
  temp: The weather is 281.15 degrees Kelvin
...

Catmandu contains many fixes to manipulate data. Check the documentation to get a complete list. This post only presented a short introduction into catmandu. In the next posts we will go deeper into its capabilities.

Continue to Day 7: Catmandu JSON paths >>

Day 5: Editing text with nano

05_librecatprojectYesterday we looked at the commands grep, wc and less. Today we will show you how to store and edit files in UNIX First, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) . Start up the UNIX prompt (hint: see our day 2 tutorial) and type in the command ‘nano’:

$ nano

You will be presented with the GNU nano text editor.

Screenshot_14_11_14_11_31

In this text editor you can type text or programs you can save on disk for later use. In this short tutorial I will guide you to some basic commands we will need in later tutorials. Type for instance a short text in this screen:

“Hello  world. My name is …”

When you want to save this text into a file type Ctrl-o (that is pressing the Ctrl-key and ‘o’ key on your keyboard). In the bottom of the screen nano will ask for a filename.

Screenshot_14_11_14_11_37

 

Type for instance ‘hello.txt’ as filename as press return. The file ‘hello.txt’ is now created on disk. We can test this with the commands we learned in the previous tutorial.

First exit the nano editor by typing Ctrl-x. And type ‘cat hello.txt’

$ cat hello.txt

You will see now the text created in the nano editor. With the UNIX command ‘ls‘ you can view all the filenames in the current directory.

$ ls

If you want to add more text to this file you can start again the nano editor with a file name.

$ nano hello.txt

You will again see the text you can edit and save again with Ctrl-o and exit nano with Ctrl-x.

Output of UNIX commands can also be written to a file.  Lets try to find all the lines in War and Piece that contain Bolkonski and inspect the results with nano:

$ cat Documents/war_and_peace.txt | grep Bolkonski > bolkonski.txt

Here we use the key ‘>’ to redirect the output of the command grep to a file named ‘bolkonski.txt’. Next we can use nano to inspect the contents of this file.

$ nano bolkonski.txt

By the way, you don’t need to type in the complete filenames in all the commands we have shown in the examples. When you type ‘bo’ and hit the tab-key then UNIX will autocomplete the file name to ‘bolkonski.txt’. I’m lazy and would type ‘cat bol’ and press tab .

Again you can use Ctrl-x to exit nano. You can view all the files with the ls command.

$ ls

bolkonski.txt  Documents  hello.txt  Pictures  
Templates  Videos Desktop        Downloads  
Music      Public    test.fix

If you want to delete a file you can use the rm command. We can try to remove our bolkonski.txt file with like:

$ rm bolkonski.txt

This concludes our short excursion into UNIX. Monday we will be back with a new chapter: processing JSON with Catmandu. Have a nice weekend!

Continue with Day 6: Introduction to Catmandu >>