Yesterday we gave you a little introduction into UNIX programming. Today we will show you how many words there are in Tolstoy’s War and Peace. First, as always, you need to startup your Virtual Catmandu (hint: see our day 1 tutorial) and open the UNIX prompt (hint: see our day 2 tutorial). As a result you should see this on your screen:
First we are going to learn you a new UNIX command, cat. In our Catmandu project the cat command is our favorite. With this command you can read War and Peace in 2 seconds! Lets try it out, type ‘cat Documents/war_and_peace.txt‘ on the UNIX prompt and press ‘enter’.
$ cat Documents/war_and_peace.txt
“What was that?!” you might wonder? Well, that was the complete War and Peace running across your screen. We provided cat command with one argument ‘Documents/war_and_peace.txt’ which is the filename that contains the complete text of War and Peace (this text I downloaded from the Gutenberg Project for you).
In UNIX it is possible to glue the output from one command to the input of another command this is called a pipe. To use a pipe in a command you have to find a funny little key on your computer which has this sign ‘|’. On my computer it looks like this:
With this pipe symbol ‘|’ you can glue the output of one command to another command. We will use a pipe to count the number of lines, words and characters in a file with the UNIX wc command like this:
$ cat Documents/war_and_peace.txt | wc
64620 563290 3272023
The output contains three numbers: 64620 , 563290 and 327203. The first number counts the number of lines in the file Documents/war_and_peace.txt. The second number counts the number of words in Documents/war_and_peace.txt. And the third number counts the number of characters in Documents/war_and_peace.txt.
Five hundred sixty-three thousand two hundred and ninety words counted with one simple command! This is the power of command line processing.
The file Documents/war_and_peace.txt contains the English translation of War and Peace. We can count the number of times the word ‘war’ is mentioned in this novel. You need to use a new UNIX command, grep, to do this trick. Type the following commands at the UNIX prompt and I will explain in a moment what happens.
$ cat Documents/war_and_peace.txt | grep -ow war | wc
274 274 1096
We count 274 occurrences of the word ‘war’ in War and Peace. With the cat command we read the document Documents/war_and_peace.txt. With the pipe ‘|’ symbol we send all the text of this document to the grep command, where we use the -ow option to search for the word ‘war’. With the pipe ‘|’ symbol we send all the ‘war’ words to the wc command which will count the number of ‘war’-s.
You can experiment with these commands to better understand what happens. If you type:
$ cat Documents/war_and_peace.txt
, then you will see the complete War and Peace in the output. When you type:
$ cat Documents/war_and_peace.txt | grep -ow war
, then you will see many lines containing ‘war’, ‘war’, ‘war’ (one line for every occurence of the word ‘war’ in War and Peace). When you type:
$ cat Documents/war_and_peace.txt | grep -ow war | wc
, then you will count all these ‘war’ lines. Pretty neat, eh?
What about ‘war’ at the beginning of the sentence like ‘War’ or someone shouting ‘WAR!’? To have a correct count we need to be case-insensitive. You can go this by adding the -i option to the grep command like this:
$ cat Documents/war_and_peace.txt | grep -i -ow war | wc
297 297 1188
We get 23 more occurences.
Now we can try to find out if War and Peace is more about ‘war’ than ‘peace’ by counting the number of times ‘peace’ is mentioned:
$ cat Documents/war_and_peace.txt | grep -i -ow peace | wc
110 110 660
This proves ‘peace’ is mentioned only 110 times and ‘war’ 297 times!
To finish this tutorial I will learn you one more UNIX command: less. When experimenting with commands like cat and grep you might want to inspect intermediary results. With the less command you page through output. Lets try this out. Type:
$ cat Documents/war_and_peace.txt | less
Your screen will now show the first page of War and Peace. When you press the spacebar the next page will be displayed. Pressing spacebar again you see again a next page, etc etc. This way you can slowly page through long result lists. When you press the ‘b’ key you will go one page back. To exit this less command press the ‘q’ key.
Some more examples of what we have learned.
Show all the lines which contain the word Bolkonski
$ cat Documents/war_and_peace.txt | grep Bolkonski
Or, use ‘less’ to page through the results:
$ cat Documents/war_and_peace.txt | grep Bolkonski | less
Count all the lines which contain the word Bolkonski
$ cat Documents/war_and_peace.txt | grep Bolkonski | wc
178 1848 11705
The answer is 178 lines.
Count all the number of times Napeleon is mentioned
$ cat Documents/war_and_peace.txt | grep -ow Napoleon | wc
580 580 5220
This answer is 580 times.
Continue to Day 5: Editing text with nano >>
Today we are going to learn you some UNIX commands. To get started you need to start the Virtual Catmandu application (hints: see our day 1 tutorial). When you see the Catmandu desktop you need to start the “UNIX prompt” terminal window (hints: see our day 2 tutorial). As a result you should see this on your screen:
The screen above is called the “UNIX prompt” or the “command line” and we will use it to execute UNIX commands. All Catmandu commands were written for this “command line” because it provides the powers of UNIX coupled with the Perl programming language.
Yesterday we saw the date command. Try it again: type ‘date’ in this window and press ‘enter’:
Wed Dec 3 07:34:45 UTC 2014
You will see the current date appearing in this window. With the w command you can see who is logged into the system. Type ‘w’ and press ‘enter’:
10:37:11 up 1:04, 2 users, load average: 0.08, 0.02, 0.01
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
catmandu tty1 :0 09:33 1:04m 10.13s 0.37s pam: gdm-autolo
catmandu pts/0 :0.0 10:25 0.00s 0.12s 0.11s w
You see from this output that there is named ‘catmandu’ working in your Virtual Catmandu. That is you!
UNIX provides thousands and thousands of commands you can excute this way. For every imaginable task there exists a command. In the coming days we will use the UNIX prompt to execute Catmandu commands to process library data. Before we can do this, we need to give you a little more context how these UNIX commands work.
First, the commands date and w above are simple commands that don’t required any other input than typing their name. You execute these command by hitting the return button at the end of the name. Some commands need one or more arguments. For instance the echo command is used to print text on a screen. When you type ‘echo’ and hit return, then it prints a blank line
(by the way, the $-sign you see above isn’t part of the command but an indication where the UNIX prompt is so that I don’t have to upload a screenshot for every example).
When you type “echo 123” the command will print ‘123’ as output.
$ echo 123
When you type “echo 123 abc def” the command will print “123 abc def” as output.
$ echo 123 abc def
123 abc def
These “123”, “abc”, “def” are called the arguments of the command. In the example above we ran the echo command with 3 arguments. All the arguments in our examples are words. You can also have use sentence as arguments, but then you need to put the sentence in double quotes:
$ echo "Hello, my name is Catmandu" 123 abc def
Hello, my name is Catmandu 123 abc def
In the example above, the echo command had 4 arguments: 1 sentence and 3 words.
Some commands accept also options. With options you can change the behaviour of commands. Some command have no options, some have many options. With Catmandu you will encounter commands that love to provide you a lot of options. For now lets keep things simple. The UNIX cal command can be used to output a calendar for the current month.
$ cal December 2014 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 $
When you provide a month and a year you can select which month you would like to see the calendar. E.g. for May (month 5) of 2014
$ cal 5 2014 May 2014 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 $
The cal command also accepts a year with the -y option. E.g. the calendar for the year 1755 is:
$ cal -y 1755
To find out more about a specific command, call the man command. A documentation will show up where you can browse with arrow keys and quit with hitting the key ‘q‘:
$ man cal
That was enough for today. Tomorrow we will show you how to work with files and do a bit of text analysis.
Continue to Day 4: grep, less and wc >>