Assignment I

Question 1

Suppose that you have a text file in the format used on Microsoft Windows and that you need to convert it the format used on Macintoshen. Give the tr command for this.

Question 2

Give the tr command to remove everything other than lower case letters. Do this in a single invocation of tr with a total command line length (exclusive of input and output) of no more than 14 characters:

Question 3

The file udlr-catalan.txt contains the Universal Declaration of Linguistic Rights in Catalan, encoded in the Latin-1 character set (ISO-8859-1). This character set is an extension of ASCII in which codes with the high bit set (that is, with octal values greater than or equal to 200 (decimal 128) are used for the non-ASCII letters that are used in a variety of Western European languages, including Catalan. This text contains a number of non-ASCII characters. You should be able to see them in most computer systems. If you cannot, you can always use the od command as we have in class to see what is actually in the file or in your results.

(a) Give a tr command to remove all non-ASCII characters from this text.

(b) Give a tr command to remove all ASCII characters from this text. There are two ways to do this. Give both.

Question 4

The file udrl-eng-cipher.txt contains a cipher version of the English text of the Universal Declaration of Linguistic Rights created by means of a single, fairly simple, use of tr.

(a) What tr command was used to create the cipher text?

(b) What tr command will convert the cipher text back to the original?

(If your commands (exclusive of the input and output) take more than 34 characters, you are not doing this in the most elegant way.)

Question 5

The file wordcounts contains a list of words and the number of times they occur in a text. It is currently in alphabetical order. Give the command necessary to sort it into decreasing numerical order (so that the most commonly ocurring word comes first). The manual page for sort contains the information necessary to do this. (Since the manual page is not crystal clear on this, note that sort breaks input lines into fields using whitespace much as AWK does.)

Question 6

A common error in writing is typing the same word twice in a row. The file udhr-english-dups.txt is a modified version of the English text of the Universal Declaration of Human Rights in which there are errors of this type. Write a program that takes as input a text file and generates a list of words ocurring twice (or more) in a row.

Question 7

Write a program that generates a list of the non-ASCII characters used in Catalan by extracting them from the Catalan text of the Universal Declaration of Linguistic Rights. In addition to the programs that we have already discussed, you will find it useful to familiarize yourself with the fold program.

Question 8

Create a shell script that prints the words in a text and the number of times they occur, using tr, sort, and uniq, as we have discussed in class. The result of this will be a two-column table in which the first column gives the count and the second column gives the word.

Now add to your shell script the commands necessary to generate output in which the first column contains the words, the second column contains the counts, and the third column contains the percentage that it represents of the total. In other words, it should look like this:

the3817.576%
of2645.250%
to2013.997%
and2003.977%
in1452.883%

Your output should be sorted in decreasing order of frequency as in the example.

You will need to use sort and AWK to do this. There is one necessary aspect of AWK that we have not discussed, namely the use of the printf function. Study the AWK reference card or an AWK manual to learn about this. However, here is a hint. If the percentage that you want to print is in the variable pct, the following printf command will print it, followed by a percent-sign and a line-feed, with 3 decimal places of accuracy:

printf("%.3f%%\n",pct)

Back to Top