Unix Tools

Introduction

There are a number of Unix utilities that allow one to do such things as break text files into pieces, combine text files together, extract bits of information from them, rearrange them, and transform their content. Taken together, these Unix tools provide a powerful system for obtaining linguistic information. Here is a brief summary of the relevant tools. In each case, the name of the program is a link to the manual page or other more detailed information. These programs are available on virtually all Unix systems and have roughly the same properties. The descriptions here, and the documentation to which links are provided, apply to the GNU versions.

Overview of the Tools

Determining How Much is in a File

It is often useful to know how much is in a file. This can help to determine whether it contains enough material to be worth the bother, whether it is so large as to require special handling, and whether it is in the expected or desired format. (For example, if a file has very few lines in comparison to the number of words or characters, it may come from system with different end-of-line conventions.)

wc: Prints a count of characters, words, and lines. By default, wc simply counts bytes to produce its character count, but if the -m flag is used it will correctly count UTF-8 Unicode characters.

Cutting a File Into Pieces

Several utilities allow one to cut a file into pieces.

cut: Extracts the specified field from each input line. Fields may be specified by numerical position or in terms of character offset. By default, fields are taken to be separated by whitespace. Another delimiter may be specified instead. The inverse of cut is paste.
head: Copies the first N lines of its input to the standard output. An option makes the unit bytes instead of lines. The opposite of head is tail.
tail: Copies the last N lines of its input to the standard output. An option makes the unit bytes instead of lines. The opposite of tail is head.

Note that head and tail used in combination allow one to extract any desired contiguous set of lines. For example, the command

head -20 | tail -5

extracts lines 16 through 20.

Extracting Selected Lines from a File

Instead of cutting a file into pieces based purely on the position of the pieces, it is possible to extract material based on its content.

uniq: Given sorted input, writes to the standard output the unique lines, that is, one line in place of what may be multiple identical lines in the input. If desired, uniq will print a count of the repeated lines. Options provide for the printing only of lines that are not repeated or only of lines that are repeated.
grep: Copies to the standard output the lines of input that match a regular expression. An option allows the lines not matching a regular expression to be selected instead. GNU grep understands Unicode.

Combining Files

Given two or more files, it is possible to combine them into a single file either "horizontally" or "vertically", or on the basis of the contents of a particular field.

cat: Concatenates the files named on the command line and writes the result on the standard output.
paste: Writes lines consisting of sequentially corresponding lines of each input file on the standard output. By default, the "columns" taken from each input file are separated by a TAB character. Another delimiter may be specified. The inverse of paste is cut.
join: For each line in which the specified join field is the same in the two input files, writes to the standard output the concatenation of the two input lines. Join provides a simple, text-based, relational database facility.

Rearranging a File

Most frequently we want to rearrange a file on the basis of the content of the pieces, for which we use sort. The standard sort program is very useful, but it is not capable of some of the kinds of sorting that arise in linguistic work. A more powerful sorting program designed specifically for linguistics is msort, which we will look at later when we deal with sorting in more detail. It is occasionally useful, however, to be able to reverse the order of the contents of a file, for which tac is available.

sort: Sorts its input and writes the result on the standard output. GNU sort does not understand Unicode.
tac: Concatenates the files named on the command line and writes the result on the standard output in reverse order, that is, the last record first, the next-to-last record second, and so forth. Records default to lines, but the record separator may be specified by a regular expression.

Comparing Two Files

cmp: Given two files, identifies the byte and line at which they differ, if they do. cmp is useful for finding out whether binary files are the same and, if they are different, finding out where to look for the difference. comm and diff are generally more useful for comparing human-readable files.
comm: Given two sorted files as input, writes on the standard output the lines that are common to both inputs, the lines that occur only in the first input file, and the lines that occur only in the second input file. Options allow any chosen combination of the three columns of output to be suppressed.
diff: Diff generates a description of how one input file differs from the other. Several output formats are available. Generally, they describe the differences in terms of the changes that must be made to derive the second file from the first.

Transforming a File

There are a variety of ways of transforming a file in a systematic way. These range from the specialized transformations provided by fold to the very general transformations provided by sed and awk.

fold

Breaks long input lines. The primary use is formatting, but fold is sometimes useful in linguistic text processing. For example, if you need to get each character onto a line by itself, the command

fold -w 1

, which sets the line length to one character, will do the job. GNU fold understands Unicode.

tr

Translate one set of characters into another. Can also delete specified characters and reduce sequences of multiple tokens of the same character to a single token.

sed

A powerful stream editor. Sed copies its input to its output, editing it in passing. Each portion of an input line matching a regular expression can be replaced with a fixed string or another portion of the input. Lines matching a regular expression can be deleted. GNU sed understands Unicode.

awk

Copies its input to standard output, performing specified actions whenever the input matches a specified pattern. awk automatically parses its input into records and the records into fields. By default, a record is a line, with fields separated by whitespace. However, both the field and record separators may be changed. awk is actually a full-fledged programming language, meaning both that there is a good deal to learn in order to use all of its capabilities and that it can be used for many purposes. With only a small amount of effort, however, it can be used to extract particular records and fields and to rearrange fields. For example, the command

awk '{print $3,$1}'

will extract the first and third fields from every input line and print them in reverse order, that is, the third field followed by the first field. GNU awk understands Unicode.

Revised 2004/02/04.