Unix Tools
Introduction
There are a number of Unix utilities that allow one to do such things as
break text files into pieces, combine text files together, extract bits of information from them,
rearrange them, and transform their content. Taken together, these Unix tools
provide a powerful system for obtaining linguistic information.
Here is a brief summary of the relevant tools. In each case, the name of the program
is a link to the manual page or other more detailed information.
These programs are available on virtually all Unix systems and have roughly
the same properties. The descriptions here, and the documentation to which
links are provided, apply to the GNU versions.
Overview of the Tools
Determining How Much is in a File
It is often useful to know how much is in a file. This can help to determine whether
it contains enough material to be worth the bother, whether it is so large as to
require special handling, and whether it is in the expected or desired format.
(For example, if a file has very few lines in comparison to the number of words or
characters, it may come from system with different end-of-line conventions.)
- wc
- Prints a count of characters, words, and lines.
By default, wc simply counts bytes to produce its character
count, but if the -m flag is used it will correctly count
UTF-8 Unicode characters.
Cutting a File Into Pieces
Several utilities allow one to cut a file into pieces.
- cut
- Extracts the specified field from each input line. Fields may be specified by numerical
position or in terms of character offset. By default, fields are taken to be separated
by whitespace. Another delimiter may be specified instead.
The inverse of cut is paste.
- head
- Copies the first N lines of its input to the standard output. An option makes the unit
bytes instead of lines. The opposite of head is tail.
- tail
- Copies the last N lines of its input to the standard output. An option makes the unit
bytes instead of lines. The opposite of tail is head.
Note that head and tail used in combination
allow one to extract any desired contiguous set of lines. For example, the command
head -20 | tail -5
extracts lines 16 through 20.
Extracting Selected Lines from a File
Instead of cutting a file into pieces based purely on the position of the pieces,
it is possible to extract material based on its content.
- uniq
- Given sorted input, writes to the standard output the unique lines, that is,
one line in place of what may be multiple identical lines in the input.
If desired, uniq will print a count of the repeated lines.
Options provide for the printing only of lines that are not repeated or
only of lines that are repeated.
- grep
- Copies to the standard output the lines of input that match a regular expression. An option allows the lines
not matching a regular expression to be selected instead. GNU grep
understands Unicode.
Combining Files
Given two or more files, it is possible to combine them into a single file
either "horizontally" or "vertically", or on the basis of the contents of a particular field.
- cat
- Concatenates the files named on the command line and writes the result on the standard output.
- paste
- Writes lines consisting of sequentially corresponding lines of each input file
on the standard output. By default, the "columns" taken from each input file
are separated by a TAB character. Another delimiter may be specified.
The inverse of paste is cut.
- join
- For each line in which the specified join field is the same in the two input files,
writes to the standard output the concatenation of the two input lines. Join
provides a simple, text-based, relational database facility.
Rearranging a File
Most frequently we want to rearrange a file on the basis of the content of the pieces, for which
we use sort. The standard sort program is
very useful, but it is not capable of some of the kinds of sorting that arise in
linguistic work. A more powerful sorting program designed specifically for linguistics
is msort,
which we will look at later when we deal with sorting in more detail.
It is occasionally useful, however, to be able to
reverse the order of the contents of a file, for which tac is available.
- sort
- Sorts its input and writes the result on the standard output.
GNU sort does not understand Unicode.
- tac
- Concatenates the files named on the command line and writes the result on the standard output in reverse order, that is, the last record first, the next-to-last record second, and so forth.
Records default to lines, but the record separator may be specified by a regular expression.
Comparing Two Files
- cmp
- Given two files, identifies the byte and line at which they differ, if they do.
cmp is useful for finding out whether binary files are the same and, if they
are different, finding out where to look for the difference.
comm and diff are generally more useful
for comparing human-readable files.
- comm
- Given two sorted files as input, writes on the standard output the lines
that are common to both inputs, the lines that occur only in the first input file,
and the lines that occur only in the second input file. Options allow any chosen
combination of the three columns of output to be suppressed.
- diff
- Diff generates a description of how one input file differs from the other.
Several output formats are available. Generally, they describe the differences
in terms of the changes that must be made to derive the second file from the first.
Transforming a File
There are a variety of ways of transforming a file in a systematic way.
These range from the specialized transformations provided by fold
to the very general transformations provided by sed and
awk.
- fold
- Breaks long input lines. The primary use is formatting, but fold is sometimes
useful in linguistic text processing. For example, if you need to get each character
onto a line by itself, the command
fold -w 1
, which sets the line length to
one character, will do the job. GNU fold understands Unicode.
- tr
- Translate one set of characters into another. Can also delete specified characters and
reduce sequences of multiple tokens of the same character to a single token.
- sed
- A powerful stream editor. Sed copies its input to its output, editing it in passing.
Each portion of an input line matching a regular expression can be replaced with a fixed
string or another portion of the input. Lines matching a regular expression can be deleted. GNU sed understands Unicode.
- awk
- Copies its input to standard output, performing specified actions whenever
the input matches a specified pattern. awk automatically parses its input
into records and the records into fields. By default, a record is a line, with
fields separated by whitespace. However, both the field and record separators may
be changed. awk is actually a full-fledged programming language, meaning
both that there is a good deal to learn in order to use all of its capabilities
and that it can be used for many purposes. With only a small amount of effort,
however, it can be used to extract particular records and fields and to rearrange
fields. For example, the command
awk '{print $3,$1}'
will extract the
first and third fields from every input line and print them in reverse order, that
is, the third field followed by the first field. GNU awk understands Unicode.
Revised 2004/02/04.