GNU/Linux Tutorial

This tutorial provides basic information about how to use the GNU/Linux systems in the Phonetics Lab. Full details are given only in a few simple and important cases. In general, the intention is to alert you to what you need to learn about. You will then need to read the appropriate documentation to find out the details.

At present, with the exception of two machines running Microsoft Windows, all of the machines in the Phonetics Lab are running GNU/Linux. This is one of a larger class of UNIX-type systems, including Solaris/SunOS, Macintosh OS X, and FreeBSD. Although details differ, in most respects this tutorial is applicable to all UNIX-type systems.

Strictly speaking, UNIX is a registered trademark of the Open Group and refers only to those descendants of the original Bell Labs UNIX licensed by AT&T. We here use UNIX in its standard colloquial sense of all operating systems modelled on the original UNIX, including lineal descendants not called UNIX, such as HP-UX and Solaris, and reimplementations not dependant on AT&T code, such as Minix, FreeBSD, and GNU/Linux.

Outline

  1. Getting Information
  2. Shells
  3. Terminating Programs
  4. Navigating the File System
  5. Programs and Paths
  6. Locating Files
  7. Manipulating Files
  8. Printing
  9. File System Usage
  10. Wildcards
  11. Shell Goodies
  12. Important Utilities
  13. Some Other Useful Programs
  14. GNU/Linux

Getting Information

If you want to find out about a command, one way is to use the "man" command. Type "man" followed by the name of the command. This will generally produce information about the command. If you aren't sure of the name of the command you need, try "man -k" followed by a topic. For example, "man -k directory" will list all of the manual entries dealing with directories. The manual has 9 sections, of which the first is the one devoted to ordinary user commands. Other sections contain information primarily of interest to programmers. If you do "man -k directory", the result will list not only commands like "ls" that are of interest to you but library functions and system calls. Look for the "(1)" indicating the section number to tell you that a result is a user command.

Another source of information is the "info" command. This will bring up a kind of hypertext system. Just try typing "info"; there is a tutorial.

Many commands provide a synopsis when executed with a special option, such as -i (for information) or -h (for help), or with impossible arguments. For example, the "man" command cannot do anything without any arguments or options, so if you just type "man" by itself, a synopsis will be printed.

back to outline

Shells

When you are in a terminal window the program that you are communicating with is a command interpreter known as a "shell". There are a number of shells, generally similar but different in detail. The common ones are: csh, tcsh, bash, and sh. You can find out which one you are using by typing: ps. This generates a list of "processes" that you are running. One of the entries in the last column, headed COMMAND, will be something like "tcsh". If you want to run a different shell temporarily, you can just run it as a child of your current shell. If, for example, you are running bash and you give the command tcsh, you will then be running tcsh. To get out of tcsh just give the command logout. This will log you out of tcsh and put you back in bash. If you want to change the shell you are given when you login, use the command chsh.

The various shells execute commands that you type at them. Some of the commands are built into the shell. Others are independent programs that the shell executes for you. The words following the command itself may be option flags or arguments. UNIX commands normally take options marked by a leading hyphen or in some cases two leading hyphens, e.g. ls -F. Here the command is ls and the option is described verbally as "the F option". Some options have their own arguments. For example, the command

tar -xf foo.tar

extracts files from the archive file foo.tar. The tar program can both read and write archives, so the x option flag is given in order to tell it to extract files in this case. The f flag indicates that the archive to extract the files from is the following file. foo.tar is therefore an argument to the f flag. Notice in this case that several options can be combined after a single hyphen. Some programs allow this, but others do not.

The various shell programs all execute initialization files when they start up. csh and tcsh execute a file called ".cshrc". bash executes ".bash_profile", and sh executes ".profile". You can put commands into these to set up the environment the way you want it.

back to outline

Terminating Programs

Once you have executed a program, you may decide that you want to stop it. Programs intended to be interactive usually accept a command that tells them to stop. This varies from program to program, but quit and exit are likely candidates.

In general, you can terminate programs by typing Control-c to the shell from which you executed the program. That is, while holding down the Control key (labelled ctrl on many keyboards), you type c. If the program is intended to terminate in some other way, such as in response to a command, this approach may cause the program to terminate abnormally, with varying consequences. If, for example, you exit a text editor this way, it will probably not have saved the text you were working on and your file will be left in the state it was in when you last saved to it.

Sometimes for one reason or another you cannot stop a program this way. An alternative is to use the kill command. This command is used to send signals to processes. You have to specify the signal that you want to send and the process ID number of the proces to which you want to send it. The signal is a number that follows a hyphen. The process ID follows this. For example, the command:

kill -9 2341

will send signal number 9 ("terminate with extreme prejudice") to process number 2341. To find out the process ID number give the ps command. You can execute the kill command from any window, not just the one from which you executed the program.

Many versons of UNIX provide a pkill command, which is easier to use. pkill takes as argument a pattern. It kills any process whose name matches the pattern. You should, of course, be careful not to use this command with a pattern that might match processes that you do not want to kill.

If you just want to suspend a program, type Control-z in the window in which the program is running. The fg command ("foreground") will resume the program. The bg command ("background") will put the program into the background. This means that the program will continue to run but that the shell will move on and accept new commands from you.

back to outline

Navigating the Filesystem

The UNIX filesystem is basically a tree structure. (Strictly speaking, since it is possible to create cycles, it is not a tree, which by definition is acyclic, but a directed graph.) As with other trees used in linguistics, it is upside-down, with the root at the top. The root is referred to as /. A file that contains other files is referred to as a directory. This is the equivalent of a folder on Windows and MacIntosh systems.

At any given time, you have a current working directory. You can find out what your current working directory is by giving the command pwd. If you are really lost and are not sure what machine you are on, give the command hostname. If you are not sure who is logged in to a machine, the command whoami will tell you.

When you log in, your current working directory is your home directory. This is your base of operations, and it is where things like initialization files should go. Your home directory will usually have a name like /home/jones, where jones is your user name. You can always get back to your home directory by giving the command cd (change directory) with no argument.

There are a few other abbreviations. The current working directory may always be referred to as . (that is, just a period). The parent of the current directory is .. (two periods). One's own home directory is ~ (tilde). The home directory of a named user is named by the user's username preceded by ~, e.g. ~myl.

References to files may be absolute or relative. An absolute filename gives the complete path from the root of the tree. For example:

/usr/bin/ls

is an absolute file name because it starts at the root. If our current working directory were /usr/bin, we could refer to ls simply as ls and to /usr/local as ../local. Both of these would be relative names since they give the location relative to the current working directory.

The cd command ("change directory") is used to move around the tree. It changes the current working directory to the named directory. For example,

cd /tmp

changes the current working directory to /tmp. With no argument, it changes the current working directory to the user's home directory.

To find out what files are in a directory, you use the command ls. With no argument, this lists the ordinary files in your current working directory. "ordinary" files are files whose names do not begin with a period. File names may begin with a period, but ls does not normally list them. Such names are used for files that one does not often want to work with, such as initialization files like .cshrc and files used by programs. If you want to list these files too, use the comand ls -a.

back to outline

Programs and Paths

When you type a command, the shell has to figure out what program you want it to execute. If there is a command built into the shell with the given name, it executes the built-in command. Otherwise, it searches a sequence of directories known as a path. You can find out what your path is currently set to by giving the command:

echo $path

You can set your own path by assigning appropriate values to the shell's path variable. For example, the command:

set path=(/usr/bin /bin /usr/local/bin)

sets the path to the three named directories. When the shell looks for the file containing the program to execute, it will look first for a suitably named executable file in /usr/bin, then in /bin, and then in /usr/local/bin. The system manager will generally set up an appropriate path for most users.

Most people keep their personal programs in a directory called bin in their home directory. If you do this, you will want to include your bin directory in your path.

Some people set up their path so that programs in the current directory will be executed first. This is done by including the directory . (just a period) in the path. Remember that ., referring to the current working directory, is a member of every directory. Whether you should do this, and if so, whether you should put the current working directory at the beginning or end of your path, is controversial. If you write a lot of programs of your own, you will find it convenient to have the current directory in your path. You may not want to go to the trouble of moving your programs to your bin directory, and it often makes sense to keep everything associated with a particular project in one directory. If you often modify programs that are already to be found in the system path or in your bin, you may find it tedious to remember to type ./foo to execute the version you are working on instead of the previous version. For this reason, programmers often add the current directory to their path and put it at the front, so that it is searched first.

The problem with this is that it is a potential security hole. It makes it easier for attackers to make use of what are known as "Trojan horses". A Trojan horse is a program that appears to be innocuous but will do damage when executed. The attacker writes a program that will do some sort of damage if executed by someone else with the appropriate privileges. If the attacker can arrange to put the Trojan horse in a directory that someone else will visit and in which the other person will execute a common command, by giving the Trojan horse the name of the common command, the attacker tricks the other user into executing it. For example, suppose that the Trojan horse is named ls and is really a script that deletes all of the files that it can. The person who writes the Trojan horse may not have the permissions necessary to delete other people's files. Instead, he or she puts the Trojan horse in a directory, such as /tmp, in which anyone can put files. When another person goes to /tmp and gives the ls command, if the current directory is first in his or her path, he or she will execute the Trojan horse, not the real ls command.

From a security point of view, therefore, the best thing is for the current directory not to be included in the path. If you do need to execute a program in the current directory, you can always refer to it directly as ./foo. The next best thing is for the current directory to be put at the end of the path, so that programs in the standard system directories always get first shot. Here again, if you need to execute a program from the current directory, you can do so by referring to the current directory directly as ./foo. The least secure approach is to put the current directory at the beginning of the path. For the most part, only programmers will want to do this. If you do, you should be aware of the potential security problems and know how to minimize them.

Unless you are running your own system or are an expert, you should generally just add your own components to the path set up by the system manager. You can do this by referring to the value of the path variable in your path setting command. For example,

set path=(~/bin $path)

adds your personal bin directory to the front of the path set up by the system manager. The which command tells you what file would be executed if you were to give the name as a command. This is helpful in determining which version of a program you are using.

back to outline

Locating Files

The most general way to locate a file is to use the find command. The first argument is the name of the point in the tree at which to start searching. If you want to search the entire file tree, this argument should be /. Next, you indicate the criteria for identifying the file or files you want. Typically you give the -name flag followed by the name of the file you want or a regular expression. Finally, you give the -print flag, which tells find to print the names of matching files. For example, the command:

find / -name Nouns.tex -print

will search the entire tree for files named Nouns.tex. The find command can use other criteria, such as the modification date of the files, and it can take other actions than printing, such as deleting the files.

The whereis command lists the locations of files related to programs, including executable programs, manual pages, and libraries.

back to outline

Manipulating Files

The cp command is used to copy files. In its basic form, with two arguments it copies the first file to the second, unless the second argument is the name of a directory, in which case the copy has the same base name as the original but is placed in the named directory. With more than two arguments, the last argument must be the name of the directory into which all of the other files are to be copied. With the -r option, copying is recursive. That is, not only ordinary files but directories are copied. The effect is to copy an entire sub-tree.

The mv command moves files, including directories. If the files are on the same filesystem, the effect is just to rename them. If they are on different filesystems, data must actually be moved, so the effect is like first copying the files, then deleting the originals.

The rm command deletes files. It is important to note that it does not just move them to a holding area like the Recycling Bin in Windows or the Garbage Can on a MacIntosh; once you delete a file using rm, it is gone.

back to outline

Printing

To print a file, use the command lpr. On some systems lpr just sends a file to a printer and it is your job to make sure that the file you give it is suitable for printing on that printer. On most systems, however, you can give lpr the name of a plain text file and lpr will arrange for it to be converted into the language of the appropriate printer.

To specify the printer to use, you can use the -P option. For example,

lpr -P ling mydata

will cause the file mydata to be printed on the printer called ling, the printer in the Linguistics department office. If you generally use a particular printer, you can set an environment variable in your initialization file. For example, in your .cshrc file you would put the line:

setenv PRINTER ling

The command:

lpr mydata

would then send the file mydata to the printer ling. The printer in the phonetics lab is called speech.

back to outline

File System Usage

To find out how much storage the files in the current directory (and those below it) take up, give the command:

du -s

This will report the number of one kilobyte blocks taken up by the files.

To find out how much space is available on a filesystem, use the df comnand. It will report the total size of a disk partition, how many blocks are in use, and how many are still avalaible. Take note that the column headed Available contains the percentage of storage that is in use, NOT the percentage still available for use. The df command with no arguments provides this information for all disk partitions currently mounted on the system. You can give it an argument naming the particular filesystem in which you are interested. An easy way to limit df to the filesystem you are using is to give the command:

df . In case that was hard to read, that was: df [SPACE] [PERIOD]. The period is an abbreviation for the current directory.

On many of the machines you will be using each user has a quota for each filesystem. This allows the system manager to limit each user's use of storage. Your quota consists of both a hard limit and a soft limit. The system will not let you use more than the hard limit. It will let you use more than the soft limit but will nag you about it. To find out what your quota is and how much of it you have used, give the command:

quota -v

If you need extra space temporarily, you can put files in the /tmp directory. This is a directory in which anyone can store files and is not subject to quotas. If you need more permanent space, you will have to negotiate with the system manager to increase your quota.

back to outline

Wildcards

The shells understand a number of wildcards, which allow abbreviations to be expanded into long lists of files. For example, an asterisk (*) represents any number of any character. The command:

ls C*.wav

will list all of the files in the current directory whose names begin with C and end in .wav. The expressions used by the shells are approximately those used for regular expressions. For details, see the documentation for your shell.

back to outline

Shell Goodies

The various shells all provide a number of nice facilities. The details of these vary from shell to shell. The examples here are for tcsh.

History

The shell has a history system, which keeps track of the commands that you type. You can use this to see what you have done. Giving the command history will generate a list of the last N commands that you typed. The number of commands recorded is determined by the value of the shell variable history. You can also re-execute previous commands, as is, or after editing them. For example, typing:

!!

will re-execute the previous command, and typing:

!23

will re-execute the 23d command. For details consult the documentation for your shell.

Aliases

Commands can be given aliases to reduce typing and to avoid the need to remember complicated options. For example, if you give the command:

alias h history

typing h will execute the history command. (If you use bash as your shell, the equivalent command is: alias h=history.) If you want to see "invisible" files when you list a directory and to have the shell identify files of different types, you can alias "ls" to "ls -aF". Aliases work for the duration of the shell. If you want to have them permanently put the alias commands into your init file (e.g. .cshrc).

I/o Redirection

Each process has associated with it three i/o streams: a standard input, a standard output, and a standard error output. A program that reads data from the terminal is really one that is reading input from its standard input, which is connected to the terminal. A program that prints on the terminal is writing to its standard output or standard error output. The shells provide facilities for re-directing these i/o channels. A > followed by a filename tells the shell to connect the program's standard output to the specified file. For example, if you would like to capture a directory listing in a file, you can type:

ls > mylist

and the listing will appear in the file mylist rather than on the terminal. Many programs write their error messages on the standard error output so that these messages will still appear on the terminal even if the ordinary output of the program is re-directed into a file. Similarly, < causes the standard input to be connected to a file.

Pipes

Another form of i/o redirection is the use of pipes. Pipes are used to connect one program's standard output to another program's standard input. For example, suppose that you want a list of all of the files in your current directory containing just the file name followed by its size in bytes. The ls command with the -l option will give you the information you need, but it will give you other information as well, and it will put the name of the file at the end. You can use the awk program to rearrange the fields and extract only the ones you want, like this:

ls -l | awk '{print $9,$5}'

Placing the pipe symbol | between the two programs causes the output of ls to be fed as input to awk.

Shell Loops

The shells are really programming language interpreters and provide control structures such as conditionals and loops. For details consult the documentation for your shell.

Order of Command Line Substitutions

It is important to understand the order in which substitutions are performed by the shell. The description here is based on the (t)csh shell. Substitutions take place in the following order.

History Substition
The shell keeps track of the commands that you issue and allows you to refer back to previous commands. For example, the command line !23 reissues the command in event 23. The first thing the shell does in interpreting the command line is to resolve references to the command history. For example, suppose that the command in event 23 was awk -f script < $D/foo.txt > foo.out. Typing !23 will reissue the command in exactly that form, including the reference to the variable D. The variable reference will be replaced with the current value of the variable (not the value it had at event 23), because variable substitution takes place after history substitution.

Alias Substition
Aliases are alternative names for commands that you may define using the alias command. Alias substitution takes place when the shell replaces aliases with the full command for which they stand. For example, if you give the command alias m more, thereafter when you type m at the beginning of the command line, it will be interpreted as an abbreviation for the command more. Since alias substitution follows history substitution, aliases in commands reissued via the history system are interpreted correctly.

Variable Substitution
The shell contains variables, that is, named pieces of information. Some of these are created by the shell itself as a way to provide information to the user. You may also create variables yourself. At this stage, the shell resolves references to variables and replaces them with the value of the variable.

Command Substitution
The backquote notation allows the output of a command to be used on the command line. For example, if the file list contains a list of files that you wish to process, in the place on the command line in which it is appropriate to supply filenames as arguments to a command, you could write: `cat list`. The command cat would normally write the names of the files on the standard output, by default, your terminal. The backquotes cause the names to become part of the command-line instead.

Filename Substitution
Certain special characters cause the shell to list the contents of the specified directory and match the argument containing the special characters against the list of files, passing on only those that match. This is referred to as filename globbing.

back to outline

Some Important Utilities

grep
This program matches regular expressions against its input and either passes on only those strings that match or only those that do not.

sort
This program sorts text files in a variety of ways on specified fields.

paste
Psste is a standard UNIX utility that joins two or more files of text in a column-wise fashion. You can specify what to use as "glue".

sed
This is a stream editor similar to the ed line editor but designed for editing its input on the fly. It reads from standard input and writes on standard output. It can select input lines by matching regular expressions and can perform substitutions on regular expressions.

awk
AWK is a language designed for processing text. An Awk program automatically parses its input into records (by default lines) and parses each record into fields (by default separated by spaces or tabs). Most of a typical program consists of one or more patterns which are matched against the parsed input text. When a pattern is matched, the associated action is taken. For example, an the complete Awk program to extract the second column from its input is {print $2}. This is an action with no condition, so it matches every input line. The action is to print the second field of the record. Many people use Awk to process experimental results. Awk is named after its creators, Al Aho, Peter Weinberger, and Brian Kernighan.

perl
Perl is similar to AWK and is considered by some to be its replacement. It has all sorts of additional facilities. Depending on your point of view these add great functionality to PERL or make it excessively kitchen-sink-ish. Perl manuals and introductions may be found in the Phonetics Lab library. The creator of Perl, Larry Wall, is a linguist. All sorts of information about Perl may be found at the PERL web site.

back to outline

Some Other Useful Programs

bccalculator (with special functions and arbitrary precision)
calcalendar (for any month of any year)
catconcatenates files
chmodsets file's read, write, and execute permissions
cutextracts selected portions of each line from a text file
fileidentifies file type on the basis of contents
moreviews a file a page at a time
unitsunit conversions (e.g. feet to meters)

back to outline

GNU/Linux

The version of UNIX running on PC hardware is called GNU/Linux. This is a reimplementation of UNIX by a consortium of programmers. Most of the code was developed by the GNUProject (a recursive acronym standing for "GNU is not UNIX"), a project initiatied in 1984. The goal of the GNU project is to liberate UNIX by creating a non-proprietary re-implementation whose source code could be freely shared and improved. This is part of a movement led by the Free Software Foundation. Several years ago, the GNU project had completed a large part of the GNU software but was still at work on the kernel, the core of the operating system which handles such things as process scheduling, memory management, and communications with disks and other such devices. A Finnish student named Linus Torvalds wrote a kernel for his PC. In the strictest sense, it is Torvalds' kernel and its descendants that constitute Linux. Since the systems that we use in practice represent the combination of Torvald's kernel and the GNU software, the systems is properly known as GNU/Linux. Details of the relationship between GNU and Linux.

In practice, almost everyone installs GNU/Linux from one of the many (nearly 200 at current count) distributions now available. Each distributor chooses a particular combination of software and provides its own system for installation and maintenance. Most of the software is itself availble for download at no cost. When you buy a distribution, you are paying for CDs, for the manuals that may be provided, for the installation and maintenance systems, and possibly for some support services. For information on GNU/Linux distribtutions, go to the Linux Distribution List. The distribution installed in the Lab at present is RedHat.

back to outline