Tutorial on grep

There are numerous general tutorials for grep (an acronym based on "global regular expression print") - see the external links in the Wikipedia entry.

The following instructions are geared towards the search requirements for this course. Grep is line-based; since the coding strings are each on a separate line, grep searches each coding string separately, just as we want.

The general format of a grep command is:

grep search-pattern file(s)-to-be-searched
The search pattern (but not the input files to be searched) needs to be enclosed in (single or double) quotes, like this:
grep "v:.:.:.:.:.:.:6:8:.:7:0:.:f:.:.:.:l" ling300.cod.ooo
Given the information about regular expressions below and the coding conventions for the coding strings, the search string above matches the coding strings for clauses with an ordinary main verb and do support (rather than simple negation) in private letters written in the 1700s by women born in the 1680s.

If you run the above command at the command line, you should get one match.

Search patterns in grep can contain literal characters as well as so-called regular expressions. The following table contains the regular expressions that you will need in order to search the coding strings for the syntax project.

Regular expression Explanation
. Period stands for any single character (including itself)
[aeiouy] Square brackets enclose alternatives. The expression to the left matches the set of English vowels.
[a-e] For digits and letters, alternatives can be specified as ranges of characters. The expression on the left is another way of searching for [abcde].
[0-9]
[a-z]
[A-Z]
Commonly used alternatives can be specified as ranges of characters. The expressions on the left match, respectively, a single digit, a single lowercase letter, a single uppercase letter.
[0-9a-z]
[a-zA-Z0-9]
[a-cg-im-os-t]
Ranges can be combined. The first expression matches a single digit or lowercase letter. The second expression matches a single digit or any letter. As the third expression shows, the ranges that are combined can be any well-formed range.

A hyphen right after the opening bracket or right before the closing bracket is interpreted literally. So, the following searches are equivalent.

grep '^[a-c-]' ling300.cod.ooo
grep '^[abc-]' ling300.cod.ooo
grep '^[-a-c]' ling300.cod.ooo
grep '^[-abc]' ling300.cod.ooo
^ The caret character has two different meanings, depending on where it occurs in a search string.

A caret as the first character of a search string "anchors" the search string to the beginning of an input line. In other words, there is a difference between the following two commands.

grep 'D' ling300.cod.ooo
grep '^D' ling300.cod.ooo
The first command finds lines with D anywhere in the coding string. Given the coding conventions for the coding strings, this would match negative sentences with main verb do, questions with main verb do, any coding string from a private diary, and a number of other sentence types - not a linguistically meaningful result! The second command finds lines with D as the first character on the input line. Given the coding conventions, this would match negative sentences with main verb do.

In order to find tokens from private diaries (regardless of their other properties), you'd say

grep '^.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:D' ling300.cod.ooo

When a caret immediately follows a square bracket, it has an entirely different meaning. In that context, it negates the contents of the material in square brackets. For instance, given the coding conventions, all of the following searches are equivalent.

grep '^[DHK]' ling300.cod.ooo
grep '^[^BVbdhkv-]' ling300.cod.ooo
grep '^[^BVa-z-]' ling300.cod.ooo
* An asterisk after an expression indicates zero or more instances of that expression (that is, the optional occurrence of an expression).

Once we have ascertained that there are no errors in the coding strings, we generally don't care about the strings themselves; we're just interested in the number of times that strings of a particular form occurs. In order to count matches, grep allows you to use a switch (= option) called -c, like this:

grep -c "V:.:.:.:.:.:.:6:8:.:7:0:.:f:.:.:.:l" ling300.cod.ooo

The output of grep -c can be entered into spreadsheets for further quantitative analysis. Obviously, this can be done by hand, but don't do this, as it is both time-consuming and error-prone. Instead, see Shell scripts for saving searches for how to save your searches in a form that allows you to import the results into your spreadsheet program.