Searching and counting with grep

The following instructions tell you how to search and count coding strings with the Linux command grep.


The general format of a grep command is:

grep search-pattern file(s)-to-be-searched
The search pattern (but not the input files to be searched) needs to be enclosed in double quotes, like this:
grep "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo

grep can contain literal characters as well as regular expressions. The following table contains the regular expressions that you will need in order to search coding strings for the syntax project. For some of the examples, you will need to refer to the coding conventions that are used in the coded version of the corpus.

Regular expression Explanation
. Period stands for any single character (including itself)
[abcde] Square brackets enclose alternatives. The expression to the left matches "a" or "b" or "c" or "d" or "e".
[a-e] For digits and letters, alternatives can be specified as ranges of characters. The expression on the left is another way of searching for [abcde].
[0-9], [a-z], [A-Z] Commonly used alternatives can be specified as ranges of characters. The expressions on the left match, respectively, a single digit, a single lowercase letter, a single uppercase letter.
[0-9a-z], [a-zA-Z0-9] Ranges can be combined. The first expression matches a single digit or lowercase letter. The second expression matches a single digit or letter.
^ A caret as the first character of a search string "anchors" the search string to the beginning of an input line. In other words, there is a difference between the following two commands.
grep 'D' file(s)-to-be-searched
grep '^D' file(s)-to-be-searched
The first command finds lines with D anywhere on an input line. Given the coding conventions in the coded parsed corpus, this would match negative sentences with main verb do, questions with main verb do, any coding string from a private diary, and a number of other sentence types - not a linguistically meaningful result! The second command finds lines with D as the first character on the input line. Given the coding conventions, this would match negative sentences with main verb do.

In order to find tokens from private diaries (regardless of their other properties), you'd say

grep '^.:.:.:.:.:.:.:.:.:.:.:.:D' $LING300/ppceme.cod.ooo

When a caret immediately follows a square bracket, it has an entirely different meaning. In that context, it negates the contents of the material in square brackets. For instance, given the coding conventions, all of the following searches are equivalent.

grep '^[DHK]' $LING300/ppceme.cod.ooo
grep '^[^Vdhkv-]' $LING300/ppceme.cod.ooo
grep '^[^Va-z-]' $LING300/ppceme.cod.ooo
$ A dollar sign "anchors" the search string to the end of the input. In contrast to the caret, the dollar sign doesn't have two meanings depending on its context. You probably won't use the dollar sign, but I include it here for completeness.
* An asterisk after an expression indicates zero or more instances of that expression (that is, the optional occurrence of an expression).
+ A plus sign after an expression indicates one or more instances of that expression (that is, at least one instance of that expression).

Given the information above and the coding conventions for the coded parsed corpus, you can see that the search at the beginning of this page, repeated here for convenience, returns all the coding strings for clauses with main verb have and simple negation (rather than do support) in nonprivate letters from the 1670s.

grep "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo

Once we have ascertained that there are no errors in the coded corpus, we generally don't care about the coding strings themselves; we're just interested in the number of times that strings of a particular form occurs. In order to count matches, you can invoke a so-called switch on grep. Instead of using the simple grep command, you use grep -c output of grep through another command called wc. wc ordinarily counts characters, words, and lines, but we can force it to report only lines by running its variant wc -l (the last character is an ell, for "line"). The concatenation of the grep and wc -l commands is indicated by the so-called pipe symbol (|), like this:

grep -c "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo

The output of grep -c can be entered (by hand or better, automatically) into spreadsheets for further quantitative analysis.

To further expedite your work, you can write shell scripts that will generate series of numbers for you, and you can even generate files that can be imported in Excel (obviating the need to typing with its attendant risk of typos). See Saving your searches for more details.