Tutorials and other links


Cat (back to top)

In order to copy the .cod.ooo files for the two corpora into your account, say something like
cat $LING300/ppceme.cod.ooo $LING300/pceec.cod.ooo > both.cod.ooo
In your spreadsheet-generating programs, you can then specify both.cod.ooo as the file searched by your grep commands. Remember that the both.cod.ooo file is now in your account, not in $LING300.

Emacs (back to top)

Text files on babel cannot be read or edited using Word or similar text editors. Instead, we will use a text editor called Emacs, which comes with a self-paced online tutorial that covers the basic commands. Once you've completed the tutorial, you can refresh your memory by googling "emacs" "cheat sheet".

To access the online tutorial, log on to your babel account.

Call up an Emacs window with emacs

Here are two alternative ways to access the online tutorial from an open Emacs window. The first one is simpler, so try it first.

If you get into trouble in an Emacs window (for instance, if it freezes up), a useful sequence to know is the "escape" sequence C-g.

Excel (back to top)

Finding corpus examples (back to top)

In general, the research for the syntax project involves tabulating the number of occurrences of various syntactic patterns, and you are not interested in the particular sentences that instantiate the patterns in question. At some point, however, you might wish to look at the sentences themselves, either to make sure that your searches are retrieving the examples you are interested in or to find illustrative examples for your paper. In order to do so, you will need to search the coded parsed corpora (the .cod files) in Emacs, rather than using grep from the command line to search only the coding strings (the .cod.ooo files). Open a coded parsed file with (the two corpora are ppceme.cod and pceec.cod)

300
emacs corpusName.cod
In the .cod files, each coding string is embedded in the clause that it belongs with. In addition, the beginning of the .cod file contains the CorpusSearch coding query that generated the file. When you open a .cod file, you will see the query. Page past it, and eventually you will see the coded corpus.

Instead of paging down the file, you can use C-s to search for the string CODING (see the Emacs tutorial for further discussion of searches).

In addition to searching for ordinary strings, you can also search for regular expressions within an Emacs window (rather than from the command line). The command for regular expression searches in Emacs is Esc C-s (as opposed to C-s for ordinary searches). In an ordinary search, the last line in your Emacs window says I-search: In connection with a regular expression search, the last line will say Regexp I-search: If you need to escape from either type of search, say C-g.

The conventions regarding regular expression searches within Emacs are the same as for regular expression searches from the command line with grep. So searching for coding strings within a .cod file is essentially the same as searching the corresponding .cod.ooo file from the command line. However, since the coding strings in the .cod file are not at the beginning of a line, you shouldn't begin your regular expression searches within the .cod file with a caret to indicate beginning of line. Rather, you "anchor" your searches with the string (CODING. For instance, in order to search for all instances of the old grammar in negative sentences in the ppceme.cod file, your regular expression would be:

(CODING [DKV]

Grep (back to top)

The following instructions tell you how to search and count coding strings with the Linux command grep (short for "get regular expression").

The general format of a grep command is:

grep search-pattern file(s)-to-be-searched
The search pattern (but not the input files to be searched) needs to be enclosed in (single or double) quotes, like this:
grep "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo

You can search both of the historical corpora available to you like this (the search below is slightly simplified in that it doesn't take into account the slight overlap between the two corpora; for more information, see the coding conventions.

grep "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo $LING300/pceec.cod.ooo

grep can contain literal characters as well as so-called regular expressions. The following table contains the regular expressions that you will need in order to search coding strings for the syntax project. For some of the examples, you will need to refer to the coding conventions that are used in the coded version of the corpus.

Regular expression Explanation
. Period stands for any single character (including itself)
[aeiouy] Square brackets enclose alternatives. The expression to the left matches the set of English vowels.
[a-e] For digits and letters, alternatives can be specified as ranges of characters. The expression on the left is another way of searching for [abcde].
[0-9], [a-z], [A-Z] Commonly used alternatives can be specified as ranges of characters. The expressions on the left match, respectively, a single digit, a single lowercase letter, a single uppercase letter.
[0-9a-z], [a-zA-Z0-9], [a-cg-im-os-t] Ranges can be combined. The first expression matches a single digit or lowercase letter. The second expression matches a single digit or any letter. As the third expression shows, the ranges that are combined can be any well-formed range.
^ A caret as the first character of a search string "anchors" the search string to the beginning of an input line. In other words, there is a difference between the following two commands.
grep 'D' file(s)-to-be-searched
grep '^D' file(s)-to-be-searched
The first command finds lines with D anywhere on an input line. Given the coding conventions in the coded parsed corpus, this would match negative sentences with main verb do, questions with main verb do, any coding string from a private diary, and a number of other sentence types - not a linguistically meaningful result! The second command finds lines with D as the first character on the input line. Given the coding conventions, this would match negative sentences with main verb do.

In order to find tokens from private diaries (regardless of their other properties), you'd say

grep '^.:.:.:.:.:.:.:.:.:.:.:.:D' $LING300/ppceme.cod.ooo

When a caret immediately follows a square bracket, it has an entirely different meaning. In that context, it negates the contents of the material in square brackets. For instance, given the coding conventions, all of the following searches are equivalent.

grep '^[DHK]' $LING300/ppceme.cod.ooo
grep '^[^BVbdhkv-]' $LING300/ppceme.cod.ooo
grep '^[^BVa-z-]' $LING300/ppceme.cod.ooo
$ A dollar sign "anchors" the search string to the end of the input. In contrast to the caret, the dollar sign doesn't have two meanings depending on its context. You probably won't use the dollar sign, but I include it here for completeness.
* An asterisk after an expression indicates zero or more instances of that expression (that is, the optional occurrence of an expression).
+ A plus sign after an expression indicates one or more instances of that expression (that is, at least one instance of that expression).

Given the information above and the coding conventions for the coded parsed corpus, you can see that the search at the beginning of this page, repeated here for convenience, returns all the coding strings for clauses with main verb have and simple negation (rather than do support) in nonprivate letters from the 1670s.

grep "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo

Once we have ascertained that there are no errors in the coded corpus, we generally don't care about the coding strings themselves; we're just interested in the number of times that strings of a particular form occurs. In order to count matches, you can invoke a so-called switch on grep. Instead of using the simple grep command, you use grep -c, like this:

grep -c "H:.:.:.:.:.:.:6:7:.:.:.:L" $LING300/ppceme.cod.ooo

The output of grep -c can be entered into spreadsheets for further quantitative analysis. Obviously, this can be done by hand, but to save time and eliminate the possibility of input errors, you'll learn how to save your searches in a form that will allow you to import the results into your spreadsheet program. See Shell scripts for saving searches for more details.

IF statements in Excel (back to top)

(1) gives the general form of an IF statement in Excel. The spaces surrounding parens are for clarity; I think Excel doesn't care about them. Note the quotes around the second two parts of the IF statement to indicate the character strings that Excel will insert depending on whether the test condition is met or not.

(1) IF ( condition, "condition_met", "else" )

(2) gives an example. You could have a column with this statement to divide the tokens in to "early" tokens (before 1400) and late tokens (1400 and after).

(Here and in the following examples, "A" refers to column for effective date of first attestation, "B" to the column for the number of syllables, and "C" to the column for stress.)

(2) IF ( A2<1400, "early", "late" )

The test condition can be complex. Often you'll impose conditions that involve an AND statement. Suppose you want to investigate the influence of effective date on stress shift. You could identify the tokens attested before 1100 that have retained the French stress with an IF statement like the following. First, the relevant AND statement:

(3) AND ( A2<1100, B2=2, C2=1 )

You could edit the earlier IF statement by pasting the AND statement in (3) over the condition in (2). You could then say "21" if the condition is met (mnemonic for 2-syllable word with stress on 1), and "-" for else.

(4) IF ( AND ( A2<1100, B2=2, C2=1 ), "21", "-" )

The IF statement in (5) would give you tokens with the same date of first attestation that have undergone stress shift. The underlining is for clarity.

(5) IF ( AND ( A2<1100, B2=2, C2=2 ), "22", "-" )

If you use simple IF statements like (4) and (5), you'll have to record the pre-1100 tokens that have undergone stress shift and the ones that haven't in separate columns. You can save space and record both in the same column by collapsing the two IF statements into a single recursive IF statement.

What I mean by an IF statement being recursive is that the "else" part can itself be an IF statement. This sounds simple, and it is simple, and yet nevertheless it is easy to get the parens all screwed up. In order to avoid tearing out your hair, try copying an IF statement into the clipboard and then pasting it over its own "else" statement in the formula bar. This will give you a legal IF statement that you can then edit to say what you want. For instance, you can take the IF statement in (4), repeated here as (6), and turn it into (7).

(6) IF ( AND ( A2<1100, B2=2, C2=1 ), "21", "-" )

(7) IF ( AND ( A2<1100, B2=2, C2=1 ), "21", IF ( AND ( A2<1100, B2=2, C2=1 ), "21", "-" ) )
original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . copy of original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In (8), I've edited the inner IF statement in (7) to give a recursive IF statement that lets you record the stress-shifted and non-stress-shifted tokens in a single column.

(8) IF ( AND ( A2<1100, B2=2, C2=1 ), "21", IF ( AND ( A2<1100, B2=2, C2=2 ), "22", "-" ) )
original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . edited copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Of course, more complex examples with further embedded IF statements are possible. There is a limit of 7 IF statements. Notice that the number of final parens corresponds to the number of IFs.

Here are two examples. (9) is roughly analogous to (8), but for the three syllable case. (10) is a more general version for all polysyllabic words.

(9) IF ( AND ( A2<1100, B2=3, C2=1 ), "31", IF ( AND ( A2<1100, B2=3, C2=2 ), "32", IF ( AND ( A2<1100, B2=3, C2=3 ), "33", "-" ) ) )

(10) IF ( AND (A2<1100, B2>2, C2=1 ), "poly French", IF ( AND ( A2<1100, B2>2, C2=B2 ), "poly Germanic", IF ( AND (A2<1100, B2>2 ), "whoa", "-" ) ) )

Linux/Unix (back to top)

The operating system on babel is Linux, the open-source version of Unix. In order to explore the online historical corpora that are stored on babel, you'll need to become familiar with some basic concepts and commands in Linux. You don't need to read all of the tutorials below. Pick the material that is written in a way that is most helpful to you, and please let me know of any other tutorials that you find helpful.

Meter and scanning (back to top)

Query replace in Emacs (back to top)

You can find and replace text in emacs (useful when writing shell scripts) by typing

M-x
This will echo in the command line of the emacs file (at the bottom). In that line, type
query-replace
followed by Enter. You can then type the old text (the text you want to find and replace). Hit Enter and type the new text. The text will be replaced from where the cursor is onwards. You can replace text on a one-by-one basis by hitting SPACE (to replace) and "n" to skip on to the next instance, or you can hit "!" (exclamation point) to do a global replace.

Shell scripts (back to top)

Very often, it is convenient to save the various searches that you run on coded corpora with so-called shell scripts.

When you type commands at the keyboard, you are giving commands to the shell one by one. When you run (a.k.a. execute) a shell script, you are giving the commands in the script to the shell in a batch.

I've put together two sample shell scripts for you to edit and tailor to your own purposes. The first one executes a batch of grep searches and outputs the results so that you can enter them (by hand) into a spreadsheet program. The second one generates a file with results that you can import into a spreadsheet program directly. This obviates the need to type in the results, with its attendant risks of typos.

Sample shell script for executing a batch of grep commands

This shell script (batchGrep) illustrates how you can put together a list of searches that you want to perform, check them for typos, make sure that you're covering all the cases that you want to cover, remove and add searches, and so on.

Copy the script into your own account. If you want, you can make a separate directory to store this and other scripts. If so, you'll have to be in that directory to run the script, or you'll have to edit your .cshrc file to make it possible to run the script from other directories.


cp /htdocs/courses/Fall_2009/ling300/batchGrep .

You don't have to call your copy batchGrep. You could copy it and give it some different name, as illustrated below.


cp /htdocs/courses/Fall_2009/ling300/batchGrep myVeryOwnShellScript

Open the file in Emacs and get a sense of what it does.

emacs batchGrep

In order to run batchGrep, you'll have to turn it from an ordinary text file into an executable program (this is often called changing permissions on the file). To do this, close the file in Emacs. At the system prompt, type

chmod 755 batchGrep

You only need to use chmod once. Once the file is executable, it stays that way.

From now on, you can run the script by simply typing its name.

batchGrep

You can save the output of the shell script by writing it to a file (as opposed to displaying it in the terminal window). The file is an ordinary file that can be viewed with Emacs or with more.

batchGrep > firstResults

Go back and forth between editing and running the script to learn how it works and to get it to do what you want.

Sample shell script for generating a spreadsheet

Once you understand how shell scripts like batchGrep work, you can further expedite your work by writing shell scripts that generate files that can be directly imported into a spreadsheet program. You should use this second type of shell script because it eliminates the possibility of typos in transferring your data to your spreadsheet program. You can access a sample of such a script by typing
cp /htdocs/courses/Fall_2009/ling300/spreadsheetGenerator .

Once again, you can review and edit the file with Emacs and run it once you've changed its permissions with chmod. As with the earlier script, you can save the output by writing it to a file. For instance:

spreadsheetGenerator > aLittleSpreadsheet

You can use Emacs to view this file, too. But mostly, you will transfer it from babel to your laptop, and then import the file into your spreadsheet program.

Transferring files back to top)

If you need to transfer files from babel (say, the results of a program that generates spreadsheet results) to your laptop, you can do so easily on a Windows machine once you download a program called Filezilla. See http://www.seas.upenn.edu/cets/answers/filezilla.html for more details, and let me know if you have problems.

On a Mac, no additional program is necessary. Just open a Terminal window (without logging on to babel) and use scp (= secure copy), replacing the italicized expressions as needed:

scp username@babel.ling.upenn.edu:pathnameOnBabel/filenameOnBabel pathnameOnMac/filenameOnMac

Example (copy with unchanged filename):
scp beatrice@babel.ling.upenn.edu:/home/beatrice/results Desktop/.

Example (copy and change filename at the same time):
scp beatrice@babel.ling.upenn.edu:/home/beatrice/results Desktop/moreResults

You can also transfer files in the other direction - for instance, if you want to compose and edit your shell scripts on your laptop in TextEdit rather than on babel in emacs. Transferring such scripts doesn't make them executable, so you'll have to turn them into executable scripts with chmod 755 as usual.

scp pathnameOnMac/filenameOnMac username@babel.ling.upenn.edu:pathnameOnBabel/filenameOnBabel

Example (copy with unchanged filename):
scp Desktop/myScript beatrice@babel.ling.upenn.edu:/home/beatrice/.

Example (copy and change filename):
scp Desktop/myScript beatrice@babel.ling.upenn.edu:/home/beatrice/thisOneWorks
chmod 755 thisOneWorks

Word stress (back to top)

Word stress in the examples below is indicated by an apostrophe preceding the stressed syllable.
Periods and equal signs indicate syllable and morpheme boundaries, respectively.
Unstressable syllables are in italics.

Germanic stress rule

Since English is a Germanic language, English words were originally subject to the Germanic stress rule in (1).

(1)     Germanic stress rule:
Germanic words consist of a stem, preceded and followed by optional unstressable affixes. Word stress falls on the stem-initial syllable.

(2) a. Stem-initial syllable is word-initial
'hea.ven
'lat.ter
'un.der.=ling
b. Stem-initial syllable is not word-initial
for.='get
mis.='take
mis=un.der.='stand
un.der.='stand
with.='draw
c. Exceptions (very rare)
e.'le.ven

French stress rule

As a result of the Norman Conquest in 1066, many French words entered English, whose stress originally followed the rule in (3).

(3)     French stress rule:
In French, word stress falls on the final stressable syllable of the word. Syllables containing schwa are unstressable.
(4) a.   Final syllable is stressable
pen.'dant
tes.ta.'ment
b.   Final syllable containing schwa is unstressable
ad.ven.'tu.re
a'zu.re
ma.ri.'a.ge

A complicating factor

Stress on verbs borrowed from French (and from Romance more generally) tends to remain on the stem, as illustrated in the morphologically related noun-verb pairs in (5). Given our focus on stress shift, we will exclude verbs from consideration for the purposes of the class.

    Noun Verb
(5) a.   'ad.mit ad-'mit
b.   'con.voy con-'vey (near-minimal)
c.   'in.cense in.'cense
d.   'per.mit per-'mit