Linguistics 300, F08, Assignment 6

Ongoing (M 9/29 - M 10/6)


Background on corpora

Parsed corpora

The historical corpora that we will be working with consist of parsed files that were constructed by Penn graduates both here at Penn and at York, England. Corpus construction begins with electronic texts. Fortunately, we did not have to type in all of our texts. Instead, we were able to take advantage of two already existing corpora of online texts, the diachronic part of the Helsinki Corpus, which includes texts from Old English to the early 1700's, and the Helsinki Corpus of Early English Correspondence, which includes letters from 1500 to about 1700. We extended the Early Modern English part of the Helsinki Corpus (1500-1712) to triple its original size, using texts by the same authors or as comparable as we could find. We typed in the additional texts. Even now, OCR software is not yet up to the kind of accuracy we need, especially in connection with older fonts. In connection with producing a followup corpus to the PPCEME, we learned that even professional data entry businesses still use human typists in clever ways rather than OCR software. This resulted in about 1.8 million words of text, of various genres. Following the Helsinki Corpus, we used short samples (about 2,000 words for most genres) so as to let us to sample more authors and genres. The Helsinki Corpus of Early English Correspondence, containing about 2.2 million words, was used as is. There is a slight overlap between the two corpora, so that they contain a total of slightly less than 4 million words. We will shortly give instructions for how to access a sample text from the PPCEME (alhatton).

In order to produce a parsed version of each text, we first ran each text through the best automatic part-of-speech tagger that we could find (the so-called Brill tagger, developed at the Penn Computer Science Department). State-of-the-art taggers like the Brill tagger are about 95% accurate, which sounds high until you realize that 95% accuracy means that 1 in 20 words is mistagged, resulting in 20-25 errors per typical printed page. So we had to hand-correct the tags on the output of the tagger. Later on, we give instructions for accessing the tagged version of alhatton.

Finally, we ran the part-of-speech tagged files through the best automatic parser we could find (the so-called Collins/Bikel parser, once again developed here at the Penn Computer Science Department). Parsing is harder than part-of-speech tagging, both for people and for computers, and the output of even the best parsers is best regarded as a very rough draft. Once again, we hand-corrected the output of the relevant program - in this case, the parser. The historical corpora that we will be investigating are sets of such corrected parsed files.

Below are instructions for accessing the sample text alhatton at the three stages of linguistic annotation just described. Please take a look at the files, especially the parsed file, to get a sense of the data that we'll be investigating. The files are write-protected, so you don't have to worry about corrupting them.

The texts in the original Helsinki corpus from 1500-1710 were divided into three 70-year time periods. The e3 in the filename indicates that the text belongs to the third time period (1640-1710), and the h indicates that it belongs to the original Helsinki corpus (rather than to the text samples added at Penn).

  1. Log on to your babel account.
  2. Type 300
  3. You should be in the directory
    /home/migration/other/MIDENG/PPCEME/ling300-ppceme-coded-all
  4. Open an Emacs window with the text sample by typing
    emacs alhatton-e3-h.txt
    As usual, you can cut and paste the above command from the browser window into your terminal window.
  5. The part-of-speech tagged and the parsed versions of the file have the same basename (alhatton-e3-h), but different extensions (.pos for the tagged version, and .psd for the parsed version). So you can open an Emacs window with the tagged or parsed versions of the file by typing, respectively,
    emacs alhatton-e3-h.pos
    emacs alhatton-e3-h.psd

Coded corpora

In order to be useful, historical corpora, especially parsed ones, need to be big. But being big, they can't be searched "by hand" (that is, by reading through the files and making note or counting the examples of interest). In fact, even with a small corpus, this wouldn't be a good idea, since manual searches are very error-prone. Instead, we use automatic methods to search the corpora. When historical parsed corpora first started being constructed at Penn in the mid-1990s, there were no search programs powerful enough to perform the kinds of searches we needed to perform to do research in historical syntax. So we commissioned our own search program,
CorpusSearch, which allows users to search parsed (and tagged) corpora. In addition, CorpusSearch facilitates corpus construction by providing a graphical user interface to correct the output of a parser and by enabling the automatic revision of already parsed corpora. CorpusSearch also allows us to convert an ordinary parsed corpus into one with coding strings of the type familiar from sociolinguistic research (more on that in a second). As far as I know, there is only one other program in the world that is comparable. It is called TigerSearch, and it allows users to search and construct corpora, but it doesn't include automatic revision (at least not as powerful revisions as CorpusSearch) or the coding feature. Work is currently underway at Penn to devise programs that will allow users to search the historical corpora on the web, but they are not yet available. That is why we've given you babel accounts; for the moment, it's the only way to access and search the historical corpora.

In principle, it would be possible for this class to search the historical corpora using the ordinary parsed files (like the alhatton-e3-h.psd file that you looked at earlier). However, even relatively simple searches of interest to us quickly become so complex that they aren't convenient to implement even for expert users of CorpusSearch. So instead of searching the ordinary parsed files, we will search the corpora in coded format.

In order to generate the coded corpora, I used a special type of CorpusSearch query called a coding query. Ordinary CorpusSearch queries retrieve only a single sort of example from a corpus (say, all negative sentences with main verb have), and their output is a list of the sentences in the corpus that match the query. By contrast, coding queries can contain many different queries, and the results of each query are expressed in a coding string of the type familiar from multivariate statistical analysis in sociolinguistics (VARBRUL). The CorpusSearch coding query that I wrote associates each finite clause in the corpus with 15 properties of interest, both internal (related to the syntactic structure of the clause) and external (related to the sociolinguistic properties of the author and text). The meanings of the 15 columns in the coding strings and their possible values are explained in Coding conventions.

You can access the coded versions of the PPCEME and the PCEEC as follows:

  1. Log on to babel.
  2. Go to the class directory by typing 300
  3. Access the coded version of the PPCEME by typing emacs ppceme.cod
  4. Access the coded version of the PCEEC by typing emacs pceec.cod

Both of the coded parsed files are huge, since they each contain about 2 million words of text, a part-of-speech tag for every word, and syntactic annotation over and above that - not to forget the coding strings. Emacs will ask you if you really want to open them. Respond with y for "yes".

Start by looking at the coded parsed version of the PPCEME (emacs ppceme.cod). At the very beginning of the file is the (very long) CorpusSearch query that I wrote in order to generate this particular coded version of the parsed corpus. Right after the query comes the alhatton text that you looked at earlier. You can find the text easily (without having to scroll down screen by screen) by searching for the string CODING. As you'll see, the coding strings are interspersed among the parsed sentences of the text. See Finding corpus examples for more detailed discussion of how to navigate the coded parsed corpora.

When performing searches on a coded corpus, we are sometimes interested in seeing the actual sentences that match our search criteria. But more often than not, we aren't interested in the individual sentences themselves, but only in how many times a certain type occurs. To faciliate the latter type of search, CorpusSearch allows users to extract the coding strings from the coded corpus. The resulting files are simply all the coding strings in a coded corpus. By convention, these files have the same name as the coded parsed corpora from which they are derived, but with an additional .ooo extension. You can take a look at the relevant files, which are not intended for regular human consumption, by typing
emacs ppceme.cod.ooo
emacs pceec.cod.ooo
Since these files are very much smaller than the corresponding .cod files, Emacs won't give you the warning about large file size.

First searches

As mentioned earlier, coding queries encode the results of potentially very many ordinary CorpusSearch queries as a single string of characters. In order to query the resulting coding strings, we need a search program that is dedicated to searching strings (rather than parsed structures). Luckily, Linux (babel's operating system) provides a command of exactly the right sort. It is called grep, which stands for 'get regular expression', and not surprisingly, you need to learn a bit about regular expressions before you can use it. See Searching and counting with grep for a summary of the regular expressions that you will use.

Once you're somewhat familiar with the coding conventions and with grep, you can use your knowledge to answer the following sample questions.

A "live" search

Warner 2005 suggests that the decline in do support in the early 1600s is due to a stigmatization of the form do not (this would be comparable to the well-known stigmatization of ain't). An alternative worth considering and mentioned by Warner himself (2005:277) is that what was stigmatized was not do not, but rather clitic negation (n't). Warner was unable to test the two hypotheses on his data (Ellegård's corpus) because they included only sentences that might have contained do, but not ones with other auxiliary verbs or modals. Is there enough evidence in the PPCEME and/or PCEEC to decide between the two alternative hypotheses?

If there is enough evidence in the corpora to decide between the alternative hypotheses, this issue would make a good topic for your first class project.