Linguistics 300, F09, Assignment 9

In this assignment, you will be conducting some first searches of the parsed historical corpora that have been constructed at Penn and York, England. Before a description of the searches, I provide some background information about the parsed corpora that form the empirical basis of the work.


Background

Parsed corpora

The historical corpora that we will be working with consist of parsed files that were constructed by researchers both here at Penn and at York, England. Corpus construction begins with electronic texts. Fortunately, we did not have to type in all of our texts. Instead, we were able to take advantage of two already existing corpora of online texts, the diachronic part of the Helsinki Corpus, which includes texts from Old English to the early 1700's, and the Helsinki Corpus of Early English Correspondence, which includes letters from 1500 to about 1700. We extended the Early Modern English part of the Helsinki Corpus (1500-1712) to triple its original size, using texts by the same authors or as comparable as we could find. We typed in the additional texts. Even now, OCR software is not yet up to the kind of accuracy we need, especially in connection with older fonts. In connection with producing a followup corpus to the PPCEME, we learned that even professional data entry businesses still use human typists in clever ways rather than OCR software. This resulted in about 1.8 million words of text, of various genres. Following the Helsinki Corpus, we used short samples (about 2,000 words for most genres) so as to let us sample more authors and genres. The Helsinki Corpus of Early English Correspondence, containing about 2.2 million words, was used as is. There is a slight overlap between the two corpora, so that they contain a total of slightly less than 4 million words.

In order to produce a parsed version of each text, we first ran each text through the best automatic part-of-speech tagger that we could find (the so-called Brill tagger, developed at the Penn Computer Science Department). State-of-the-art taggers like the Brill tagger are about 95% accurate, which sounds high until you realize that 95% accuracy means that 1 in 20 words is mistagged, resulting in 20-25 errors per typical printed page. So we had to hand-correct the tags on the output of the tagger.

Finally, we ran the part-of-speech tagged files through the best automatic parser we could find (the so-called Collins/Bikel parser, once again developed here at the Penn Computer Science Department). Parsing is harder than part-of-speech tagging, both for people and for computers, and the output of even the best parsers is best regarded as a very rough draft. As with the output of the tagger, we once again corrected the output of the parser, using a mixture of hand-correction and semi-automated techniques based on CorpusSearch (see below). The historical corpora that we will be investigating are simply collections of these corrected parsed files.

Below are instructions for accessing the sample text alhatton at the three stages of linguistic annotation just described. Please take a look at the files, especially the parsed file, to get a sense of the data that we'll be investigating. The files are write-protected, so you don't have to worry about corrupting them.

The texts in the original Helsinki corpus from 1500-1710 were divided into three 70-year time periods. The e3 in the filename indicates that the text belongs to the third time period (1640-1710), and the h indicates that it belongs to the original Helsinki corpus (rather than to the text samples added at Penn).

  1. Log on to your babel account.
  2. Type 300
  3. You should be in the directory
    /home/migration/other/MIDENG/PPCEME/ling300-ppceme-coded-all
  4. Open an Emacs window with the text sample by typing
    emacs alhatton-e3-h.txt
    As usual, you can cut and paste the above command from the browser window into your terminal window.
  5. The part-of-speech tagged and the parsed versions of the file have the same basename (alhatton-e3-h), but different extensions (.pos for the tagged version, and .psd for the parsed version). So you can open an Emacs window with the tagged or parsed versions of the file by typing, respectively,
    emacs alhatton-e3-h.pos
    emacs alhatton-e3-h.psd

Coded corpora

In order to be useful, historical corpora, especially parsed ones, need to be big. But being big, they can't be searched "by hand" (that is, by reading through the files and making note or counting the examples of interest). In fact, even with a small corpus, this wouldn't be a good idea, since manual searches are very error-prone. Instead, we use automatic methods to search the corpora. When historical parsed corpora first started being constructed at Penn in the 1990s, there were no search programs powerful enough to implement the kinds of searches we routinely want to perform in historical syntax. So we commissioned our own search program,
CorpusSearch, which allows users to search parsed corpora. In addition, CorpusSearch facilitates corpus construction by providing a graphical user interface to correct the output of a parser and by enabling the automatic revision of already parsed corpora. CorpusSearch also allows us to convert an ordinary parsed corpus into one with coding strings of the type familiar from sociolinguistic research (more on that in a second). As far as I know, there is only one other program in the world that is comparable. It is called TigerSearch, and it allows users to search and construct corpora, but it doesn't include automatic revision (at least not as powerful revisions as CorpusSearch) or the coding feature. Work is currently underway at Penn to devise programs that will allow users to search the historical corpora on the web, but they are not yet available. That is why we've given you babel accounts; for the moment, it's the only way to access and search the historical corpora that we have constructed.

In principle, it would be possible for this class to search the historical corpora using the ordinary parsed files (like the alhatton-e3-h.psd file that you looked at earlier). However, even relatively simple searches of interest to us quickly become so complex that they aren't convenient to implement even for expert users of CorpusSearch. So instead of searching the ordinary parsed files, we will search coded versions of the corpora.

In order to generate the coded corpora, I used a special type of CorpusSearch query called a coding query. Ordinary CorpusSearch queries retrieve only a single sort of example from a corpus (say, all negative declarative sentences with main verb have), and their output is a list of the sentences in the corpus that match the query. By contrast, coding queries can contain many different queries, and the results of each query are expressed in a coding string of the type familiar from sociolinguistics. The CorpusSearch coding query that I wrote associates each finite clause in the corpus with about 15 properties of interest, both internal (related to the syntactic structure of the clause) and external (related to the sociolinguistic properties of the author and text). The meanings of the columns in the coding strings and their possible values are explained in Coding conventions.

You can access the coded versions of the PPCEME and the PCEEC as follows:

  1. Log on to babel.
  2. Go to the class directory by typing 300
  3. Access the coded version of the PPCEME by typing emacs ppceme.cod
  4. Access the coded version of the PCEEC by typing emacs pceec.cod

Both of the coded parsed files are huge, since they each contain about 2 million words of text, a part-of-speech tag for every word, and syntactic annotation over and above that - not to forget the coding strings. Emacs will ask you if you really want to open the file. Respond with y for "yes".

Start by looking at the coded parsed version of the PPCEME (emacs ppceme.cod). At the very beginning of the file is the (very long) CorpusSearch query that I wrote in order to generate this particular coded version of the parsed corpus. Right after the query comes the alhatton text that you looked at earlier. You can find the text easily (without having to scroll down screen by screen) by searching for the string CODING. As you'll see, the coding strings are interspersed among the parsed sentences of the text.

When performing searches on a coded corpus, we are sometimes interested in seeing the actual sentences that match our search criteria (see Finding corpus examples for instructions if you ever want to do that). But more often than not, we aren't interested in the individual sentences themselves, but only in how many times a certain type occurs. To facilitate the latter type of search, CorpusSearch allows users to extract the coding strings from the coded corpus. The resulting files are simply all the coding strings in a coded corpus. By convention, these files have the same name as the coded parsed corpora from which they are derived, but with an additional .ooo extension. You can take a look at the relevant files, which are not intended for regular human consumption, by typing
emacs ppceme.cod.ooo
emacs pceec.cod.ooo
Since these files are very much smaller than the corresponding .cod files, Emacs won't give you the warning about large file size.


Assignment

As mentioned earlier, coding queries encode the results of potentially very many ordinary CorpusSearch queries as a single string of characters. In order to query the resulting coding strings, we need a search program that is dedicated to searching strings (rather than parsed structures). Luckily, Linux (babel's operating system) provides a command of exactly the right sort. It is called grep (short for 'get regular expression'), and not surprisingly, you need to learn a bit about regular expressions before you can use it. See the
tutorial on grep for a summary of the regular expressions that you will use.

Once you're somewhat familiar with the coding conventions and grep, you can address questions like the following.