In this assignment, you will be conducting some first searches of the parsed historical corpora that have been constructed at Penn and York, England. Before a description of the searches, I provide some background information about the parsed corpora that form the empirical basis of the work.
The historical corpora that we will be working with consist of parsed files that were constructed by researchers both here at Penn and at York, England. Corpus construction begins with electronic texts. Fortunately, we did not have to type in all of our texts. Instead, we were able to take advantage of two already existing corpora of online texts, the diachronic part of the Helsinki Corpus, which includes texts from Old English to the early 1700's, and the Helsinki Corpus of Early English Correspondence, which includes letters from 1500 to about 1700. We extended the Early Modern English part of the Helsinki Corpus (1500-1712) to triple its original size, using texts by the same authors or as comparable as we could find. We typed in the additional texts. Even now, OCR software is not yet up to the kind of accuracy we need, especially in connection with older fonts. In connection with producing a followup corpus to the PPCEME, we learned that even professional data entry businesses still use human typists in clever ways rather than OCR software. This resulted in about 1.8 million words of text, of various genres. Following the Helsinki Corpus, we used short samples (about 2,000 words for most genres) so as to let us sample more authors and genres. The Helsinki Corpus of Early English Correspondence, containing about 2.2 million words, was used as is. There is a slight overlap between the two corpora, so that they contain a total of slightly less than 4 million words.
In order to produce a parsed version of each text, we first ran each text through the best automatic part-of-speech tagger that we could find (the so-called Brill tagger, developed at the Penn Computer Science Department). State-of-the-art taggers like the Brill tagger are about 95% accurate, which sounds high until you realize that 95% accuracy means that 1 in 20 words is mistagged, resulting in 20-25 errors per typical printed page. So we had to hand-correct the tags on the output of the tagger.
Finally, we ran the part-of-speech tagged files through the best
automatic parser we could find (the so-called Collins/Bikel parser, once
again developed here at the Penn Computer Science Department). Parsing
is harder than part-of-speech tagging, both for people and for
computers, and the output of even the best parsers is best regarded as a
very rough draft. As with the output of the tagger, we once again
corrected the output of the parser, using a mixture of hand-correction
and semi-automated techniques based on CorpusSearch (see below). The
historical corpora that we will be investigating are simply collections
of these corrected parsed files.
Below are instructions for accessing the sample text
alhatton at the three stages of linguistic annotation just
described. Please take a look at the files, especially the parsed file,
to get a sense of the data that we'll be investigating. The files are
write-protected, so you don't have to worry about corrupting them.
| The texts in the original Helsinki corpus from 1500-1710 were divided into three 70-year time periods. The e3 in the filename indicates that the text belongs to the third time period (1640-1710), and the h indicates that it belongs to the original Helsinki corpus (rather than to the text samples added at Penn). |
In principle, it would be possible for this class to search the historical corpora using the ordinary parsed files (like the alhatton-e3-h.psd file that you looked at earlier). However, even relatively simple searches of interest to us quickly become so complex that they aren't convenient to implement even for expert users of CorpusSearch. So instead of searching the ordinary parsed files, we will search coded versions of the corpora.
In order to generate the coded corpora, I used a special type of CorpusSearch query called a coding query. Ordinary CorpusSearch queries retrieve only a single sort of example from a corpus (say, all negative declarative sentences with main verb have), and their output is a list of the sentences in the corpus that match the query. By contrast, coding queries can contain many different queries, and the results of each query are expressed in a coding string of the type familiar from sociolinguistics. The CorpusSearch coding query that I wrote associates each finite clause in the corpus with about 15 properties of interest, both internal (related to the syntactic structure of the clause) and external (related to the sociolinguistic properties of the author and text). The meanings of the columns in the coding strings and their possible values are explained in Coding conventions.
You can access the coded versions of the PPCEME and the PCEEC as follows:
Both of the coded parsed files are huge, since they each contain about 2 million words of text, a part-of-speech tag for every word, and syntactic annotation over and above that - not to forget the coding strings. Emacs will ask you if you really want to open the file. Respond with y for "yes".
Start by looking at the coded parsed version of the PPCEME (emacs ppceme.cod). At the very beginning of the file is the (very long) CorpusSearch query that I wrote in order to generate this particular coded version of the parsed corpus. Right after the query comes the alhatton text that you looked at earlier. You can find the text easily (without having to scroll down screen by screen) by searching for the string CODING. As you'll see, the coding strings are interspersed among the parsed sentences of the text.
When performing searches on a coded corpus, we are sometimes
interested in seeing the actual sentences that match our search criteria
(see Finding corpus examples
for instructions if you ever want to do that). But more often than not,
we aren't interested in the individual sentences themselves, but only in
how many times a certain type occurs. To facilitate the latter type of
search, CorpusSearch allows users to extract the coding strings from the
coded corpus. The resulting files are simply all the coding strings in
a coded corpus. By convention, these files have the same name as the
coded parsed corpora from which they are derived, but with an additional
.ooo extension. You can take a look at the relevant files,
which are not intended for regular human consumption, by typing
emacs ppceme.cod.ooo
emacs pceec.cod.ooo
Since these files are very much smaller than the corresponding
.cod files, Emacs won't give you the warning about large file size.
Once you're somewhat familiar with the coding conventions and grep, you can address questions like the following.
| Take a look at shell scripts for saving searches. |