The historical corpora that we will be working with consist of parsed files that were constructed by Penn graduates both here at Penn and at York, England. Corpus construction begins with electronic texts. Fortunately, we did not have to type in all of our texts. Instead, we were able to take advantage of two already existing corpora of online texts, the diachronic part of the Helsinki Corpus, which includes texts from Old English to the early 1700's, and the Helsinki Corpus of Early English Correspondence, which includes letters from 1500 to about 1700. We extended the Early Modern English part of the Helsinki Corpus (1500-1712) to triple its original size, using texts by the same authors or as comparable as we could find. We typed in the additional texts. Even now, OCR software is not yet up to the kind of accuracy we need, especially in connection with older fonts. In connection with producing a followup corpus to the PPCEME, we learned that even professional data entry businesses still use human typists in clever ways rather than OCR software. This resulted in about 1.8 million words of text, of various genres. Following the Helsinki Corpus, we used short samples (about 2,000 words for most genres) so as to let us to sample more authors and genres. The Helsinki Corpus of Early English Correspondence, containing about 2.2 million words, was used as is. There is a slight overlap between the two corpora, so that they contain a total of slightly less than 4 million words. We will shortly give instructions for how to access a sample text from the PPCEME (alhatton).
In order to produce a parsed version of each text, we first ran each text through the best automatic part-of-speech tagger that we could find (the so-called Brill tagger, developed at the Penn Computer Science Department). State-of-the-art taggers like the Brill tagger are about 95% accurate, which sounds high until you realize that 95% accuracy means that 1 in 20 words is mistagged, resulting in 20-25 errors per typical printed page. So we had to hand-correct the tags on the output of the tagger. Later on, we give instructions for accessing the tagged version of alhatton.
Finally, we ran the part-of-speech tagged files through the best
automatic parser we could find (the so-called Collins/Bikel parser, once
again developed here at the Penn Computer Science Department). Parsing
is harder than part-of-speech tagging, both for people and for
computers, and the output of even the best parsers is best regarded as a
very rough draft. Once again, we hand-corrected the output of the
relevant program - in this case, the parser. The historical corpora
that we will be investigating are sets of such corrected parsed files.
Below are instructions for accessing the sample text
alhatton at the three stages of linguistic annotation just
described. Please take a look at the files, especially the parsed file,
to get a sense of the data that we'll be investigating. The files are
write-protected, so you don't have to worry about corrupting them.
Below are instructions for accessing the sample text alhatton at the three stages of linguistic annotation just described. Please take a look at the files, especially the parsed file, to get a sense of the data that we'll be investigating. The files are write-protected, so you don't have to worry about corrupting them.
|The texts in the original Helsinki corpus from 1500-1710 were divided into three 70-year time periods. The e3 in the filename indicates that the text belongs to the third time period (1640-1710), and the h indicates that it belongs to the original Helsinki corpus (rather than to the text samples added at Penn).|
In principle, it would be possible for this class to search the historical corpora using the ordinary parsed files (like the alhatton-e3-h.psd file that you looked at earlier). However, even relatively simple searches of interest to us quickly become so complex that they aren't convenient to implement even for expert users of CorpusSearch. So instead of searching the ordinary parsed files, we will search the corpora in coded format.
In order to generate the coded corpora, I used a special type of CorpusSearch query called a coding query. Ordinary CorpusSearch queries retrieve only a single sort of example from a corpus (say, all negative sentences with main verb have), and their output is a list of the sentences in the corpus that match the query. By contrast, coding queries can contain many different queries, and the results of each query are expressed in a coding string of the type familiar from multivariate statistical analysis in sociolinguistics (VARBRUL). The CorpusSearch coding query that I wrote associates each finite clause in the corpus with 15 properties of interest, both internal (related to the syntactic structure of the clause) and external (related to the sociolinguistic properties of the author and text). The meanings of the 15 columns in the coding strings and their possible values are explained in Coding conventions.
You can access the coded versions of the PPCEME and the PCEEC as follows:
Both of the coded parsed files are huge, since they each contain about 2 million words of text, a part-of-speech tag for every word, and syntactic annotation over and above that - not to forget the coding strings. Emacs will ask you if you really want to open them. Respond with y for "yes".
Start by looking at the coded parsed version of the PPCEME (emacs ppceme.cod). At the very beginning of the file is the (very long) CorpusSearch query that I wrote in order to generate this particular coded version of the parsed corpus. Right after the query comes the alhatton text that you looked at earlier. You can find the text easily (without having to scroll down screen by screen) by searching for the string CODING. As you'll see, the coding strings are interspersed among the parsed sentences of the text. See Finding corpus examples for more detailed discussion of how to navigate the coded parsed corpora.
When performing searches on a coded corpus, we are sometimes
interested in seeing the actual sentences that match our search
criteria. But more often than not, we aren't interested in the
individual sentences themselves, but only in how many times a certain
type occurs. To faciliate the latter type of search, CorpusSearch
allows users to extract the coding strings from the coded corpus. The
resulting files are simply all the coding strings in a coded corpus.
By convention, these files have the same name as the coded parsed
corpora from which they are derived, but with an additional
.ooo extension. You can take a look at the relevant files,
which are not intended for regular human consumption, by typing
Since these files are very much smaller than the corresponding .cod files, Emacs won't give you the warning about large file size.
As mentioned earlier, coding queries encode the results of potentially
very many ordinary CorpusSearch queries as a single string of
characters. In order to query the resulting coding strings, we need a
search program that is dedicated to searching strings (rather than
parsed structures). Luckily, Linux (babel's operating system) provides
a command of exactly the right sort. It is called grep, which
stands for 'get regular expression', and not surprisingly, you need to
learn a bit about regular expressions before you can use it.
See Searching and counting
with grep for a summary of the regular expressions that you
Once you're somewhat familiar with the coding conventions and with grep, you can use your knowledge to answer the following sample questions.
If there is enough evidence in the corpora to decide between the alternative hypotheses, this issue would make a good topic for your first class project.