A Brief History of Corpus Searching

contents of this chapter:

searching texts by hand
searching texts with coding strings
Penn-Helsinki Parsed Corpus of Middle English, Phase 1 (PPCME1)
Penn-Helsinki Parsed Corpus of Middle English, Phase 2 (PPCME2)
Why CorpusSearch was written for the PPCME2

searching texts by hand

The way linguists used to work with texts was to read through the text and write down every example of the particular structure they were looking for. This was tedious, inaccurate work, and the inevitable errors were hard to find and correct. Don Ringe, of the University of Pennsylvania, spent four years gathering the data for his PhD thesis The perfect tenses in Greek inscriptions, completed in 1983. He still has 10,000 index cards packed in boxes in his office, each bearing the reference for a Greek verb form.

searching texts with coding strings

In 1979 David Sankoff and Henrietta Cedegren developed a program called varbrul that performed multivariate analysis on linguistic data. Data was encoded for that system using "coding strings" (see below). The program was originally developed to analyze phonological data (that is, data having to do with the pronunciation of words.)

In the early '80's Anthony Kroch and Don Hindle (a graduate student in linguistics) started using varbrul to analyze syntactic data (that is, data having to do with the structure of sentences). Don Hindle and Susan Pintzuk wrote programs to manipulate coding strings to make varbrul more effective for syntactic analysis.

A corpus of selected texts was assembled and then coding strings were added to each sentence by someone reading through the sentences and writing the coding strings by hand. Here's a typical sentence (from Malory) with coding string:

(mvNpi ( Hit besemyth you nought .' ))

Each position in the coding string "mvNpi" is called a "column". Column 1 describes the clause type (in this example, m means "main clause"), column 2 describes the verb type (v means "besymeth" is a tensed verb), column 3 describes negation ("N" indicates the clause is negated, because of the word "nought"), column 4 describes the type of the object ("p" means the object "you" is a pronoun) and column 5 describes object position ("i" indicates that the object "you" immediately follows the verb "besemyth".)

Examples prefaced by coding strings were searched using a program called "tsort" written by Susan Pintzuk in LISP. Here's typical input to tsort:

(and (col 1 m) (col 3 N) (col 4 p))

This will pull all examples that have "m" in column 1, "N" in column 3, and "p" in column 4 of the coding string (that is, all negative main clauses with a pronoun object), including the sample sentence above.

The disadvantage to this system is that the automatic phase of searching on the coding strings was only applicable after a great deal of work (the collection of data and the addition of coding strings) had already been done by hand.

Penn-Helsinki Parsed Corpus of Middle English, Phase 1 (PPCME1)

In 1991, Anthony Kroch and Ann Taylor began a pilot project to develop a syntactically annotated corpus of Middle English, funded by a National Science Foundation grant. This resulted in the publication of the first phase of the Penn-Helsinki Parsed Corpus of Middle English (PPCME1) in 1994.

This version of the corpus was constrained by the technology available at the time. To build a corpus, linguists need parsers to describe sentence structure and labellers (or "taggers") to label parts of speech (e.g., nouns, verbs, pronouns, etc.) At the time, there were no automatic parsers or labellers available that were robust enough to handle Middle English, with its spelling and word-order variations; they had all been designed for Modern English and could not be trained to handle other data sets.

Because of this, Ann Taylor wrote all the programs used in constructing the corpus herself, and much of the annotation was done by hand. This corpus had relatively flat parsing (not every piece of text was labelled, only the most important ones like subject, verb, or object).

There were two ways to search PPCME1.

One was to use a search program written by Ann Taylor in Perl called find-mideng. This program matched regular expressions. Here's an example of a search string that could be input to this program:

expletive-there,\[s\-\d\s+\+*[td]h*[ea][eai]*r
The label, "expletive-there", describes what the query is searching for. Here's an example of expletive there:

There is a unicorn in the garden.

In this sentence, "there" does not denote a place; it indicates existence only. By contrast, in the sentences

There's the unicorn!
The unicorn is there.

"there" denotes some particular place in the mind of the speaker, and thus is demonstrative rather than expletive.

Here's another example, from Gertrude Stein describing Los Angeles:

There is no there there.

Here, the first "there" is expletive, the second "there" is being used as a noun in a very nonstandard fashion, and the last "there" is demonstrative.

The rest of the query, \[s\-\d\s+\+*[td]h*[ea][eai]*r, is a regular expression in Perl. The output resulting from this query was a list of examples matching the regular expression. Here's a sample piece of output:

( [b So ] [t the meanwhyle ] [s-1 there ] [vt com ] [p into the courte ] [n-1 the Lady of the Laake ] , )(MALORY,48.113)

Or, as Malory wrote it:

So the meanwhyle there come into the courte the Lady of the Laake,

In the query, \[s\-\d\s matches "[s-1 " which indicates a subject in the PPCME1 schema, and "\+*[td]h*[ea][eai]*r" matches all the various different ways of spelling "there" in Middle English (including "there", "theire", "thair", and many more.)

The other way to use PPCME1 was to automatically generate a coding string using a program called "code" written by Ann Taylor, then use the previous methods to search on the coding strings.

Since its release, the PPCME1 has been downloaded by researchers and research groups from more than 100 universities in 22 countries: Australia, Austria, Belgium, Brazil, Britain, Canada, China, Denmark, Finland, France, Germany, India, Italy, Japan, Korea, the Netherlands, New Zealand, Norway, Portugal, Spain, Sweden, and the United States. At a recent Diachronic Generative Syntax conference held in York, UK (May 29-30, 1998), all but one of the eight papers on English made use of the PPCME1.

Penn-Helsinki Parsed Corpus of Middle English, Phase 2 (PPCME2)

The second phase of the corpus, PPCME2, began in 1995. By this time Eric Brill, a graduate student at Penn in computational linguistics, had written a trainable tagger. To use this tagger, the linguist would first write a training set of correctly labelled example sentences. From the training set the tagger develops a lexicon and a set of rules for assigning tags. This made it possible to include part-of-speech tags in the second version of the corpus.

Once the part-of-speech tags were included, it was then possible to do automatic parsing which is based on the part-of-speech tags. The first parser the linguists used is called "fidditch", written by Don Hindle, working on a grant Tony Kroch had to study speech and writing. The Penn Treebank Project, a Modern English corpus project under the direction of Mitch Marcus, wound up using the fidditch parser, and also developed friendly interfaces for correcting the tagged and parsed output.

The disadvantage to fidditch was that it only parsed constituents that it was sure about, and the rest of the parsing had to be done by hand. It was underdetermined, and a lot of correcting had to be done.

Then Mike Collins, a graduate student in computational linguistics at Penn, wrote a trainable automatic parser that was extremely good for Modern English. Again, the key was that it was trainable. He was interested in seeing the training feature in action, so he did a lot of work to make the parser handle Middle English correctly. This provided parsing that was so much more accurate than fidditch that it cut the correcting time by a magnitude of 4.

A program called "tgrep", originally designed for the Penn Treebank Project, was used to search the PPCME2 (tgrep is described in more detail in "The Goals of CorpusSearch").

Why CorpusSearch was written for the PPCME2

The Penn-Helsinki parsed corpus of Middle English, Phase II (PPCME2) is a collection of machine-readable texts from the Middle English period (1150-1550 CE) which have been linguistically annotated to facilitate automatic search for linguistic structures. The corpus contains 1.5 million words and is balanced by time period, dialect and genre. For each time period (1150-1250, 1250-1350, 1350-1420, 1420-1550), texts were chosen to represent each dialect (North, South, East Midland, West Midland, and Kent) and a wide range of genres (sermon, biography, fiction, poetry, travelogue, science, legal, letters, etc.). Each word of the sentence is labelled by its part of speech (noun, verb, etc.). The words are then grouped into phrases, which are labelled for function (subject, object, etc.). The annotations in conjunction with an appropriate search engine (CorpusSearch) make it possible for linguists to easily extract all sentences from the corpus which have a given linguistic structure (for example, all sentences in which the subject follows the verb, as in "Thus spake Zarathustra").

To illustrate, here is an unusually simple sentence from the corpus:

((IP 
     (CONJ and)
     (NP-SBJ 
             (PRO she))
     (BED was)
     (VAN clothe)
     (ADVP 
           (ADV rychly))
     (. ,)))

which represents this tree (ignoring the punctuation node(. ,)):

The syntax of a natural human language sentence is represented as a tree rather than as a string, because only representation as a tree (or its equivalent) can express precisely the relations between nonadjacent words (which are important in the syntax of every human language). There are two basic types of structure in the syntactic tree that linguists look for: dominance and precedence. "x dominates y" means that y is contained in the subtree headed by x. In the example, IP dominates everything else, NP-SBJ dominates PRO, and ADVP dominates ADV. "x precedes y" means that x is a previous sister of y. In the example, CONJ precedes NP-SBJ, BED precedes VAN, and BED precedes ADVP. Dominance and precedence are mutually exclusive: if x dominates y, then x does not precede y, and vice versa. Every relationship in the tree can be described in terms of dominance and precedence. The CorpusSearch program searches for dominance and precedence and their combinations and variations.

From the programmer's point of view, this is a formal exercise. I have not had to deal with the issues of natural language, because the corpus has been parsed into purely formal syntactic terms. CorpusSearch can be used to search any corpus that has been parsed and labelled in the same general format as the PPCME2. Over the next few years, the corpus will be extended to Old English and Early Modern English, and CorpusSearch will be used to search those texts as well.