The Goals of CorpusSearch

contents of this chapter:

about tgrep
why CorpusSearch is an improvement
clearer command language.
searchable output.
more informative output.
more useful output.
searching for corpus errors
handling corpus errors in a regular search
portability; web-friendliness

about tgrep

Prior to CorpusSearch, the linguists at the University of Pennsylvania had been using a search program called "tgrep." The purpose of tgrep is to search the Penn treebank, which is a corpus of linguistically annotated text. tgrep was designed by computer scientists to aid other computer scientists in their study of natural language. Not surprisingly, the linguists had discovered that tgrep was not optimal for their purposes.

why CorpusSearch is an improvement:

CorpusSearch was designed to be an improvement over tgrep as follows:

Clearer command language.

tgrep was not designed by or for linguists, and its command language was counter-intuitive and took a long time to learn. The linguists were concerned that other linguists might be so put off by the difficulties of learning tgrep that they wouldn't use the Middle English corpus for their research.

Here's an example of the sort of thing linguists look for:

Subordinate clauses with two nominal constituents between the subject and the modal verb and the modal verb preceding the non-tensed verb.

as a tgrep command:

'/CP/ < (/IP/ < (NP-SBJ $. (/^NP/ $. (/^NP/ $. (/MD$/ $.. /VB$/)))))'

and as a CorpusSearch command:

node: CP*

query: ((((NP-SBJ iPrecedes 1NP*)
AND (1NP* iPrecedes 2NP*))
AND (2NP* iPrecedes *MD))
AND (*MD precedes *VB))

The tgrep language and the CorpusSearch language are equally expressive; the advantage of the CorpusSearch language is that it is phrased in terms that come naturally to linguists.

When the CorpusSearch project was originally proposed, the idea was that I would simply write a new interface, which would translate linguist-friendly commands into tgrep commands. As time went on, it became clear that I needed to write a new search program as well, for the following reasons.

searchable output.

In tgrep, output could not be searched directly -- it first had to be filtered through "tprep", which prepared a searchable file. This was an annoyance, since tprep takes time and had to be used again if the input file was changed in any way. In CorpusSearch, output or edited corpus files can be searched directly with no intervening steps.

more informative output.

In tgrep, the default setting is to return only one example of the search string per sentence. If there is more than one such example, only the first is returned, with no indication that there are others. This is not what linguists need, especially since they are often interested in statistical analysis. Also, tgrep simply printed out the sentences (or sometimes the nodes) that contained the searched-for structure. Some of the corpus sentences are quite long and complex, and it takes time to locate the structure in them. Each sentence of the output had to be examined to find the structure. CorpusSearch prints out indices and exact descriptions, so that the number and location of the found structures can be determined at a glance.

more useful output.

CorpusSearch output gives the original text version of the corpus sentence, as well as the parsed and labelled version. With tgrep output, linguists had to go through and delete all the parentheses and labels by hand to show what the original sentence looked like!

support and maintenance

tgrep is no longer supported; the writer has moved on to other projects. The linguists' only recourse in understanding the program is the unwieldy documentation. Recently tgrep has stopped searching the entire corpus at once; the corpus must be broken into two batches for searching, for no apparent reason. Since I'm writing CorpusSearch in close communication with the linguists, they have a strong understanding of what CorpusSearch does and how closely it meets their specifications. I'll be available for at least a year to support the code.

searching for corpus errors

CorpusSearch includes a feature ("bug_hunt") that searches for errors in the corpus itself, for instance, phrase labels that have incorrectly been assigned to single words (or vice versa), missing parentheses, or bits of text left in inappropriate places. The bug-hunter is the only part of CorpusSearch that is label-dependent. (That is, it depends on the particular system of word-labels that was worked out for the Middle English corpus.) The code that deals with labels is written in an exremely simple form, to make it easy to add labels as the corpus is expanded to Early Modern English and Old English. Prior to this innovation, there was no automated way to search for such errors.

handling corpus errors in a regular search

Sometimes corpus errors are encountered in the course of a regular search for information (not a search explicitly seeking corpus errors.) If CorpusSearch finds a badly formed sentence in the course of conducting a regular search, it gives an error message indicating where in the sentence the malformation was found, prints the badly-formed sentence both to the screen and to the output file, and then continues searching on the next corpus sentence.

portability; web-friendliness

Another problem with tgrep is that it is not portable. It has been running on unagi (one of the computers at Penn) but it won't run on other, upgraded machines. When unagi gets upgraded, tgrep might not run there anymore either. Since CorpusSearch is written in Java, it is portable. In the long run it might also be run over the web (perhaps searching an abbreviated demonstration-quality corpus.)