CorpusSearch is a search program that finds linguistic structures in a parsed and labelled collection of texts, known as a "corpus". The project came about as a collaboration between myself and two linguists at the University of Pennsylvania, Anthony Kroch and Ann Taylor. They have been working for some years on a parsed and labelled collection of Middle English texts. (in the next few years, the corpus project will be extended to Old English and Early Modern English). Anthony Kroch originally asked me to write a new interface for the search program they were using at the time; it became clear that what they really needed was a whole new search program.
For the search program, I invented a user-friendly query language using terms as close as possible to the ones linguists use when they talk about the structures they look for. I wrote a new search program which searched directly for the linguistic structures (not going through the intermediate step of translating the structures into regular expressions, as the previous search program had.) I provided the program with a Unix interface, and designed a new format for the output, which is as briefly informative as I could make it. The format of the output includes new features which I designed, such as the original form of the sentence printed above the parsed form of the sentence, and a summary block at the end ot the output file giving statistics describing the results of the search. The output of one search can be used as input to another search (a new feature requested by the linguists).
I also designed new ways to handle errors in the corpus itself. If an error is encountered during a search, CorpusSearch doesn't break, but outputs the sentence with a message pintpointing where in the sentence the error was found. Then the search continues with the next input sentence. I also built a "bug_hunter", which searches explicitly for corpus errors and outputs the errors with exact error messages. This is a useful tool for the linguists who build the corpus.
My overall goal has been to make CorpusSearch as user-friendly as possible. It is designed to be relatively easy for linguists to learn, and the output is as clear and informative as I could make it.
[Questionnaires, etc. Will fill in later.]