Conclusion

contents of this chapter:

what I've learned
a successful project
ideas for future improvements
more logical operators
definition files
a presence on the Web
drawing trees
tools for corpus builders

what I've learned

I learned from this experience how important and difficult real communication is. There were many times that I thought a certain issue was resolved only to find that it wasn't, because I had misunderstood what was needed or my clients had misunderstood what I was doing. I also learned how very important it is to have the program tested by real users. It wasn't enough for me to test CorpusSearch, since I couldn't foresee the kind of use a linguist would make of it. If I could change anything about this project, I would have started the testing phase earlier.

a successful project

In spite of the difficulties in building it, I am confident that CorpusSearch is a robust, useful tool for linguists. Every bug that's been found so far has been fixed. I look forward to the publication of the Middle English corpus and CorpusSearch.

ideas for future improvements:

more logical operators

The same-instance ramifications of AND were so complex that I had to shelve some of the other logical operators while I worked them all out. Now that same-instance is running smoothly, it should be possible to re-establish OR and NOT as applied to search-function calls.

definition files

The argument lists needed to describe certain objects can be long and complex. For instance, Ann Taylor uses this list to look for tensed verbs:

*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD

Instead of writing this cumbersome list, it would be useful to keep a "definition file" (let's call it "args.def") of argument lists and their aliases. The definition file would include this line:

set tensed_verb= *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD

Then you'd add a line to the command file to link in the definition file: perhaps something like this:

include args.def

When you wrote your query, you could write something like:

query: (NP-SBJ precedes tensed_verb)

and CorpusSearch would translate it to:

query: (NP-SBJ precedes MD|HVP|HVD|DOP|BEP|VBP|VBD)

a presence on the Web

When this project was begun, the plan was to make a Web interface to allow searches over the Web. Unfortunately, security and storage issues have made this unfeasible. However, it could still be possible to have a demo version of CorpusSearch, with perhaps a self-tutoring program, on the Web. The User's Manual is already on the Web.

drawing trees

When a sentence is parsed, it is parsed into tree form. The tree is represented by parentheses in the corpus because it's easier to store and print. Here's an example:

( (IP (CONJ and)
      (ADVP (ADV so))
      (NP-SBJ (PRO hit))
      (VBD londid)
      (PP (P undir)
          (NP (D that) (N rocche)))
      (E_S .)) )
However, the tree form is more intuitive for most human beings (including this one) to look at, as seen here:

It could be useful to have a way to show the tree form, perhaps as part of a Web interface. Also, there could be a command file option to include in the output the latex code to print the parsed sentence as a tree. Then this latex code could be cut and pasted into the eventual scholarly article.

tools for corpus builders

It is quite likely that CorpusSearch will be used for other corpora besides the Middle English corpus. It will certainly be used for the upcoming Old English and Early Modern English corpora, and it may also be used for Chinese and Korean corpora. So far, the main tool that I've made for the corpus-builder is the bug-hunter that searches for badly formed corpus sentences. The bug-hunter is specific to a certain set of labels and rules for how the labels should be used, in this case the labels and rules for the Middle English corpus. It would be interesting to make a bug-hunter that could be easily customized to any corpus.