Overview

contents of this chapter:

overall algorithm
stacks and recursion
a note on efficiency

overall algorithm

The overall algorithm of CorpusSearch is quite straightforward. First the command file is read, and the information it contains (query, printing instructions, etc.) is interpreted and stored. Then a sentence is read from the input file, converted to vector form to facilitate searching, and sent to the search functions as described by the query. If the result of the search functions is a non-empty result vector, the sentence and its result vector are printed in the output file. The search continues with the next sentence, and so on through the whole set of input files.

stacks and recursion

To understand the algorithms used in CorpusSearch, all you need to know is stacks and recursion.

To deal with text that contains parentheses (that is, the query and parsed sentences from the corpus), stacks are used to build vectors that contain the same information as the text but in a form that is more easily accessible to the program. In general, an open parenthesis causes a new sub-vector to be declared and put on the stack, and a close parenthesis causes the stack to be popped and its top element placed in the mother vector.

Recursion is used every time the sentence vector is traversed. This is a reflection of the recursion of natural language.

a note on efficiency

I have tried to follow the rule that if something only needs to be done once, I don't do it thousands of times. Some tasks can be performed once per search (for instance, interpreting the command), others are performed once per output sentence (for instance, printing out comments and the sentence), and still others must be performed once per input sentence (for instance, reading in the sentence and building its representative vector). Naturally, the biggest time-eaters are the tasks that must be performed once per input sentence. At the moment, the lion's share of the program's running time is taken by reading through the input file.