Handling Input Sentences

contents of this chapter:

disadvantages of text
advantages of vectors
how the sentence vector is built
traversing the vector through recursion

disadvantages of text

The input to CorpusSearch takes the form of text files. Text files are convenient for the human user because they are easy to read. On the other hand, the text version of a parsed sentence is impractical for a computer program because the text must be read sequentially. So, for instance, to find the next sister of a node, all the node's daughters must be read through first before the sister is encountered. Consider this sentence:

Thenne quene Igrayne waxid dayly gretter and gretter.

In Modern English, this would be:

Every day Queen Igrayne got bigger and bigger.

The sentence describes Igrayne's pregnancy with Arthur.

Here is the parsed sentence as it appears in the Middle English corpus:

(
(IP-MAT
        (ADVP-TMP (ADV Thenne))
        (NP-SBJ (NPR quene)
                (NPR Igrayne))
        (VBD waxid)
        (ADVP-TMP (ADV dayly))
        (ADJP (ADJR gretter)
              (CONJ and)
              (ADJR gretter))
        (E_S .))
(ID CMMALORY,5.120))

When this sentence is read as text, to find the next sister of NP-SBJ, you must read sequentially through (NPR quene) (NPR Igrayne), then find the ) that closes the NP-SBJ sub-tree, then read the sister, (VBD waxid).

advantages of the vector form

To make access to the parts of the sentence easier, CorpusSearch builds a vector representing the sentence. Daughter nodes are sub-vectors of a mother node. Two sister nodes are sub-vectors of the same mother-node. Every node receives an index, starting at 0. This makes it possible to distinguish between nodes with the same label (for instance, in this sentence the only way to tell the first (ADJR gretter) node from the second (ADJR gretter) node is by the indices.) Here's the vector form of the above sentence (indentation has been added to make the vector easier to read):

[[0],
[[1, IP-MAT],
[[2, ADVP-TMP],
[[3, ADV, Thenne]]],
[[4, NP-SBJ],
[[5, NPR, quene]],
[[6, NPR, Igrayne]]],
[[7, VBD, waxid]],
[[8, ADVP-TMP],
[[9, ADV, dayly]]],
[[10, ADJP],
[[11, ADJR, gretter]],
[[12, CONJ, and]],
[[13, ADJR, gretter]]],
[[14, E_S, .]]],
[[15, ID, CMMALORY,5.120]]]

Using the vector, we can find the next sister of NP-SBJ by simply looking for the next element in its mother-vector (the vector whose first element is [1, IP-MAT]). We no longer have to read through the daughters of NP-SBJ, (NPR quene) and (NPR Igrayne).

Notice that the vector does not contain parentheses as elements. This is because the system of parentheses, which describes the sentence as a tree, is implicit in the structure of the vector. Instead of reading and examining parentheses (which took up a non-negligible amount of time when reading the sentence as text) to understand the sentence's structure, CorpusSearch can simply traverse the vector. The pattern of sub-vectors contains all the information needed to reconstruct the tree.

how the sentence vector is built

To build the sentence vector, CorpusSearch uses a stack of vectors (begun empty), an index stack (begun empty), and an index variable (begun with value -1). It may seem redundant to use both a stack and a variable to keep track of indices, but in fact both of these structures are needed. The top of the index stack holds the current open node, and the index variable holds the highest index that has yet been reached. These are not always the same, as shown in the example below.

The index variable is incremented step by step from -1 to the final index of the sentence. (Starting at -1 means that the first index of the sentence will be 0.) Every time an open parenthesis is encountered, the index variable is incremented, and a copy of the incremented value is put on the index stack to show that this is the current open node.

Every time a close parenthesis is encountered in the sentence, the index stack is popped and the resulting element examined. When the result of popping the stack is zero, the end of the sentence has been reached. (This is how CorpusSearch defines the end of the sentence.)

The vector stack begins empty. When an open parenthesis is found in the sentence, a new vector is declared, the current index is put in the vector, and the vector is stored on the vector stack. When a label (e.g., "NPR") or piece of text (e.g. "Igrayne") is found in the sentence, it is added to the vector at the top of the stack. When a close parenthesis is found, the vector at the top of the stack is popped and added as one element to the vector immediately behind it on the stack. In this way finding a close parenthesis causes the vector stack to collapse in on itself like a telescope. When the close parenthesis that ends the sentence is found, the vector stack has collapsed for the last time. The stack now contains exactly one element, which is the sentence vector.

To illustrate, I will walk through the process, using the following extremely simple sentence which I invented for this purpose:

( (IP-MAT 
          (NP-SBJ (NPR Igrayne))
	  (VBD laughed)))

Before any text has been read, the variables have their initial values.

previously read text: <empty>
current text: <empty>

<empty> -1 <empty>
index stack index variable vector stack

previously read text:
current text: (

0 0 [0]
index stack index variable vector stack

previously read text: (
current text: (

1 [1]
0 1 [0]
index stack index variable vector stack

previously read text: ( (
current text: IP-MAT

1 [1, IP-MAT]
0 1 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT
current text: (

2 [2]
1 [1, IP-MAT]
0 2 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT (
current text: NP-SBJ

2 [2, NP-SBJ]
1 [1, IP-MAT]
0 2 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ
current text: (

3 [3]
2 [2, NP-SBJ]
1 [1, IP-MAT]
0 3 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ (
current text: NPR

3 [3, NPR]
2 [2, NP-SBJ]
1 [1, IP-MAT]
0 3 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR
current text: Igrayne

3 [3, NPR, Igrayne]
2 [2, NP-SBJ]
1 [1, IP-MAT]
0 3 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne
current text: )

2 [2, NP-SBJ, [3, NPR, Igrayne]]
1 [1, IP-MAT]
0 3 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne )
current text: )

1 [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]]]
0 3 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne ) )
current text: (

4 [4]
1 [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]]]
0 4 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne ) ) (
current text: VB

4 [4, VB]
1 [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]]]
0 4 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne ) ) ( VB
current text: laughed

4 [4, VB, laughed]
1 [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]]]
0 4 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne ) ) ( VB laughed
current text: )

1 [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]], [4, VB, laughed]]
0 4 [0]
index stack index variable vector stack

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne ) ) ( VB laughed )
current text: )

0 4 [0, [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]], [4, VB, laughed]]]
index stack index variable vector stack
Here, when the ) is first read, the top of the index stack is still 1, so we haven't yet reached the end of the sentence. After the 1 at the top of the stack is popped and examined, the top of the stack reads 0.

previously read text: ( ( IP-MAT ( NP-SBJ ( NPR Igrayne ) ) ( VB laughed ) )
current text: )

0 4 [0, [1, IP-MAT, [2, NP-SBJ, [3, NPR, Igrayne]], [4, VB, laughed]]]
index stack index variable vector stack
Here, because the top of the index stack is 0 when the ) is read, we know it's the end of the sentence.

traversing the vector through recursion

Once the vector is built, it is traversed many times. A vector that winds up being printed in the output file is traversed once for each search-function call contained in the query, and then again when it is printed. Every method that traverses the vector has the same basic recursive form.

Every vector contains two types of sub-vector; one, that I call a "phrase-vector", which itself contains multiple sub-vectors; and another, which I call a "word-vector", which contains one vector, containing only an index, a label, and a piece of text. For instance, in the vector above, the sub-vector

[[2, ADVP-TMP],
[[3, ADV, Thenne]]]

is a phrase-vector, and the sub-vector

[[3, ADV, Thenne]]

is a word-vector.

Then the basic recursive algorithm followed by every method that traverses the vector is as follows:

traverse (Vector phrase-vector) {

examine phrase-vector[0];
for (int i = 1; i < phrase-vector.size; i++) {
current-vector = phrase-vector[i];
if (current-vector.size > 1) // phrase-vector
traverse (current-vector); // recursive call.
else // word-vector
examine current-vector; // base case.
}
} // end traverse