the Function AND

contents of this chapter:

  • overview
  • about node boundaries
  • building a call to AND
  • how AND works
  • example
  • overview

    In contrast to the elegance of the sentence-vector algorithm, the algorithm for the AND function is a mass of picky detail. The problem is to take two result vectors, describing structures found by search-function calls, and combine them appropriately. There are many variables -- sometimes you need to match node boundaries, sometimes you don't; sometimes particular nodes must coincide (that is, be the same instance); sometimes particular nodes must not coincide.

    about node boundaries

    Consider this query:

    node: CP-REL query: ((1NP* iDominates PP) AND (2NP! iDominates ADVP))

    This query is looking for relative clauses (CP-REL) that contain two different noun phrases (NP*). One noun phrase immediately dominates a prepositional phrase (PP), and the other immediately dominates an adverbial phrase (ADVP).

    In this case, the node boundaries must coincide. That is, you don't want sentences where the noun-phrase structures are found in two different relative clauses.

    In general, if the node boundary is not the wild card (*), and two arguments must not coincide, the node boundary must coincide. Otherwise, the node boundary is not checked.

    building a call to AND

    Let's look again at this query:

    (( (1ADJP|ADVP-TMP iDominates ADV*|ADJ*)
    AND
    (ADV*|ADJ* dominates !*er))
    AND
    (2ADJP|ADVP-TMP iPrecedes NP-SBJ))

    In English, this query would be: "find sentences that contain two different ADJP|ADVP-TMP nodes. One ADJP|ADVP-TMP node immediately dominates an ADV*|ADJ* node, which does not dominate a word ending in "er". The other ADJP|ADVP-TMP node immediately precedes an NP-SBJ node."

    The query is recorded in this function-calls vector:

    0.) [iDominates, 18, [ADJP, ADVP-TMP], [ADV*, ADJ*]]
    1.) [dominates_not_y, 17, [ADV*, ADJ*], [*er]]
    2.) [AND, 1, [[true, 1, 2]]]
    3.) [iPrecedes, 9, [ADJP, ADVP-TMP], [NP-SBJ]]
    4.) [AND, 1, [[false, 0, 4]]]

    As CorpusSearch reads through the query to build the function-calls vector, it keeps a list of all the search-function arguments encountered so far. For instance, by the time the first call to AND is read, the argument list looks like this:

    1ADJP|ADVP-TMP, ADV*|ADJ*, ADV*|ADJ*, !*er

    When the call to AND is being built, the argument list is searched for same-instance specifications. In this case, there are two instances of ADV*|ADJ*, at indices 1 and 2 of the argument list. Because these two arguments match, the structures must coincide in the sentence vector. So the coincide value "true", along with the indices 1 and 2, are sent along to the AND function.

    In order to prevent the same match from being considered more than once, the second of the matching arguments is replaced with a dummy in the argument list. The dummy consists of the word "dummy" with its index concatenated to it. (If you just use the word "dummy", you get false matchings of "dummy" to "dummy"!)

    So this is how the argument list looks after the first call to AND has been put together:

    1ADJP|ADVP-TMP, ADV*|ADJ*, dummy2, !*er

    Here's the argument list when the second call to AND is being built:

    1ADJP|ADVP-TMP, ADV*|ADJ*, dummy2, !*er, 2ADJP|ADVP-TMP, NP-SBJ

    This time, when the argument list is searched, we find two arguments with mismatched prefix indices, namely 1ADJP|ADVP-TMP and 2ADJP|ADVP-TMP, at indices 0 and 4. Because the prefix indices are different, this is a case where the arguments must not coincide. So the coincide value "false", along with the indices 0 and 4, are sent along to the AND function.

    The whole process of recording same-instance specifications is done once at the beginning of the search and never again.

    example

    As an example, I'll work through the above query as applied to the sentence

    Thenne quene Igrayne waxid dayly gretter and gretter.

    [[0],
    [[1, IP-MAT],
    [[2, ADVP-TMP],
    [[3, ADV, Thenne]]],
    [[4, NP-SBJ],
    [[5, NPR, quene]],
    [[6, NPR, Igrayne]]],
    [[7, VBD, waxid]],
    [[8, ADVP-TMP],
    [[9, ADV, dayly]]],
    [[10, ADJP],
    [[11, ADJR, gretter]],
    [[12, CONJ, and]],
    [[13, ADJR, gretter]]],
    [[14, E_S, .]]],
    [[15, ID, CMMALORY,5.120]]]

    CorpusSearch takes the first search function call, as recorded in the function-calls vector:

    0.) [iDominates, 18, [ADJP, ADVP-TMP], [ADV*, ADJ*]]

    When the search function iDominates is called along with the above sentence, this result vector is produced (I'll call it "result I"), and stored on a results-stack for future reference:

    [[[1, IP-MAT], [2, ADVP-TMP], [3, ADV, Thenne]],
    [[1, IP-MAT], [8, ADVP-TMP], [9, ADV, dayly]],
    [[1, IP-MAT], [10, ADJP], [11, ADJR, gretter]],
    [[1, IP-MAT], [10, ADJP], [13, ADJR, gretter]]

    Then the next call is taken:

    1.) [dominates_not_y, 17, [ADV*, ADJ*], [*er]]

    producing this result vector (I'll call it "result II"):

    [[[1, IP-MAT], [3, ADV, Thenne], [3, ADV, Thenne]],
    [[1, IP-MAT], [9, ADV, dayly], [9, ADV, dayly]]

    When one node is recorded twice, as in:

    [[1, IP-MAT], [9, ADV, dayly], [9, ADV, dayly]]]

    it is because "dominates" will allow both x and y to be in the same node. In this case, "ADV" matches the ADV*|ADJ* argument, and "dayly" matches the !*er argument.

    The next call is the first call to AND:

    2.) [AND, 1, [[true, 1, 2]]]

    So the two result vectors above are taken from the results-stack and sent to AND along with the same-instance vector [[true, 1, 2]]].

    To see if a new result vector can be formed from the two input result vectors, the AND function takes the first sub-vector from result I and compares it to each sub-vector of result II. Then the second sub-vector from result I is compared to each sub-vector of result II, and so on until every relevant combination of sub-vectors has been examined. Here's the first potential match, between the first sub-vector of result I and the first sub-vector vector of result II:

    [[[1, IP-MAT], [2, ADVP-TMP], [3, ADV, Thenne]] (from result I)
    [[1, IP-MAT], [3, ADV, Thenne], [3, ADV, Thenne]] (from result II)

    The same-instance vector [true, 1, 2] tells us that the arguments at indices 1 and 2 of the argument list must coincide. Unfortunately, these indices are not the appropriate ones for the result vectors, because the result vectors don't just contain nodes representing the arguments; they also contain nodes representing the node boundaries. For every two arguments in the argument list, one node boundary has been added to the result sub-vector. So we use this formula to find the appropriate index in the result sub-vector (adding same/2 adds one for every two arguments, and adding 1 again accounts for the very first node boundary, which appears before the first pair of arguments):

    dex1 = same1 + (same1)/2 + 1

    For instance, where the same-instance value is 1, the sub-vector index is

    dex1 = 1 + 1/2 + 1 = 2

    so the argument that was at index 1 on the argument list (namely, ADV*|ADJ*) should be looked for at index 2 of the result I sub-vector [[[1, IP-MAT], [2, ADVP-TMP], [3, ADV, Thenne]].

    For the second same-instance value, 2, we use the formula again:

    dex2 = 2 + 2/2 + 1 = 4

    4 would be the correct index of the result sub-vector if the two sub-vectors were concatenated. Since they're not concatenated but separate, we need to subtract the length of the first result sub-vector, namely 3, to obtain the correct index.

    dex2 = same2 + (same2)/2 + 1 - 3 = 4 - 3 = 1

    so the argument that was at index 2 on the argument list (namely, ADV*|ADJ*) should be looked for at index 1 of the result II sub-vector [[1, IP-MAT], [3, ADV, Thenne], [3, ADV, Thenne]].

    Now that we've got the indices worked out, we know to compare [3, ADV, Thenne] from result I to [3, ADV, Thenne] from result II. The fact that both these nodes have index 3 proves they are the same, as required by the same-instance value "true". So the AND function concatenates the two sub-vectors and puts the result into the AND result vector.

    The final result vector output from the first call to AND, result III, looks like this:

    [[[1, IP-MAT], [2, ADVP-TMP], [3, ADV, Thenne], [1, IP-MAT], [3, ADV, Thenne], [3, ADV, Thenne],
    [[1, IP-MAT], [8, ADVP-TMP], [9, ADV, dayly], [1, IP-MAT], [9, ADV, dayly], [9, ADV, dayly]]]

    Notice that the result vector does not contain any reference to words ending in "er", as specified by the query.

    The next search-function call is:

    3.) [iPrecedes, 9, [ADJP, ADVP-TMP], [NP-SBJ]]

    which produces this result vector ("result IV"):

    [[[1, IP-MAT], [2, ADVP-TMP], [4, NP-SBJ]]]

    The next search-function call is:

    4.) [AND, 1, [[false, 0, 4]]]

    So the two most recent result vectors, III and IV, are taken off the stack and sent to the AND function. Again, we'll look at the first potential match, between the first sub-vector of result III and the only sub-vector of result IV:

    [[1, IP-MAT], [2, ADVP-TMP], [3, ADV, Thenne],
    [1, IP-MAT], [3, ADV, Thenne], [3, ADV, Thenne], (result III)
    [[[1, IP-MAT], [2, ADVP-TMP], [4, NP-SBJ]]] (result IV)

    Here, the first same-instance index is 0, which produces this result vector index:

    dex1 = 0 + 0/2 + 1 = 1

    and the second same-instance index is 4, which produces this result vector index:

    dex2 = 4 + 4/2 + 1 - (length(result III sub-vector)) = 7 - 6 = 1

    so we compare the index 1 node of the result III sub-vector ([2, ADVP-TMP]) with the index 1 node of the result IV sub-vector ([2, ADVP-TMP]). Since they both have index 2 we know they're the same node; but in this case the coincide value was false, so this match is rejected.

    This is the final result vector from the last call to AND:

    [[[1, IP-MAT], [8, ADVP-TMP], [9, ADV, dayly], [1, IP-MAT], [9, ADV, dayly], [9, ADV, dayly], [1, IP-MAT], [2, ADVP-TMP], [4, NP-SBJ]]]