CorpusSearch Logical Operators

contents of this chapter:

about logical operators
search-function operators vs. argument operators
AND (and search-function call)

time-saver
same-instance
same-instance with prefix indices
do node boundaries matter?

! (not argument)

not-argument reports last legitimate node
not one argument at a time
not before prefix indices

| (or argument)

negating a list

about logical operators

CorpusSearch supports the following logical operators:

AND (and search-function call)

! (not argument)

| (or argument)

Also, the printing command print_complement can be thought of as NOT applied to a query.

search-function operators vs. argument operators

AND acts on search-function calls; ! and | act on arguments to the search functions.

AND has a time-saving switch, so that if the first structure is not found in the sentence being searched, the second structure is not looked for. Therefore, if you know that one structure is rarer than the other, you can save time by listing the rarer structure first.

AND; same-instance

AND has been implemented with same-instance as a default. So

((IP iDomsNumber1 VBP|VBD) AND (IP iDomsNumber2 ADVP|PP*))

will return only sentences where the same instance of IP has the described number 1 and 2 children. Sentences containing one IP with number 1 child VBP and some other IP with number 2 child ADVP will not be returned.

Same-instance is triggered by matching argument strings. So

((ADVP precedes MD|HV*|VB*) AND (MD|HV*|VB* precedes NP-SBJ))

will return only sentences with the same instance of MD|HV*|VB*, but

((ADVP precedes MD|VB*|HV*) AND (MD|HV*|VB* precedes NP-SBJ))

will return sentences with the same instance or different instances (because the argument lists do not match as strings.)

AND; same-instance with prefix indices

If you need to specify which arguments coincide (that is, refer to the same instance) and which don't, you can use prefix indices. Matching arguments with the same pre-index must coincide, matching arguments with different pre-indices must not coincide. Pre-indices must be enclosed by square brackets "[" and "]".

For example, suppose you are looking for two noun-phrases which are sisters; each noun-phrase immediately dominates a pronoun. Use pre-indices as follows:

((([1]NP* precedes [2]NP*) AND ([1]NP* iDominates [3]PRO)) AND ([2]NP* iDominates [4]PRO))

to find sentences like this one:

/~*
And +tere it lykede him to suffre many repreuynges and scornes for vs
(CMMANDEV,1.4)
*~/

/*
    1 IP-MAT: 5 NP-SBJ-1, 8 NP-OB2, 6 PRO it, 9 PRO him
*/

(0
   (1 IP-MAT (2 CONJ And)
             (3 ADVP-LOC (4 ADV +tere))
             (5 NP-SBJ-1 (6 PRO it))
             (7 VBD lykede)
             (8 NP-OB2 (9 PRO him))
             (10 IP-INF-1 (11 TO to)
                          (12 VB suffre)
                          (13 NP-OB1 (14 Q many)
                                     (15 NS repreuynges)
                                     (16 CONJP (17 CONJ and)
                                               (18 NX (19 NS scornes))))
                          (20 PP (21 P for)
                                 (22 NP (23 PRO vs)))))
      (ID CMMANDEV,1.4))

For another example, here's a query written by Ann Taylor:

query: ((((IP-SMC iDoms [1]NP*)
AND ([1]NP* iDoms [3]\**))
AND (IP-SMC iDoms [2]NP*))
AND ([2]NP* iDoms [4]\**))

This query searches for a node labelled IP-SMC which immediately dominates two different NP* nodes, each immediately dominating a trace. In this example, the two mentions of IP-SMC must coincide (be the same instance); [1]NP* and [2]NP* must not coincide (because of the different pre-indices); similarly, [3]\** and [4]\** must not coincide. [1]NP* and [3]\** are not forced either to coincide or not coincide, because the substrings following the indices ("NP*" and "\**") do not match as strings. Here's a sentence found by this query:

 
/~*
+After +t+am L+acedemonie gecuron him to ladteowe, Ircclidis w+as haten,
(OR4,1.53.30.12)
*~/

/*
    23 IP-SMC: 24 NP-NOM *-2, 25 NP-NOM-PRD *ICH*-1
    23 IP-SMC: 25 NP-NOM-PRD *ICH*-1, 24 NP-NOM *-2
*/


(0  (1 CODE )
  (2 IP-MAT
            (3 PP (4 P +After)
                  (5 NP-DAT (6 D^D +t+am)))
            (7 NP-NOM (8 NPR^N L+acedemonie))
            (9 VBDI gecuron)
            (10 NP-DAT-RFL-ADT (11 PRO|D him))
            (12 PP (13 P to)
                   (14 NP-DAT (15 N|D ladteowe)))
            (16 , ,)
            (17 IP-MAT-PRN (18 NP-NOM-2 *pro*)
                           (19 NP-NOM-1 (20 NPR^N Ircclidis))
                           (21 BEDI w+as)
                           (22 VBN haten)
                           (23 IP-SMC (24 NP-NOM *-2)
                                      (25 NP-NOM-PRD *ICH*-1)))
            (26 . ,))
  (27 ID OR4,1.53.30.12))

do node boundaries matter?

Note: this section concerns a subtle point which may not be of interest to beginning CorpusSearch users.

CorpusSearch keeps a boolean variable, "bounds_matter", that is set to "true" by default. It is not set by the user, but set automatically within CorpusSearch. When "bounds_matter" is true, the AND function checks that the node boundaries of the reported structures match. This is useful for queries like this one:

node: CP-REL
query: ((NP* precedes VB*) AND (PP* iDominates CONJ))

Here, CorpusSearch will only return sentences where both structures (NP* precedes VB*) and (PP* iDominates CONJ) are contained in the same CP-REL clause, not two different CP-REL clauses in the sentence.

Under certain conditions, CorpusSearch sets the "bounds_matter" variable to false. When "bounds_matter" is false, no attempt is made to match the node boundary.

1.) If CorpusSearch finds that the node boundary might coincide with a search-function argument, "bounds_matter" is automatically set to false. Here's why: consider this query:

node: IP*
query: ((C iPrecedes IP-SUB*) AND (IP-SUB iDominates NP*))

which finds (among others) this sentence:

/~*
and asked the kynge why he was seke.
(CMMALORY,3.43)
*~/

/*
    1 IP-MAT: 11 C 0, 12 IP-SUB, 14 NP-SBJ
*/

(0
   (1 IP-MAT (2 CONJ and)
             (3 NP-SBJ *con*)
             (4 VBD asked)
             (5 NP-OB2 (6 D the) (7 N kynge))
             (8 CP-QUE
                       (9 WADVP-1 (10 WADV why))
                       (11 C 0)
                       (12 IP-SUB (13 ADVP *T*-1)
                                  (14 NP-SBJ (15 PRO he))
                                  (16 BED was)
                                  (17 ADJP (18 ADJ seke))))
             (19 E_S .))
      (ID CMMALORY,3.43))

Here, the node boundary of the first reported structure, (C iPrecedes IP-SUB), is (1, IP-MAT). The node boundary of the second reported structure, (IP-SUB iDominates NP*), is (12, IP-SUB), because the node boundary is allowed to coincide with the first argument of the search function. Thus, if "bounds_matter" had been true, this sentence would not have been reported.

A special example of this is when the node boundary is set to *. Then, the node boundary will always coincide with the first argument to the search function. For this reason, when the node boundary is *, "bounds_matter" is set to false.

2.) "bounds_matter" is automatically set to false when the search function refers to a node which is outside the parsed sentence, e.g., the ID node or CODING node, which may be searched using the functions "column" and "inID".

! (not-argument)

! is used to negate the argument to a search function.

For instance, suppose you're looking for sentences whose subject does not immediately dominate a pronoun. You could use this query:

(NP-SBJ* iDominates !PRO*)

to obtain sentences like this:

/~*
a runde fot & +ticke bi-come+t an hors wel.
(CMHORSES,87.17)
*~/

/*
    1 IP-MAT: 2 NP-SBJ, 10 ADJ +ticke
*/

(0
   (1 IP-MAT
             (2 NP-SBJ (3 D a)
                       (4 ADJP (5 ADJ runde)
                               (6 CONJP *ICH*-1))
                       (7 N fot)
                       (8 CONJP-1 (9 CONJ &))
                       (10 ADJ +ticke))
             (11 VBP bi-come+t)
             (12 NP-OB1 (13 D an) (14 N hors))
             (15 ADVP (16 ADV wel))
             (17 E_S .))
      (ID CMHORSES,87.17))

! one argument at a time

CorpusSearch does not allow you to negate both arguments to a single search function. So this is *not* a legitimate command, and will abort the search:

(!NP-SBJ iPrecedes !VBD)

! before prefix indices

If you need to use both ! and prefix indices, put the ! before the indices.

For instance, suppose you're looking for sentences that contain a subject that precedes the object, and neither the subject nor the object contains a pronoun. You could use this query:

(((NP-SBJ* precedes NP-OB1*)
AND (NP-SBJ* iDominates ![1]PRO*))
AND (NP-OB1* iDominates ![2]PRO*))

to obtain sentences like these:

/~*
& +tat schal be a good hors.
(CMHORSES,85.9)
*~/

/*
    1 IP-MAT: 3 NP-SBJ, 7 NP-OB1, 4 D +tat, 10 N hors
*/

(0
   (1 IP-MAT (2 CONJ &)
             (3 NP-SBJ (4 D +tat))
             (5 MD schal)
             (6 BE be)
             (7 NP-OB1 (8 D a) (9 ADJ good) (10 N hors))
             (11 E_S .))
      (ID CMHORSES,85.9))

Notice that it is necessary to use pre-indices before the PRO* labels. Otherwise, CorpusSearch would try to find an NP-SBJ* and an NP-OB1* both dominating the *same* not-PRO* object, and would come up empty.

| (or argument)

Any number of arguments to a search function may be linked together into an argument list using |, which means "or". For instance,

(*VB*|*HV*|*BE*|*DO*|*MD* iPrecedes NP-SBJ*)

means "*VB* or *HV* or *BE* or *DO* or *MD* immediately precedes NP-SBJ*," and will find sentences like this:

/~*
+Tan was pompe & pryde cast down & leyd on syde.
(CMKEMPE,2.12)
*~/

/*
    2 IP-MAT-1: 5 BED was, 6 NP-SBJ
*/

(NODE
      (2 IP-MAT-1
                  (3 ADVP-TMP (4 ADV +Tan))
                  (5 BED was)
                  (6 NP-SBJ (7 N pompe) (8 CONJ &) (9 N pryde))
                  (10 VAN cast)
                  (11 RP down))
      (ID CMKEMPE,2.12))

negating a list

If a list is preceded by !, the entire list is negated. So,

(!*VB*|*HV*|*BE*|*DO*|*MD* iPrecedes NP-SBJ*)

means, "none of these (*VB* or *HV* or *BE* or *DO* or *MD*) iPrecedes NP-SBJ*", and finds sentences like this:

 
/~*
& sche wold not consentyn in no wey,
(CMKEMPE,3.34)
*~/

/*
    1 IP-MAT: 2 CONJ &, 3 NP-SBJ
*/

(0
   (1 IP-MAT (2 CONJ &)
             (3 NP-SBJ (4 PRO sche))
             (5 MD wold)
             (6 NEG not)
             (7 VB consentyn)
             (8 PP (9 P in)
                   (10 NP (11 Q no) (12 N wey)))
             (13 E_S ,))
      (ID CMKEMPE,3.34))

the Command File
Table of Contents