CorpusSearch Search Functions

contents of this chapter:

x search-function y
search-function name variants
Search Functions:
exists
precedes
iPrecedes
anyPrecedes
dominates
iDominates
iDomsOnly
iDomsNumber#
iDomsLast#
iDomsTotal#
iDomsTotal<#
iDomsTotal>#
domsWords#
domsWords<#
domsWords>#
inID
column

x search-function y

I commonly refer to the first argument to a search function as "x", and the second argument as "y".

search-function name variants

To save typing, CorpusSearch allows shorthands and lower-case/upper-case variations for the names of search functions. Before the query is sent to the parser, all variant names are changed to their standard forms. This is why variant names appear in their standard form when the query is printed in the output file.

exists

variants: Exists

exists searches for label or text anywhere in the sentence. For instance, this query:

(MD0 exists)

will find this sentence:

/~*
but I fere me that I shal not conne wel goo thyder /
(CMREYNAR,14.261)
*~/

/*
    10 IP-SUB: 15 MD0 conne
*/

(NODE
      (10 IP-SUB
                 (11 NP-SBJ (12 PRO I))
                 (13 MD shal)
                 (14 NEG not)
                 (15 MD0 conne)
                 (16 ADVP (17 ADV wel))
                 (18 VB goo)
                 (19 ADVP-DIR (20 ADV thyder)))
      (ID CMREYNAR,14.261))

A common mistake is to use "exists" unneccessarily, as in this example:

((MD exists) AND (MD iPrecedes VB))

If a sentence contains the structure (MD iPrecedes VB), MD necessarily exists in the sentence. So this query would get the same result:

(MD iPrecedes VB)

precedes

variants: Precedes, pres, Pres

precedes means "sister precedes". That is, x sister precedes y when x and y are immediately dominated by the same node, and x is previous to y. This function will accept label or text as any combination of x and y. So this query:

(VB precedes NP-OB*)

produces this output:

/~*
thenne have ye cause to make myghty werre upon hym. '
(CMMALORY,2.25)
*~/

/*
    9 IP-INF-PRP: 11 VB make, 12 NP-OB1
*/

(NODE
      (9 IP-INF-PRP (10 TO to)
                    (11 VB make)
                    (12 NP-OB1 (13 ADJ myghty)
                               (14 N werre)
                               (15 PP (16 P upon)
                                      (17 NP (18 PRO hym)))))
      (ID CMMALORY,2.25))

iPrecedes

variants: iprecedes, i_Precedes, i_precedes, i_Pres, i_pres, iPres, ipres

iPrecedes means "immediately sister precedes." That is, x immediately sister precedes y when x and y are immediately dominated by the same node, and x is immediately previous to y. Notice that "iPrecedes" is a subset of "precedes". This query:

((MD iPrecedes NP-OB*) AND (NP-OB* iPrecedes VB))

produces this output:

/~*
But the kynges wold none receyve,
(CMMALORY,12.337)
*~/

/*
    1 IP-MAT: 6 MD wold, 7 NP-OB1, 9 VB receyve
*/

(0
   (1 IP-MAT (2 CONJ But)
             (3 NP-SBJ (4 D the) (5 NS kynges))
             (6 MD wold)
             (7 NP-OB1 (8 Q none))
             (9 VB receyve)
             (10 E_S ,))
      (ID CMMALORY,12.337))

anyPrecedes

variants: anyprecedes, any_Precedes, any_precedes, any_pres, any_Pres, AnyPrecedes, Any_Precedes, Any_Pres, Any_pres

anyPrecedes means "precedes anywhere but does not dominate." That is, x precedes y somewhere in the sentence, but y is not contained in the sub-tree dominated by x. "anyPrecedes" is a superset of "precedes". So this query:

(*-LFD anyPrecedes *-RSP)

results in this output:

/~*
And who woll sey the contrary, I woll preve hit on hys body. '
(CMMALORY,36.1140)
*~/

/*
    1 IP-MAT-SPE: 3 NP-LFD, 24 NP-RSP
*/

(0
   (1 IP-MAT-SPE (2 CONJ And)
                 (3 NP-LFD
                           (4 CP-FRL
                                     (5 WNP-1 (6 WPRO who))
                                     (7 C 0)
                                     (8 IP-SUB (9 NP-SBJ *T*-1)
                                               (10 MD woll)
                                               (11 VB sey)
                                               (12 NP-OB1 (13 D the) (14 ADJ contrary)))))
                 (15 , ,)
                 (16 NP-SBJ (17 PRO I))
                 (18 MD woll)
                 (19 VB preve)
                 (20 NP-OB1 (21 PRO hit))
                 (22 PP (23 P on)
                        (24 NP-RSP (25 PRO$ hys) (26 N body)))
                 (27 E_S .)
                 (28 ' '))
      (ID CMMALORY,36.1140))

dominates

variants: Dominates, doms, Doms

dominates means "dominates to any generation." That is, y is contained in the sub-tree dominated by x. Dominates will accept text as y, but text as x will always return an empty set (text never dominates a subtree.) Notice that the following query uses the escape character, "\", to search for *arb*:

(IP-INF dominates \*arb*)

returns this sentence:

/~*
And soo by the counceil of Merlyn the kyng lete calle his barons to counceil,
(CMMALORY,14.419)
*~/

/*
    18 IP-INF: 19 NP-SBJ *arb*
*/

(NODE
      (18 IP-INF (19 NP-SBJ *arb*)
                 (20 VB calle)
                 (21 NP-OB1 (22 PRO$ his) (23 NS barons))
                 (24 PP (25 P to)
                        (26 NP (27 N counceil))))
      (ID CMMALORY,14.419))

iDominates

variants: idominates, iDoms, idoms, i_Dominates, i_dominates, i_Doms, i_doms

iDominates means "immediately dominates". That is, x dominates y if y is a child (exactly one generation apart) of x. So this query:

((NP* iDominates FP) AND (FP iDominates ane))

finds this sentence:

/~*
Sythen he ledes +tam by +tar ane,
(CMROLLEP,118.978)
*~/

/*
    1 IP-MAT: 11 NP, 13 FP ane
*/

(0
   (1 IP-MAT
             (2 ADVP-TMP (3 ADV Sythen))
             (4 NP-SBJ (5 PRO he))
             (6 VBP ledes)
             (7 NP-OB1 (8 PRO +tam))
             (9 PP (10 P by)
                   (11 NP (12 PRO$ +tar) (13 FP ane)))
             (14 E_S ,))
      (ID CMROLLEP,118.978))

/*

Notice that "iDominates" describes the relationship between a label and its associated text (e.g., "FP" and "ane").

iDomsOnly

variants: i_Doms_Only, i_doms_only, iDominatesOnly, i_dominates_only, idomsonly

iDomsOnly means "immediately dominates as an only child." That is, x immediately dominates y as an only child if x immediately dominates y and y is the only legitimate child of x. So this query:

(ADJP iDomsOnly Q*)

results in this output:

 
/~*
But after my lytyll wytt it semeth me, sauynge here reuerence, +tat is more.
(CMMANDEV,123.2992)
*~/

/*
    23 IP-SUB: 27 ADJP, 28 QR more
*/

(NODE
      (23 IP-SUB
                 (24 NP-SBJ (25 D +tat))
                 (26 BEP is)
                 (27 ADJP (28 QR more)))
      (ID CMMANDEV,123.2992))

iDomsNumber#

variants: iDomsNum, idomsnum, idomsnumber, IDomsNumber, IDomsNum

iDomsNumber means "immediately dominates as the #th child", where # is tacked on to the end of iDomsNumber ("iDomsNumber#" must be picked up by the parser as one string.) That is, x immediately dominates y as the #th child if x immediately dominates y and y is the #th child of x. Notice that iDomsNumber1 is a superset of iDomsOnly. This query:

(CP-DEG iDomsNumber1 C)

produces this output:

/~*
And Merlion was so disgysed that kynge Arthure knewe hym nat,
(CMMALORY,30.939)
*~/

/*
    1 IP-MAT: 9 CP-DEG, 10 C that
*/

(0
   (1 IP-MAT (2 CONJ And)
             (3 NP-SBJ (4 NPR Merlion))
             (5 BED was)
             (6 ADJP (7 ADVR so)
                     (8 VAN disgysed)
                     (9 CP-DEG (10 C that)
                               (11 IP-SUB
                                          (12 NP-SBJ (13 NPR kynge) (14 NPR Arthure))
                                          (15 VBD knewe)
                                          (16 NP-OB1 (17 PRO hym))
                                          (18 NEG nat))))
             (19 E_S ,))
      (ID CMMALORY,30.939))

iDomsLast#

variants: idomslast, iDomsLast, Idomslast

iDomsLast is similar to iDomsNumber but it counts backward from the last child. So iDomsLast1 means "immediately dominates as the last child", iDomsLast2 means "immediately dominates as the second-to-last child", and so on. So this query, which looks for a sentence ending with three prepositional phrases:

(((IP* iDomsLast1 [1]PP) AND (IP* iDomsLast2 [2]PP)) AND (IP* iDomsLast3 [3]PP))

(notice the use of prefix indices since the three PPs must be different) produces this output:

/~*
and soo I went unto bed with hym as I ought to do with my lord;
(CMMALORY,5.128)
*~/

/*
    1 IP-MAT-SPE: 16 PP, 12 PP, 8 PP
*/

(0
   (1 IP-MAT-SPE (2 CONJ and)
                 (3 ADVP (4 ADV soo))
                 (5 NP-SBJ (6 PRO I))
                 (7 VBD went)
                 (8 PP (9 P unto)
                       (10 NP (11 N bed)))
                 (12 PP (13 P with)
                        (14 NP (15 PRO hym)))
                 (16 PP (17 P as)
                        (18 CP-ADV (19 WADVP-1 0)
                                   (20 C 0)
                                   (21 IP-SUB (22 ADVP *T*-1)
                                              (23 NP-SBJ (24 PRO I))
                                              (25 MD ought)
                                              (26 TO to)
                                              (27 DO do)
                                              (28 PP (29 P with)
                                                     (30 NP (31 PRO$ my) (32 N lord))))))
                 (33 E_S ;))
      (ID CMMALORY,5.128))

domsWords#

domsWords# counts the number of words dominated by the search-function argument. So "domsWords4" means "dominates 4 words", domsWords2 mean "dominates 2 words", and so on. A word in this case is defined as a leaf node that is not on the word_ignore_list. Here's the default word_ignore_list:

REMOVED|COMMENT|CODE|ID|LB|'|\"|,|E_S|0|\**

Thus, traces, 0 complementizers, punctuation, and comments are not counted as words.

So this query:

(NP-OB* domsWords3)

will return this structure (ignoring the trace *ICH*-1):

/~*
and by kynge Ban and Bors his counceile they lette brenne and destroy all the
contrey before them there they sholde ryde.
(CMMALORY,20.613)
*~/

/*
    24 NP-OB1: 27 N contrey
*/

(NODE
      (24 NP-OB1 (25 Q all)
                 (26 D the)
                 (27 N contrey)
                 (28 CP-REL *ICH*-1))
      (ID CMMALORY,20.613))

(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").

domsWords<#

domsWords<# is just like domsWords# except that it returns structures that dominate strictly less than the given number of words. For instance, this query:

(NP-OB* domsWords<3)

will return this structure:

/~*
for it was I myself that cam in the lykenesse.
(CMMALORY,5.131)
*~/

/*
    6 NP-OB1: 9 PRO$+N myself
*/

(NODE
      (6 NP-OB1 (7 PRO I)
                (8 NP-PRN (9 PRO$+N myself)))
      (ID CMMALORY,5.131))

(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").

domsWords>#

domsWords># is just like domsWords# except that it returns structures that dominate strictly more than the given number of words. For instance, this query:

(NP-OB* domsWords>3)

will return this structure:

/~*
for she was called a fair lady and a passynge wyse,
(CMMALORY,2.9)
*~/

/*
    9 NP-OB1: 20 ADJ wyse
*/

(NODE
      (9 NP-OB1
                (10 NP (11 D a) (12 ADJ fair) (13 N lady))
                (14 CONJP (15 CONJ and)
                          (16 NP (17 D a)
                                 (18 ADJP (19 ADV passynge) (20 ADJ wyse)))))
      (ID CMMALORY,2.9))

(only the NP-OB1 node was printed in this output because the query file included the line "node: NP*").

iDomsTotal#

iDomsTotal# counts the number of nodes immediately dominated by the search- function argument. So this query:

(NP-OB* iDomsTotal3)

results in this output:

/~*
And +tere it lykede him to suffre many repreuynges and scornes for vs
(CMMANDEV,1.4)
*~/

/*
    10 IP-INF-1: 13 NP-OB1, 16 CONJP
*/

(NODE
      (10 IP-INF-1 (11 TO to)
                   (12 VB suffre)
                   (13 NP-OB1 (14 Q many)
                              (15 NS repreuynges)
                              (16 CONJP (17 CONJ and)
                                        (18 NX (19 NS scornes))))
                   (20 PP (21 P for)
                          (22 NP (23 PRO vs))))
      (ID CMMANDEV,1.4))

Here, the 3 nodes immediately dominated by NP-OB1 are labelled Q, NS, and CONJP.

iDomsTotal<#

iDomsTotal<# is like iDomsTotal# except that it returns structures that immediately dominate strictly less than the given number of nodes. So this query:

(NP-OB* iDomsTotal<3)

results in this output:

/~*
& take of euereche iliche myche
(CMHORSES,125.397)
*~/

/*
    1 IP-IMP: 8 NP-OB1, 9 QP
*/

(0
   (1 IP-IMP (2 CONJ &)
             (3 VBI take)
             (4 PP (5 P of)
                   (6 NP (7 Q euereche)))
             (8 NP-OB1
                       (9 QP (10 ADV iliche) (11 Q myche))))
      (ID CMHORSES,125.397))

iDomsTotal>#

iDomsTotal># is like iDomsTotal# except that it returns structures that immediately dominate strictly more than the given number of nodes. So this query:

(NP-OB* iDomsTotal>3)

will produce this output:

/~*
& aftur tak an hot yre +tat is smal bi-fore
(CMHORSES,95.119)
*~/

/*
    1 IP-IMP: 6 NP-OB1, 10 CP-REL
*/

(0
   (1 IP-IMP (2 CONJ &)
             (3 ADVP-TMP (4 ADV aftur))
             (5 VBI tak)
             (6 NP-OB1 (7 D an)
                       (8 ADJ hot)
                       (9 N yre)
                       (10 CP-REL (11 WNP-1 0)
                                  (12 C +tat)
                                  (13 IP-SUB (14 NP-SBJ *T*-1)
                                             (15 BEP is)
                                             (16 ADJP (17 ADJ smal))
                                             (18 ADVP-LOC (19 ADV bi-fore))))))
      (ID CMHORSES,95.119))

inID

"inID" searches the ID node. Because the ID node is outside of the parsed sentence, it is not encountered by the other search functions. For instance, (ID iDominates *) will turn up empty.

Here's a typical ID node from the Malory corpus file:

(ID CMMALORY,3.41)

To isolate Malory sentences from an output file, you could use this query:

query:  (*MALORY* inID)

Because the ID node is outside the parsed sentence, when inID is called, the bounds_matter variable is automatically set to false.

column

"column" is used to search columns of the CODING node. If you don't happen to be coding, you don't need to use this function.

If you are coding, and, for instance, want to find sentences whose CODING node contains an "m" or "n" in the 7th column, use this query:

query:  (CODING column7 m|n)

If you want to find sentences whose CODING node does not contain a "p" or "q" in the 4th column, use this query:

query:  (CODING column4 !p|q)

Like the ID node, the CODING node is not part of the parsed sentence itself. For this reason, a call to "column" automatically sets the bounds_matter variable to false. Also, when you're searching CODING nodes, you may want to include this line in your command file:

nodes_only: false

to ensure that your output includes entire sentences or nodes and is thus searchable.

Logical Operators
Table of Contents