Searching for Words

contents of this chapter:

labels and words
string variations
fuzzy tree structure

labels and words

"Labels" are the all upper-case tags inserted by the linguists who prepared the corpus (e.g., "IP", "CONJ", "N".) "Words" refers to the mostly lower-case original words of text (e.g. "so", "hit".) Every node in the sentence-tree has a label, and the leaf nodes also have words. CorpusSearch can conduct searches on labels or words. In practice, the vast majority of searches look for labels only.

string variations

CorpusSearch uses case-sensitive character-by-character string matching to match search-function arguments to strings found in the input. Therefore, spelling and upper-case/lower-case variations must be described explicitly (usually with an argument list.) For instance, this query searches for a complementizer whose associated text is "that" or "That":

(C iDominates that|That)

and finds sentences such as this:

/~*
and he shalle do yow remedy, that youre herte shal be pleasyd. '
(CMMALORY,3.47)
*~/

/*
    12 CP-ADV: 13 C that
*/

(NODE
      (12 CP-ADV (13 C that)
                 (14 IP-SUB
                            (15 NP-SBJ (16 PRO$ youre) (17 N herte))
                            (18 MD shal)
                            (19 BE be)
                            (20 VAN pleasyd)))
      (ID CMMALORY,3.47))

fuzzy tree structure

For the purposes of dominance, a words and its associated node label are considered separate objects. Thus, in the sentence below, "PRO" dominates "hit". For the purposes of precedence, a word and its associated label are considered to be one object. Thus, "that" sister-precedes "rocche" in this sentence, because the labels associated with "that" and "rocche" are sisters.

/~*
and so hit londid undir that rocche.
(CMMALORY,667.4861)
*~/

/*
    1 IP-MAT: 11 D that, 12 N rocche
*/

(0
   (1 IP-MAT (2 CONJ and)
             (3 ADVP (4 ADV so))
             (5 NP-SBJ (6 PRO hit))
             (7 VBD londid)
             (8 PP (9 P undir)
                   (10 NP (11 D that) (12 N  rocche))
             (13 E_S .))
      (ID CMMALORY,667.4861))

Definition Files
Table of Contents