CorpusSearch Query Language

contents of this chapter:

about the query language
search function calls
wild cards
logical operators
a formal grammar of the query language.
messages from the parser

about the query language

The CorpusSearch query language has these basic components:

search function calls

The most basic query is a single search-function call. For instance, here is a query that searches for nodes labelled QP ("quantifier phrase") that immediately dominate nodes labelled CONJ ("co-ordinating conjunction"):

(QP iDominates CONJ)

and here is a sentence found by the query:

/~* and so he is bo+te more and lasse to his seruaunt. (CMWYCSER,351.2223) *~/ /* 1 IP-MAT: 9 QP, 10 CONJ bo+te 1 IP-MAT: 9 QP, 12 CONJ and */ (0 (1 IP-MAT (2 CONJ and) (3 ADVP (4 ADV so)) (5 NP-SBJ (6 PRO he)) (7 BEP is) (8 ADJP (9 QP (10 CONJ bo+te) (11 QR more) (12 CONJ and) (13 QR lasse)) (14 PP (15 P to) (16 NP (17 PRO$ his) (18 N seruaunt)))) (19 E_S .)) (ID CMWYCSER,351.2223))

Any number of search-function calls may be combined into more complex queries using AND.

wild cards

CorpusSearch supports two wild cards, namely * and #.

*

* works as in regular expressions, that is, it stands for any combination of symbols. For instance, "CP*" means any label beginning with the letters CP (e.g. CP, CP-ADV, CP-QUE-SPE). "*-SPE" means any label ending with "-SPE", and *hersum* means any string containing the substring "hersum" (e.g., "hersumnesse", "unhersumnesse"). * by itself will match any string. * may be used anywhere in the function argument; beginning, middle or end.

\*

Some labels, for example "*con*" ("subject elided under conjunction"), contain the character '*'. If you're looking for such a label, use \ (escape character) to show that you're searching for * and not using it as a wild card. For instance, to search for *con* dominated by a noun phrase, you could use this query:

(NP* dominates \*con\*)

to find (among others) this sentence:

/~*
ne did euyll.
(CMMANDEV,1.14)
*~/

/*
    1 IP-MAT: 3 NP-SBJ *con*
*/

(0
   (1 IP-MAT (2 CONJ ne)
             (3 NP-SBJ *con*)
             (4 DOD did)
             (5 NP-OB1 (6 N euyll))
             (7 E_S .))
      (ID CMMANDEV,1.14))

#

# is the wild card for digits. For instance, to find prepositions divided into parts, you could use this query:

(PP iDominates P#) 

to find sentences like this:

/~*
Anone there $with all arose sir Gawtere
(CMMALORY,199.3135)
*~/

/*
    1 IP-MAT: 4 PP, 7 P21 $with
    1 IP-MAT: 4 PP, 8 P22 all
*/

(0
   (1 IP-MAT
             (2 ADVP-TMP (3 ADV Anone))
             (4 PP
                   (5 ADVP (6 ADV there))
                   (7 P21 $with)
                   (8 P22 all))
             (9 VBD arose)
             (10 NP-SBJ (11 NPR sir) (12 NPR Gawtere)))
      (ID CMMALORY,199.3135))

logical operators

Search-function calls may be combined using the logical operator AND. Search-function calls must be appended to the query one at a time:

(((NP-SBJ iDomsLast N) AND (VBD|VBG iPrecedes NEG)) AND (C dominates that))

AND acts on search-function calls. There are also logical operators that act on arguments to search functions. These are |, which means "or" for a list of arguments (e.g. "MD*|HV*" means "MD* or HV*"), and "!", which negates an argument (or list of arguments) (e.g. "NP-SBJ dominates !N" returns cases where NP-SBJ does not dominate N.)

a formal grammar of the query language.

arg -- an argument to a search function. Examples: NP-SBJ, NP*, !NPR.

un -- a unary search function. Examples: exists, domsWords#, iDomsTotal#.

bin-- a binary search function. Examples: iDomsLast#, iPrecedes, precedes, iDomsNumber#.

AND -- binary logical operator AND.

<stmt> -> <call>
| (<stmt> <append>)

<append> -> AND <call>

<call> -> (arg bin arg)
| (arg un)

messages from the parser

To make sure that the query is correctly formed, it is sent to the CorpusSearch parser. If you write every query perfectly the first time, the parser will be invisible to you. Most users, however, will write the occasional badly-formed query and get an error message from the parser. Here's a typical error message:

This much of the query passed the parser:
((NP* iDominates PP* )
 AND (PP* iDominates CONJ )
) expected next.

This is the query as found in command file Igrayne.q:
 ((NP* iDominates PP*) AND (PP* iDominates CONJ)

Here, the query was missing a ) at the end. Notice that the parser tells you how much of the query parsed correctly, and what was expected but not found. Query errors always abort the search.

Search Functions
Table of Contents