Search Tips

contents of this chapter:

about this chapter
about the author
The correct query format
Using definition files
Using *
The "exists" function
Same instance
Ignoring certain nodes
Searching for traces
Finding non-pronominal NPs
When to use complement files
Restricting searches to a single IP
Counting words and remove_nodes

about this chapter

This chapter gives tips on a number of common problems and errors that arise when using CorpusSearch. The reader is assumed to have a general familiarity with the rest of the CorpusSearch manual. Many of the example queries assume a standard definition file containing definitions for at least finite_verb and non_finite_verb.

about the author

The author of this chapter is Ann Taylor. Ann was instrumental in the design of CorpusSearch and has used the program more than anyone.

The correct query format

Queries must be right-branching.

query: (((X FUNCTION Y)
AND (Y FUNCTION Z))
AND (Z FUNCTION W))

All other commands must precede the query.

Using definition files

The following are useful definitions to include in a definition file:

finite_verb:  *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD
non_finite_verb:  *VB|V*N|*HV|H*N|*DO|D*N|*BE|BEN
non-pronominal_NP: *N*|D*|Q*|ADJ*|CONJ*|*ONE*|*OTHER*|CP*

(For another way to find non-pronominal NPs, see Finding non-pronominal NPs.)

If you have definitions which include search functions rather than just lists, such as:

pronominal_subject: NP-SBJ* iDomsOnly PRO

be careful how you combine them with the other elements of a query. The following query will not work:

query: (pronominal_subject precedes finite_verb)

The substitution for the query is:

((NP-SBJ* iDomsOnly PRO) precedes *MD|*HVP|*DOP|*DOD|*BEP|*BED|*VBP|*VBD)

which is an illegal query which will be rejected by CorpusSearch. The correct format is:

query: ((pronominal_subject)
AND (NP-SBJ* precedes finite_verb))

A common error is to forget to use the define command to specify the definition file when using definitions. No error message will be generated, but the search will result in no output.

Using *

Be liberal in using *. Using NP-SBJ as a search term will only find a subset of subjects. Some subjects are resumptive (NP-SBJ-RSP), some are coindexed to a clause, or to trace in a lower clause (NP-SBJ-1), some may have other additional labels. Using NP-SBJ* will find all the subjects labelled in this way, no matter what might be added on to the end of the label. In general, only leave off the * if you are sure you don't want it.

When you want to refer to all the labels referred to by, for instance, ADVP*, except one, you have to use a list and list all the options you are interested in, as for instance ADVP|ADVP-LOC|ADVP-TMP (this omits ADVP-DIR which would be included in ADVP*). This is what definition files are for; you only have to write it once.

Note that if you want to refer to an actual * in a search (all traces start with *), escape it with a backslash \ . The following query finds subjects which dominate traces. The first * in \** is escaped and thus refers to an actual *, while the second is not and thus matches anything that follows the *; this will match, for instance, *con*, *exp*, *T-1* and others.

query: (NP-SBJ* iDoms \**)

The "exists" function

A common error is to overuse the exists function. Using a search term forces that term to exist; it is not necessary to specify this separately. Thus the following is an inefficient query, although it is not ill-formed.

query: ((NP-SBJ* exists)
AND (IP* iDoms NP-SBJ*))

The second part of the query alone will accomplish the same thing and use less resources.

Same instance

Same instance works by literal match. Thus NP-SBJ does not match NP-SBJ*, and MD|VBD does not match VBD|MD; that is, in neither case would same instance be invoked between the two terms.

When two search terms match, they are forced to apply to the same node. Thus two uses of NP-OB* will require that, if for instance, NP-OB2 is found as an instance of the first NP-OB*, then the next use of NP-OB* will also apply to the same NP-OB2 (not, for instance, an NP-OB1 which may also be in the vicinity).

When two search terms do not match but might refer to the same node, as for instance, NP-SBJ and NP-SBJ*, or MD|VBD and VBD|MD, same instance is not forced, but neither is it ruled out; that is, same instance may or may not apply.

In order to force non-same instance, use index numbers. [1]NP-SBJ* and [2]NP-SBJ* cannot apply to the same NP-SBJ* node.

A common error is to forget that impossible (to the linguist) cases of same instance will nonetheless be interpreted this way by CorpusSearch. Thus, for instance, a query such as the following will produce no results:

query: ((NP-SBJ* iDoms PRO)
AND (NP-OB1* iDoms PRO))

Although it is impossible for these PROs to refer to the same node, since they are dominated by different nodes, CorpusSearch will assume they do, and consequently will find no matches. Traces and zeros also need to be differentiated, as in the following:

query: ((MD iDoms [1]!\**)
AND (VB iDoms [2]!\**))

or

query: ((WNP iDoms [1]0)
AND (C iDoms [2]0))

An easier way to accomplish the former is to add traces to the ignore list.

Labels contained in definitions which match labels used in other parts of the query will trigger same instance. Thus if the definition file contains the following:

pronominal_subject: NP-SBJ* iDomsOnly PRO

and we use the following query:

query: ((pronominal_subject)
AND (NP-SBJ* precedes finite_verb))

the NP-SBJ* refered to in the second part of the query will be the same instance of NP-SBJ* as that refered to in the definition pronominal_subject.

Ignoring certain nodes

A default "ignore list" is supplied with CorpusSearch. It contains such things as punctuation and various meta labels that are not part of the text. If you want to search for punctuation, for instance, or line breaks, then you must provide your own ignore list which does not include the items you want to be able to access.

Although the ignore list is primarily a way to avoid non-text annotations, linguistic labels can also be added to the ignore list, in which case CorpusSearch will simply act as if they are not there. Thus for instance, if you add NEG to the ignore list, you can find cases in which nothing but negation intervenes between the subject and the finite verb.

add_to_ignore: NEG
query: (NP-SBJ* iPrecedes finite_verb)

This will find the following two sentences:

Arthur loves Guinevere
Arthur ne loves Guinevere

but not:

Arthur madly loves Guinevere

Using the ignore list is also helpful in looking for V2. In many cases, the verb is not technically the second node in the IP because of initial conjunction. Adding CONJ (and possibly some other things, such as INTJ*, and NP-VOC) to the ignore list will solve this problem (or at least reduce it). The query below will find all the following:

The sword desired Lancelot
And the sword desired Lancelot
Gramercy, Arthur, the sword desired Lancelot

add_to_ignore: INTJ*|NP-VOC|CONJ
query: ((IP* iDomsNumber1 NP-OB*)
AND (IP* iDomsNumber2 finite_verb))

Searching for traces

Traces (which all start with * in the PPCME2) are treated as text by CorpusSearch, and thus can be searched for. In order to differentiate the * which means "match anything" from the * that is part of the text of a trace, use \* to refer to the latter. The string \** will match any trace.

In the more common case, in which you want to simply ignore traces, add them to the ignore list as follows:

add_to_ignore: \**

This means that any node that contains a trace will not be found. Thus a query such as (NP* exists) will not find any NPs which contain only traces.

Finding non-pronominal NPs

Do not search for non-pronominal NPs with the following query:

(NP* iDoms !PRO)

This will also eliminite cases like Robin and me and he and I, since these contain a PRO. Instead either use the non-pronominal_NP definition, or use the following query:

print_complement: t
query: (NP* iDomsOnly PRO)

The .cmp (complement file) of this query will contain every NP that does not contain only a single PRO. (But see the notes under complement files. If you actually run this query on a corpus file, the complement file will also contain every token which includes, for instance, no NPs. First produce an output file which only contains tokens with NPs, then run this query to divide the tokens into those with a pronominal NP and those with a non-pronominal NP.)

When to use complement files

Never run a query with print_complement set to true on a corpus file. The output will include such interesting tokens as the page numbers, and other useless things, because the .cmp file contains absolutely everything that does not match the query. Use the print_complement function on already filtered output files that contain some set of data that you want to divide into subsets, as for instance, to divide all clauses with objects into those with nominal and those with pronominal objects; or to divide all clauses with a finite verb into those that also contain a non-finite verb and those which don't.

Restricting searches to a single IP

CorpusSearch specifies a node boundary in which to search. The default is set as IP*. The node includes everything under the node, no matter how deeply embedded. Thus if an IP contains a subordinate clause, the contents of the embedded subordinate clause are also within the node. A common error is to write a query such as

query: ((IP* iDomsNumber1 NP-OB*)
AND (finite_verb iPrecedes NP-SBJ*))

with the intent of finding V2 clauses with a topicalized object. The first function looks for IPs which have an object as the first element; the second for a finite verb immediately preceding the subject. This query will, in fact, find V2 clauses with a topicalized object, but it may also find some other clauses as well. It will find (if there are any) IPs which contain one clause in which the first element is an object, and another different clause in which the finite verb precedes the subject. Either, one of these clauses will be the main clause and another a embedded clause, or, they will both be embedded IPs within an IP.

There are two ways to avoid this error and force all parts of the query to apply within the same IP.

  1. Make use of the built-in same instance feature. Same instance means that if you use a node label in the query more than once in exactly the same form, CorpusSearch assumes that you intend each use to apply to the same instance of that node. Thus the query (ADVP precedes ADVP) will give you nothing because CorpusSearch will try to find instances of an ADVP preceding itself, an obvious impossibility. You can use same instance to keep all the queries inside the same IP (for instance, or any other node) by "tying" one term of the query to the node, as in the first element of the query above, and then making sure that in every subsequent search function, either that "tied" term or the node is used. For instance, we could fix the query above, by writing it as:

    query: (((IP* iDomsNumber1 NP-OB*)
    AND (NP-OB* iPrecedes finite_verb))
    AND (finite_verb  iPrecedes NP-SBJ*))
    

    or alternatively:

    query: (((IP* iDomsNumber1 NP-OB*)
    AND (NP-OB* iPrecedes finite_verb))
    AND (finite_verb  iPrecedes NP-SBJ*))
    

    The repeated instances of NP-OB* in the first example and IP* in the second refer to the same instance of NP-OB* and IP* respectively, thus forcing all parts of the query to be immediately dominated by the node.

  2. The second solution is to use the remove_nodes function. The default setting for remove_nodes is false, so to activate it you must include the line remove_nodes: t in the query file. Removing nodes is the equivalent of the unpack function of the PPCME1; it removes any embedded node that matches the specified node. When no node is specified (so the node is set as the default IP*), all embedded IPs will be removed. If the node is set as NP*, all NPs embedded within another NP will be removed. Note that all that is required for a match is that the part of the label before the hyphen matches. Thus, if the node is IP-MAT*, any node whose label starts with IP will be removed, including in this case, IP-SUB, IP-SMC, IP-PPL, etc. Thus, for instance, you cannot set the node as IP-MAT* and not have IP-SUBs removed. When "remove nodes" is in force, any node that doesn't match the query is removed completely; any embedded node that matches the query is removed from its matrix and printed below it.

    To solve our problem the "remove nodes" way, we would first create a file with only single clauses with all embedded nodes removed, by a query such as

    remove_nodes: t
    query: (IP* iDoms finite_verb)
    

    This query will produce a file in which every token is an IP containing a finite verb with all embedded IPs removed. The following query:

    query: ((IP* iDomsNumber1 NP-OB*)
    AND (finite_verb iPrecedes NP-SBJ*))
    

    can then be used on the output of the first query and will yield only the cases intended. (But note that this query is not actually going to produce all V2 clauses with a topic object anyway, since many such clauses begin with a conjunction or other introductory type word and thus the object will be the second element in the IP*; for a solution to this problem, see Ignoring certain nodes).

Counting words and remove_nodes

Note that if you have remove_nodes turned on, the word REMOVED, counts as text so you can search for it. It will not, however, be counted as a word when doing word counts (like traces, which likewise are not counted). But, if you count the number of words in a node that contains REMOVED, you will, of course, get the wrong answer, since REMOVED replaces a clause full of words. In order to avoid this result, either don't use remove_nodes when counting, or, use a query like the following which won't count any node containing REMOVED. Nodes containing REMOVED can then be counted separately.

query: (((IP* iDoms NP-OB*)
AND (NP-OB* domsWords3))
AND (NP-OB* doms !REMOVED))

Another way to do this is to add REMOVED to the ignore list and then, as before, count the nodes containing REMOVED separately.

add_to_ignore: REMOVED
query: ((IP* iDoms NP-OB*)
AND (NP-OB* domsWords3))

How to Make Your Corpus Compatible with CorpusSearch
Table of Contents