CorpusSearch General Principles

contents of this chapter:

node boundary
nodes to ignore
searching output

node boundary command

The node boundary command tells the program what kind of node to search for to contain the described structures. If the command file doesn't list a "node:" command, CorpusSearch uses the default node boundary IP*.

CorpusSearch can treat one instance of a label as the node command and also the argument to a search function, as in:

node: PP*
query: (PP iDomsNumber1 RP)

If you don't have a particular node in mind, use the node command "*".

CorpusSearch will accept a list of nodes for the node boundary command. For instance, this is a legitimate command:

node: PP*|NP*|ADJP*

This structure is contained in the rocche sentence:

node: PP
query: (NP iDominates N)

(PP (P undir)
    (NP (D that) (N rocche)))

By default, only the nodes specified in the node command will be printed out (not the entire sentence containing them). To print the entire parsed sentence, include this line in your command file:

nodes_only: false

nodes to ignore

There are some nodes in the corpus that linguists usually don't want to consider as part of the strucure of the sentence, for instance, punctuation, line breaks, page numbers, and comments. CorpusSearch will ignore all nodes whose labels are contained in the "ignore- list". This is the default version of the ignore-list:

COMMENT|CODE|ID|LB|'|\"|,|E_S|/

For instance, if you run this query:

query: (NP* iPrecedes PP*)

This sentence will be returned:

****************************************************************begin_comments

 1 IP-MAT-SPE: 5 NP-1, 9 PP

******************************************************************end_comments

*****************************************************************begin_ur_text

There ar two bretheren beyond the see,
(CMMALORY,15.439)

*******************************************************************end_ur_text

 (0
(1 IP-MAT-SPE
              (2 NP-SBJ-1 (3 EX There))
              (4 BEP ar)
              (5 NP-1 (6 NUM two)
                      (7 NS bretheren))
              (8 CODE )
              (9 PP (10 P beyond)
                    (11 NP (12 D the)
                           (13 N see)))
              (14 E_S ,))
(15 ID CMMALORY,15.439))

Notice that NP-1 immediately precedes PP in spite of the intervening node (8 CODE <P_15>). This is because CODE is on the default ignore-list.

To add labels to the default ignore-list, include this command in your command_file:

add_to_ignore: <list_of_labels>

For instance, if you want to ignore traces, include this command in your command_file:

add_to_ignore: \**

To replace the default ignore-list with your own ignore-list, include this command in your command_file:

ignore_nodes: <your_ignore_list>

To tell CorpusSearch not to ignore any nodes, include this command in your command_file:

ignore_nodes: null

I will sometimes refer to nodes that are not to be ignored as "legitimate" nodes.

searching output

The output of one search may be used directly as input to the next search. CorpusSearch recognizes output files as those ending in ".out" or ".cmp".