The CorpusSearch Command File

contents of this chapter:

optional commands
boolean shorthand
nodes to ignore
search commands
add_to_ignore:
ignore_nodes:
node:
query:
printing commands:
begin_remark:, end_remark
nodes_only:
only_ur_text:
print_complement:
print_indices:
print_ur_text:
remove_nodes:
set_margin:
debugging commands
debug_function_calls:
hunt_bugs:
comments

optional commands:

Optional (non-query) commands must be written *before* the query. All the optional commands have default values which are used if no value is found in the command file.

boolean shorthand

For commands that take a boolean argument, CorpusSearch will accept any of these strings: "true", "TRUE", "T", "t", or "false", "FALSE", "F", "f".

nodes to ignore

There are some nodes in the corpus that linguists usually don't want to consider as part of the strucure of the sentence, for instance, punctuation, line breaks, page numbers, and comments. CorpusSearch will ignore all nodes whose labels are contained in the "ignore- list". This is the default version of the ignore-list:

CODE|ID|LB|'|\"|,|E_S|/

For instance, if you run this query:

(NP* iPrecedes PP*)

This sentence will be returned:

/*
 1 IP-MAT-SPE: 5 NP-1, 9 PP
*/
/~*
There ar two bretheren beyond the see,
(CMMALORY,15.439)
*~/

 (0
(1 IP-MAT-SPE
              (2 NP-SBJ-1 (3 EX There))
              (4 BEP ar)
              (5 NP-1 (6 NUM two) (7 NS bretheren))
              (8 CODE )
              (9 PP (10 P beyond)
                    (11 NP (12 D the) (13 N see)))
              (14 E_S ,))
(15 ID CMMALORY,15.439))

Notice that NP-1 immediately precedes PP in spite of the intervening node (8 CODE <P_15>). This is because CODE is on the default ignore-list.

I will sometimes refer to nodes that are not to be ignored as "legitimate" nodes.

search commands

add_to_ignore: (String label_list)

default "" (empty string)

adds given labels to the ignore_list. For instance,

add_to_ignore:  \**

will tell CorpusSearch to ignore traces for this search.

ignore_nodes: (String ignore_list)

default COMMENT|CODE|ID|LB|'|\"|,|E_S|/

tells CorpusSearch what nodes to ignore.

To replace the default ignore-list with your own ignore-list, include this command in your command_file:

ignore_nodes:  <your_ignore_list>

To tell CorpusSearch not to ignore any nodes, include this command in your command_file:

ignore_nodes:   null

If you try to search for an item that is on the ignore_list, you'll get an error message. For instance, this query:

(NP-SBJ* iPrecedes CODE)

generates this message:

WARNING!  CODE in y_argument to iPrecedes is on the ignore_list.

    To make the ignore_list empty, add this line to your command file:

        ignore_nodes: null

    To write your own ignore_list, add this line to your command file:

        ignore_nodes: 

The program goes ahead and runs as usual, but if you don't get the results you were looking for you probably need to change the ignore_list.

node: (String node_boundary)

default IP*|NODE

gives CorpusSearch a node boundary to search within. The default list gives boundaries that any structure you search for will fall within; IP* describes all the basic sentence divisions in the corpus, and NODE is the outermost boundary label of nodes_only output.

The choice of node boundary determines the following:

To illustrate this, I ran the same query with different node boundaries on a simple file containing one sentence. First I ran the query with the default node boundary, IP*|NODE:

node:   IP*|NODE
query:  (NP* iDominates PRO*)

Here's the output; notice that 1 hit is counted because there was one IP* node (1 IP-MAT containing both NP*:

/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 1 IP-MAT: 3 NP-SBJ, 4 PRO he 1 IP-MAT: 6 NP-OB2, 7 PRO them */ (0 (1 IP-MAT (2 CONJ and) (3 NP-SBJ (4 PRO he)) (5 VBD made) (6 NP-OB2 (7 PRO them)) (8 NP-OB1 (9 ADJ grete) (10 N chere)) (11 ADVP (12 ADV out) (13 PP (14 P of) (15 NP (16 N mesure))))) (ID CMMALORY,2.13)) /* FOOTER source file: CMMALORY hits found: 1 sentences containing the hits: 1 total sentences searched: 1 */

Next I ran the query with node boundary NP*:

node:   NP*
query:  (NP* iDominates PRO*)

Here's the output; this time 2 hits are counted, because there are two distinct NP* nodes (3 NP-SBJ and (6 NP-OB2. Because nodes_only is true by default, only the NP* nodes are printed:

/~*
and he made them grete chere out of mesure
(CMMALORY,2.13)
*~/

/*
    3 NP-SBJ: 4 PRO he
    6 NP-OB2: 7 PRO them
*/

(NODE
      (3 NP-SBJ (4 PRO he))
      (ID CMMALORY,2.13))

(NODE
      (6 NP-OB2 (7 PRO them))
      (ID CMMALORY,2.13))


/*
    FOOTER
    source file:  CMMALORY
    hits found:  2
    sentences containing the hits:  1
    total sentences searched:  1
*/

query: (String query)

default ERROR

Every command file must contain a query, although it need not contain anything else. The query must be the last item in the command file.

printing commands:

These commands do not in any way influence the current search. They only give instructions about how the results of the current search should be printed. However, because these commands can cause the output of the current search to take different forms, they may influence future searches which will take as their input the output of the current search.

begin_remark: (String remark) end_remark

default "" (empty string)

tells CorpusSearch to print user's remark in the output Preface. This is a way for the user to write a note to herself, for instance to remember the goal of the search.

For instance, the command file "pro-obj.q" contains this command:

begin_remark: 
	pronoun objects
end_remark

which is printed in the output preface like this:

/*
    PREFACE:  regular output file.
    CorpusSearch copyright Beth Randall 1999.
    Date:  Wed Nov 03 19:12:03 EST 1999

    command file:       pro-obj.q
    input file:         ipmat-2vb.out
    output file:        pro-obj.out

    remark:
        pronoun objects

    node:   IP*
    query:  (NP-OB* iDominates PRO)
*/

nodes_only: (boolean true or false)

default true

If true, CorpusSearch prints out only the nodes that contain the structure described in "query".

If false, CorpusSearch prints out the entire sentence that contains the structure described in "query".

For instance, suppose you have this query:

node:  ADVP* 
query: (ADVP* iDominates ADVP*)

Here's what a piece of the output looks like with nodes_only true.

/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.574)
*~/

/*
 2 ADVP: 3 ADVP
*/

(NODE (ADVP
            (ADVP (ADV certayn))
            (CONJP (CONJ and)
                   (PP (P wit-owte)
                       (NP (N doute))))
            (, ,))(ID CMAELR3,45.574))

And here's the same piece of output with nodes_only false:

/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.589)
*~/

/*
 2 ADVP: 3 ADVP
*/

(
(IP-MAT
        (ADVP
              (ADVP (ADV certayn))
              (CONJP (CONJ and)
                     (PP (P wit-owte)
                         (NP (N doute)))))
        (, ,)
        (NP-OB1 (NPR Ihon))
        (BEP is)
        (NP-SBJ (PRO$ is) (N name))
        (E_S .))
(ID CMAELR3,45.589))

only_ur_text: (boolean true or false)

default false

If true, CorpusSearch prints out only the ur_text version of the sentences containing the searched-for structure. It also prints the ur_text version of the nodes in which the structures were found. This could be a useful step at the very end of a search, providing a file full of sentences ready to be copied into a research paper.

NOTE: Since the output of an only_ur_text search contains no parsed sentences, it cannot be used as the input to a new search.

Here's a piece of only_ur_text output resulting from this query:

node:  ADVP* 
query:  (ADVP* iDominates ADVP*)

/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.589)

ADVP:   certayn and wit-owte doute
*~/

print_complement: (boolean true or false)

default false

The idea behind print_complement is to split the input file into two complementary sets, the output file and the complement file. If print_complement is true, CorpusSearch prints a separate file containing all the sentences found in the input that did *not* contain the searched-for structure. The name of the complement file is the same as the name of the output file, but with ".cmp" replacing ".out".

print_indices: (boolean true or false)

default true

tells CorpusSearch whether or not to print indices in the output.

Indices start at 0 and are used to label every node in the tree. CorpusSearch uses indices to distinguish, for instance, between several different NP nodes in the same sentence.

Here's a piece of an output sentence with indices:

             (10 NP-OB1 (11 NPR Morgan)
                        (12 NPR le)
                        (13 NPR Fay)

Here's how it looks without indices:

                  (NP-PRN (NPR Morgan)
                          (NPR le)
                          (NPR Fey)))

remove_nodes: (boolean true or false)

default false

removes nodes of the same species as the node boundary, which did *not* contain the searched-for structure.

The purpose of this is to make it easier to search output. For instance, if you were looking for IP nodes containing a certain structure, remove_nodes will ensure that your output contains only IP nodes with that structure, and no other IP nodes.

CorpusSearch uses this algorithm to find the node species: start with the node boundary. If the node boundary contains a '-', the node species is the substring of the node boundary up to the first hyphen, with a '*' tacked on. If the node boundary does not contain a '-', the node species is simply the node boundary with a '*' tacked on if the node boundary didn't already have one.

For instance, if the node boundary is IP-PRN*, the node species is IP*.

For example, consider this command file:

remove_nodes: true
query: (NP-OB* iDoms PRO)

Here's a piece of the output:

/~*
'And I shall defende the,' seyde the knyght.
(CMMALORY,39.1264)
*~/

/*
 1 IP-MAT-SPE: 8 NP-OB1, 9 PRO the
*/

 (0 (1 IP-MAT-SPE (2 ' ')
                 (3 CONJ And)
                 (4 NP-SBJ (5 PRO I))
                 (6 MD shall)
                 (7 VB defende)
                 (8 NP-OB1 (9 PRO the))
                 (10 , ,)
                 (11 ' ')
                 (12 IP-MAT-PRN REMOVED)
                 (13 E_S .))
     (ID CMMALORY,39.1264))

Notice that the sub-sentence "seyde the knyght" has been removed from the parsed sentence. A search on this output will be a search only on IP* nodes that contain a pronoun object, and on no other nodes.

set_margin: (int margin)

default 78

sets margin for CorpusSearch comments and ur_text, but not for parsed sentences, which wrap around the screen.

debugging commands:

The debugging commands are intended for the use of Corpus-Mistresses. The average user probably has no cause to use these commands.

debug_function_calls: (boolean true or false)

default false

tells CorpusSearch to print the function calls vector to the screen.

hunt_bugs: (boolean true or false)

default false

For use by the Corpus-Mistress. Sends the input files to the bug-hunter, and outputs any errors discovered. The bug-hunter is the one piece of CorpusSearch that is label-dependent.

comments

Comments may be added to the command file using // or /*. Do not add comments after the query!

Understanding the Output
Table of Contents