Understanding the Output

contents of this chapter:

general form of the output
a typical output file
preface
header
result block with output sentence
footer
hits/tokens/total
summary block
using nodes_only and remove_nodes

general form of the output

Output files have this general form:

1 per output file 1 per input file 1 set per output sentence
Preface
Header
ur_text sentence
result block
parsed sentence
Footer
Summary

Since the output file can become input to a subsequent search, everything except parsed sentences is surrounded by comment markers /* and */ (the ur_text block has slightly different markers).

a typical output file

As an example, I'll walk through a typical output file, from a search done by Ann Taylor. The query was designed to search for inverted pronoun subjects, that is, pronoun subjects that appear after the tensed verb.

To make this example easier to follow, this line was added to the command file:

nodes_only: f 

I will discuss nodes_only and remove_nodes below.

preface

/*
    PREFACE:  regular output file.
    CorpusSearch copyright Beth Randall 2000.
    Date:  Sun Apr 30 07:05:51 EDT 2000

    command file:       under.q
    output file:        under.out

    remark:   this query searches for inverted pronoun subjects.

    node:   IP*
    query:  ((((NP*|ADJP*|ADVP*|PP* iPrecedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD)
            AND (NP*|ADJP*|ADVP*|PP* iDominates !\*T*))
            AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iPrecedes NP-SBJ*))
            AND (NP-SBJ* iDominates PRO|MAN))
*/

The preface begins with a label identifying this as a regular output file, that is, not a complement file. This is followed by a copyright declaration and the date and time of the search.

The names of the command file and output file are listed. If this search had been performed using an output file as input (instead of a corpus file), the name of the output-as-input file would also have been listed in this block. But because the input file is a corpus file, the header and summary blocks contain all the necessary information (for more on searching output files, see below).

The remark was found in the command file. It serves as a reminder of the purpose of the query.

The beginning of the query,

          ((NP*|ADJP*|ADVP*|PP* iPrecedes  *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD)
          AND (NP*|ADJP*|ADVP*|PP* iDominates !\*T*))

requires a constituent (NP*|ADJP*|ADVP*|PP*) which immediately precedes the tensed verb (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD). The constituent is required not to have a trace (\*T*) (a placeholder for a word which would appear in that place under some circumstances, but in fact appears elsewhere in this particular sentence.) This requirement was put in to preclude questions (such as, "Kepte he his fadir scheep full mekly?"), where there is no constituent before the inverted pronoun subject other than the tensed verb. In Middle English, there must be one constituent before the tensed verb in statements, as the first two lines of the query describe.

The last two lines of the query,

          AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iPrecedes NP-SBJ*))
          AND (NP-SBJ* iDominates PRO|MAN))

describe the tensed verb (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD) which precedes the subject noun phrase (NP-SBJ*), which itself immediately dominates a pronoun ("PRO|MAN") (that is, the subject is a pronoun.)

header

/*
    HEADER:
    source file:  cmcapchr.m4.psd
*/
Here, the source file is listed as its name appears in the corpus directory. If this had been an output file, the source file would have been listed as its name appears in the ID node of each sentence, that is, CMCAPCHR. (for more on searching output files, see below).

result block with output sentence

Here's an example of an output sentence, first presented as the original text, followed by a result block, followed by the sentence in its parsed form:

/~*
His fadir scheep kepte he ful mekly;
(CMCAPCHR,32.13)
*~/

/*
    1 IP-MAT: 2 NP-OB1, 7 VBD kepte, 6 N scheep, 8 NP-SBJ, 9 PRO he
*/

(0
   (1 IP-MAT
             (2 NP-OB1
                       (3 NP-POS (4 PRO$ His) (5 N$ fadir))
                       (6 N scheep))
             (7 VBD kepte)
             (8 NP-SBJ (9 PRO he))
             (10 ADVP (11 ADVR ful) (12 ADV mekly))
             (13 E_S ;))
      (ID CMCAPCHR,32.13))

Notice that the default word order would be "He kepte his fadir scheep ful mekly", but in this case the object "his fadir scheep" has been moved to the beginning of the sentence. Since only one constituent can precede the verb, the subject "he" must be moved after the verb "kepte" --- that is, subject and verb have been inverted.

Notice that the original text is surrounded by special markers, "/~*" and "*~/". When a search is run on the output file, CorpusSearch will find and record this block as the original text of the output sentence. In this way the entire original text is conserved, even when only bits and pieces of the original parsed sentence appear in the output.

The first item in the list of indices and structures is the boundary node (in this case, 1 IP), which fit the "node: " line of the command file. It is followed by a colon to separate it from the rest of the list, which details the structures that correspond to the "query: " line of the command file. The list of indices and structures has been weeded out so that no node is reported more than once.

For some queries, there may be many nodes that fit one search-function argument. In these cases CorpusSearch always reports the last legitimate fitting node. For instance, look at this part of the query:

(NP*|ADJP*|ADVP*|PP* iDominates !\*T*)

In the sentence above, (2 NP-OB1 iDominates this structure, where neither (3 NP-POS nor (6 N scheep is \*T*:

	               (3 NP-POS (4 PRO$ His) (5 N$ fadir))
                       (6 N scheep))

so it is the last node, (6 N scheep), that is reported in the result block.

The parsed version of the output sentence is indented to show the structure of the tree. Sisters have the same indentation (for instance, 2 NP-OB1 and 7 VBD kepte.) Daughters are indented further than their mothers. Leaves are printed on the same line to save space.

footer

/*
    FOOTER
    source file:  cmcapchr.m4.psd
    hits found:  220
    tokens containing the hits:  220
    total tokens searched:  4175
*/

The footer gives the statistics for hits, tokens, and total as found in that input file. The same information appears again as one line of the summary block.

hits/tokens/total

CorpusSearch reports these statistics:

hits
number of distinct boundary nodes contaning the searched-for structure.
tokens
number of independent parsed objects in which hits occurred.
total
total number of independent parsed objects searched.

When you're searching a corpus file, "tokens" means "sentences", since each independent parsed object in the corpus is a sentence. In these searches it's very common to have "hits" greater than "tokens", since one sentence may contain many distinct boundary nodes.

But suppose you follow these steps:

  1. Run a search on the corpus, using "nodes_only" and "remove_nodes". Call the output of this search "1.out".
  2. Now, run a search on "1.out". Call the output of this second search "2.out".
In "2.out", "hits" and "tokens" will be the same number, because each token in "1.out" contained exactly one boundary node and thus can contain at most one hit.

summary block

/*
    SUMMARY:  regular output file.

    command file:       	invert.q
    output file:        	invert.out

    source files, hits/tokens/total:
        cmaelr4.m4.psd          	46/46/766
        cmcapchr.m4.psd         	220/220/4175
        cmcapser.m4.psd         	12/12/91
        cmedmund.m4.psd                 2/2/300
        cmfitzja.m4.psd         	14/14/228
        cmgregor.m4.psd        	        14/14/2631
        cminnoce.m4.psd         	6/6/208
        cmkempe.m4.psd          	203/202/3851
        cmmalory.m4.psd         	214/213/4995
        cmreynar.m4.psd         	36/36/547
        cmreynes.m4.psd         	0/0/245
        cmsiege.m4.psd          	6/6/731
    grand total hits :  773
    grand total tokens:  771
    grand total tokens searched:  18772
*/

The summary, like the preface, is labelled "regular output file" to show that it is not the summary of a complement file.

The summary block gives the same information as the footer blocks for each input file, but brought together in one place. This summary block was produced by a search on all corpus files whose titles contain "m4", meaning they are from the fourth chronological period (1420 - 1500).

using nodes_only and remove_nodes

Consider this query file, called ipmat-2vb.q:

begin_remark:
    This query searches for matrix clauses which contain a
    subject and at least two verbs.  The subject precedes
    both verbs.
end_remark

node:  IP-MAT*
query: (((((IP-MAT* iDoms NP-SBJ*)
AND (NP-SBJ* precedes *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD))
AND (NP-SBJ* precedes VB|VAN|VBN|HV|HAN|HVN|DO|DAN|DON|BE|BEN))
AND (*MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD iDoms ![1]\**))
AND (VB|VAN|VBN|HV|HAN|HVN|DO|DAN|DON|BE|BEN iDoms ![2]\**))

Because remove_nodes and nodes_only are true by default, the output will print only the boundary nodes containing the structure, and irrelevant boundary nodes will be removed. The purpose of this is to ensure that subsequent searches are conducted only on the matrix clauses that contain a subject preceding two verbs. Here's a sample output sentence: in Modern English, this sentence would be: "He would have told you more if you had allowed him to."

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 8 HV a
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 9 VBN tolde
*/

(0 (1 IP-MAT-SPE (2 CONJ and)
                 (3 NP-OB1 (4 QR more))
                 (5 NP-SBJ (6 PRO he))
                 (7 MD wolde)
                 (8 HV a)
                 (9 VBN tolde)
                 (10 NP-OB2 (11 PRO you))
                 (12 PP (13 P and)
                        (14 CP-ADV (15 C 0)
                                   (IP-SUB REMOVED)))
                 (24 E_S .))(ID CMMALORY,35.1106))

Notice that the IP-SUB clause, "$ye wold a suffirde hym", has been removed.

Suppose we run this output through a search for pronoun objects, using this query file, called "pro-obj.q":

begin_remark:
pronoun objects
end_remark

add_to_ignore: \**
print_complement: t
query: (NP-OB* iDoms PRO)

The "suffirde" sentence shows up again, because it has a pronoun object "you":

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you
*/

 (0 (1 IP-MAT-SPE (2 CONJ and)
                 (3 NP-OB1 (4 QR more))
                 (5 NP-SBJ (6 PRO he))
                 (7 MD wolde)
                 (8 HV a)
                 (9 VBN tolde)
                 (10 NP-OB2 (11 PRO you))
                 (12 PP (13 P and)
                        (14 CP-ADV (15 C 0)
                                   (16 IP-SUB REMOVED)))
                 (17 E_S .))(ID CMMALORY,35.1106))

Notice that the results block describes one structure,

1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you

This structure will be counted as one hit in the final summary block.

Now suppose we run the same series of searches, but this time we add this line to the command files:

nodes_only: f

When nodes_only is false it makes remove_nodes false automatically.

Here's how the "suffirde" sentence looks after running ipmat-2vb.q with nodes_only and remove_nodes false:

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 8 HV a
 1 IP-MAT-SPE: 5 NP-SBJ, 7 MD wolde, 9 VBN tolde
*/

(0
(1 IP-MAT-SPE (2 CONJ and)
              (3 NP-OB1 (4 QR more))
              (5 NP-SBJ (6 PRO he))
              (7 MD wolde)
              (8 HV a)
              (9 VBN tolde)
              (10 NP-OB2 (11 PRO you))
              (12 PP (13 P and)
                     (14 CP-ADV (15 C 0)
                                (16 IP-SUB
                                           (17 NP-SBJ (18 PRO $ye))
                                           (19 MD wolde)
                                           (20 HV a)
                                           (21 VBN suffirde)
                                           (22 NP-OB1 (23 PRO hym)))))
              (24 E_S .))
(25 ID CMMALORY,35.1106))
Notice that the clause "$ye wolde a suffirde hym" is printed out in full.

Now we run pro-obj.q on this output. Here's the "suffirde" sentence resulting from this search:

/~*
and more he wolde a tolde you and $ye wolde a suffirde hym.
(CMMALORY,35.1106)
*~/
/*
 1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you
 16 IP-SUB: 22 NP-OB1, 23 PRO hym
*/

 (0
(1 IP-MAT-SPE (2 CONJ and)
              (3 NP-OB1 (4 QR more))
              (5 NP-SBJ (6 PRO he))
              (7 MD wolde)
              (8 HV a)
              (9 VBN tolde)
              (10 NP-OB2 (11 PRO you))
              (12 PP (13 P and)
                     (14 CP-ADV (15 C 0)
                                (16 IP-SUB
                                           (17 NP-SBJ (18 PRO $ye))
                                           (19 MD wolde)
                                           (20 HV a)
                                           (21 VBN suffirde)
                                           (22 NP-OB1 (23 PRO hym)))))
              (24 E_S .))
(25 ID CMMALORY,35.1106))

Notice that here the results block contains two different structures,

1 IP-MAT-SPE: 10 NP-OB2, 11 PRO you 
16 IP-SUB: 22 NP-OB1, 23 PRO hym

The structure

16 IP-SUB: 22 NP-OB1, 23 PRO hym

is reported in this case because remove_nodes was false in the previous search. The pronoun object "hym" was found in a subordinate clause, not the matrix clause that was of interest to the last search.

Because the structures occur in two distinct boundary nodes (1 IP-MAT-SPE and 16 IP-SUB), this will count as two hits in the summary block, in contrast to the one hit counted when remove_nodes was true. This explains why the "remove_nodes: true" version of the search counts fewer objects than the "remove_nodes: false" version of the search.

Here's the summary block from the "remove_nodes: true" version:

/*
    SUMMARY:  regular output file.

    command file:       pro-obj.q
    input file:         ipmat-2vb.out
    output file:        pro-obj.out

    source files, hits/tokens/total:
        CMMALORY                177/176/875
    grand total hits :  177
    grand total tokens:  176
    grand total tokens searched:  875
*/

And here's the summary block from the "remove_nodes: false" version:

/*
    SUMMARY:  regular output file.

    command file:       pro-obj.q
    input file:         ipmat-2vb.out
    output file:        pro-obj.out

    source files, hits/tokens/total:
        CMMALORY                290/249/875
    grand total hits :  290
    grand total tokens:  249
    grand total tokens searched:  875
*/

Searching for Words
Table of Contents