How to Make Your Corpus Compatible with CorpusSearch

contents of this chapter:

your corpus
parse completely
labels must be single words
labels must not begin with digits
no square brackets
round parentheses
wrap your sentences
use identification nodes
give corpus file names a standard ending
the corpus bug-hunter is label-dependent
an example of an incompatible corpus

your corpus

With the invention of trainable parsers more corpora are being built. So far, CorpusSearch has been used to search Middle English, Old English, Chinese, Korean and Yiddish corpora. If you're building a corpus, here's what you need to know to ensure that you can use CorpusSearch to search it.

parse completely

CorpusSearch expects sentences to be completely parsed. That is, every piece of text is expected to have a label affixed to it. If your sentence is only partially parsed, CorpusSearch won't break, but you won't have any way to search the partially parsed areas of text.

labels must be single words

CorpusSearch expects labels to be single strings, that is, containing no spaces (" "). If your label consists of multiple strings, the first string will be interpreted as the label and the next string will be ignored (in the case of a phrase label), or picked up as original text (in the case of a word label). For instance, if you try to use "NOUN PHRASE" as a label, CorpusSearch will interpret "NOUN" as the label and ignore "PHRASE". On the other hand, "NOUN_PHRASE" will be interpreted as a label and could be found using CorpusSearch.

labels must not begin with digits

Labels must not begin with digits ("0", "1", ...."9"). Digits before labels will be interpreted as indices left over from a previous search, and so will be ignored. Labels are allowed to *end* with digits, though. So "PP1" is an acceptable label, but "1PP" is not.

no square brackets

Square brackets ("[" and "]") are used in CorpusSearch to enclose prefix indices. They were a safe choice because the Middle English corpus doesn't contain any square brackets to search for. If your corpus contains square brackets, they will probably have to be changed, or they will be difficult to search for.

tree must be described with round parentheses

CorpusSearch expects the structure of the sentence to be described with round parentheses ("(", ")"). If your tree is described with "{" or "[" or some other system, you will have to convert it to "(" and ")".

wrap your sentences

CorpusSearch expects every sentence to have a "wrapper", that is, a pair of parentheses surrounding the sentence. The wrapper is a useful place to store items that are extraneous to the sentence but linked to it, for instance ID nodes. Here's an example: the "wrapper" consists of the first and last parentheses seen here:

(
  (IP-MAT
          (ADVP-TMP (ADV Thenne))
          (NP-SBJ (NPR quene) (NPR Igrayne))
          (VBD waxid)
          (ADVP-TMP (ADV dayly))
          (ADJP (ADJR gretter) (CONJ and) (ADJR gretter))
          (E_S .))
      (ID CMMALORY,5.120))

use identification nodes

Although CorpusSearch can function without identification nodes (labelled "ID"), it's better to have them. When CorpusSearch searches the output of a previous search, it uses the ID nodes to keep statistics for the header, footer and summary blocks. Here's an example of an ID node:

(ID CMMALORY,5.120)

Here, the CMMALORY identifies the source file, 5 is the page number, and 120 is the sentence number in that file. In general, an ID node should have this form:

(ID <source_name>,<free_space>.<sentence_number>)

The information between the source_name and the sentence_number is actually not referenced by CorpusSearch. It could be used to store page numbers (as in the Middle English Corpus), or some other information, or not used at all. The important thing is that the ID_string must begin with a string followed by a comma (to be picked up as the source_name), and end with a "." followed by a sentence number.

Notice that there are no spaces (" ") in the information following the label "ID". This is crucial, because it ensures all the information will be picked up as one string.

CorpusSearch expects to find the ID node just after the sentence ending but inside the sentence wrapper.

give corpus file names a standard ending

CorpusSearch expects corpus file names to have a standard ending (or "extension".) As a default, CorpusSearch understands ".psd" (for "parsed") to indicate an original corpus file. If you've used a different ending, add this line to your command files:

corpus_file_extension:  <your_extension>

If an input file name does not end with the corpus_file_extension, it is presumed to be an output file and treated somewhat differently. For instance, when searching output, CorpusSearch uses the ID nodes to keep statistics for the header, footer, and summary blocks. If you see "NO_FILE_ID" listed in the header, footer and summary blocks, it may be because your corpus files don't have names ending with a recognized corpus_file_extension and don't contain ID nodes.

the corpus bug-hunter is label-dependent

The only part of CorpusSearch that is dependent on a particular set of labels is the corpus bug-hunter. This is the part of CorpusSearch that responds to errors in the corpus itself (as opposed to, for instance, errors in the query.) When CorpusSearch encounters a corpus error, it sends the suspicious sentence to the corpus bug-hunter, which prints out an error message followed by the suspicious sentence. If your corpus has a different set of labels than the Middle English corpus, the error message might not be completely appropriate. However, the fact that an error message has appeared means that CorpusSearch found *some* problem with that sentence.

If you have a private copy of CorpusSearch and you're familiar with Java programming, you can try your hand at customizing the list of labels that the corpus bug-hunter responds to. The list is in a class called "Tags.java" and the code is quite straightforward.

an example of an incompatible corpus

In 1994, Beatrice Santorini of the University of Pennsylvania built a corpus of parsed and annotated Yiddish texts. Like Phase 1 of the Middle English corpus, the Yiddish corpus was parsed only to the first level of constituents. This "flat parsing" was searchable using Perl scripts that matched regular expressions.

One passage from the corpus tells a joke that begins this way:

"When you tell a story to a peasant, he laughs three times. He laughs the first time when someone tells him the story. The second time, when it is explained to him. And the third time, when he understands the story."

I'll examine one sentence from that passage:

"He laughs the first time when someone tells him the story."

Here it is as it appears in the corpus. (For this discussion, we don't need the definitions of the words and their labels, so I have put them in a separate file.)

 (
   [t dem ershtn mol ] [v0 lakht ] [s er ] ,
   [B [c ven ] [s men ] [v0 dertseylt ] [i im ] [d di mayse ] , B]
   )
   (RO,1)

The first problem here is the existence of square brackets ("[", "]"), which CorpusSearch doesn't recognize. So the first task is to convert the square brackets to round parentheses:

 (
   (t dem ershtn mol ) (v0 lakht ) (s er ) ,
  (B (c ven ) (s men ) (v0 dertseylt ) (i im ) (d di mayse ) , B)
   )
   (RO,1)

This form of the sentence can be partly searched by CorpusSearch. For instance, this query:

node: *
query:  (v0 iPrecedes s)

will find the structure (v0 lakht) (s er), as expected. Notice that the node boundary had to be set to *; if you leave the node boundary at its default, IP*|NODE, nothing will be found, because the sentence does not contain IP* or NODE.

However, the sentence is still not fully compatible with CorpusSearch because it is not completely parsed. For instance, the phrase "dem ershtn mol" ("the first time") has been parsed as one object. So if you run this query:

node: *
query:  (ershtn precedes mol)

the structure will not be found. This is because CorpusSearch expects every leaf node to contain exactly two objects: a label and a single-string piece of text. Any extra information will be stored as part of the node but it will usually not be examined by the search functions. These extra pieces of information (in this case, the strings "ershtn" and "mol") behave as useless baggage that is carried along by the sentence vector but never opened.

Similarly, the ", B" that marks the end of the B-labelled clause, and the "," that separates the B-labelled clause from the rest of the sentence, are never actually referenced, so they may as well be removed. The parentheses are enough to convey the information that the B-labelled clause ends, and that the B-labelled clause is separate from the rest of the sentence.

Here is the sentence, fully parsed, and with extraneous labels removed:

 (
   (t (det dem) (adj ershtn) (n mol)) (v0 lakht ) (s er ) 
  (B (c ven ) (s men ) (v0 dertseylt ) (i im ) (d (det di) (n mayse)))
   )
   (RO,1)

Now, the query

node: *
query:  (ershtn precedes mol)

will find the structure as expected:

/~*
dem ershtn mol lakht er ven men dertseylt im di mayse
(RO,1.3)
*~/

/*
    1 t: 3 adj ershtn, 4 n mol
*/


(0
  (1 t (2 det dem) (3 adj ershtn) (4 n mol))
  (5 v0 lakht)
  (6 s er)
  (7 B (8 c ven)
       (9 s men)
       (10 v0 dertseylt)
       (11 i im)
       (12 d (13 det di) (14 n mayse)))
  (15 ID RO,1.3))

Finally, there is the node (RO,1). This identifies the sentence as being part of the first story told by informant Royte Pomerantsen. This needs to be given the standard CorpusSearch ID node form and stuck inside the wrapper. I'll make it sentence number 3:

 (
   (t (det dem) (adj ershtn) (n mol)) (v0 lakht ) (s er ) 
  (B (c ven ) (s men ) (v0 dertseylt ) (i im ) (d di) (n mayse))
(ID RO,1.3)	
   )

and our sentence is now fully compatible with CorpusSearch.

Table of Contents