What is an annotated corpus?

An annotated corpus is not God's truth

One possible annotation, ... ... and an alternative
( (CP-QUE (WNP-1 (WPRO What))
          (IP-SUB (NP-OB1 *T*-1)
                  (BEP is)
                  (NP-SBJ (D an)
                          (VAN annotated)
                          (N corpus)))
          (. ?)))

( (CP-QUE (WNP-1 (WPRO What))
          (BEP-2 is)
          (IP-SUB (NP-SBJ (D an)
                          (ADJ annotated)
                          (N corpus))
                  (VP (BEP *T*-2)
                      (NP-PRD *T*-1)))
          (. ?)))

Structural ambiguity is endemic

High adverb attachment, ... ... or low?
( (TP (NP-SBJ (PRO They))
      (T' (ADVP (ADV always))
          (T' (T past)
              (VP (V' (VBD admitted)
                      (NP-OB1 any error)))))
      (. .)))

( (TP (NP-SBJ (PRO They))
      (T' (T past)
          (VP (V' (ADVP (ADV always))
                  (V' (VBD admitted)
                      (NP-OB1 any error)))))
      (. .)))

( (TP (NP-SBJ (PRO They))
      (T' (ADVP (ADV always))
          (T' (T would)
              (VP (V' (VB admit)
                      (NP-OB1 any error)))))
      (. .)))

( (TP (NP-SBJ (PRO They))
      (T' (T would)
          (VP (V' (ADVP (ADV always))
                  (V' (VB admit)
                      (NP-OB1 any error)))))
      (. .)))

Some noteworthy annotation conventions in the Penn Historical Corpora

( (IP-MAT (NP-SBJ (PRO They))
          (ADVP-TMP (ADV always))
          (VBD admitted)
          (NP-OB1 (Q any) (N error))
          (. .)))

( (CP-QUE (WNP-1 (WPRO What))
          (IP-SUB (NP-OB1 *T*-1)
                  (BEP is)
                  (NP-SBJ (D an)
                          (VAN annotated)
                          (N corpus)))
          (. ?)))

Searching corpora with CorpusSearch

A corpus without a search program is like the Internet without a search engine. (Beth Randall)

Funding acknowledgments

Some simple queries

Literal matches


( (IP-MAT (NP-SBJ (PRO He))
          (VBD said)
          (CP-THT (C that)
                  (IP-SUB (NP-SBJ (PRO he))
                          (BED was)
                          (VAG coming)))
          (. .)))

node: CP-THT

// next line matches "that"
query: (C iDoms that)


( (IP-MAT (NP-SBJ (PRO He))
          (VBD said)
          (CP-THT (C 0)
                  (IP-SUB (NP-SBJ (PRO he))
                          (BED was)
                          (VAG coming)))
          (. .)))

node: CP-THT

// next line matches "0"
query: (C iDoms 0)

Matching alternatives


( (IP-MAT (NP-SBJ (PRO He))
          (VBD sayde)
          (CP-THT (C +tat)
                  (IP-SUB (NP-SBJ (PRO he))
                          (BED was)
                          (VAG cumyng)))
          (. .)))

node: CP-THT

// "pipe" matches alternatives
query: (C iDoms that|+tat)


( (IP-MAT (NP-SBJ (PRO He))
          (VBD said)
          (CP-THT (C That)
                  (IP-SUB (NP-SBJ (PRO he))
                          (BED was)
                          (VAG coming)))
          (. .)))

node: CP-THT

// square brackets also match alternatives
query: (C iDoms [tT]hat|+[tT]at)

Reverse matches


( (IP-MAT (NP-SBJ (PRO He))
          (VBD sayde)
          (CP-THT (C +tatt)
                  (IP-SUB (NP-SBJ (PRO he))
                          (BED was)
                          (VAG cumyng)))
          (. .)))

node: CP-THT

// "bang" matches anything except what follows it
query: (C iDoms !0)

Finding the denominator directly

node:  CP-THT

query: (C iDoms [tT]hat|+[tT]hat|0)

node:  CP-THT

query: (C iDoms 0|!0)

node:  CP-THT

query: (C exists)

A first result: Silent vs. overt Comp

Proportion of silent complementizers in ordinary and degree complement clause over time
(click here for graph)
  before 1149 1150-1249 1250-1349 1350-1419 1420-1499 1500-1569 1570-1639 1640-1699 1700-1749 1750-1799 1800-1849 1850-1914
Ordinary complement clauses
p 0.01 0.05 0.02 0.07 0.31 0.27 0.38 0.46 0.53 0.33 0.37 0.26
Total N 16,718 1,487 371 3,379 2,060 5,258 5,378 5,952 2,488 1,397 2,089 2,332
Degree complement clauses
p 0.00 0.01 0.02 0.03 0.06 0.08 0.09 0.11 0.08 0.01 0.07 0.02
Total N 675 274 87 366 296 530 478 653 307 107 116 115

More complex queries

CED violations

The trace in (1) violates the Condition on Extraction Domains (CED) (Huang 1982), which rules out movement out of adjuncts.

(1) ... Italy would have emerged as a reasonably respectable nation, capable of determining her own future,
a country which adventurous foreigners would think twice before attacking _t_.
(Luigi Barzini. 1964. The Italians. 298.)

(CP-REL (WNP-1 which)
        (IP-SUB (NP-SBJ adventurous foreigners)
                (MD would)
                (VB think)
                (ADVP-TMP twice)
                (PP (P before)
                    (IP-PPL (NP-ACC *T*-1)
                            (VAG attacking)))))

node: CP-REL

query:     (CP-REL iDoms WNP-1)

       AND (CP-REL iDoms IP-SUB)

       AND (IP-SUB iDoms PP)

       AND (PP iDoms IP-PPL)

       AND (IP-PPL iDoms NP-ACC)

       // backslash "escapes" special characters
       AND (NP-ACC iDoms \*T\*-1)

(CP-REL (WNP-1 which)
        (IP-SUB (NP-SBJ adventurous foreigners)
                (MD would)
                (VB think)
                (ADVP-TMP twice)
                (PP (P before)
                    (IP-PPL (NP-ACC *T*-1)
                            (VAG attacking)))))

// finds (some) CED violations 
node: CP-*

// unescaped asterisk matches any character any number of times
query:     (CP-* iDoms W*)

       AND (CP-* iDoms IP-SUB)

       // iDomsMod allows paths to contain optional nodes
       // pipe, as before, separates alternatives
       AND (IP-SUB iDomsMod PP IP-PPL|IP-INF-PRP)

       AND (IP-PPL|IP-INF-PRP doms \*T*)

       AND (W* sameIndex \*T*)

( (CP-QUE (WNP-99 Which country)
          (IP-SUB (HVD had)
                  (NP-SBJ he)
                  (BEN been)
                  (VAN sent)
                  (IP-INF-PRP (NP-ACC *T*-99)
                              (TO to)
                              (VB govern)))
          (. ?)))

The second query retrieves 119 matches from all of the historical corpora of English (423,970 tokens) in a little under 3 minutes.
The results of the query need to be reviewed "by hand". They include

Clearly irrelevant

(2)     "Of" clearly heads a complement.
We have voted many qualities to be virtues, now, ___ that they never thought (PP of (IP-PPL calling _t_ virtues formerly.))
(PPCMBE, COLMAN-1805,22.119)

Possibly irrelevant

(3) a.   "Of" clearly heads a complement, but does "against" head a complement or an adjunct?
it's a - in fact - an emergency ___ I never thought (PP of (IP-PPL providing against _t_.))
(PPCMBE, BROUGHAM-1861,30.1115)
b.   Is "which" extracted out of the gerund clause (violating the CED)?
after this he detach'd the Band of Persians called the Immortal Regiment,
which _t_ (IP-PPL meeting with the same Success, Xerxes is said to have leapt three times out of his Throne ... )
(PPCMBE, HIND-1707,310.150)
Or does "which" pied-pipe the gerund clause (vacuously respecting the CED)?
after this he detach'd the Band of Persians called the Immortal Regiment,
which meeting with the same Success , _t_ Xerxes is said to have leapt three times out of his Throne ...

Clearcut examples

(4) a.   by envy of that power or honour, which they have in vain laboured to acquire _t_ ;
(PPCMBE, BURTON-1762,2,22.325)
b.   these Five-and-thirty Articles, which Garde-des-Sceaux is waxing hoarse with reading _t_
(PPCMBE, CARLYLE-1837,1,140.93)
c.   and could not find the Town of Teguantepeque, which they went to seek _t_ .
(PPCMBE, COOKE-1712,1,442.370)
d.   But where is the young Gentlewoman ___ that we came to drink with _t_ :
(PPCMBE, DAVYS-1716,24.38)
e.   and then she repeated over a Catalogue of Names and Titles, many of which we might, perhaps, be guilty of a Breach of Privilege by inserting _t_ .
(PPCMBE, FIELDING-1749,3,11.408)
f.   ... Canning who completed by his death his victory in the country ___ he had been sent to govern _t_
(PPCMBE, TROLLOPE-1882,179.382)
(5)     ... whatever obstacles the small remnant of his followers may assist him in throwing _t_ in the way of Government
(PPCMBE, WOLLASTON-1793,23.184)
(6) a.   a poor Town, call'd Santa Maria, which having taken _t_ , they found nothing but a Parcel of wretched thatch'd Houses.
(PPCMBE, COOKE-1712,1,438.298)
b.   Two raised seats of cushions had been prepared, towards which the Regent waving his hand _t_ , with a very significant look, directed us to be seated.
(PPCMBE, TURNER2-1800,237.123)

Parasitic gaps

A subset of the hits retrieved by the CED violation query contain parasitic gaps - gaps that would ordinarily violate the CED, but that are licensed by a second, ordinary trace. If we wanted to retrieve only parasitic gaps, we could restrict the CED violation query as follows.

( (CP-QUE (WNP-1 which article)
          (IP-SUB (NP-ACC *T*-1)
                  (DOD did)
                  (NP-SBJ you)
                  (VB file)
                  (PP (P without)
                      (IP-PPL (NP-ACC *T*-1)
                              (VAG reading))))
          (. ?)))

// finds (some) parasitic gaps 
node: CP-*

query:     (CP-* iDoms W*)

       AND (CP-* iDoms IP-SUB)

       AND (IP-SUB iDomsMod PP IP-PPL|IP-INF-PRP)

       // next two lines find the parasitic gap
       // identical to CED query except for indices in square brackets
       AND (IP-PPL|IP-INF-PRP doms [2]\*T*)

       AND (W* sameIndex [2]\*T*)

       // next two lines find the licensing trace
       AND (IP-SUB doms [1]\*T*)

       AND (W* sameIndex [1]\*T*)

The query retrieves 3 matches from all of the historical corpora of English (423,970 tokens). Once again, the results of the query need to be reviewed "by hand".

Irrelevant ATB extraction

(7)     for I am at more leasure, which having _t_ , and not using _t_ , I might seeme to neglect my frend,
(PCEEC, HOLLES,III,397.110.3069)

Parasitic gap sentences

(8) a.   it was particularly so to the temper of the Jews, a people not more distinguished by uncommon felicities, than by their ingratitude: an untoward disposition!
which their inspired Lawgiver foreseeing _pg_ , endeavoured to prevent and discourage _t_ by variety of cautions and admonitions.
(PPCMBE, BURTON-1762,1,8.46)
b.   Performance error? Note the ATB violation.
all things ... which haveing made the most of your inquiries in _pg_ ,
I should think it would be of most use to me that you should prepare your selfe for _t_ ,
and hasten your returne.
(PCEEC, PEPYS,74.040.505)

The extreme rarity of parasitic gap sentences clearly shows that there are limits to what we can learn from corpora - at least ones of the sizes that are currently available.

Coding queries

Four conventional queries to investigate Doubly Filled Comp phenomena

wh
(NP (D a) (N man)
    (CP-REL (WNP-1 (WPRO who))
            (C 0)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

query:     (CP-REL iDoms W*)
       AND (W* iDomsMod W* !0)
       AND (CP-REL iDoms C)
       AND (C iDoms 0)

that
(NP (D a) (N man)
    (CP-REL (WNP-1 0)
            (C that)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

query:     (CP-REL iDoms W*)
       AND (W* iDomsMod W* 0)
       AND (CP-REL iDoms C)
       AND (C iDoms !0)

wh + that
("doubly filled COMP")
(NP (D a (N man)
    (CP-REL (WNP-1 (WPRO who))
            (C that)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

query:     (CP-REL iDoms W*)
       AND (W* iDomsMod W* !0)
       AND (CP-REL iDoms C)
       AND (C iDoms !0)

zero-marked
(NP (D a) (N man)
    (CP-REL (WNP-1 0)
            (C 0)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

query:     (CP-REL iDoms W*)
       AND (W* iDomsMod W* 0)
       AND (CP-REL iDoms C)
       AND (C iDoms 0)

Four conventional queries recast as two coding queries

The redundancy in the above four queries can be factored out by a special sort of query, so-called coding queries.

The idea of coding queries is due to Susan Pintzuk and Ann Taylor, and their implementation was funded by the U.K.Arts and Humanities Council.

node: CP-REL

coding_query:

// status of Spec(CP)
1: {
     w:   (CP-REL iDomsMod W* W*) AND (W* iDoms !0)
     z:   (CP-REL iDomsMod W* W*) AND (W* iDoms  0)
     -:   ELSE
   }

// status of complementizer
2: {
     c:   (CP-REL iDoms C) AND (C iDoms !0)
     z:   (CP-REL iDoms C) AND (C iDoms  0)
     -:   ELSE
   }

Output of coding query

wh
(NP (D a) (N man)
    (CP-REL (CODING w:z)
            (WNP-1 (WPRO who))
            (C 0)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

that
(NP (D a) (N man)
    (CP-REL (CODING z:c)
            (WNP-1 0)
            (C that)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

wh + that
("doubly filled COMP")
(NP (D a) (N man)
    (CP-REL (CODING w:c)
            (WNP-1 (WPRO who))
            (C that)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

zero-marked
(NP (D a) (N man)
    (CP-REL (CODING z:z)
            (WNP-1 0)
            (C 0)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

A real-life coding query

( (IP-MAT (NP-SBJ (PRO He))
          (VBD said)
          (CP-THT (C that)
                  (IP-SUB (NP-SBJ (PRO he))
                          (BED was)
                          (VAG coming)))
          (. .))
  (ID FRANKLIN-1776,1.58))

Period Example file name Example ID
Old English cobenrul.o3 (ID cobenrul.o3,BenR:73.133.20.1288)
Middle English cmntest-m3 (ID CMNTEST-M3,11,40J.1152)
Early Modern English wpaston-e1-h (ID WPASTON-E1-H,76.37)
Modern British English priestley-1769 (ID PRIESTLEY-1769,183.318)

node: CP-*

coding_query:

// time period
1: {
// first line sets aside translations and archaic texts
      x:  (AUTHNEW*|AUTHOLD*|BOETHEL*|ERV-*|NEWCOME-*|PURVER-* inID)
      a:  (co* inID)
      b:  (*-M1,* inID)
      c:  (*-M2,* inID)
      d:  (*-M3,* inID)
      e:  (*-M4,* inID)
      f:  (*-E1-* inID)
      g:  (*-E2-* inID)
      h:  (*-E3-* inID)
// plain "*" after square brackets doesn't work as intended,
// so use the more explicit ".*"
      i:  (*-17[01234].* inID)
      j:  (*-17[56789].* inID)
      k:  (*-18[01234].* inID)
      l:  (*-18[56789].* inID)
      l:  (*-19[01].* inID)
      -:  ELSE
   }

// clause type
2: {
     a:  (CP-ADV* iDoms IP-SUB*)
     c:  (CP-CAR* iDoms IP-SUB*)
     d:  (CP-DEG* iDoms IP-SUB*)
     f:  (CP-FRL* iDoms IP-SUB*)
     l:  (CP-CLF* iDoms IP-SUB*)
     m:  (CP-CMP* iDoms IP-SUB*)
     q:  (CP-QUE* iDoms IP-SUB*)
     r:  (CP-REL* iDoms IP-SUB*)
     t:  (CP-THT* iDoms IP-SUB*)
     -:  ELSE
   }

// Spec(CP) silent or overt?
3: {
     w:  (CP-* iDomsMod W* W*) AND (W* iDoms !0)
     z:  (CP-* iDomsMod W* W*) AND (W* iDoms  0)
     -:  ELSE
   }

// complementizer silent or overt?
4: { 
     c:  (CP-* iDoms C) AND (C iDoms !0)
     z:  (CP-* iDoms C) AND (C iDoms  0)
     -:  ELSE
   }

// type of gap
5: {
// subject gap
     s:     (CP-* iDoms WNP*)
        AND (CP-* iDoms IP-SUB*)
        AND (IP-SUB* iDoms NP-SBJ*)
        AND (NP-SBJ* iDoms \*T*)
        AND (WNP* sameIndex \*T*)

// object gap
     o:     (CP-* iDoms WNP*)
        AND (CP-* iDoms IP-SUB*)
        AND (IP-SUB* iDoms NP-OB1*)
        AND (NP-OB1* iDoms \*T*)
        AND (WNP* sameIndex \*T*)

// PP with pied piping
     p:     (CP-* iDoms WPP*)
        AND (CP-* iDoms IP-SUB*)
        AND (IP-SUB* iDoms PP*)
        AND (PP* iDoms \*T*)
        AND (WPP* sameIndex \*T*)

// PP with preposition stranding
     P:     (CP-* iDoms WNP*)
        AND (CP-* iDoms IP-SUB*)
        AND (IP-SUB iDoms PP*)
        AND (PP* iDoms NP)
        AND (NP iDoms \*T*)
        AND (WNP* sameIndex \*T*)
     -:     ELSE
   }

// length of wh- constituent
6: {
// digits needs to be "escaped" by backslash
      \1:  (CP-* iDoms W*) AND (W* domsWords 1)
      \2:  (CP-* iDoms W*) AND (W* domsWords 2)
      \3:  (CP-* iDoms W*) AND (W* domsWords 3)
      \4:  (CP-* iDoms W*) AND (W* domsWords 4)
      \5:  (CP-* iDoms W*) AND (W* domsWords 5)
      \6:  (CP-* iDoms W*) AND (W* domsWords 6)
      \7:  (CP-* iDoms W*) AND (W* domsWords 7)
      \8:  (CP-* iDoms W*) AND (W* domsWords 8)
      \9:  (CP-* iDoms W*) AND (W* domsWords 9)
      \0:  (CP-* iDoms W*) AND (W* domsWords> 10)
       -:  ELSE
   }

Advanced uses

Adding information to partially coded corpora

Any serious study of the various ways of marking relative clauses in English needs to take into account two important properties that are not encoded in the existing corpora.

It is possible to use coding queries to produce a copy of a corpus that contains coding strings that can be edited by hand.

Sample coding query

coding_query:

// status of Spec(CP)
1: {
     w:   (CP-REL* iDomsMod W* W*) AND (W* iDoms !0)
     z:   (CP-REL* iDomsMod W* W*) AND (W* iDoms  0)
     -:   ELSE


// status of complementizer
2: {
     c:   (CP-REL* iDoms C) AND (C iDoms !0)
     z:   (CP-REL* iDoms C) AND (C iDoms  0)
     -:   ELSE
}

// human vs. nonhuman antecedent
3: {
     // guess that "who(m)" is only used with human antecedents
     h:   (CP-REL* iDomsMod W* W*) AND (W* iDoms [wW]ho|[wW]hom)
     // guess that "which" is only used with nonhuman antecedents
     n:   (CP-REL* iDomsMod W* W*) AND (W* iDoms [wW]hich)
     -:   ELSE
}

// restrictive vs. nonrestrictive
4: {
     // don't even try to guess
     -:   ELSE
}

Sample output

(NP (D a) (N man)
    (CP-REL (CODING w:z:h:-)
            (WNP-1 (WPRO who))
            (C 0)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

(NP (D an) (N issue)
    (CP-REL (CODING z:c:n:-)
            (WNP-1 which)
            (C 0)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arises))))

(NP (D any) (N person)
    (CP-REL (CODING w:c:-:-)
            (WNP-1 (WPRO 0))
            (C that)
            (IP-SUB (NP-SBJ *T*-1)
                    (VBP arrives)
                    (ADVP-TMP late))))

Building corpora with revision queries

It is possible to use CorpusSearch in connection with part-of-speech tagged corpora to produce draft parsed corpora that are more or less comparable with the output of ordinary parsers. We have done this to produce training corpora for languages for which we had none, and we have sometimes continued to use CorpusSearch as a "poor man's parser" even when we had sufficient training data because it is very transparent compared to ordinary parsers.

Related links

Annotation manual for the Penn Historical Corpora and the PCEEC

CED violations in Modern English

CorpusSearch

Doubly filled COMP sentences in Modern English

PCEEC (York-Helsinki Parsed Corpus of Early English Correspondence)

Penn Parsed Corpora of Historical English

YCOE (York-Toronto-Helsinki Parsed Corpus of Old English Prose)