Annotation differences among the Penn historical corpora

  • Cipher
  • Collective nouns
  • Concessive clauses
  • Disfluencies
  • DO (causative versus periphrastic)
  • Foreign languages
  • ELSE (see Post-head modifiers)
  • ENOUGH (see Post-head modifiers)
  • LIKE and similar verbs (LACK, NEED, WANT)
  • Nonfinite clausal complementation
  • Post-head modifiers
  • Quantifiers and quantified expressions
  • Splitting and joining words

    Our general strategy in extending the guidelines for the PPCME2 to the later historical corpora has been to minimize the number of any changes between the corpora. In particular, we have attempted to make changes to the original annotation scheme only when forced to do so by distributional changes in the texts. In some instances (notably, in connection with quantifiers), we have modified the annotation guidelines because they prove too difficult to implement in a consistent manner. Finally (in connection with the post-modifier rule), we enforce the annotation guidelines more strictly in the later historical corpora than in the PPCME2.

    In subsequent editions of the corpora, we hope to further minimize the differences described below.


    Text used as cipher is tagged as CIPHER in the PCEEC. In the other corpora, such text, if it occurs, is tagged by its ordinary part of speech (N, NUM).

    Collective nouns

    See also
    Singular, collective, and plural nouns.

    In the PPCME2, collective nouns (FOLK, HORS, PEOPLE, etc.) are tagged as N. In early texts, before the universalization of plural -S, it can be quite difficult to distinguish reliably between singular and plural. For texts from the Middle English period M1, we have therefore tried to follow the translation that accompanies the text used, or when this is lacking, a separate translation. For details, consult the information for the individual texts.

    In the later corpora, PEOPLE is tagged as singular (N) when preceded by an unambiguously singular determiner (A, THAT, THIS), and as plural (NS) elsewhere.

    Concessive clauses

    In the PPCME2, ALL BE IT (THAT) and SO BE IT (THAT) clauses are treated similarly to V1 conditionals. HOW BE IT (THAT) clauses are treated as adverbial free relatives.

    In the later corpora, ALL BE IT and HOW BE IT (though not SO BE IT) come to be used absolutely. Moreover, regardless of whether they appear absolutely or introduce subordinate clauses, these items come to be spelled as single words (ALBEIT, HOWBEIT). We therefore treat them as unitary adverbs or prepositions. In the later corpora, SO BE IT clauses cease to pattern with ALL BE IT clauses. Instead, they are simply word order variants of IT BE SO and are annotated accordingly.

    See below for further details and examples.


    In the PPCME2, ALL BE IT (THAT) clauses, like SO BE IT (THAT) clauses, are treated similarly to V1 conditionals. ALL is POS-tagged Q, surrounded by ADVP brackets, and treated as a daughter of CP-ADV. This is not intended as the correct analysis of the construction, but rather to fit in with the annotation of V1 conditionals.
    ( (IP-MAT (CONJ and)
              (PP (P atte)
                  (NP (N risyng)
                      (PP (P of)
                          (NP (D the) (N sonne)))))
              (NP-SBJ (PRO I))
              (VBD fond)
              (NP-OB1 (D the) (ADJ secunde) (N degre)
                      (PP (P of)
                          (NP (NPR Aries))))
              (IP-PPL (VAG sittyng)
                      (PP (P upon)
                          (NP (PRO$ myn) (N est) (N orisonte))))
              (, ,)
              (CP-ADV (ADVP (Q all))			← ALL BE IT
                      (IP-SUB (BEP be)
                              (NP-SBJ-1 (PRO it))
                              (CP-THT-1 (C that)
                                        (IP-SUB (NP-SBJ (PRO it))
                                                (BEP was)
                                                (ADJP (FP but) (ADJ litel))))))
              (. .))
      (ID CMASTRO,673.C1.364))
    ( (IP-MAT-SPE (CONJ And)
                  (, ,)
                  (CP-ADV (ADVP (Q al))			← ALL BE IT
                          (IP-SUB (BED were)
                                  (NP-SBJ-1 (PRO it))
                                  (ADVP (ADV so))
                                  (CP-THT-1 (C that)
                                            (IP-SUB (NP-SBJ (PRO she))
                                                    (ADVP-TMP (ADV right) (ADV now))
                                                    (BED were)
                                                    (ADJP (ADJ deed))))))
                  (, ,)
                  (NP-SBJ (PRO ye))
                  (NEG ne)
                  (MD oughte)
                  (NEG nat)
                  (, ,)
                  (PP (P as)
                      (PP (P for)
                          (NP (PRO$ hir) (N deeth))))
                  (, ,)
                  (NP-OB1 (PRO$+N youreself))
                  (TO to)
                  (VB destroye)
                  (. .))
      (ID CMCTMELI,217.C1b.18))

    In the later corpora, ALBEIT (like HOWBEIT) is treated as a unitary adverb (when used absolutely) or as a unitary preposition (when introducing a subordinate clause).

    (NODE (CP-CAR (WNP-1 (WPRO Which))
    	      (C 0)
    	      (IP-SUB (PP (P in)
    		          (NP (NP-POS (D the) (N$ kinges))
    			      (NS daies)))
       		      (, ,)
    		      (PP-LFD (P albeit)		← ALBEIT
    			      (CP-ADV (C 0)
    				      (IP-SUB (NP-SBJ (PRO he))
    					      (BED was)
    					      (ADVP (ADV sore))
    					      (VAN ennamored)
    					      (PP (P vpon)
    						  (NP (PRO her))))))
        		      (, ,)
    		      (ADVP-RSP (ADV yet))
    		      (NP-SBJ-RSP=1 (PRO he))
        		      (VBD forbare)
      		      (NP-OB1 (PRO her))
    		      (, ,)
      		      (PP (CONJ either)
      		          (PP (P for)
    		   	      (NP (N reuerence)))
      		          (, ,)
    			  (CONJP (CONJ or)
    			         (PP (P for)
    				     (NP (D a) (ADJ certain) (ADJ frendly) (N faithfulnes)))))))
          (ID MORERIC,55.118))


    In the PPCME2, HOW BE IT (THAT) clauses are treated as adverbial
    free relatives.
    ( (IP-MAT (CONJ And)	
    	  (ADVP (CP-FRL (WADVP-1 (WADV how))		← HOW BE IT
    			(C 0)
    			(IP-SUB (ADVP *T*-1)
    				(BEP be)
    				(NP-SBJ-2 (PRO it))
    				(CP-THT-2 (C 0)
    					  (IP-SUB (IP-SUB-3 (NP-SBJ (PRO thou))
    							    (HVP hast)
    							    (ADVP-TMP (ADV often))
    							    (ADVP-TMP (ADV before))
    							    (PP (P in)
    								(NP (NP (PRO$ thy) (ADJ yonge) (N age))
    								    (CONJP (CONJ and)
    									   (NP (ADJ myddell) (N age)))))
    							    (VBN dyvydyd)
    							    (NP-OB1 (PRO$ thy) (N lyfe))
    							    (NP-TMP (Q+N somtyme))
    							    (PP (P to)
    								(NP (N vertue))))
    						  (, ,)
    						  (IP-SUB=3 (NP-TMP (Q+N somtyme))
    							    (PP (P to)
    								(NP (N vyce)))))))))
    	  (, ,)
    	  (NP-SBJ (PRO ye))
    	  (PP (P as)
    	      (ADVP-TMP (ADV now)))
    	  (PP (P in)
    	      (NP (PRO$ thy) (ADJR latter) (N age)))
    	  (VBP kepe)
    	  (IP-SMC (NP-SBJ (PRO$ thy) (N lyfe))
    		  (ADJP (ADJ holy)))
    	  (PP (P in)
    	      (NP (N vertue)))
    	  (. .)) (ID CMINNOCE,11.189))

    In the later corpora, HOWBEIT (like ALBEIT) is treated as a unitary adverb (when used absolutely) or as a unitary preposition (when introducing a subordinate clause).

    ( (IP-MAT-SPE (ADVP (ADV (ADV31 How) (ADV32 be) (ADV33 it)))	← HOWBEIT
    	      (NP-SBJ (PRO he)
    		      (CP-REL-SPE (WNP-1 0)
    				  (C that)
    				  (IP-SUB-SPE (NP-SBJ *T*-1)
    					      (HVP hath)
    					      (VBN receaved)
    					      (NP-OB1 (PRO$ hys) (N testimonye)))))
    	      (HVP hath)
    	      (VBN set)
    	      (RP to)
    	      (NP-OB1 (PRO$ his)
    		      (N seale)
    		      (CP-THT-SPE (C that)
    				  (IP-SUB-SPE (NP-SBJ (NPR God))
    					      (BEP is)
    					      (ADJP (ADJ true)))))
    	      (. .))
      (ID TYNDNEW,III,20J.218))


    In the PPCME2, SO BE IT (THAT) clauses, like ALL BE IT (THAT) clauses, are treated similarly to
    V1 conditionals. SO is POS-tagged ADV, surrounded by ADVP brackets, and treated as a daughter of CP-ADV. Again, this is not intended as the correct analysis of the construction, but rather to fit in with the annotation of V1 conditionals.
    ( (IP-MAT-SPE (CONJ and)
                  (NP-SBJ (Q all))
                  (MD $shal)
                  (BE be)
                  (VAN delyverde)
                  (, ,)
                  (PP (P so)
                      (CP-ADV (C 0)
                              (IP-SUB (NP-SBJ (PRO thou))
                                      (MD wolte)
                                      (VB telle)
                                      (NP-OB2 (PRO me))
                                      (NP-OB1 (PRO$ thy) (N name)))))
                  (, ,)
                  (CP-ADV (ADVP (ADV so))				← SO BE IT
                          (IP-SUB (BEP be)
                                  (NP-SBJ-1 (PRO hit))
                                  (CP-THT-1 (C that)
                                            (IP-SUB (NP-SBJ (PRO thou))
                                                    (BEP be)
                                                    (NEG nat)
                                                    (NP-OB1 (NPR sir) (NPR Launcelot))))))
                  (. .)
                  (' '))
      (ID CMMALORY,191.2824))

    In the later corpora, SO BE IT is a word order variant of IT BE SO and is tagged accordingly.

    ( (IP-MAT (PP-LFD (P IF)
    		  (CP-ADV (C 0)
    			  (IP-SUB (ADVP (ADV SO))		← SO BE IT
    				  (BEP BE)
    				  (NP-SBJ=1 (PRO IT))
    				  (, ,)
    				  (CP-THT-1 (C THAT)
    					    (IP-SUB (PP (P IN)
    							(NP (Q ANY) (N TRIANGLE)))
    						    (, ,)
    						    (NP-SBJ (D THE)
    							    (N SQUARE)
    							    (PP (P OF)
    								(NP (D THE) (ONE ONE) (N SYDE))))
    						    (BEP BE)
    						    (ADJP (ADJ =L)
    							  (PP (P TO)
    							      (NP (D THE)
    								  (NUM .IJ.)
    								  (NS SQUARES)
    								  (PP (P OF)
    								      (NP (D THE) (OTHER OTHER) (NUM IJ.) (NS SIDES)))))))))))
    	  (, ,)
    	  (ADVP-RSP (ADV THAN))
    	  (MD MUST)
    	  (NP-ADV (N NEDES))
    	  (NP-SBJ (D THAT) (N CORNER))
    	  (BE BE)
    	  (NP-OB1 (D A)
    		  (ADJ RIGHT)
    		  (N CORNER)
    		  (, ,)
    		  (CP-REL (WNP-3 (WPRO WHICH))
    			  (C 0)
    			  (IP-SUB (NP-SBJ *T*-3)
    				  (BEP IS)
    				  (VAN CONTEINED)
    				  (PP (P BETWENE)
    				      (NP (D THOSE) (NUM TWO) (ADJR LESSER) (NS SYDES))))))
    	  (. .))
      (ID RECORD,2.E4V.296))


    For the moment,
    disfluencies are indicated only in the PPCMBE.


    In Middle English, DO can be ambiguous between a causative (
    ECM) main verb and a periphrastic auxiliary. The default in the PPCME2 is to treat ambiguous cases as causative except when a causative reading is impossible. Causative DO dies out in the course of Middle English, and so instances of DO in the later corpora that could in principle be treated as ambiguous and hence causative by default are instead uniformly treated as periphrastic.

    Foreign languages

    Sequences of more than two foreign words are labelled by language (FRENCH, ITALIAN, LATIN, SPANISH, etc.) in the Penn historical corpora (PPCME2, PPCEME, and PPCMBE). In the PPEEC, such sequences are labelled indiscriminately as FOREIGN.

    LIKE and similar verbs (LACK, NEED, WANT)

    As is well known, LIKE and similar verbs (LACK, NEED, WANT) occur in two constructions in the history of English. In the earlier construction (ME LIKE(N) PEARS), these verbs are parallel to modern PLEASE, and the subject is the theme argument. In the modern construction that replaces it (I LIKE PEARS), the subject is the experiencer.

    In the older construction, which continues into Early Modern English with LIKE, the theme is labelled NP-SBJ, and the experiencer NP-OB1. This contrasts with the annotation of impersonal copular constructions of the type IT/THERE IS NEED (TO) ME, where the experiencer is labelled NP-OB2 because of the presence of the verb BE (see NP-OB2 in copular constructions).

    (NODE (IP-SUB (NP-SBJ (D this) (ADJ wise) (N man))
                  (VBD saugh)
                  (CP-THT (C that)
                          (IP-SUB (NP-OB1 (PRO hym))
                                  (VBD wanted)
                                  (NP-SBJ (N audience)))))
          (ID CMCTMELI,219.C2.95))
    (NODE (PP (P if)
              (CP-ADV (C 0)
                      (IP-SUB (NP-SBJ (D +tat))
                              (NP-OB1 (PRO +gow))
                              (VBP nede+t))))
          (ID CMBRUT3,51.1503))

    In the modern construction, the experiencer is labelled NP-SBJ, and the theme NP-OB1.

    Post-head modifiers

    In the PPCME2, ELSE and ENOUGH in post-head position are surrounded by POS brackets, but not by phrasal brackets, contrary to the general rule that post-head modifiers are always bracketed as phrases.

    In the later corpora, the general rule is applied consistently, and these items are surrounded by both types of brackets.

    PPCME2						Later corpora
    (NP (Q no) (N thynge) (ADJ elles))		(NP (Q+N nothing)
    						    (ADJP (ADJ else)))
    (ADVP-LOC (Q+WADV anywhere) (ADV elles))	(ADVP-LOC (Q+N anywhere)
    							  (ADVP (ADV else)))
    (NP (N blisse) (ADJR inoh))			(NP (N bliss)
    						    (ADJP (ADJR enough)))
    (ADJP (ADJ rich) (ADVR ynow))			(ADJP (ADJ rich)
    						      (ADVP (ADVR enough)))
    (ADVP (ADV quickly) (ADVR ynow))		(ADVP (ADV quickly)
    						      (ADVP (ADVR enough)))

    In this connection, it is worth noting that the PPCME2 and the later corpora do not always agree on which instances of postnominal ELSE and ENOUGH are tagged as adjectival (ADJR) or as adverbial (ADVR).

    Quantifiers and quantified expressions


    In the PPCME2, LESS, LEAST and MUCH, MORE, MOST are generally tagged as quantifiers (Q, QR, QS), but as adjectives (ADJ, ADJR, ADJS) under conditions described below. The distinction between the adjectival use and the pure quantifier use is not always easy to make in a consistent way and becomes more difficult over time. In the later corpora, these items are therefore uniformly tagged as quantifiers (Q, QR, QS).

    LESS, LEAST and MUCH, MORE, MOST are treated as adjectives (ADJ, ADJR, ADJS) in the PPCME2 under the following conditions. See Comparative adjectives as heads of ADJP and Superlative adjectives as heads of ADJP for further relevant discussion.

    Measure phrases

    Quantified expressions functioning as clause-level measure phrases are tagged NP-ADV in the PPCME2 (see below), but as NP-MSR in the later corpora. See also Q+N, Q+WPRO.

    (NODE (IP-SUB (NP-SBJ (D +te) (N water))
                  (MD wolde)
                  (NP-ADV (Q+N no+ting))				← nothing
                  (DO done)
                  (NP-OB1 (PRO$ his) (N commandement)))
          (ID CMBRUT3,123.3740))
    ( (IP-MAT (CONJ &)
              (NP-SBJ (D +tis) (NPR Harolde))
              (HVD hade)
              (NP-ADV (Q+N no+ting))				← nothing
              (NP-OB1 (NP (D +te) (NS condicions))
                      (CONJP (CONJ ne)
                             (NP (NS maners)))
                      (PP (P of)
                          (NP (NPR Kyng) (NPR Knoght)
                              (CP-REL (WNP-1 0)
                                      (C +tat)
                                      (IP-SUB (NP-SBJ *T*-1)
                                              (BED was)
                                              (NP-OB1 (PRO$ his) (N fader)))))))
              (. ,))
      (ID CMBRUT3,124.3772))
    (NODE (IP-SUB (PP *ICH*-2)
                  (NP-SBJ (D the) (N werre))
                  (VBP liketh)
                  (NP-OB1 (PRO yow))
                  (NP-ADV (Q no) (N thyng)))			← nothing
          (ID CMCTMELI,235.C1.699))
    ( (IP-MAT (NEG Ne)
              (MD +terf)
              (NP-SBJ (D +tt) (ADJ seli) (N meiden)
                      (CP-REL (WNP-1 0)
                              (C +tt)
                              (IP-SUB (NP-SBJ *T*-1)
                                      (HVP haue+d)
                                      (ADVP (Q al))
                                      (DON idon)
                                      (NP-OB1 (PRO hire))
                                      (PP (RP ut) (P of)
                                          (NP (ADJ +tullich) (N +teowdom)))
                                      (PP (P as)
                                          (NP (NP (NPR$ godes) (ADJ freo) (N dohter))
                                              (CONJP (CONJ &)
                                                     (NP (NP-POS (PRO$ his) (N$ sunes))
                                                         (N spuse))))))))
              (, .)
              (VB drehe)
              (NP-ADV (Q+N nawiht))					← nought
              (NP-OB1 (SUCH swucches))
              (. .))
      (ID CMHALI,157.417))
    ( (IP-MAT-SPE (NP-OB1 (Q Alle) (PRO$ +tine) (CODE ←P_47>) (NS +treates))
                  (NEG ne)
                  (VBP drede)
                  (NP-SBJ (PRO ich))
                  (IP-MAT-PRN (VBD q+d)
                              (NP-SBJ (PRO ha)))
                  (NP-ADV (QP (ADV riht) (Q noht)))			← nought
                  (. .))
      (ID CMKATHE,47.442))
    (NODE (NP (D $+te) (CODE {TEXT:bi+te}) (N mu+d)
              (CP-REL (WNP-2 0)
                      (C +tt)
                      (IP (NP-SBJ *T*-2)
                          (NP-ADV (Q+N eawicht))			← ought
                          (VBP (VBP21 mis) (VBP22 sei+d))
                          (NP-OB1 (PRO +te)))))
          (ID CMANCRIW,II.100.1211))
    (NODE (IP-SUB (NP-ADV (Q oghte))				← ought
                  (NP-SBJ (PRO it))
                  (BEP es)
                  (NP-OB1 (QP (ADVR swa) (Q lyttill))
                          (CONJP (CONJ and)
                                 (ADJP (ADVR swa) (ADJ schorte))))
                  (, ,)
                  (PP (P for)
                      (NP (OTHER othire) (NS thoghtes)
                          (CP-REL (WNP-1 0)
                                  (C +tat)
                                  (IP-SUB (NP-SBJ *T*-1)
                                          (BEP are)
                                          (PP (P in)
                                              (NP (PRO thaym))))))))
          (ID CMROLLTR,9.249))
    (NODE (IP-SUB (NP-SBJ (PRO ic))
                  (NP-OB1 (PRO hit))
                  (NP-ADV (Q ouht))					← ought
                  (VBP wite)
                  (, ,)
                  (PP (P to)
                      (NP (OTHER o+der) (NS +tinge))))
          (ID CMVICES1,53.588))
    (NODE (IP-SUB (NP-SBJ (PRO ha))
                  (BED wes)
                  (NP-ADV (Q+N sumdel))				← somedeal
                  (VAN (VAN offruht) (CONJ ant) (VAN offert)))
          (ID CMKATHE,29.161))
    (NODE (IP-SUB (NP-SBJ (PRO ich))
                  (NP-OB1 (D +tis))
                  (IP-MAT-PRN (VBP sei+d)
                              (NP-SBJ (N warschipe)))
                  (NP-ADV (Q+N sumdel))
                  (VBP understonde))
          (ID CMSAWLES,182.236))
    ( (IP-IMP (CONJ and)
              (PP (P among)
                  (NP (QP (ADVR so) (Q muche))
                      (N ioye)))
              (VBI antermete)
              (NP-OB1 (PRO +te))
              (NP-ADV (Q+WPRO sumwhat))				← somewhat
              (. ,))
      (ID CMAELR3,40.410))
    (NODE (IP-SUB (NP-SBJ (D +tis) (N worde)
                          (NP-PRN (N Gaste)))
                  (VBP sownnes)
                  (NP-ADV (Q+WPRO sumwhate))			← somewhat
                  (PP (P into)
                      (NP (N fellenes))))
          (ID CMEDTHOR,48.744))


    To facilitate searches, quantified expressions of the form Q+N, Q+WPRO (e.g., SOMETHING, SOMEWHAT) below the clause level are always enclosed in NP-MSR brackets in the later corpora, regardless of whether they are spelled as one word or two.
    PPCME2					Later corpora
    (ADJP (Q+WPRO somewhat)			(ADJP (NP-MSR (Q+WPRO somewhat)
          (ADJ late))			      (ADJ late))
    (ADJP (NP-MSR (Q some) (WPRO what))	(ADJP (NP-MSR (Q some) (WPRO what)
          (ADJ late))			      (ADJ late))
    (QP (Q+WPRO sumdele) (QR moor))		(QP (NP-MSR (Q+WPRO somewhat))
    					    (QR more))

    Splitting and joining words

    In keeping with our general strategy of minimizing changes to the annotation guidelines, most items are split or joined in the same way in the later corpora as in the PPCME2. However, some items that are treated as PPs in the PPCME2 (like AFTERNOON, TODAY, and TONIGHT) have a wider distribution in Modern English (the afternoon, today's lecture) and are therefore reclassified as
    unitary nouns. By contrast, the distribution of fused forms like ALIVE and ASLEEP continues to reflect their phrasal origin (*an asleep child), and so these items continue to be tagged with complex (+) tags.

    In a few cases (for instance, ALMIGHTY, BETIME(S)), we have changed the treatment of items in the later corpora for sheer convenience.

    Item PPCME2 Later corpora
    AFTERNOON Always phrasal.
    (PP (P after)
        (NP (N noon)))
    Unitary noun. Note: the afternoon
    (N afternoon)
    (N (N21 after) (N22 noon))
    (see Concessive clauses)
    Always phrasal.
    (Q all) (BEP be) (PRO it)
    Unitary adverb or preposition.
    (ADV albeit)
    (ADV (ADV31 al) (ADV32 be) (ADV33 it))
    (P albeit)
    (P (P31 al) (P32 be) (P33 it))
    ALIVE Treated inconsistently. Mostly unitary ADJ, sometimes P+N.
    (ADJ alive)
    (P+N alive)
    Always treated as a fused form.
    ALMIGHTY Treated as written; spelling varies.
    (ADJ allmighty)
    (ADJP (Q all) (ADJ mighty))
    Unitary adjective.
    (ADJ almighty)
    (ADJ (ADJ21 al) (ADJ22 mighty))
    BETIME(S) Treated as written; spelling varies.
    (ADVP-TMP (ADV betimes))
    (PP (P be)
        (NP (NS times)))
    Unitary adverb.
    (ADVP-TMP (ADV betimes))
    (ADVP-TMP (ADV (ADV21 be) (ADV22 times))
    (see Concessive clauses)
    Always phrasal.
    (WADV how) (BEP be) (PRO it)
    Unitary adverb or preposition.
    (ADV howbeit)
    (ADV (ADV31 how) (ADV32 be) (ADV33 it))
    (P howbeit)
    (P (P31 how) (P32 be) (P33 it))
    TODAY Always phrasal.
    (PP (P to)
        (NP (N day)))
    Unitary noun.
    (N today)
    (N (N21 to) (N22 day))
    TONIGHT Always phrasal.
    (PP (P to)
        (NP (N night)))
    Unitary noun
    (N tonight)
    (N (N21 to) (N22 night))
    WELCOME Always spelled together.
    (ADJ welcome), (N welcome)
    Unitary adjective or unitary noun
    (ADJ welcome), (N welcome)
    (ADJ (ADJ21 well) (ADJ22 come)), (N (N21 well) (NJ22 come))