General introduction

As with the Penn Historical Corpora, the primary goal of the annotation system described here is to facilitate automated searches rather than to give a correct linguistic analysis of each sentence, which in many cases is unworkable and in some cases (due to structural and morphosyntactic ambiguity, inaudible material, or other reasons) downright impossible.

As with the historical corpora, we have tried to construct the system so that at each stage of the annotation, information can be added in a monotonic way. That is, we want any future revisions of the bracketed structures always to add information, never to change it. This goal requires us to avoid judgments that are subjective or error-prone.

More so than in the past, we have attempted to simplify the present guidelines in the sense of streamlining the principles involved rather than minimizing the number of nodes in the annotated structures. For instance, with very few exceptions, phrases under the present guidelines have a unique head - sometimes assumed implicitly, but more often than not explicitly included in the annotation. Although the resulting structures may look quite complex, the simplification of the principles improves the accuracy of the annotation process as well as its speed if (semi)automated methods of corpus construction, quality control, revision and maintenance are taken advantage of. In this connection, it is useful to recall the main purpose of the annotation, which is not to produce a corpus for human review, but to facilitate computer-assisted searches. In addition to improving the annotation process, the simpler annotation principles should result in search queries that are simpler and hence less prone to errors of logic. A final important advantage of simplifying the annotation principles is, we hope, that the simpler guidelines will extend more easily to other languages. In particular, the hope is that languages will differ mainly with regard to the "Individual words and phrases" section of the manual, but that syntactic annotation above the lexical level will be markedly more uniform cross-linguistically. Of course, we do not expect the present system to extend to other languages in every detail; in particular, including a VP is more sensible for some languages than for others. For Appalachian English, including a VP level simplifies decisions that arise in connection with the rampant syncretism in the verbal morphology, but for languages with productive scrambling assuming a VP level would likely raise more problems than it is worth.

Our focus on simplifying the annotation principles pushes the structures in the corpus in the direction of normalization. In particular, when it is sensible to do so, we prefer to give linguistic variants the same syntactic structure at the price of freely assuming and including empty categories in the annotation, as in the following examples.

( (PP (ADVP (ADV back))		← like this
      (P in / 0)
      (NP (D them)
	  (NS days))))

( (PP (ADVP (ADV back))		← not like this
      (NP-TMP (D them)
	      (NS days))))

In many cases, assuming empty categories makes little difference one way or the other to the task of annotation, but often they simplify the task significantly. For instance, assuming an empty HV node in the example below makes the choice of POS tag for "been" trivial. But if the empty HV node is eliminated, the proper POS tag is not clear.

( (FRAG (VP (VAN suppose)		
	    (IP-INF (TO to)
		    (VP (HV have / 0)
			(VP (BEN been)
			    (NP-PRD (NP-MSR (NUMP (NUM three))
					    (NS days'))
				    (N work))))))
	(PUNC .)))

( (FRAG (VP (VAN suppose)
	    (IP-INF (TO to)
		    (VP (?? been)
			(NP-PRD (NP-MSR (NUMP (NUM three))
					(NS days'))
				(N work)))))
	(PUNC .)))
For examples, see especially the treatment of degree and comparative constructions.

In any annotation system, it is necessary to deal with indeterminate and ambiguous cases.

POS ambiguity. In the case of ambiguous POS tags, it is sometimes possible to constrain the ambiguity to a choice between two (or very rarely three) options. When that is possible, that is what we do (rather than simply using the blanket tag X to indicate that the POS tag is not clear). The variant tags are concatenated using underline, as in the following examples. Unless there is a compelling reason not to do so, we list the options in alphabetical order.

( (CP-QUE-MAT (WADVP (WADV How@))
	      (IP-SUB (DOD_MD @'d)
		      (NP-SBJ (PRO you))
		      (VP (DO do)
			  (NP-OB1 (D that))))
	      (PUNC ?)))

( (CP-QUE-MAT (WADVP (WADV How@))
	      (IP-SUB (HVD_MD @'d)
		      (NP-SBJ (PRO they))
		      (VP (VB_VBN set)
			  (NP-OB1 (D that))
			  (RP up)))
	      (PUNC ?)))

Syntactic ambiguity. For syntactic ambiguity, the various options cannot be indicated as easily as for POS tags, and so we establish default annotations for difficult or unclear cases. These are discussed in connection with the relevant topics. The long table of contents also contain an entry under Defaults that provides a crib sheet.