Division into sentence tokens


In the general case, sentence tokens are simple (= non-conjoined) instances of one of the following categories:

Sentence tokens are distinct from orthographic sentences (sequences of words delimited by sentence-final punctuation such as period, question mark, or exclamation point). Orthographic sentences can, and often do, consist of several sentence tokens. In the canonical case, sentence tokens are matrix clauses - that is, clauses not in the scope of a (possibly silent) subordinating conjunction. When a matrix clause has a subject that either differs in reference from a previous matrix subject or is overt (whether coreferential with a previous matrix subject or not), a new sentence token is indicated. Conversely, VP conjuncts do not constitute sentence tokens (unless they are instances of FRAG).


( (IP-MAT (NP-SBJ (PRO I)) (VP (VBD left)) (PUNC ,))) ( (IP-MAT (CONJ and) (NP-SBJ (PRO she)) ← switch reference (VP (VBD arrived)) (PUNC ,)))
( (IP-MAT (NP-SBJ (PRO I)) (VP (VBD arrived)) (PUNC ,))) ( (IP-MAT (CONJ and) (NP-SBJ (PRO I)) ← no switch reference, but subject is overt (VP (VBD cleaned) (RP up)) (PUNC ,))) ( (IP-MAT (CONJ and) (ADVP-TMP (ADV then)) (NP-SBJ (PRO I)) ← no switch reference, but subject is overt (VP (VBD left)) (PUNC ,)))
( (IP-MAT (NP-SBJ (PRO I)) (VP (VP (VBD arrived)) ← VP conjuncts (PUNC ,) (CONJP (VP (VBD cleaned) (RP up)) (PUNC ,) (CONJP (CONJ and) (VP (ADVP-TMP (ADV then)) (VP (VBD left)))))) (PUNC ,)))
( (CP-QUE-MAT (IP-SUB (DOD Did) (NP-SBJ (PRO (PRO you)) (VP (VB come)))) (PUNC ?))) ← see Question mark ( (CP-QUE-MAT (CONJ Or) (IP-SUB (DOD did) (NP-SBJ (PRO (PRO you)) ← no switch reference, but overt (VP (VB go)))) (PUNC ?)))
( (CP-QUE-MAT (IP-SUB (DOD Did) (NP-SBJ (PRO (PRO you)) (VP (VP (VB come)) ← VP conjuncts (CONJP (CONJ or) (VP (VB go)))))) (PUNC ?)))
( (CP-QUE-MAT (WNP-1 (WPRO What)) (IP-SUB (DOD did) (NP-SBJ (PRO (PRO you)) (VP (DO do) (NP-OB1 *T*-1)))) (PUNC ?))) ( (FRAG (VP (VBD planted) (NP-OB1 (NS potatoes))) (PUNC .)))

Interjections, vocatives, and dangling categories

The following items are generally included in adjacent sentence tokens by the same speaker:

However, in the absence of a close connection between these items and an adjacent clause, these items stand alone. For instance, an interjection such as YES or NO may form a complete answer to a question, with subsequent material bearing no relation to the question, in which case the interjection and the subsequent material are annotated as separate tokens.

Direct speech

In multi-clause sequences of direct speech, the first direct speech clause belongs to the same sentence token as the governing matrix verb. Subsequent direct speech clauses are separate sentence tokens. Depending on the conventions of individual projects, the META constituent may be redundantly set off by double quotes.
( (IP-MAT (NP-SBJ (PRO He))
          (VP (VBD said)
	      (, ,)
	      (QTP (IP-MAT (NP-SBJ (PRO I))
			   (VP (VBD came)))))
          (PUNC ,)))

( (QTP (IP-MAT (NP-SBJ (PRO I))
	       (VP (VBD saw)))
       (PUNC ,)))

( (QTP (IP-MAT (NP-SBJ (PRO I))
	       (VP (VBD conquered)))
       (PUNC .)))

Exceptional cases

Asyndetic degree head clause

Close pragmatic links between two clauses are possible not only when those clauses are explicitly conjoined, as in the case of
pseudo-imperatives, but also with asyndetic clauses (ones not explicitly conjoined). When a matrix clause is interpreted as the semantic complement of a degree head in a following matrix clause, the second clause is given its usual structure, enclosed in PAREN brackets, and included in the same sentence token as the first clause.
( (IP-MAT (NP-SBJ (D The) (N river))
	  (VP (VBD froze))
	  (PUNC ,)
	  (PAREN (IP-MAT (NP-SBJ (PRO it))
			 (VP (BED was)
			     (ADJP-PRD (ADVP (ADVR so))
				       (ADJ cold)))))
	  (PUNC .)))

Integrated adnominal IP-MAT

Another type of close but asyndetic discourse link is the case where a clause in the form of an IP-MAT functions very much like a relative clause, modifying a noun in a preceding clause. In this case, too, the second matrix clause is given its usual structure. It is attached in the same way as an ordinary relative clause would be: either as a sister of the noun, or (if the clause is not adjacent to the noun) as close and low to the noun as structurally possible. To explicitly indicate the second clause's function, its IP-MAT label is given the additional dash tag -REL.
( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD met)
	      (NP-OB1 (D a) (N guy)
		      (PUNC ,)
		      (IP-MAT-REL (NP-SBJ (PRO he))
				  (VP (BED was)
				      (ADJP-PRD (ADJ able)
						(IP-INF (TO to)
							(VP (VB show)
							    (NP-OB2 (PRO them)))))))))
	  (PUNC .)))

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD met)
	      (NP-OB1 (D a) (N guy)
		      (PUNC ,)
		      (IP-MAT-REL (NP-SBJ (PRO I))
				  (VP (MD ca@)
				      (NEG @n't)
				      (VP (VB remember)
					  (NP-OB1 (NP-POS (PRO$ his))
						  (N name))
					  (PP (P at)
					      (NP (D the) (N moment))))))))

	  (PUNC .)))

We adopt the IP-MAT-REL label over other annotation alternatives (see below) because we wish to make the retrieval of these cases as convenient as possible. In particular, we wish to facilitate cross-linguistic study of the phenomenon, which has been noted in German, where it is very striking given that language's V2 character.

(1) a.   Es war einmal ein König, der sieben Kinder hatte.
expl-it was once a king who/this-one seven children had
Ordinary relative: 'Once, there was a king who had seven children.'
b. Es war einmal ein König, der hatte sieben Kinder.
Integrated V2 relative: 'Once, there was a king; he had seven children.'

The two annotation alternatives that we do not adopt are as follows. First, we could simply enclose the second clause in PAREN brackets. But this would not allow a targeted retrieval of the cases of interest, as parenthetical clauses can attach as sisters of N without being adnominal modifiers.

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD met)
	      (NP-OB1 (D a) (N guy)
		      (PUNC ,)
		      (PAREN (IP-MAT (NP-SBJ (PRO it))
				     (VP (BED was)
					 (NP-PRD (ADJP (ADJ sheer))
						 (N coincidence)))))
		      (PUNC ,)
		      (PP (P with)
			  (NP (D the)
			      (ADJP (ADJ right))
			      (N kind)
			      (PP (P of)
				  (NP (N equipment)))))))
	  (PUNC .)))

Second, we could annotate IP-MAT-RELs as zero-marked relative clauses containing a resumptive pronoun. Though this would allow targeted retrieval, the IP-MAT-REL label is more convenient.

Parenthetical question

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBP like)
	      (NP-OB1 (PRO that)))
	  (PUNC ,)
          (PAREN (CP-QUE-MAT (IP-SUB (DOP do)
				     (NP-SBJ (PRO they)))))
	  (PUNC ?)))

( (IP-MAT (NP-SBJ (PRO They))
          (VP (MD would)
              (VP (VB like)
                  (NP-OB1 (PRO that))))
	  (PUNC ,)
          (PAREN (CP-QUE-MAT (IP-SUB (DOP do@)
				     (NEG @n't)
				     (NP-SBJ (PRO you))
				     (VP (VB think)))))
	  (PUNC ?)))

Pseudo-imperative and similar cases

In the so-called pseudo-imperative, an imperative followed by a matrix declarative corresponds pragmatically to the protasis (IF clause) and apodosis (THEN clause) of a conditional construction. In order to represent the tight link between the two clauses and to allow these cases to be retrieved using CorpusSearch, pseudo-imperatives are exceptionally annotated as single sentence tokens, even though they consist of two independent conjoined clauses. In doubtful cases, the default is to annotate as a single token, since otherwise the examples would not be retrievable with CorpusSearch.
( (IP-IMP (IP-IMP (VP (VBI Cross)
		      (NP-OB1 (D the)
			      (N line))))
	  (PUNC ,)
	  (CONJP (CONJ and)
		 (IP-MAT (NP-SBJ (PRO I))
			 (VP (VBP shoot))))
	  (PUNC .)))

( (IP-IMP (IP-IMP (VP (VBI Stay)
		      (PP (P behind)
			  (NP (D the)
			      (N line)))))
	  (PUNC ,)
	  (CONJP (CONJ or)
		 (IP-MAT (NP-SBJ (PRO I))
			 (VP (VBP shoot))))
	  (PUNC .)))

Pragmatically analogous cases where the first conjunct belongs to some category other than IP-IMP are treated the same way.

( (FRAG (NP (NP (NUMP (NUM One))
		(QP (QR more))
		(N step))
	    (PUNC ,)
	    (CONJP (CONJ and)
		   (IP-MAT (NP-SBJ (PRO I))
			   (VP (VBP shoot)))))
	(PUNC .)))

( (FRAG (FRAG (NP (NUMP (NUM One))
		  (N step))
	      (ADVP (ADVR closer)))
	(PUNC ,)
	(CONJP (CONJ and)
	       (IP-MAT (NP-SBJ (PRO I))
		       (VP (VBP shoot))))
	(PUNC .)))

Reprise construction

( (IP-MAT (NP-SBJ (D The)
                  (NS kids))
	  (VP (VBP like)
	      (NP-OB1 (PRO that)))
	  (PUNC ,)
	  (PAREN (IP-MAT (NP-SBJ (PRO they))
			 (VP (DOP do))))
	  (PUNC .)))

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBP like)
	      (NP-OB1 (PRO that)))
	  (PUNC ,)
	  (PAREN (IP-MAT (NP-SBJ (D the)
				 (NS kids))
			 (VP (DOP do))))
	  (PUNC .)))

In order to facilitate retrievability and comparison, "reverse" reprise constructions are also treated as single tokens.

( (IP-MAT (NP-SBJ (D The)
                  (NS kids))
          (VP (DOP do))
	  (PUNC ,)
	  (PAREN (IP-MAT (NP-SBJ (PRO they))
			 (VP (VBP like)
			     (NP-OB1 (PRO that)))))
	  (PUNC .)))

( (IP-MAT (NP-SBJ (PRO They))
          (VP (DOP do))
	  (PUNC ,)
	  (PAREN (IP-MAT (NP-SBJ (D they)
				 (NS kids))
			 (VP (VBP like)
			     (NP-OB1 (PRO that)))))
	  (PUNC .)))

Tag question

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBP like)
	      (NP-OB1 (PRO that)))
	  (PUNC ,)
          (PAREN (CP-QUE-TAG (IP-SUB (DOP do@)
				     (NEG @n't)
				     (NP-SBJ (PRO they)))))
	  (PUNC ?)))

( (IP-MAT (NP-SBJ (PRO They))
          (VP (MD would)
              (VP (VB like)
                  (NP-OB1 (PRO that))))	
	  (PUNC ,)
          (PAREN (CP-QUE-TAG (IP-SUB (MD would@)
				     (NEG @n't)
				     (NP-SBJ (PRO they)))))
	  (PUNC ?)))