Disfluencies and parentheticals

The following labels and dash tags are used to annotate disfluencies. The treatment of false starts is based on Hindle 1983.

Hindle, Don. 1983. Deterministic parsing of syntactic non-fluencies. In: Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics. 123–128. https/doi.org/10.3115/981311.981336.

FS (false start)

FS indicates a false start and is the default annotation for syntactic disfluencies. False starts can involve incomplete words, copies of single words and strings, and incomplete structures followed by a reprise containing (largely) parallel structures. False starts are determined by eliminating disfluent material until the result is a grammatical structure. When there is more than one way to eliminate disfluent material, as there generally is, it is early material that is eliminated rather than late material (hence, the term "false start"). In other words, the last copy in a sequence of copies counts as the text; the earlier copies count as false starts (Hindle 1983, p. 125, section 4.1). Material that is part of a false start remains available for computation in other modules (Hindle 1983, p. 126).

In practical terms, the most efficient way to determine the boundaries of a false start is to start at the end of a sentence token and work backwards.

FS can be used as a POS label to tag incomplete words, which are marked as such with a trailing hyphen. In general, however, FS encloses sequences of words, including ones labelled as FS. False starts attach as high as is structurally possible. (This is to eliminate needlessly time-consuming attachment decisions that will inevitably result in inconsistencies.)

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD said)
	      (CP-THT (C 0)
		      (FS (FS sh-) (PRO she))
		      (IP-SUB (NP-SBJ (PRO she))
			      (VP (HVD had)
				  (NP-OB1 (D the)
                                          (N-COMP (N bold) (NS hives)))))))
	  (PUNC .)))

Interjections do not count as part of false starts.

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD said)
	      (CP-THT (C 0)
		      (FS (FS sh-) (PRO she))
		      (PUNC ,)
		      (INTJ uh)
		      (PUNC ,)
		      (FS (PRO she) (HVD had))
		      (IP-SUB (NP-SBJ (PRO she))
			      (VP (HVD had)
				  (NP-OB1 (D the)
                                          (N-COMP (N bold) (NS hives)))))))
	  (PUNC .)))

With some exceptions in the case of very long false starts, the internal constituent structure of false starts is not annotated.


( (IP-MAT (INTJ Uh)
	  (PUNC ,)
	  (FS (NP-SBJ (PRO I))
	      (VP (VBP think)
		  (CP-THT (C 0)
			  (IP-SUB (NP-SBJ (D the)
					  (N story)
					  (PP (P of)
					      (NP (CP-FRL (WNP-1 (WPRO what@))
							  (IP-SUB (NP-SBJ *T*-1)
								  (VP (HVP @'s)
								      (VP (VBN happened)
									  (PP (P to)
									      (NP (PRO me))))))))))
				  (VP (MD will)
				      (VP (VB make)
					  (NP-OB1 (D a)
						  (ADJP (ADVP (ADV very))
							(FS in-)))))))))
	  (INTJ oh)
	  (PUNC ,)
	  (NP-SBJ (PRO I))
	  (VP (VBP know)
	      (CP-THT (C 0)
		      (IP-SUB (NP-SBJ (PRO it@))
			      (VP (MD @'ll)
				  (VP (VB make)
				      (NP-OB1 (D a)
					      (ADJP (ADVP (ADV very))
						    (ADJ interesting))
					      (N book)))))))
	  (PUNC .)))

BREAK

BREAK indicates that a constituent is broken off without being corrected or reprised, as would be the case with a false start. The material after BREAK is generally a separate sentence token, though token-internal BREAK is attested.
( (IP-MAT (NP-SBJ (PRO They))
          (VP (MD $could)
	      (NEG $n't)
	      (VP (VB believe)
		  (CP-THT (C 0)
			  (IP-SUB (NP-SBJ (PRO we))
				  (VP (VBD bought)
				      (NP-OB1 (D a) (CODE <BREAK>)))))))))	← token-final BREAK

( (IP-MAT (NP-SBJ (D the)
		  (N-COMP (N church) (N house)))
	  (VP (BED was)
	      (NP-MSR (FP just)
		      (D a)
		      (ADJP (ADJ little))
		      (CODE <BREAK>))						← token-internal BREAK
	      (CODE <$$RNapier_xmax=746.16>)
	      (CODE )
	      (ADVP-LOC (CP-FRL (WADVP-1 (WADV where))
				(IP-SUB (NP-SBJ (PRO I))
					(VP (BED =uz)
					    (VP (ADVP-LOC *T*-1)
						(VAN raised)))))))
	  (PUNC .)))

Ordinary ellipsis is not annotated with BREAK.

( (IP-MAT (NP-SBJ (PRO I))
	  (VP (VBD said)
	      (CP-THT (C that)
		      (IP-SUB (NP-SBJ (PRO I))
			      (VP (MD would)
				  (VP (VB help))))))
	  (PUNC ,)))

( (IP-MAT (CONJ and)
	  (NP-SBJ (PRO I))
	  (VP (DOD did))			← no explicit indication of elided main verb 
	  (PUNC ,)))

ELAB (elaboration)

Parenthetical constituents that elaborate on a preceding constituent (without being exact or even close copies) are marked as ELAB. In contrast to false starts, the internal structure of elaborations is fully annotated.

Elaborations attach as daughters of the constituent they are construed with if that is structually possible. Otherwise, they attach as low and close as they can to the constituent. Eventually, the annotation will include an *ICH* trace.

It is sometimes difficult to distinguish between elaborations and conjuncts. In general, conjunction structures are marked with at least one overt conjunction (if only one, generally introducing the last conjunct). In other words, asyndetic phrases are more likely to be elaborations than conjuncts, and so, unless a phrase is easily interpreted as a non-final conjunct, the default for cases that are ambiguous between elaboration and conjunction structure is ELAB.

( (IP-MAT (NP-SBJ (PRO It@))
  	  (VP (BEP @s)
	      (NP-PRD (D a) (N problem)
		      (ELAB (NP (D a)
				(ADJP (ADJ real))
				(N problem)))))
	  (PUNC .)))

( (IP-MAT (CONJ and)
	  (NP-TMP (D a)
		  (N lot)
		  (PP (P of)
		      (NP (NS times))))
	  (NP-SBJ (PRO hit@))
	  (VP (MD @'ll)
	      (VP (VP (VB fall))
		  (CONJP (CONJ and)
			 (VP (VB kill)
			     (NP-OB1 (D the) (NS people))))
		  (PUNC ,)
		  (ELAB (VP (GT get)
			    (ADJP-PRD (ADJ loose))))))
	  (PUNC ,)))

( (IP-MAT (INTJ Well)
	  (PUNC ,)
	  (PP (P for)
	      (NP (D a) (N while)))
	  (PUNC ,)
	  (NP-SBJ (PRO they@))
	  (VP (HVD @'d)
	      (VP (A a=)
		  (VBN used)
		  (NP-OB1 (NS mules))
		  (PUNC ,)
		  (ELAB (VP (VBN pulled)
			    (NP-OB1 (PRO it))
			    (RP out)
			    (PP (P with) (CODE <BREAK>))))))
	  (PUNC ,)))

( (IP-MAT (INTJ Hmm)
	  (PUNC ,)
	  (INTJ well)
	  (PUNC ,)
	  (CP-ADV (C after)
		  (IP-SUB (NP-SBJ (PRO they))
			  (VP (VBD quit)
			      (NP-OB1 (N work)))))
	  (PUNC ,)
	  (NP-SBJ (PRO I))
	  (VP (VBP imagine)
	      (CP-THT (C 0)
		      (IP-SUB (NP-SBJ (D the)
				      (QP (QS most))
				      (PP (P of)
					  (NP (PRO them))))
			      (VP (VBD drifted)
				  (ADVP-DIR (ADV away))
				  (PUNC ,)
				  (ELAB (VP (VBD left)))))))
	  (PUNC .)))

( (IP-MAT (NP-SBJ (NPR Indianapolis)
		  (PUNC ,)
		  (ELAB (PP (P to)
			    (NP (NPR Indiana)))))
	  (PUNC ,)
	  (PAREN (IP-MAT (NP-SBJ (PRO I))
			 (VP (VBP imagine))))
	  (VP (MD =ud)
	      (VP (BE be)
		  (NP-PRD (ADVP (ADV about))
			  (D the)
			  (ADJP (ADJS farthest))
			  (N place))))
	  (PUNC .)))

REP (repetition)

When a constituent (including a clause) is repeated for emphasis or other rhetorical reasons, the second and subsequent copies are marked as repetitions (REP). As with false starts, the internal structure of repetitions is not annotated.

The repetition must be exact; near repetitions are annotated as elaborations (ELAB). Otherwise, the conventions for false starts (FS) are applied.

The attachment of repetitions obeys the same rules as for elaborations.

In order to enable searches for exact sentential repetition, clauses can be enclosed in REP.

( (IP-MAT (NP-SBJ (PRO It@))
  	  (VP (BEP @s)
	      (NP-OB1 (D a) (N problem)
		      (PUNC ,)
		      (REP (D a) (N problem))
		      (PUNC ,)
		      (REP (D a) (N problem))))
	  (PUNC .)))

( (IP-MAT (NP-SBJ (PRO I))
  	  (VP (DOP do))
	  (PUNC ,)
	  (REP (PRO I) (DOP do))
	  (PUNC .)))

PAREN (parenthetical)

Parenthetical expressions other than those marked by ELAB or REP are enclosed in PAREN brackets. The following list is representative, but not intended to be exhaustive.

I DON'T KNOW, I WOULD SAY, IT SEEMS (LIKE), LET'S SEE, LOOK, SEE, WAIT, YOU KNOW, YOU SEE

PAREN is not the default annotation; in other words, when verbs in these expressions can be interpreted as taking ordinary complements, the default is to annotate them that way.

( (IP-MAT (NP-SBJ (PRO It))
	  (VP (VBP seems)
	      (CP-THT (C 0)                            ← default - no PAREN
		      (IP-SUB (NP-SBJ (D that@))
			      (VP (BEP @'s)
				  (NP-PRD (D a)
					  (N problem))))))
	  (. .)))

( (IP-MAT (NP-SBJ (D that@))
	  (VP (BEP @'s)
	      (NP-PRD (D a)
		      (N problem)))
	  (, ,)
	  (PAREN (IP-MAT (NP-SBJ (PRO it))
			 (VP (VBP seems))))
	  (. .)))

( (IP-MAT (NP-SBJ (D that))
	  (, ,)
	  (PAREN (IP-MAT (NP-SBJ (PRO it))
			 (VP (VBP seems))))
	  (, ,)
	  (VP (BEP is)
	      (NP-PRD (D a)
		      (N problem)))
	  (. .)))

PAREN also encloses the following constructions. Again, the list is intended to be illustrative rather than exhaustive.

( (IP-MAT (NP-SBJ (PRO they@))
	  (VP (MD @'d)
	      (VP (VB issue)
		  (NP-OB1 (D that))
		  (PUNC ,)
		  (PAREN (IP-MAT (NP-SBJ (PRO you))
				 (VP (VBP know))))
		  (PUNC ,)
		  (PP (P at)
		      (NP (D the) (N office)))))
	  (PUNC .)))

( (CP-QUE-MAT (CONJ And)
	      (PUNC ,)
	      (PAREN (IP-IMP (VP (VBI let@)
				 (IP-ECM (NP-SBJ (PRO @'s))
					 (VP (VB see))))))
	      (PUNC ,)
	      (WNP-1 (WD what)
		     (ADJP (ADJ other))
		     (N-COMP (N childhood) (NS diseases)))
	      (IP-SUB (IGNORE-BEP-2 are)
		      (NP-SBJ (EX there))
		      (VP (BEP *-2)
			  (NP-LGS *T*-1)))
	      (PUNC ?)))

Instances of PAREN are generally clauses (IP or CP) or instances of parenthetical gapping. There are a handful of exceptions, which are arguably better treated as elaborations (ELAB).

( (PP (PP (P to)
	  (NP (NPR Verda)))
      (PUNC ,)
      (CONJP (CONJ or)
	     (NP (ADJP (ADJR further))
		 (PUNC ,)
		 (ELAB (PP (ADVP (ADV almost))
			   (PAREN (NP (QP (Q some))
				      (PP (P of)
					  (NP (PRO them)))))
			   (P from)
			   (NP (NUMP (ADVP (ADV about))
				     (NUM seven))
			       (NS mile))))))))

SPCH (speaker change)

In general, overlapping speech and interruptions are edited in the parse so that material from one speaker is followed by material from another. In some cases, though, the overlapping speech or interruption is a backchannel cue or similarly minor material, and inserting a token break raises more problems than it solves. There are also cases where the general editing convention leads to a very odd-sounding sequence (as when the first speaker reacts to an interruption from the second). We therefore allow short overlapping sequences or interruptions to be marked by SPCH (speaker change) and to form part of a surrounding single sentence token.
( (IP-MAT (NP-SBJ (PRO They))
	  (VP (MD 0)
	      (VP (FP just)
		  (CODE <$$MJohnson_xmax=347.34>)
		  (SPCH (CODE <JReynolds_xmin=345.9>) (INTJ Right) (PUNC ,) (CODE <$$JReynolds_xmax=346.18>))
		  (CODE <$$overlap>)
		  (CODE <MJohnson_xmin=347.95>)
		  (VB take)
		  (NP-OB1 (D the) (N test))))
	  (PUNC ,)))