See also Using punctuation to aid the parser.

Punctuation and symbols


For expository simplicity, we refer to the punctuation marks of interest in this section by name in uppercase (for example, COMMA rather than ",").

The following discussion assumes a distinction between orthographic sentences and (sentence) tokens. Orthographic sentences are word strings delimited by PERIOD, QUESTION MARK, or EXCLAMATION POINT (but not COMMA). Sentence tokens are structures delimited in accordance with the guidelines in Division into sentence tokens. Orthographic sentences may coincide with tokens, but often span more than one sentence token. Conversely, tokens need not be full sentences (as in the case of short answers to questions, other sentence fragments, and so on); see FRAG for details.

Contrary to conventional usage, punctuation is split off from words and tagged as PUNC except in the following cases:

By default, punctuation precedes timestamps.

By default, punctuation attaches as high as is structurally possible. Attempting to put punctuation where it belongs is too difficult to be worthwhile. Default high attachment is easy to enforce automatically, so it is not worth spending time on during the correction process.

Finally, though it is tempting to do so, users of the corpus should not take punctuation to be a reliable proxy for intonation. This is a special case of the general maxim that the transcribed version of the corpus should not be used as the (only) basis for phonological analysis.


Period (.)

PERIOD (.) is the default sentence-final punctuation. But sentence tokens are often also delimited by COMMA.

( (IP-MAT (NP-SBJ (D The) (N cat))
          (VP (BED was)
              (ADJP-PRD (ADJP (ADJ black))
                        (CONJP (CONJ and)
                               (ADJP (ADJ fluffy)))))
          (PUNC .)))

In AAPCAppE and CoNYCE, PERIOD is not used for acronyms or initials.

( (NP (N-COMP (NPR T)
              (NPR S)
              (NPR Eliot))))

( (NP (NPR YMCA)))

( (NP (N-COMP (NPR P)
              (NPR S)
              (NPR Ten))))

Comma (,)

COMMA (,) is the most commonly used punctuation mark. It is the default punctuation mark everywhere except sentence-finally. Following standard (American) convention, COMMA is used for long sentences consisting of more than one token, notably in cases of clausal conjunction.

( (IP-MAT (NP-SBJ (D The)
		  (ADJP (ADJ black))
		  (N cat))
          (VP (VBD played)
              (PP (P with)
		  (NP (D the)
		      (ADJP (ADJ gray))
		      (N one))))
          (PUNC ,)))

( (IP-MAT (NP-SBJ (D the)
		  (ADJP (ADJ white))
		  (N cat))
          (VP (VBD groomed)
	      (NP-OB1 (D the)
		      (N puppy)))
          (PUNC ,)))

( (IP-MAT (CONJ and)
	  (NP-SBJ (D the)
		  (ADJP (ADJ Siamese))
		  (N cat))
          (VP (VBD slept)
              (PP (P on)
                  (NP (D the) (N radiator))))
          (PUNC .)))

Token-medially, COMMA serves a wide variety of functions in general accordance with standard (American) conventions. Some of the more noteworthy rules for comma placement are listed here.

Question mark (?)

See also the entry on
Question mark in Tips for corpus builders.

QUESTION MARK (?) marks tokens with interrogative discourse function, intended to elicit an answer or (possibly nominal) response, regardless of whether they exhibit interrogative morphosyntactic form.

( (CP-QUE-MAT (WNP-1 (WPRO What))			← interrogative form
              (IP-SUB (MD would)
	              (NP-SBJ (PRO you))
		      (VP (VB like)
			  (NP-OB1 *T*-1)))
	      (PUNC ?)))

( (IP-MAT (NP-SBJ (PRO They))				← declarative form with interrogative force
	  (VP (VBD liked)
	      (NP-OB1 (D the)
		      (N show)))
	  (PUNC ?)))

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD liked)
	      (NP-OB1 (D the)
		      (N show)))
	  (PUNC ,)
	  (PAREN (CP-QUE-MAT (DOP do)			← parenthetical question
			     (NP-SBJ (PRO you))
                             (VP (VB remember))))
	  (PUNC ?)))

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD liked)
	      (NP-OB1 (D the)
		      (N show)))
	  (PUNC ,)
	  (PAREN (CP-QUE-TAG (DOD did@)			← tag question
			     (NEG @n't)
			     (NP-SBJ (PRO they))))
	  (PUNC ?)))

( (IP-MAT (NP-SBJ (PRO They))				← echo question
	  (VP (VBD saw)
	      (WNP-OB1 (WPRO who)))
	  (PUNC ?)))

( (FRAG (NP (QP (Q Any))				← sentence fragment with interrogative force
	    (NS siblings))
	(PUNC ?)))

To facilitate the process of annotation, QUESTION MARK delimits each interrogative token separately, replacing COMMA or PERIOD, contrary to standard convention.

( (CP-QUE-MAT (WNP-1 (WPRO What))
              (IP-SUB (MD would)
	              (NP-SBJ (PRO you))
		      (VP (VB like)
			  (NP-OB1 *T*-1)))
	      (PUNC ?)))				← like this; not COMMA or PERIOD

( (CP-QUE-MAT (CONJ And)
	      (WADVP-1 (WADV when))
	      (IP-SUB (MD would)
		      (NP-SBJ (PRO you))
		      (VP (VB like)
			  (IP-ECM (NP-SBJ (PRO it))
				  (VP (ADVP-TMP *T*-1)
				      (VAN delivered)))))
	      (PUNC ?)))

QUESTION MARK does not mark indirect questions, even ones with direct question syntax (since they do not have interrogative force).

( (IP-MAT (NP-SBJ (PRO They))
	  (VP (VBD asked)
	      (CP-QUE-SUB (WNP-1 (WPRO what))
			  (IP-SUB (MD would)
				  (NP-SBJ (PRO I))
				  (VP (VB like)
				      (NP-OB1 *T*-1)))))
	  (PUNC .)))

QUESTION MARK can be used for cases where the speaker expresses pronounced uncertainty, usually in response to a question. It should be used sparingly and not used to indicate every case of of rising intonation.

( (ADVP (ADV Maybe))
  (PUNC ?))

Exclamation point (!)

EXCLAMATION POINT (!) can be used as token-final punctuation for extremely vehement exclamations.

EXCLAMATION POINT is deprecated as it has a special meaning for the syntax of many operation system shells and hence complicates automatic processing. See Tips for corpus builders for further discussion.

( (IP-MAT (NP-SBJ (D That@))
	  (VP (BEP @'s)
	      (NP-PRD (D a)
		      (N lie)))
	  (PUNC !)))

( (FRAG (PP (P Of)
	    (NP (N course)))
	(NEG not)
	(PUNC !)))

Double quotes (" ")

See also the entry on
Quotation marks in Tips for corpus builders. CoNYCE uses DOUBLE QUOTES as described below. In AppCAppE, direct speech is indicated by QTP alone, rather than by the redundant combination of QTP and punctuation.

DOUBLE QUOTES (" ") enclose direct speech. In tokens that dominate direct speech along with other material, double quotes as used in the standard way.

Contrary to the standard modern eye-candy convention, sentence-final punctuation follows DOUBLE QUOTES when associated with the matrix clause.

( (IP-MAT (NP-SBJ (PRO She))
	  (VP (VBD said)
	      (PUNC ,)
	      (PUNC ")
	      (QTP (IP-MAT (NP-SBJ (PRO I@))
			   (VP (BEP @m)
			       (VP (VAG coming))))))
	  (PUNC ")					← non-standard order
	  (PUNC .)))

In multi-token sequences of direct speech, each individual token is separately marked by DOUBLE QUOTES, contrary to standard orthographic convention, as highlighted in the example below. See the section on quotations in Tips for corpus builders for issues related to implementing this convention.

( (IP-MAT (NP-SBJ (PRO She))
	  (VP (VBD said)
	      (PUNC ,)
              (PUNC ")						← conventional
	      (QTP (IP-MAT-INV (ADVP-LOC (ADV Here))
			       (BEP is)
			       (NP-SBJ (NUMP (NUM one))))))
	  (PUNC ")						← additional
	  (PUNC ,)))

( (QTP (PUNC ")							← additional
       (FRAG (CONJ and)
	     (ADVP-TMP (ADV now))
	     (NP (D a)
		 (ADJP (ADJ second))
		 (N one)))
       (PUNC ")							← additional
       (PUNC ,)))

( (QTP (PUNC ")							← additional
       (IP-MAT-INV (CONJ and)
		   (ADVP-LOC (ADV here))
		   (BEP is)
		   (NP-SBJ (D a)
			   (ADJP (ADJ third))
			   (N one)))
       (PUNC ")							← conventional
       (PUNC .)))

( (IP-MAT (CONJ And)						← continuation of narrative
	  (ADVP-TMP (ADV then))
	  (NP-SBJ (PRO she))
	  (VP (VBD left))
	  (PUNC .)))

Single quotes (' ')

CoNYCE uses SINGLE QUOTES as described below. In AppCAppE, mention vs. use is indicated by META alone, rather than by the redundant combination of META and punctuation.

SINGLE QUOTES (' ') are used to highlight items that are mentioned rather than used, notably titles of books, songs, movies, TV shows, and so on. See META for discussion and examples.