Splitting and joining words

Summary overview

Spelled together Spelled apart
Split Emendation
(MD $can) (NEG $n't)
(CODE {TEXT:can't})
Separate tags
(MD can) (NEG not)
Treated as unitary Simple tag
(ADJ blue-eyed)
Numbered tag
(ADJ (ADJ21 blue) (ADJ22 eyed))
Treated as compound Complex (+) tag
(ADJ+NS gentlemen)
Separate tags
(ADJ gentle) (NS men)
Treated as written Relationship between tagging for variant spellings not necessarily transparent
(NPRS Englishmen)		(ADJ English) (NS men)
Fused form Phrase and complex (+) tag
(PP (P+N bicause))
Phrase and separate tags
(PP (P be)
    (NP (N cause)))

Items that are split

When an orthographic word in the original text belongs to different constituents (as defined by our annotation guidelines), the word is split into relevant parts, which are marked as emendations. As is usual with emendations, the original form is enclosed in (CODE {TEXT:...}).

Some combinations, such as a pronoun and a modal (e.g., 'twill), always belong to separate constituents and are therefore always separated. A systematic exception to the above concerns prepositions and single-word complements when they are spelled together (e.g., abed, on't, therewith); see below. Other combinations, such as determiner-modifier combinations (e.g., tother), do not always belong to distinct constituents in the sense of the annotation guidelines and are therefore not always split; see below.

In the later corpora, we attempt to regularize the spelling of split forms to the standard modern equivalent (if there is one). However, in two exceptional texts (stevenson, udall), the split forms are not standardized, but reflect the characteristic dialect forms used elsewhere in these texts. To facilitate searches, we distinguish contracted and non-contracted forms in the emendations (see "Modal plus negation" for examples).

The following cases of split words are particularly common:

Exceptionally not split. Although prepositions and their complements always belong to different constituents according to our guidelines, prepositions are exceptionally not split from single-word complements if both are spelled together. Most frequently, these single-word complements are R-pronouns or a contracted form of IT. The entire sequence is treated as a PP or WPP.

(PP (ADV+P heretofore))		(PP (ADV+P therefore))
(WPP (WADV+P wherewith))

(PP (P+PRO for't))		(PP (P+PRO in't))
(PP (P+PRO on't))		(PP (P+PRO too't))

(PP (P+NS acneon))		(PP (P+N areawe))
(PP (P+N ibedde))		(PP (P+N iwit))

Split or not depending on syntactic context. Some common cases are:

Items treated as unitary

Items in this category may be spelled as one orthographic word or several. When written together, they are given a simple POS tag. When written apart, each part of the multiword sequence is surrounded by a numbered POS tag. The first number indicates the total number of parts; the second number indicates each part's place within the entire sequence. In order to facilitate CorpusSearch queries, an additional POS tag (unnumbered) surrounds the entire sequence in the parsed files.

(ADV nevertheless)		(ADV (ADV31 never) (ADV32 the) (ADV33 less))

Although our treatment of fused forms generally reflects their phrasal origin, certain such items must be treated as unitary because of their syntactic distribution. For instance, UNDERHAND must be treated as an adjective because it can appear as a prenominal modifier.

(NP (ADJ underhand) (NS courses))

Once an item is treated as unitary in one context, it is treated that way consistently.

(ADVP (ADV secretly) (CONJ and) (ADV underhand))	← not (PP (P+N underhand))

For items that go the other way (e.g., ALIVE, ASLEEP), see Fused forms.

Historical changes in distribution can lead to differences in the way that items are treated in the PPCME2 and in the later corpora.

Items treated as compounds

When they are spelled together, items that are treated as compounds receive a complex POS tag, consisting of two or more POS tags joined by "+".

When items treated as compounds are spelled apart, each part receives a simple POS tag. No additional pair of POS brackets is added to indicate the item's compound character (unlike in the case of unitary items).
(NP (ADJ+NS gentlemen))		(NP (ADJ gentle) (NS men))          ← no added NS
Phrasal brackets, on the other hand, are added as appropriate.
(NP (D the)			(NP (D the)
    (ADV+VAN aforesayde)	    (ADJP (ADV afore) (VAN sayde))  ← added ADJP
    (N matter))			    (N matter))		

The first part of a compound is tagged as N if that is possible given the meaning of the compound (EVIL-DOERS, ILL-BODING). Otherwise (EVIL-FAVOURED, ILL-DISPOSED, WELL-DOERS), the first part is tagged with the appropriate POS tag (here, ADV).

Items treated as written

This category is used largely for
fused forms, but also includes the following items.

ALMIGHTY and BETIME(S) are treated differently in the PPCME2 and in the later corpora.

Spelled together Spelled apart
(P into)
(RP in) (P to)
unlike unitary unto
(P up-on)
(RP up) (P on)
unlike unitary apon
(NPR Englishman)
(NPRS Englishmen)
(NP (ADJ English) (N man))
(NP (ADJ English) (NS men))
similarly DUTCHMAN, FRENCHMAN
(NUM fifty-three)
(NUMP (NUM fifty) (NUM three))
 
(ADJ one-and-fiftieth)
(ADJ three-and-fiftieth)
(ADJ fifty-first)
(ADJ fifty-third)
(ADJP (ONE one) (CONJ and) (ADJ fiftieth))
(ADJP (NUM three) (CONJ and) (ADJ fiftieth))
 

Fused forms

Certain items in later English are fusions of earlier multi-word phrases. Given the time coverage of our diachronic corpora and the fact that word division in early texts is not always well represented, these items are very difficult to treat in a consistent way. The strategy we have adopted is as follows.

See Items treated as unitary for distinction between cases like ALIVE and ASLEEP (fused forms) and UNDERHAND (unitary adjective or adverb).