Spelled together | Spelled apart | |
---|---|---|
Split | Emendation
| Separate tags
|
Treated as unitary | Simple tag
| Numbered tag
|
Treated as compound | Complex (+) tag
| Separate tags
|
Treated as written | Relationship between tagging for
variant spellings not necessarily transparent
(NPRS Englishmen) (ADJ English) (NS men) | |
Fused form | Phrase and complex (+) tag
| Phrase and separate tags
(PP (P be) (NP (N cause))) |
When an orthographic word in the original text belongs to different constituents (as defined by our annotation guidelines), the word is split into relevant parts, which are marked as emendations. As is usual with emendations, the original form is enclosed in (CODE {TEXT:...}).
Some combinations, such as a pronoun and a modal (e.g., 'twill), always belong to separate constituents and are therefore always separated. A systematic exception to the above concerns prepositions and single-word complements when they are spelled together (e.g., abed, on't, therewith); see below. Other combinations, such as determiner-modifier combinations (e.g., tother), do not always belong to distinct constituents in the sense of the annotation guidelines and are therefore not always split; see below.
In the later corpora, we attempt to regularize the spelling of split forms to the standard modern equivalent (if there is one). However, in two exceptional texts (stevenson, udall), the split forms are not standardized, but reflect the characteristic dialect forms used elsewhere in these texts. To facilitate searches, we distinguish contracted and non-contracted forms in the emendations (see "Modal plus negation" for examples).
The following cases of split words are particularly common:
$shall_MD $be_BE {TEXT:shalbe}_CODE $shall_MD $be_BE {TEXT:shallbee}_CODE $will_MD $be_BE {TEXT:wolbe}_CODE $will_MD $be_BE {TEXT:wylbe}_CODE
$can_MD $not_NEG {TEXT:cannot}_CODE ← non-contracted form $can_MD $n't_NEG {TEXT:cant}_CODE ← contracted form $can_MD $n't_NEG {TEXT:can't}_CODE $shall_MD $n't_NEG {TEXT:shant}_CODE $wo_MD $n't_NEG {TEXT:won't}_CODE ← "wo" rather than "will"
$grin_VBI $it_PRO {TEXT:grinit}_CODE $keth_VBP $he_PRO {TEXT:ketha}_CODE ← spelling variant of QUOTHA $maist_MD $tow_PRO {TEXT:maistow}_CODE $Pray_VBP $thee_PRO {TEXT:Prithee}_CODE $Pray_VBP $thee_PRO {TEXT:Prethe}_CODE $quoth_VBP $he_PRO {TEXT:quotha}_CODE
$ich_PRO $challe_MD {TEXT:ichalle}_CODE ← dialect form retained $it_PRO $'s_BEP {TEXT:its}_CODE $it_PRO $'s_BEP {TEXT:it's}_CODE $me_PRO $thynketh_VBP {TEXT:methynketh}_CODE ← original spelling in Middle English emendation $me_PRO $thinks_VBP {TEXT:methinkes}_CODE ← regularized spelling in Modern English emendation $there_EX $'s_BEP {TEXT:thers}_CODE $they_PRO $'ll_MD {TEXT:they'l}_CODE $'t_PRO $is_BEP {TEXT:'tis}_CODE ← position of apostrophe invariant in emendation $'T_PRO $is_BEP {TEXT:T'is}_CODE $'t_PRO $will_MD {TEXT:twil}_CODE ← apostrophe added in emendation
See Dollar tag, Possessive clitic.
Exceptionally not split. Although prepositions and their complements always belong to different constituents according to our guidelines, prepositions are exceptionally not split from single-word complements if both are spelled together. Most frequently, these single-word complements are R-pronouns or a contracted form of IT. The entire sequence is treated as a PP or WPP.
(PP (ADV+P heretofore)) (PP (ADV+P therefore)) (WPP (WADV+P wherewith)) (PP (P+PRO for't)) (PP (P+PRO in't)) (PP (P+PRO on't)) (PP (P+PRO too't)) (PP (P+NS acneon)) (PP (P+N areawe)) (PP (P+N ibedde)) (PP (P+N iwit))
Split or not depending on syntactic context. Some common cases are:
Instances of this type are not split when the determiner is the head of the NP (following the rule that prepositions are not split from single-word complements).
( (CP-QUE (IP-SUB (BEP Are) (NP-SBJ (PRO you)) (VAN a-uis'd) (PP (P+D o'that))) (. ?)))
Otherwise, they are split.
(PP (P $on) (NP (PRO$ $my) (CODE {TEXT:o'my}) (N life)))
Instances of this type are not split if the noun, adjective, or other element does not form a phrasal constituent with any following words. See also Items treated as compounds.
(NP (D+ADJ thilke) (NPR Iuditha)) (NP (D+N th'emperour)) (NP (D+N thestate)) (NP (D+ADJ thilke) (N matter)) (NP (D+ADV+VAN th'aforesayde) (N matter))
Otherwise, they are split.
(NP (D $the) (ADJP (ADV $right) (CODE {TEXT:theright}) (ADJ honourable)) (N Earle) (PP (P of) (NP (NPR Atholl)))) (NP (D $th') (ADJP (ADV $afore) (CODE {TEXT:th'afore}) (VAN sayde)) (N matter))
Items in this category may be spelled as one orthographic word or several. When written together, they are given a simple POS tag. When written apart, each part of the multiword sequence is surrounded by a numbered POS tag. The first number indicates the total number of parts; the second number indicates each part's place within the entire sequence. In order to facilitate CorpusSearch queries, an additional POS tag (unnumbered) surrounds the entire sequence in the parsed files.
(ADV nevertheless) (ADV (ADV31 never) (ADV32 the) (ADV33 less))
Although our treatment of fused forms generally reflects their phrasal origin, certain such items must be treated as unitary because of their syntactic distribution. For instance, UNDERHAND must be treated as an adjective because it can appear as a prenominal modifier.
(NP (ADJ underhand) (NS courses))
Once an item is treated as unitary in one context, it is treated that way consistently.
(ADVP (ADV secretly) (CONJ and) (ADV underhand)) ← not (PP (P+N underhand))
For items that go the other way (e.g., ALIVE, ASLEEP), see Fused forms.
Historical changes in distribution can lead to differences in the way that items are treated in the PPCME2 and in the later corpora.
This category does not include:
|
(ADJ alone) (ADJ (ADJ21 a) (ADJ22 lone)) (ADJ backward) (ADJ (ADJ21 back) (ADJ22 ward)) (ADJ gladful) (ADJ (ADJ21 glad) (ADJ22 ful)) (ADJ innermost) (ADJ (ADJ21 inner) (ADJ22 most)) similarly: other adjectives ending in -MOST other adjectives ending in -WARD (ADJ derworthy) (ADJ red-hot) (ADJ selfsame) (ADJ sevenfold) (NP (ADJ upright) (NS men) (also adverb) (ADJ welcome) This category includes apparent compounds with 'false participles': (ADJ feather-footed) (ADJ (ADJ21 feather) (ADJ22 footed)) (ADJ mild-hearted) (ADJ (ADJ21 mild) (ADJ22 hearted)) (ADJ two-toothed) (ADJ (ADJ21 two) (ADJ22 toothed)) (ADJ ill-natured), but (ADV+VAN ill-favoured)
|
The following adverbs and prepositions are treated as unitary.
adverbs ending in -MOST a+det about above abroad afore again against almost already although(inwith) altogether always alwhatamong amore anon apon (but not upon) aright asswa away before behind beneath beside(s) between betwixt beyond bimong eftsoon evermore for+ti fornigh forthright forto fromward(tofore) furthermore furtherover henceforward intil inwith la(n)hure maybe mayfortune mayhap moreover na+gtuor+tan natforthi ne+taget nethelatter & variants nevermore & variants nevertheless & variants nonetheless & variants notwithstanding onward outake(n) overal overmete peradventure percase perchance perhaps thenceforth there(to)against throughout tilinto tilto toeke(n) tofore(hand) togains together toward towhether umbestunde underhand (also adjective) upright (also adjective) unto (but not into) whatforthi withal within without(forth) +te+get +tewhether +tohhswa+tehh
Certain items are treated differently in the PPCME2 and in the later corpora (e.g., AFTERNOON, TODAY, and TONIGHT). |
Common items in this category include:
(N ado) (N (N21 a) (N22 do)) ← A = northern infinitival marker farewell inside outside (N todo) (N (N21 to) (N22 do)) (N to-morrow) (N (N21 to) (N22 morrow)) (N$ tomorrow's) (N$ (N$21 to) (N$22 Morrows)) (N yesterday) (N (N21 yester) (N22 day)) (N$ yesterdays) (N$ (N$21 yester) (N$22 day's)) (N yesternight) (N (N21 yester) (N22 night)) (NPR Wednesday) (NPR (NPR21 Wadenes) (NPR22 day))
(VAN (VAN21 fore) (VAN22 said)) ← FORE treated as prefix, not P, because of meaning
(VB (VB21 with) (VB22 say))
(VBD (VBD21 by) (VBD22 shone)
(VBD (VBD21 to) (VBD22 brake))
(VBD (VBD21 a) (VBD22 resunede)) (VBD (VBD21 a) (VBD22 seide))
(VBP (VBP21 a) (VBP22 kel+t)) (VBP (VBP21 a) (VBP22 turne+t))
(VAN (VAN21 y) (VAN22 cleped))
(VB (VB21 i) (VB22 heren))
(VBP (VBP21 +ge) (VBP22 bette))
When items treated as compounds are spelled apart,
each part receives a simple POS tag. No additional pair of POS brackets
is added to indicate the item's compound character (unlike in the case of
unitary items).
Phrasal brackets, on the other hand, are added as appropriate.(NP (ADJ+NS gentlemen)) (NP (ADJ gentle) (NS men)) ← no added NS (NP (D the) (NP (D the) (ADV+VAN aforesayde) (ADJP (ADV afore) (VAN sayde)) ← added ADJP (N matter)) (N matter)) |
The first part of a compound is tagged as N if that is possible given the meaning of the compound (EVIL-DOERS, ILL-BODING). Otherwise (EVIL-FAVOURED, ILL-DISPOSED, WELL-DOERS), the first part is tagged with the appropriate POS tag (here, ADV). |
(ADVR+ADV assone) = as soon (ADVR+ADV a-swythe) = as swythe (quickly)
(NEG+HVD nade) = NE + had (NEG+HVP nave) = NE + have (NEG+MD nolde) = NE + wolde (NEG+VBD nyst) = NE + wist
This category includes compounds in which the first part is a noun or some other category.
(N+N alderman) (N+N bishopric) (N+N eortheware) (N+NS evil-doers) (N+N godfather) (N+N household) (N+N lifetime) (N+N mankind)
(ADJ+NS gentlemen) (ADJ+N grandsire) (ADJ+NPR Halichurche) (ADJ+NS noblemen) (ADJ+N vainglory) (ADV+N hidercume) (NP-TMP (ADV+N oftesy+de)) (NP-TMP (ADV+N ofte-tide)) (NP-TMP (ADV+N(S) often-tyme(s))) (NP-TMP (ADV+N often-while)) (ADV+NS well-doers) (ADV+NS well-wishes) (NP-TMP (ADV+N afor-tyme)) (NP-TMP (ADV+N beforetime)) (NP-ADV (OTHER+NS othergates)) (NP-ADV (OTHER+N otherwise)) (PP (P+N beforehand))
(ADVR+Q overmanie) (ADVR+Q overmuch) (ADVR+ADJ overproud)
(ADV+VAN abouesaide) (ADV+VAN aforn-seyd) (ADV+VAN be-forn-wretyn) (ADV+VAG everlasting) (ADV+VAN ill-disposed) (ADV+VAN new-born) (ADV+VBN new-come) (ADV+VAN well-knowyn)
(N+VAG alms-willing) (NPR+VAG god-fearing) (N+VAG ill-boding) (N+VAN self-conceited) (N+VAN wind-driven)
See Dollar tag, Possessive clitic.
Quantified adverbs are treated as compounds of ANY, EVERY, NO, SOME, etc. (Q) + HOW, WHERE, etc. (WADV).
(ADVP (Q+WADV anyhow)) (ADVP (Q any) (WADV how)) (ADVP-LOC (Q+WADV anywhere)) (ADVP-LOC (Q any) (WADV where)) similarly: everywhere nowhere somehow somewhere
Quantified nouns are treated as compounds of ANY, EVERY, NO, SOME, etc. (Q) + ONE, PLACE, THING, TIME(S), WHAT, WIHT, etc. (N, NS, ONE).
(NP (Q+ONE anyone)) (NP (Q any) (N thing)) (NP-LOC (Q+N anyplace)) (NP-LOC (Q any) (N place)) (NP (Q+N eawiht)) (NP (Q+ONE echone)) (NP (Q+ONE ilkane)) (NP (Q+N somdel) ← -MSR, -OB1, -SBJ, etc. according to function (NP-TMP (Q+N sometime) (NP-TMP (Q some) (N time)) (NP-TMP (Q+NS sometimes) (NP-TMP (Q some) (NS times)) (NP (Q+N somewhat) ← -MSR, -OB1, -SBJ, etc. according to function similarly (some items repeated here for convenience): anyone anyplace anything anytime everydel everyone everyplace everything everytime no-one noplace nothing someone someplace something sometime, sometimes
(FOR for) (TO+VB tabyde) (TO+VB tappeal) (TO+VB toffrenn) (TO+VB toslenne)
Combinations with -WARD that are used as adjectives or prepositions are treated as unitary items.
All other uses and occurrences are tagged WARD, which is either part of a complex (+) POS tag (ADV+WARD, N+WARD, NPR+WARD, RP+WARD, etc.) or a separate tag, depending on whether -WARD is spelled as a separate orthographic word.
(ADVP-TMP (ADV+WARD afterward)) (ADVP-TMP (ADV after) (WARD ward)) (ADVP-DIR (ADV+WARD backward)) (ADVP-DIR (ADV back) (WARD ward)) like adverbial BACKWARD: FORWARD (ADVP-DIR (RP+WARD downward)) (ADVP-DIR (RP down) (WARD ward)) like adverbial DOWNWARD: INWARD, ONWARD, OUTWARD, TOWARD, UPWARD
(WPRO+ADV whatever) (WPRO+ADV+ADV whatsoever) (WADV+ADV+ADV wheresomever) (WPRO+ADV whoso)
ALMIGHTY and BETIME(S) are treated differently in the PPCME2 and in the later corpora. |
Spelled together | Spelled apart | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
(P into) | (RP in) (P to) | unlike unitary unto | |||||||||
(P up-on) | (RP up) (P on) | unlike unitary apon
(NPR Englishman) (NPRS Englishmen)
| (NP (ADJ English) (N man)) (NP (ADJ English) (NS men)) similarly DUTCHMAN, FRENCHMAN
|
| (NUM fifty-three)
| (NUMP (NUM fifty) (NUM three))
|
| (ADJ one-and-fiftieth) (ADJ three-and-fiftieth) (ADJ fifty-first) (ADJ fifty-third)
| (ADJP (ONE one) (CONJ and) (ADJ fiftieth)) (ADJP (NUM three) (CONJ and) (ADJ fiftieth))
| |
See Items treated as unitary for distinction between cases like ALIVE and ASLEEP (fused forms) and UNDERHAND (unitary adjective or adverb). |
(PP (P a) (PP (P+ADV+WARD abackward)) (ADVP (ADV+WARD backward))) ABOARD (PP (P a) (PP (P+RP adown)) (ADVP (RP down)) (PP (P a) (PP (P+N ahunting)) (NP (N hunting))) (PP (P a) (PP (P+N alive)) (NP (N live))) AMID (PP (P a) (PP (P+N asleep)) (NP (N sleep))) (PP (P a) (PP (P+ADJ asunder)) (ADJP (ADJ sunder))) (PP (P a) (PP (P+NUM atwo)) (NP (NUM two))) similarly: abed aday afire afoot afresh amorrow anight apace aside a+tre
AFTERNOON ALBEIT ALMIGHTY (PP (P at) (PP (P+ADV atonce)) (ADVP (ADV once))) (PP (P be) (PP (P+N bycaus) (NP (N cause) (CP-ADV ...)) (CP-THT ...))) (PP (P before) (PP (P+N beforehand)) similarly: AFOREHAND, BEHINDHAND (NP (N hand))) (NP-TMP (ADV before) (N time))) (NP-TMP (ADV+N beforetime)) (NP-TMP (ADV before) (NS times))) (NP-TMP (ADV+NS beforetimes)) BETIME(S) FORASMUCH (PP (P for) (PP (P+ADV forever)) (ADVP (ADV ever))) (PP (P for) (PP (P+N forsooth)) similarly: INSOOTH (NP (N sooth))) (PP (P for) (D thi) (PP (P+D forthi)) (CP-ADV (CP-ADV (PP (P for) (WADV whi) (PP (P+WADV forwhi)) when used as subordinator (CP-ADV ...)) (CP-ADV ...)) HOWBEIT INASMUCH (like FORASMUCH) (PP (P in) (PP (P+N indeed)) (NP (N deed))) (PP (P in) (PP (P+N instead)) (NP (N stead))) (PP (P o') (PP (P+N o'clock)) (NP (N clock))) (N percent) (LATIN (FW per) (FW cent)) (LATIN (FW per) (FW cent.)) TODAY TONIGHT