Lemmatization documentation for the PPCHE

Hezekiah Bacovin

We intend to lemmatize the PPCHE according to the following guidelines. A beta lemmatized version of the PPCMBE2 is complete.

In CorpusSearch queries, the values of OEDIDs need to be "escaped" with a leading backslash (like any other numbers):

query: (OEDID iDoms \4)

Structure of lemmatized words

In a non-lemmatized corpus, the words in the text are represented as terminal nodes. These are unique daughters of preterminal nodes representing the word's POS (= part-of-speech) tag (= morphosyntactic category).
(NS men)

In the lemmatized PPCMBE2 (and eventually the remaining corpora in the PPCHE), the preterminal nodes have additional internal structure. For example:

(NS (ORTHO men)
                     (OEDID 113198))))

Each POS tag immediately dominates two nodes:


ORTHO dominates overt phonological material as it appears in the text, including punctuation. Information about traces and other silent categories, which are part of the annotated corpus but not the original text, are represented under


METAWORD dominates other information about the word under three possible daughters. One daughter, LEMMA, is obligatorily present, but not the other two, ALT-ORTHO and PLUSTAG.


ALT-ORTHO dominates phonologically null material, and it is incompatible with LEMMA or PLUSTAG. For instance, compare the overt complementizer that with its silent counterpart:
(C (ORTHO that)
                    (OEDID 200179))))


Under the current syntactic annotation guidelines, most silent categories are not dominated by POS tags, but directly by phrasal categories. In such cases, METAWORD and ALT-ORTHO are associated with the relevant phrasal category.

(NP (METAWORD (ALT-ORTHO *con*)))           ← silent conjoined subject

(WNP-1 (METAWORD (ALT-ORTHO 0)))            ← silent wh- antecedent

(NP-SBJ (METAWORD (ALT-ORTHO *T*-1)))       ← trace


LEMMA contains information about a word's lemma. Currently, a word can be associated with multiple LEMMA nodes. Over time, we aim to cull non-relevant LEMMA nodes. The goal is a one-to-one association of words with lemmas.


Each LEMMA node immediately dominates exactly one HEADWORD node, which has the relevant lemma spelling. As a general rule, this is the spelling of the OED headword. In the absence of an OED headword, the spelling is copied from ORTHO, or other guidelines apply (notably, in the case of
number words).


Each LEMMA node also immediately dominates an OEDID node, which contain an Oxford English Dictionary ID. (In cases of homography, there may be several OEDID nodes, each with their own ID.) These IDs can be added to the end of a URL like the following to access the lexical entry in the online OED. Details are discussed below.

There are four OEDID types, which are usefully thought of as being arranged on a scale:


The current annotation guidelines for the PPCHE allow complex POS tags, which reflect the etymological origins of morphological complex words.
(ADJ+N gentleman)

In the future, we aim to replace these so-called plus tags with appropriate simple tags in order to facilitate searches and the use of the corpora for computational purposes. Nevertheless, the plus tags may be useful for some projects, so we preserve them under the PLUSTAG node.

(N (ORTHO gentleman)
                    (OEDID 77673))
             (PLUSTAG ADJ+N)))

Lemmatization guidelines

General principles

The OED is our lemma authority. In other words, we adopt the OED's decisions concerning lemmatization in all cases, even when alternative analyses are possible (or even preferred). In particular, the POS tag FW (foreign word) is restricted to words without an OED entry. However, the OED is not our authority for POS tags in general. For instance, we continue to treat subordinating conjunctions as P or much as a quantifier (Q) rather than as an adverb (ADV).

The two guiding principles for our lemmatization are:

Cases of conflict between these principles are resolved by adjusting orthographic word boundaries - that is, by joining or splitting orthographic words. Joining is indicated by "_" (underscore), and splitting by "@" (at sign). By default, splits occur after hyphens. For examples, see the discussion below.

Number words are a systematic exception to both principles and are discussed in a separate section.

As just mentioned, we generally respect the orthography of the text. In particular, cases (especially in earlier stages of English) that are (in principle) ambiguous between a phrasal and a compound word analysis are handled as they are spelled. For instance, a two-word sequence like gentle man is not joined, regardless of meaning.

(NP (ADJ (ORTHO gentle)                         ← phrase
         (METAWORD (LEMMA (HEADWORD gentle)
                          (OEDID 77666))))
    (N (ORTHO man)
                        (OEDID 113198)))))

(N (ORTHO gentle_man)                          ← compound noun
                    (OEDID 77673))))

When compound words have no main OED entry, but are listed as derived forms, they receive a derived OEDID (derived ID outranks NA).

Compounds without a main or derived entry still receive a derived OEDID if:

Joining and splitting

As mentioned above, the two lemmatization principles sometimes come in conflict, in which case the word boundaries of the original text are adjusted, whether by joining or splitting. Joining is frequent in connection with abbreviations and expressions from Latin (a. m., P. S., p. m., per annum, per cent, and so on), where at least some of the individual parts have no OED entry, but the joined word does.
(ADV (ORTHO per_Annum)                       ← this way
     (METAWORD (LEMMA (HEADWORD per_annum)
                      (OEDID 236980))))

(P (ORTHO per)
                    (OEDID 237088)
(FW (ORTHO Annum)
                     (OEDID NA)              ← not this way

Conversely, orthographic words without an OED entry are split and treated as phrases, if that is possible.

(ADJP (Q (ORTHO little-@)                         ← this way
         (METAWORD (LEMMA (HEADWORD little)
                          (OEDID 109250))))
      (VAN (ORTHO @heard)
           (METAWORD (LEMMA (HEADWORD hear)

(ADJ (ORTHO little-heard)
     (METAWORD (LEMMA (HEADWORD little-heard)
                      (OEDID NA))))              ← not this way

Special cases

Number words

According to the morphosyntactic annotation guidelines for
cardinal numbers, number words are annotated as nouns (N, NS) in certain contexts. The following discussion concerns only number words annotated as NUM (henceforth, NUM words). An exception is made for ordinal numbers, which are tagged as ADJ. These are treated as if they were the corresponding cardinal number, unless they represent FIRST and SECOND. (Numbers tagged as LS are not lemmatized, as expected given their POS tag.) As mentioned earlier, NUM words are a systematic exception to both of our lemmatization principles. Their OEDID is always NA, but unlike other expressions without an OEDID, their ORTHO form is not copied over as their HEADWORD. Rather, their HEADWORD is the number value (possibly a fraction) represented by the NUM word.
                      (OEDID NA))))

                      (OEDID NA))))

(NUM (ORTHO twenty-five)
                      (OEDID NA))))

(NUM (ORTHO five_and_twenty)
                      (OEDID NA))))

(NUM (ORTHO 20.5)
                      (OEDID NA)))

(NUM (ORTHO 1_1$$4)                          ← '$$' represents slash
                      (OEDID NA)))

(NUM (ORTHO threescore)
                      (OEDID NA))
               (PLUSTAG NUM+NUM)))

(NUM (ORTHO iii=xx=)
                      (OEDID NA))))

If necessary, sequences of orthographic words representing numbers are joined into a single orthographic word. AND regularly forms part of such complex number words, and punctuation do so in rare cases like the last example below.

(NUM (ORTHO one_hundred)
                      (OEDID NA))))

(NUM (ORTHO two_dozen)
                      (OEDID NA))))

(NUM (ORTHO fourscore_and_seven)
                      (OEDID NA))))

(NUM (ORTHO four_millions_,_three_hundred_twenty_thousand_five_hundred_and_sixty-eight)    ← see Cardinal numbers for the treatment of MILLIONS as NUM
     (METAWORD (LEMMA (HEADWORD 4,320,568)
                      (OEDID NA))))

Number expressions beginning in HALF A are lemmatized according to whether they are associated with an OED entry. HALF A DOZEN is joined, whereas HALF A HUNDRED, HALF A MILLION, etc. are treated as phrases.

(NUM half_a_dozen) (NS cups)              ← complex number

(NP (NUMP (Q half) (NUM a_hundred))   ← number phrase
    (NS servants))

In other instances of A(N) followed by a NUM word, A(N) is treated as a variant of ONE (reflecting its etymology) and joined.

(NUM a_hundred) (NS cups)            ← cf. 100 cups, *100 of cups

(NUM a_dozen) (NS cups)              ← cf. 12 cups, *12 of cups

In cases of gapping like forty or fifty_thousand, the lemma reflects the intended meaning, not the ORTHO form.

(NUMP (NUM (ORTHO forty)
           (METAWORD (LEMMA (HEADWORD 40,000)          ← 40,000, not 40
                            (OEDID NA))))
      (CONJ (ORTHO or)
            (METAWORD (LEMMA (HEADWORD or)
                             (OEDID 132129))))
      (NUM (ORTHO fifty_thousand)
           (METAWORD (LEMMA (HEADWORD 50,000)
                            (OEDID NA)))))

Hours and minutes in expressions of clock time are split since the numbers count different of units of time. For instance, the unlemmatized expression 10.30 becomes:

                      (OEDID NA))))
(, (ORTHO .)
                    (OEDID NA))))
                      (OEDID NA))))

No lemma

As mentioned earlier, the OED has no main entries for proper names. Certain other material also receives no lemma from the OED, as indicated in the following table. In these cases, the material from ORTHO is copied into HEADWORD, and the value for OEDID is NA.

punctuation . , ' "
metalinguistic categories CODE ID LB LS META
proper names NPR NPR$ NPRS NPRS$
foreign words FW
unknown category X

Fixed lemma

The following POS tags determine the lemma of the associated word.

also ALSO
else ELSE
there EX
for FOR
such SUCH
to TO

Closed-class items

The following POS tags represent closed-class categories, each with a fixed list of acceptable HEADWORDs:

C as, how, that
CONJ and, both, either, neither, or
D a, that, the, them (nonstandard), these, this, those
FP just, only
MD can, could, dare, have (better), may, might, need, ought, shall, should, will, would
NEG not
P a (< on), about, above, across, afore, after, again, against, albeit, along, although, amid, amidst, among, amongst, around, as, at, athwart,
because, before, behind, behither, below, beneath, beside, besides between, betwixt, beyond, but, by,
concerning, cross,
despite, down, during,
ere, except, excepting,
for, from,
if, in, inside, instead, into,
least, lest, like,
near, notwithstanding,
of, off, on, opposite, or (= before), out, outside, over,
past, per,
sans, save, saving, since, sith, so,
than, though, through, throughout, thwart, till, times, to, towards,
unless, under, underneath, until, unto, up, upon,
when, whereas, while, whilst, with, withal, within, without
PRO I, me, thou, thee, he, him, she, her, it, we, us, ye, you, they, them
PRO$ my, mine, thy, thine, his, her, hers, its, our, your, yours, their, theirs
Q($) all, any, aught, both, each, every, either, few, half, ilk, little, many, much, naught, neither, no, none, nothing, nought, several, some
QR fewer, less, more
QS fewest, least, most
RP about, across, by, down, fro, in, off, out, over, through, to, up
WADV as, how, when, whence, where, whereabouts, wherefore, whither, why
WARD seldom a free morpheme
WD what, whatever, whether, which, whichever
WPRO what, whatever, which, whichever, who, whoever, whom, whomever
WPRO$ whose
WQ if, whether

Known issues