General introduction

Philosophy and goals

File formats

Each text in the corpus comes in three different formats, each with a characteristic filename extension:

Text files (.txt)

Text files have the extension .txt. Besides the text, they contain Helsinki text level codes, converted into HTML type codes, as outlined in Text markup. The original page layout is not retained. Rather, the text is divided into tokens, which generally correspond to a main clause together with any subordinate clauses that it contains. Each token is associated with a token ID, enclosed in parentheses, which contains the name of the file, a page reference to the printed text (possibly including a volume reference), and a running token number that locates the token within the computer file. Tokens may also consist entirely of text level codes. Such tokens do not have IDs, but they are counted by the token counter, which can lead to gaps in the running token numbers. Punctuation in text files is separated from the words in order to simplify searches.


I . (CMMALORY,2.3)

Merlin (CMMALORY,2.4)


HIT befel in the dayes of Uther Pendragon , when he was kynge of all
Englond and so regned , that there was a myghty duke in Cornewaill that
helde warre ageynst hym long tyme . (CMMALORY,2.6)

and the duke was called the duke of Tyntagil . (CMMALORY,2.7)

And so by meanes kynge Uther send for this duk chargyng hym to brynge
his wyf with hym . (CMMALORY,2.8)

for she was called a fair lady and a passynge wyse . (CMMALORY,2.9)

and her name was called Igrayne . (CMMALORY,2.10)

So whan the duke and his wyf were comyn unto the kynge , by the meanes
of grete lordes they were accorded bothe . (CMMALORY,2.11)

Part-of-speech (POS) tagged files (.pos)

Part-of-speech (POS) tagged texts have the extension .pos. They contain the material in the text files with a POS tag added to each word. Editorial material is given the tag CODE. Text elements are separated from their POS tags by an underscore. The text is divided into tokens in the same way as in the text files. Also, as in the text files, tokens consisting entirely of CODE material do not receive a token ID, but are counted by the token counter.





HIT_PRO befel_VBD in_P the_D dayes_NS of_P Uther_NPR Pendragon_NPR ,_,
when_P he_PRO was_BED kynge_N of_P all_Q Englond_NPR and_CONJ so_ADV
regned_VBD ,_, that_C there_EX was_BED a_D myghty_ADJ duke_N in_P
Cornewaill_NPR that_C helde_VBD warre_N ageynst_P hym_PRO long_ADJ
tyme_N ,_. CMMALORY,2.6_ID

and_CONJ the_D duke_N was_BED called_VAN the_D duke_N of_P Tyntagil_NPR

And_CONJ so_ADV by_P meanes_NS kynge_NPR Uther_NPR send_VBD for_P
this_D duk_N chargyng_VAG hym_PRO to_TO brynge_VB his_PRO$ wyf_N with_P
hym_PRO ,_. CMMALORY,2.8_ID

for_CONJ she_PRO was_BED called_VAN a_D fair_ADJ lady_N and_CONJ a_D
passynge_ADV wyse_ADJ ,_. CMMALORY,2.9_ID

and_CONJ her_PRO$ name_N was_BED called_VAN Igrayne_NPR ._.

So_ADV whan_P the_D duke_N and_CONJ his_PRO$ wyf_N were_BED comyn_VBN
unto_P the_D kynge_N ,_, by_P the_D meanes_NS of_P grete_ADJ lordes_NS
they_PRO were_BED accorded_VAN bothe_Q ._. CMMALORY,2.11_ID

Parsed files (.psd)

Parsed files have the extension .psd. They contain a labelled bracketing of the text, with the first set of labelled parentheses around a word repeating the information from the POS-tagged files. The division into tokens in the parsed files is the same as in the text and POS files. Each token is enclosed with its ID in a set of unlabelled parentheses.
( (CODE <P_2>))

( (CODE <heading>))

        (. .))
  (ID CMMALORY,2.3))

( (NP (NPR Merlin))
  (ID CMMALORY,2.4))

( (CODE </heading>))

          (VBD befel)
          (PP (P in)
              (NP (D the) (NS dayes)
                  (PP (P of)
                      (NP (NPR Uther) (NPR Pendragon)))))
          (, ,)
          (PP (P when)
              (CP-ADV (C 0)
                      (IP-SUB (IP-SUB (NP-SBJ (PRO he))
                                      (BED was)
                                      (NP-OB1 (N kynge)
                                              (PP (P of)
                                                  (NP (Q all) (NPR Englond)))))
                              (CONJP (CONJ and)
                                     (IP-SUB (NP-SBJ *con*)
                                             (ADVP (ADV so))
                                             (VBD regned))))))
          (, ,)
          (CP-THT-1 (C that)
                    (IP-SUB (NP-SBJ-2 (EX there))
                            (BED was)
                            (NP-2 (D a) (ADJ myghty) (N duke)
                                  (CP-REL *ICH*-3))
                            (PP (P in)
                                (NP (NPR Cornewaill)))
                            (CP-REL-3 (WNP-4 0)
                                      (C that)
                                      (IP-SUB (NP-SBJ *T*-4)
                                              (VBD helde)
                                              (NP-OB1 (N warre))
                                              (PP (P ageynst)
                                                  (NP (PRO hym)))
                                              (NP-MSR (ADJ long) (N tyme))))))
          (. ,))
  (ID CMMALORY,2.6))

( (IP-MAT (CONJ and)
          (NP-SBJ-1 (D the) (N duke))
          (BED was)
          (VAN called)
          (IP-SMC (NP-SBJ *-1)
                  (NP-OB1 (D the) (N duke)
                          (PP (P of)
                              (NP (NPR Tyntagil)))))
          (. .))
  (ID CMMALORY,2.7))

( (IP-MAT (CONJ And)
          (ADVP (ADV so))
          (PP (P by)
              (NP (NS meanes)))
          (NP-SBJ (NPR kynge) (NPR Uther))
          (VBD send)
          (PP (P for)
              (NP (D this) (N duk)))
          (IP-PPL (VAG chargyng)
                  (NP-OB1 (PRO hym))
                  (IP-INF (TO to)
                          (VB brynge)
                          (NP-OB1 (PRO$ his) (N wyf))
                          (PP (P with)
                              (NP (PRO hym)))))
          (. ,))
  (ID CMMALORY,2.8))

Text markup

In general, it has not been possible to retain the markup conventions of the Helsinki Corpus in their original form because of conflicts with the annotation system. The major changes made are as follows: