Corpus description, PPCME2

General information

The PPCME2 text samples are based largely on the Middle English section of the Diachronic Part of the Helsinki Corpus of English Texts (available from ICAME), with certain additions and deletions. However, the size of the samples is considerably larger. For the earliest Helsinki time period, all texts are exhaustively sampled. For later Helsinki time periods, two texts per period were expanded to 50,000 words. The remaining texts are represented by the Helsinki Corpus sample.

The main Helsinki time periods are M1-M4, each covering approximately one hundred years. In addition, texts originally written in a given period but for which the earliest manuscript is from a later period are given two digit period designations. Table 1 is a list of all Helsinki periods as they appear in the corpus file names.

Table 1: Helsinki periods
Period designation Composition date Manuscript date
MX1 unknown 1150-1250
M1 1150-1250 1150-1250
M2 1250-1350 1250-1350
M23 1250-1350 1350-1420
M24 1250-1350 1420-1500
M3 1350-1420 1350-1420
M34 1350-1420 1420-1500
MX4 unknown 1420-1500
M4 1420-1500 1420-1500

The current edition of the PPCME2 includes a total of roughly 1.2 million words of running text. Each of the 56 text samples in the corpus is available in three forms: parsed, part-of-speech tagged, and unannotated text. In addition, there is a file with philological and bibliographical information about each text.

Wordcount information

Wordcounts for the individual text samples, along with date and genre information, are contained in the file WORDCOUNT-PPCME2 in the current directory. The wordcounts exclude punctuation and extralinguistic material such as page numbers or token ID numbers.

The file is a text file that is suitable for importing into any spreadsheet program; the field separator is the space character.