Corpus description, PPCMBE2

General information

The Penn Parsed Corpus of Modern British English, 2nd edition (PPCMBE2), consisting of almost 2.8 million words, is part of an overarching project at the University of Pennsylvania and the University of York to produce syntactically annotated corpora for all stages of the history of English. Each of the 275 text samples in the corpus is available in three forms: parsed, part-of-speech tagged, and unannotated text. In addition, there is a file with philological and bibliographical information about each text.

The second edition includes the 101 text samples from the first edition, together with 174 further text samples. Errors in the samples from the first edition that we have found ourselves or that have been reported to us are corrected in this edition. As with the first edition, the genre composition of the second edition has been kept as close as possible to that of the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), except that we have added no further statutes to the second edition. The samples from the first and the second editions are distinguished by a trailing "-1" or "-2", respectively, in their filenames.

Wordcount information

Wordcounts for the individual text samples, along with date and genre information, are contained in the file WORDCOUNT-PPCMBE2 in the current directory. The wordcounts exclude punctuation and extralinguistic material such as page numbers or token ID numbers.

The file is a text file that is suitable for importing into any spreadsheet program; the field separator is the space character.

Conventions governing filenames

General conventions

The texts in the corpus are generally named after their author. Different authors with the same surname are distinguished by appending Arabic numerals to the shared surname (for instance, "turner1", "turner2", "turner3"). Works by authors whose names are not known are either named for some salient feature of the text, such as the profession of the author ("midshipman", "officer") or the title of the work ("erv", "grafting", "statutes"), or they are named "anon1", "anon2", etc.
The filenames for the texts also include the year of composition or publication. Texts that span several years within a decade contain "x" instead of a last digit. Texts from separate decades are generally given their own files. For instance, nightingale-188x and nightingale-189x contain Florence Nightingale's letters from the 1880s and 1890s, respectively. However, when the material from a decade is not very extensive, it is subsumed in a file for a contiguous decade. For instance, forster2-191x contains a few letters from 1909.
As noted above, the second edition of the PPCMBE contains more text samples than the first edition. In some cases, the additional samples extend the original sample by a given author; in other cases, the additional samples are by authors not represented in the first edition. In both cases, the filenames for the samples from the first edition contain a trailing "-1"; the samples that have been added to the second edition contain a trailing "-2".

Special cases

The sample of Queen Victoria's private letters from the first edition has been renamed from "victoria-186x" to "victoria-186x-private-letters" in order to explicitly distinguish it from the contemporaneous sample of official letters added in the second edition.
The trial proceedings in frost-1840 were not available in time for the first edition of the PPCMBE. The current sample is divided into two parts. The first (frost-1840-1) is the sample that would have been included in the first edition, had it been available. The second sample (frost-1840-2) augments the first.

Name vs. title

Following the conventions of the Helsinki Corpus, authors are identified by name rather than by title. Sovereigns of England are identified by their given name. For instance, George III and Victoria are identified as "george" and "victoria". Members of the nobility or gentry are identified by their surname. For instance, John William Strutt, third Baron Rayleigh, and Arthur Wellesley, first duke of Wellington, appear in the corpus as "strutt" and "wellesley", even though they are better known as Lord Rayleigh and the Duke of Wellington.

Women's names

The confusing issues regarding women's names that arose in the PPCEME do not arise in the present corpus.