Corpus description, PPCEME

General information

The Penn-Helsinki Parsed Corpus of Early Modern English, consisting of over 1.7 million words, is part of an
overarching project at the University of Pennsylvania and the University of York to produce syntactically annotated corpora for all stages of the history of English. Each of the 448 text samples in the corpus is available in three forms: parsed, part-of-speech tagged, and unannotated text, as explained in detail in the annotation guidelines. In addition, the corpus is divided into three subcorpora.
  1. The Helsinki directories, consisting of 147 text samples with roughly 550,000 words, contain the Helsinki Corpus in parsed, POS-tagged, and unannotated form.

  2. The Penn1 directories, consisting of 152 text samples with roughly 600,000 words, contain a first supplement to the Helsinki Corpus. As far as possible, we have used material by the same authors and from the same editions as the material in the Helsinki Corpus. Where necessary (where the Helsinki Corpus contains an exhaustive sample of a text), we have added new material as summarized below.

  3. The Penn2 directories, consisting of 149 text samples with roughly 590,000 words, contain a second supplement to the Helsinki Corpus. Again, we have tried to use material by the same authors and from the same editions as the material in the Helsinki Corpus. However, as might be expected, the Penn2 directories contain more new material than the Penn1 directories.

Wordcount information

Wordcounts for the individual text samples, along with date and genre information, are contained in the file WORDCOUNT-PPCEME in the current directory. The wordcounts exclude punctuation and extralinguistic material such as page numbers or token ID numbers.

The file is a text file that is suitable for importing into any spreadsheet program; the field separator is the space character.

Conventions governing filenames

General conventions

As in the Helsinki Corpus, the filenames for the texts contain an indication of the time period to which they belong. See
Philological information for more details about the individual texts.

In addition, the filenames in the PPCEME contain an indication of which subcorpus they belong to.

A few examples:


In tripling the size of the samples from the Helsinki Corpus, we have sometimes had to include texts by new authors (either because the Helsinki Corpus sample for an author was itself already exhaustive, or because we ran out of text in the course of tripling the sample size). In what follows, we describe the conventions that we have followed in assigning filenames to these new authors. Our general rule has been to leave Helsinki Corpus filenames unchanged, but we have sometimes slightly modified the original Helsinki filenames for clarity and consistency. These modifications as well as which PPCEME files supplement which Helsinki Corpus files are set out in Table 1 at the end of this section.

Name vs. title

Following the conventions of the Helsinki Corpus, authors are identified by name rather than by title. Sovereigns of England are identified by their given name. For instance, Charles II is identified as charles. Other members of the nobility, including members of the royal family, are identified by their surname. For instance, Thomas Howard, earl of Surrey, 2nd duke of Norfolk, is identified as thoward (not norfolk), and Mary Tudor (Henry VIII's sister, not to be confused with his daughter, Mary I, who is not represented in the corpus) as mtudor.

In one or two cases, the Helsinki Corpus uses a title rather than a surname as the basis for a filename. For instance, Eleanor Clifford, countess of Cumberland, is identified as ecumberl (not clifford). In such cases, we retain the Helsinki filename in order to minimize confusion.

Women's names

As a general rule, women are identified by their surname at the time of writing. Generally (though not always), this is a married name. In order to minimize confusion, we do not change filenames to reflect a later marriage. Two examples:

In the correspondence of important families (such as that of the Barringtons, the Hattons, or the Plumptons), the Helsinki Corpus tends to identify women by their birthname, and we retain those filenames. So Anne Finch, countess of Nottingham, nee Hatton, is identified as anhatton (not finch).

In one or two cases, a woman appears in the Helsinki Corpus under her married name despite belonging to one of the important correspondence families. For instance, Joan Everard and Elizabeth Masham, both n&ecutee;e Barrington, are identified as everard (not jobarring) and masham (not ebarring). In such cases, we use the Helsinki filenames in order to minimize confusion.

Modifications of Helsinki Corpus filenames

Under certain circumstances, we have modified the filenames in the Helsinki Corpus for clarity and consistency. The conventions governing these modifications are given here, and the correspondence between the old and new filenames are set out in Table 1 at the end of the section.

Table 1: Summary of filename modifications and PPCEME-Helsinki correspondences
Helsinki filename PPCEME filename
(if different from Helsinki)
Supplemented by
alhatton --- alhatton2, ehatton2
bedyll --- friar, russell
boyle --- boylecol
clowes --- clowesobs
conway --- rich
counc --- dell
ebeaum --- mtudor-1510, mtudor-1520
ecumberl --- manners, delapole
ehatton --- mhatton, montague
eliz1, eliz2 included in eliz-1590 eliz-1560, eliz-1570, eliz-1580
eoxinden included in eoxinden-1660 dering, eoxinden-1650, eoxinden-1680, jackson, zouch
essex --- essexstate
everard --- jubarring
fhatton --- mhatton
harley --- harleyedw
henry1, henry2 included in henry-1520 henry-1530
hooker1 included in hooker-a ---
hooker2 included in hooker-b ---
hoxinden hoxinden-1660 hoxinden-1640, hoxinden-1650
jetaylor --- jetaylormeas
jpinney --- southard, part of jopinney
knyvett included in knyvett-1620 knyvett-1630
kscrope kscrope-1530 grey, kscrope-1580, mhoward
lords --- interview, marches, surety
morelet1, morelet2 --- part of mroper (see Remarks therein)
mowntayne --- underhill
nhadd included in nhadd-1700 nhadd-1710
osborne --- conway2
pettit --- pettit2
peyton --- moxinden
Plumpton correspondence --- abott, apoole, epoole, gascoigne, gpoole, nevill, rplumpt2, savill
proud proud-1620 proud-1630
raleigh --- judall
rferrar --- part of nferrar
rhaddsr included in rhaddsr-1670 and rhaddsr-1700 rhaddsr-1650, rhaddsr-1710
roxinden included in roxinden-1620 roxinden-1600, roxinden2
somers --- drummond
stat3 stat-1500, included in stat-1540; see info for stat-period1-e1 stat-1510, stat-1530, stat-1550, stat-1560
stat4 stat-1590, included in stat-1600; see info for stat-period2-e2 stat-1570, stat-1580, stat-1620, stat-1640
stat7 included in stat-1690; see info for stat-period3-e3 stat-1660
stevenso --- part of udall
strype --- joxinden
thoward --- dacre
throckm --- thoward2
tillots divided into tillots-a, tillots-b tillots-c
torkingt --- chaplain
trincoll --- hatcher, talbot
tunstall --- ambass
turner --- turnerherb
wcecil included in wcecil-1580 wcecil-1560
wpaston2 --- joxinden
wplumpt1 wplumpt-1500 ---
wplumpt2 wplumpt-1510 ---
wplumpt3 wplumpt-1530 ---