Corpus description

General information

The Penn Parsed Corpus of Modern British English, consisting of just under one million words, is part of an ongoing
larger project at the University of Pennsylvania and the University of York to produce syntactically annotated corpora for all stages of the history of English. The genre composition of the corpus has been kept as close as possible to that of the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME).

Wordcount information

Like the PPCEME, the PPCMBE spans roughly 210 years (1700-1914) and can thus be divided into three 70-year time periods analogous to the e1, e2, and e3 time periods of the PPCEME. Table 1 contains a wordcount summary by time period. All wordcounts exclude punctuation and extralinguistic material such as page numbers or token ID numbers.

Table 1: Wordcount summary by time period
Period Wordcount
1700-1769 298,764
1770-1839 368,804
1840-1914 281,327
Total 948,895

Table 2 contains a wordcount summary by Helsinki Corpus text genre.

Table 2: Wordcount summary by text genre
Text genre Number of words Percentage
Bible 52,909 5.6%
Biography, autobiography 25,880 2.7%
Biography, other 30,072 3.2%
Diary 69,584 7.3%
Drama, comedy 70,338 7.4%
Educational treatise 64,839 6.8%
Fiction 65,626 6.9%
Handbook, other 63,557 6.7%
History 61,621 6.5%
Law 65,748 6.9%
Letters, non-private 33,826 3.6%
Letters, private 66,362 7.0%
Philosophy 17,108 1.8%
Proceedings, trials 58,973 6.2%
Science, medicine 23,147 2.4%
Science, other 53,449 5.6%
Sermon 54,711 5.8%
Travelogue 71,145 7.5%
Total 948,895 100%

Finally, Table 3 gives wordcounts by individual text. The information in the table is also contained in the file WORDCOUNT-PPCMBE in the current directory. The file is suitable for importing into a spreadsheet program; the record separator is the space character.

Table 3: Wordcount summary by individual text
Text Date Genre Wordcount
albin-1736 1736 SCIENCE_OTHER 8,837
anon-1711 1711 EDUC_TREATISE 6,092
austen-180x 1805-1808 LETTERS_PRIV 9,650
bain-1878 1878 EDUC_TREATISE 9,095
barclay-1743 1743 EDUC_TREATISE 9,422
bardsley-1807 1807 SCIENCE_MEDICINE 7,694
benson-1908 1908 EDUC_TREATISE 9,042
benson-190x 1905-1906 DIARY 9,986
boethja-1897 1897 PHILOSOPHY 7,935
boethri-1785 1785 PHILOSOPHY 9,173
boswell-1776 1776 DIARY 9,887
bradley-1905 1905 TRAVELOGUE 10,292
brightland-1711 1711 EDUC_TREATISE 1,341
brougham-1861 1861 DRAMA_COMEDY 10,049
burton-1762 1762 SERMON 9,110
butler-1726 1726 SERMON 9,099
carlyle-1835 1835 LETTERS_PRIV 9,343
carlyle-1837 1837 HISTORY 8,752
chapman-1774 1774 EDUC_TREATISE 9,027
cibber-1740 1740 BIOGRAPHY_AUTO 10,046
collier-1835 1835 DRAMA_COMEDY 9,459
colman-1805 1805 DRAMA_COMEDY 10,161
cook-1776 1776 TRAVELOGUE 10,148
cooke-1712 1712 TRAVELOGUE 10,027
davys-1716 1716 DRAMA_COMEDY 10,294
defoe-1719 1719 FICTION 9,378
dickens-1837 1837 FICTION 9,437
doddridge-1747 1747 BIOGRAPHY_OTHER 10,432
drummond-1718 1718 HANDBOOK_OTHER 7,905
erv-new-1881 1881 BIBLE 10,964
erv-old-1885 1885 BIBLE 10,292
faraday-1859 1859 SCIENCE_OTHER 8,821
fayrer-1900 1900 BIOGRAPHY_AUTO 7,754
fielding-1749 1749 FICTION 9,385
fleming-1886 1886 HANDBOOK_OTHER 9,038
froude-1830 1830 SERMON 9,254
george-1763 1763 LETTERS_NON-PRIV 4,941
gibbon-1776 1776 HISTORY 8,804
gladstone-1873 1873 LETTERS_NON-PRIV 11,240
godwin-1805 1805 FICTION 9,343
goldsmith-1773 1773 DRAMA_COMEDY 10,385
grafting-1780 1780 HANDBOOK_OTHER 9,130
haydon-1808 1808 DIARY 10,015
herschel-1797 1797 SCIENCE_OTHER 9,156
hind-1707 1707 HISTORY 8,791
holmes-letters-1749 1749 LETTERS_NON-PRIV 6,535
holmes-trial-1749 1749 PROCEEDINGS_TRIAL 20,707
johnson-1775 1775 LETTERS_PRIV 9,525
kimber-1742 1742 HISTORY 8,829
lancaster-1806 1806 EDUC_TREATISE 9,214
lind-1753 1753 SCIENCE_MEDICINE 7,734
long-1866 1866 HISTORY 8,851
lyell-1830 1830 SCIENCE_OTHER 8,934
maxwell-1747 1747 HANDBOOK_OTHER 10,271
meredith-1895 1895 FICTION 9,322
montagu-1718 1718 LETTERS_PRIV 9,344
montefiore-1836 1836 TRAVELOGUE 10,195
newcome-new-1796 1796 BIBLE 11,033
nightingale-188x 1888-1889 LETTERS_PRIV 3,302
nightingale-189x 1890 LETTERS_PRIV 6,201
officer-1744 1744 TRAVELOGUE 10,032
okeeffe-1826 1826 BIOGRAPHY_AUTO 8,080
oman-1895 1895 HISTORY 8,851
poore-1876 1876 SCIENCE_MEDICINE 7,719
priestley-1769 1769 SCIENCE_OTHER 8,911
purver-new-1764 1764 BIBLE 11,099
purver-old-1764 1764 BIBLE 9,521
pusey-186x 1865-1866 SERMON 9,022
reade-1863 1863 TRAVELOGUE 10,369
reeve-1777 1777 FICTION 9,432
ruskin-1835 1835 DIARY 9,882
ryder-1716 1716 DIARY 9,916
skeavington-184x 184x HANDBOOK_OTHER 9,132
southey-1813 1813 BIOGRAPHY_OTHER 9,829
statutes-171x 1715-1716 LAW 9,315
statutes-1745 1745 LAW 9,320
statutes-1775 1775 LAW 9,436
statutes-1805 1805 LAW 9,440
statutes-1835 1835 LAW 9,370
statutes-1865 1865 LAW 9,456
statutes-1895 1895 LAW 9,411
stevens-1745 1745 DRAMA_COMEDY 10,277
strutt-1890 1890 SCIENCE_OTHER 8,790
talbot-1901 1901 SERMON 9,138
thring-187x 1870-1872 DIARY 9,997
tindall-1814 1814 HANDBOOK_OTHER 9,044
townley-1746 1746 PROCEEDINGS_TRIAL 9,995
trollope-1882 1882 BIOGRAPHY_OTHER 9,811
turner1-1799 1799 HISTORY 8,743
turner2-1800 1800 TRAVELOGUE 10,082
victoria-186x 1863-1865 LETTERS_PRIV 9,368
walpole-174x 1740-1747 LETTERS_PRIV 9,629
watson-1817 1817 PROCEEDINGS_TRIAL 28,271
weathers-1913 1913 HANDBOOK_OTHER 9,037
webster-1718 1718 EDUC_TREATISE 2,328
wellesley-1815 1815 LETTERS_NON-PRIV 11,110
wesley-174x 1744-1745 DIARY 9,901
whewell-1837 1837 EDUC_TREATISE 9,278
wilde-1895 1895 DRAMA_COMEDY 9,713
wollaston-1793 1793 SERMON 9,088
yonge-1865 1865 FICTION 9,329

Conventions governing filenames

General conventions

The texts in the corpus are generally named after their author. Multiple authors with the same surname are distinguished by appending Arabic numerals to the name (for instance, "turner1" vs. "turner2"). In cases of anonymous or multiple authors, the name for the text is based either on the title of the work, as in the case of the English Revised Version ("erv"), the statutes ("statutes"), and a manual on grafting ("grafting"), or on the profession of the author ("officer"). In one case, we call the text "anon". Should the need arise for extended versions of the corpus, distinct anonymous authors would be identified as "anon2", "anon3", and so on.

The filename for a text also includes the year of composition or publication. Texts that span several years within a decade contain "x" instead of a last digit. Texts from separate decades are given their own files. For instance, nightingale-188x and nightingale-189x contain Florence Nightingale's letters from the 1880s and 1890s, respectively.

Name vs. title

Following the conventions of the Helsinki Corpus, authors are identified by name rather than by title. Sovereigns of England are identified by their given name. For instance, George III and Victoria are identified as "george" and "victoria". Members of the nobility or gentry are identified by their surname. For instance, John William Strutt, third Baron Rayleigh, and Arthur Wellesley, first duke of Wellington, appear in the corpus as "strutt" and "wellesley", even though they are better known as Lord Rayleigh and the Duke of Wellington.

Women's names

The confusing issues regarding women's names that arose in the PPCEME do not arise in the present corpus.