Penn Corpora of Historical English
The Penn Corpora of Historical English, including the Penn-Helsinki Parsed
Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki
Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed
Corpus of Modern British English (PPCMBE), are running texts and text samples
of British English prose across its history - from the earliest Middle English
documents up to the First World War. The texts come
in three forms: simple text, part-of-speech tagged text and syntactically
annotated text. The syntactic annotation (parsing) permits searching not
only for words and word sequences, but also for syntactic structure. All of
the annotation has been carefully checked by expert human annotators for
accuracy and consistency.
The corpora are designed for the use of students and scholars of the
history of English, especially the historical syntax of the language,
and they are publicly available to individuals, research groups and libraries.
The Penn Corpora of Historical English are distributed on CD-ROM, along with
software to retrieve words and structures of interest to the user. This
software searches all three
forms of text on the CD and provides as well sophisticated coding, editing
and display facilities.
Various classes of corpus site license are available to individuals, to
academic departments or research groups and to libraries. See the
corpus Order Form for charges. Corpus license
fees go toward improving the corpora and increasing them in size. Upgrades,
when completed, are available to corpus license holders at modest cost.
The PPCHE CD-ROM (version 3.1) now contains a local web server which allows
users to search the corpora for syntactic structures or part-of-speech tagged
text from a web browser. A demonstration of this capability can be explored
at the PPCHE web demo site.
License holders of the current corpus version (version 3) may obtain a free
copy of the web server software by contacting the corpus project at: krochATlingDOTupennDOTedu.
The search program included with the Penn Historical Corpora, CorpusSearch2,
was written by Beth Randall and has been released as open source
software. The most current version is always downloadable from its
project web site.
- The PPCME2 was created with the support of the National Science Foundation
(Grants BNS89-19701 and SBR95-11368), with supplementary support from the
University of Pennsylvania Research Foundation.
- The PPCEME was created with the support of the National Endowment for
the Humanities (Grant PA23382-99) and the National Science Foundation (Grant BCS99-05488).
- The PPCMBE was created with the support of the National Science
Foundation (Grant BCS05-08731).
Yorkshire. It was at abbeys like Byland, throughout Britain, that the
manuscripts on which our knowledge of Middle English is based were largely
written, copied and preserved. The monastic orders that built and inhabited
these monasteries were dissolved by Henry the Eighth, whereupon the buildings
were dismantled for building materials by the landlords who succeeded to
the monastic estates. Most of the abbeys' manuscripts were lost, but some
came into private hands and so survived. Photo © A. Kroch 1998.