PPCHE RELEASE NUMBER: 5
RELEASE DATE: TBA

Last updated: April 4, 2025
(/htdocs/ppche/ppche-release-2016 on babel)

Penn Parsed Corpora of Historical English

The Penn Parsed Corpora of Historical English are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. They include three corpora:

  • the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2),
  • the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and
  • the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2).
The texts come in three forms: simple text, part-of-speech tagged text, and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for abstract syntactic structures. All of the annotation has been carefully reviewed by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are available to individuals, research groups, and libraries.

The 2016 release adds 2 million words to the Modern British English corpus, for a total of 3 million words, and includes a substantial number of corrections to the other corpora in the series. In addition, several small changes were made to streamline the annotation guidelines (2016 version of guidelines).

As of July 2024, there is a new release of the PPCHE, which corrects error and inconsistencies in the 2016 release and further streamlines the annotation. The corpora themselves are no longer available on github, as posting them there infringed on distribution rights by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. The new release will be distributed by the LDC. For details, please contact LDC directly at ldc AT ldc DOT upenn DOT edu.

Supporting material for the new release is available on github:

Please direct questions concerning the PPCHE to Beatrice Santorini at beatrice DOT santorini AT gmail DOT com, especially reports of annotation errors, so that we can continue to improve the quality of the corpora.

The 2016 release continues to be available from the LDC under catalog number LDC2020T16. Details concerning that release are available here.

Acknowledgments

  • The PPCME2 was created with the support of the National Science Foundation (Grants BNS 89-19701 and SBR 95-11368), with supplementary support from the University of Pennsylvania Research Foundation.
  • The PPCEME was created with the support of the National Endowment for the Humanities (Grant PA 23382-99) and the National Science Foundation (Grant BCS 99-05488).
  • The PPCMBE2 was created with the support of the National Science Foundation (Grants BCS 05-08731 and BCS 11-47499).

With respect to the above-listed grants, any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Endowment for the Humanities or the National Science Foundation.

    
 
Byland Abbey, Yorkshire. It was at abbeys like Byland, throughout Britain, that the manuscripts on which our knowledge of Middle English is based were largely written, copied, and preserved. The monastic orders that built and inhabited these monasteries were dissolved by Henry VIII, whereupon the buildings were dismantled for building materials by the landlords who succeeded to the monastic estates. Most of the abbeys' manuscripts were lost, but some came into private hands and so survived. Photo © A. Kroch 1998.