PPCHE RELEASE NUMBER: 5
RELEASE DATE: TBA

Last updated: July 15, 2025
(babel: /htdocs/ppche/ppche-README)

Penn Parsed Corpora of Historical English

The Penn Parsed Corpora of Historical English (PPCHE) are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. They include three corpora:

  • the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2),
  • the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and
  • the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2).
The texts come in three forms: unannotated, part-of-speech tagged, and syntactically annotated (parsed). The syntactic annotation permits searches not only for words and word sequences, but also for abstract syntactic structures. All of the annotation has been carefully reviewed by expert human annotators for accuracy and consistency. The corpora are available for non-profit use by individuals, departments, research groups, and libraries. They were originally designed for use by students and scholars of the history of English, especially the historical syntax of the language. More recently, they have proven useful to computational linguists for research in domain adaptation.

A 2016 release of PPCHE added 2 million words to the Modern British English corpus, for a total of 3 million words, and included a substantial number of corrections to all three corpora. In addition, the 2016 annotation guidelines slightly streamlined earlier versions.

As of July 2025, the 2016 release is superseded by PPCHE2, which again corrects annotation errors and inconsistencies and streamlines the annotation guidelines yet further. Unlike earlier releases, PPCHE2 contains only tagged and parsed versions of the texts. It is available from the Linguistic Data Consortium (LDC) at the University of Pennsylvania under catalog number LDC2025T09. The 2016 release remains available under catalog number LDC2020T16.

For questions concerning distribution, please contact LDC (ldc AT ldc DOT upenn DOT edu). For other issues, contact Beatrice Santorini (beatrice DOT santorini AT gmail DOT com). We especially welcome reports of annotation errors or inconsistencies, so that we can continue to improve the quality of the corpora.

For a short time, PPCHE2 was available in its entirety on github. This is no longer the case since posting the texts infringed on LDC's prior distribution rights. Supporting material for PPCHE2 continues to be available on github:

Acknowledgments

With respect to the above-listed grants, any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Endowment for the Humanities or the National Science Foundation.

    
 
Byland Abbey, Yorkshire. It was at abbeys like Byland, throughout Britain, that the manuscripts on which our knowledge of Middle English is based were largely written, copied, and preserved. The monastic orders that built and inhabited these monasteries were dissolved by Henry VIII, whereupon the buildings were dismantled for building materials by the landlords who succeeded to the monastic estates. Most of the abbeys' manuscripts were lost, but some came into private hands and so survived. Photo © A. Kroch 1998.