Annotation manual for the Penn Parsed Corpora of Historical English and the Parsed Corpus of Early English Correspondence 2

Beatrice Santorini
(January 2022)

This annotation manual is the latest revision of previous versions (2004, 2016). It is heavily indebted to the original document developed by Ann Taylor and Anthony Kroch for Middle English (Kroch and Taylor 2000) as well as to the spirit of the guidelines for the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993).

The substance of the guidelines remains largely unchanged, but the annotation scheme has been streamlined (see Changes), and the differences between the annotation of Middle English and (Early) Modern English have been reduced. Certain differences remain, however, which are occasioned by the syntactic differences between Middle English and later stages of the language. Except where necessary, the examples in the body of the manual are from (Early) Modern English.

This version of the manual also contains guidelines concerning lemmatization, which has been carried out for PPCEME and PPCMBE2 using the Oxford English Dictionary (OED) as a lemma authority.

The present guidelines are in force for the following corpora:

Penn Parsed Corpora of Historical English (Kroch 2020)
- Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2) (Kroch and Taylor 2000)
- Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) (Kroch, Santorini, and Delfs 2004)
- Penn Parsed Corpus of Modern British English, second edition (PPCMBE2) (Kroch, Santorini, and Diertani 2016)
Parsed Corpus of Early English Correspondence, second edition (PCEEC2)

The guidelines were developed for English, but they have been used as a foundation for annotation guidelines for parsed corpora of various other Germanic and Romance languages. The general idea is that the present guidelines apply as a default except where overruled by language-particular considerations that are set out in a corpus-particular manual.

Suggestions for improvement may be sent to Beatrice Santorini (beatrice DOT santorini AT gmail DOT com).

Acknowledgments

Thanks are due to the following institutions and individuals for support and assistance:

The National Endowment for the Humanities for financial support under NEH Grant PA 23382-99.
The National Science Foundation for financial support under NSF Grant BCS 99-05488.
The National Science Foundation for financial support under NSF Grants BCS 05-08731 and BCS 11-47499.
The users of the Penn Parsed Corpora of Historical English for their ongoing financial support in purchasing the corpora.
Anthony Kroch and Ann Taylor for many helpful discussions concerning the original guidelines for the PPCME2 and their adaptation to later stages of English.

References

Kroch, Anthony. 2020. Penn Parsed Corpora of Historical English LDC2020T16. Web download. Philadelphia: Linguistic Data Consortium, https://catalog.ldc.upenn.edu/LDC2020T16.
Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Distributed as part of Kroch 2020. Individual website: http://www.ling.upenn.edu/ppche-release-2016/PPCEME-RELEASE-3.
Kroch, Anthony, Beatrice Santorini, and Ariel Diertani. 2016. Penn-Helsinki Parsed Corpus of Modern British English, second edition (PPCMBE2) Distributed as part of Kroch 2020. Individual website: http://www.ling.upenn.edu/ppche-release-2016/PPCMBE-RELEASE-1.
Kroch, Anthony, and Ann Taylor. 2000. Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2). Distributed as part of Kroch 2020. Individual website: http://www.ling.upenn.edu/ppche-release-2016/PPCME2-RELEASE-4.
Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19, 313-330. Reprinted in Susan Armstrong, ed., 1994, Using large corpora. Cambridge, MA: MIT Press. 273–290.
Parsed Corpus of Early English Correspondence, second edition, parsed version. 2022. Revised and corrected by Beatrice Santorini. Annotated by Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. https://github.com/beatrice57/pceec2