Annotation manual for the Penn Parsed Corpora of Historical English
and the Parsed Corpus of Early English Correspondence 2
Beatrice Santorini
(January 2022)
This annotation manual is the latest revision of previous versions
(2004, 2016). It is heavily indebted to the original document developed
by Ann Taylor and Anthony Kroch for Middle English (Kroch and Taylor 2000)
as well as to the spirit of the guidelines for the Penn Treebank (Marcus,
Santorini, and Marcinkiewicz 1993).
The substance of the guidelines remains largely unchanged, but the
annotation scheme has been streamlined
(see Changes), and the differences between the
annotation of Middle English and (Early) Modern English have been reduced.
Certain differences remain, however, which
are occasioned by the syntactic differences between Middle English and
later stages of the language. Except where necessary, the examples in the
body of the manual are from (Early) Modern English.
This version of the manual also contains guidelines
concerning lemmatization,
which has been carried out for PPCEME and PPCMBE2 using the Oxford English
Dictionary (OED) as a lemma authority.
The present guidelines are in force for the following corpora:
The guidelines were developed for English, but they have been used as
a foundation for annotation guidelines for parsed corpora of various other
Germanic and Romance languages. The general idea is that the present
guidelines apply as a default except where overruled by
language-particular considerations that are set out in a corpus-particular
manual.
Suggestions for improvement may be sent to Beatrice Santorini (beatrice
DOT santorini AT gmail DOT com).
Acknowledgments
Thanks are due to the following institutions and individuals for support
and assistance:
- The National Endowment for the Humanities for financial support under NEH Grant PA 23382-99.
- The National Science Foundation for financial support under NSF Grant BCS 99-05488.
- The National Science Foundation for financial support under NSF Grants BCS 05-08731 and BCS 11-47499.
- The users of the Penn Parsed Corpora of Historical English for their
ongoing financial support in purchasing the corpora.
- Anthony Kroch and Ann Taylor for many helpful discussions concerning
the original guidelines for the PPCME2 and their adaptation to later
stages of English.
References
-
Kroch, Anthony.
2020.
Penn Parsed Corpora of Historical English LDC2020T16. Web download.
Philadelphia: Linguistic Data Consortium,
https://catalog.ldc.upenn.edu/LDC2020T16.
-
Kroch, Anthony, Beatrice Santorini, and Lauren Delfs.
2004.
Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME).
Distributed as part of Kroch 2020.
Individual website: http://www.ling.upenn.edu/ppche-release-2016/PPCEME-RELEASE-3.
-
Kroch, Anthony, Beatrice Santorini, and Ariel Diertani.
2016.
Penn-Helsinki Parsed Corpus of Modern British English, second edition (PPCMBE2)
Distributed as part of Kroch 2020.
Individual website: http://www.ling.upenn.edu/ppche-release-2016/PPCMBE-RELEASE-1.
-
Kroch, Anthony, and Ann Taylor.
2000.
Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2).
Distributed as part of Kroch 2020.
Individual website: http://www.ling.upenn.edu/ppche-release-2016/PPCME2-RELEASE-4.
-
Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz.
1993.
Building a large annotated corpus of English: The Penn Treebank.
Computational linguistics 19,
313-330.
Reprinted in
Susan Armstrong, ed., 1994,
Using large corpora.
Cambridge, MA:
MIT Press.
273–290.
-
Parsed Corpus of Early English Correspondence, second edition, parsed version.
2022.
Revised and corrected by Beatrice Santorini.
Annotated by Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk,
and Terttu Nevalainen.
Compiled by the CEEC Project Team.
https://github.com/beatrice57/pceec2