Computational Linguistics

Computational linguistics is a field at the intersection of linguistics and computer science concerned with applying methods from the fields of artificial intelligence and machine learning to problems involving language.

Computational linguistics is exceptionally well represented at Penn, both at the Department of Linguistics and at the Department of Computer and Information Science. Weekly meetings, such as "Clunch" (computational linguistics and lunch) and XTAG, for ongoing work in tree adjoining grammar, as well as the Institute for Research in Cognitive Science, provide students and faculty the opportunity to work together and exchange ideas on current research topics. Penn also benefits from its closeness to the Linguistic Data Consortium.

Faculty in computational linguistics often hold joint positions in Linguistics and Computer and Information Science. Aravind Joshi, the inventor of tree adjoining grammar (TAG), has been working with us since the 1950s and has done seminal work in a very wide variety of subfields.

Mitch Marcus developed the first computationally tractable parser that reflects the findings of syntactic theory. He also participated in creating the first hand-parsed corpus, the English Penn Treebank, which had a significant impact on the field of computational linguistics. The project has continued ever since, branching out to include a number of other languages (such as Chinese) within the past decade; the Treebank corpora have been used to train automatic taggers and parsers as well as in linguistic research.

Fernando Pereira (Research Director at Google, formerly Penn CIS Professor), whose earlier work highlighted the connections between parsing and deduction, is now a leading figure in the field of machine learning. Those colleagues have devised and teach a full program of courses in computational linguistics which are attended by students from both linguistics and computer science. Robin Clark, Anthony Kroch, Mark Liberman, and other colleagues also teach relevant courses, and the programs in linguistics and computer science have trained large numbers of graduate students with substantial expertise in both areas.

In addition to a secondary appointment in the Computer Science department, Mark Liberman is director of the Linguistic Data Consortium. The LDC constructs online corpora of diverse types in many languages, maintains a digital archive of research papers in computational linguistics, and hosts a variety of seminars and conferences. Liberman has published extensively on the theoretical and practical underpinnings of the LDC's work, especially on the construction of corpora and of formal frameworks for linguistic annotation.

Charles Yang is interested in computational models of language acquisition and language change. Specifically, he studies the interaction between the representation of linguistic information and the mechanisms of language processing and learning, with strong commitment to the empirical findings in the psychology of language.


Aravind Joshi
Mathematical and processing models of language

Mark Liberman
Phonetics, prosody, natural language processing, speech communication

Mitch Marcus
Natural language processing, corpus-based and statistical models for NLP

Charles Yang
Language acquisition, language change, computational linguistics, morphology, psycholinguistics


Spencer Caplan
Computational Linguistics, NLP, Psycholinguistics

Andrea Ceolin
Computational Linguistics, Historical Linguistics

Jordan Kodner
Computational Linguistics, Language Acquisition

Robert Wilder

Psycho/Neuro Linguistics, Theoretical Syntax, Computational Phonology