Freshman Seminar: Big Data in Linguistics


Instructor: Mark Liberman

What we'll do

In this seminar, we'll examine the nature of speech, language, and communication, and the development of techniques for automatic analysis of (and interaction with) the streams and archives of digital text, speech, video, and sensor data that are an increasing part of our world.

We'll use a case study method, in which we take up a series of specific problems and use them to illustrate and examine the wide range of methods now in use or under development, and the concepts that underlie them.

There are no course texts as such -- all readings will be available online, linked into the schedule.

How we'll do it

As we work through each specific case, we'll read broad survey articles and also specific technical works. No particular mathematical or computational background will be assumed, although this means that you may sometimes need to accept a translation into ordinary language of some jargon or a few equations. Everyone will learn about some basic concepts and techniques in linguistics, information theory, machine learning, and so on; individual students will be encouraged to delve more deeply into the areas that interest them.

In most cases, we'll start with a simple problem that you can solve (or imagine solving) "by hand", and then we'll learn how (and why) to evaluate the solution, how to automate the process, and how to generalize the solution.

We'll also discuss the social, political, and ethical implications of the developments that we learn about.

What we'll cover

Terminology for text-related techniques includes information retrieval, information extraction and "text analytics", machine translation, word-sense disambiguation, authorship identification, stylistics, sentiment analysis and "opinion mining", topic classification, topic detection and tracking, "e-discovery", question answering, summarization, spell checking, parsing, grammar checking, text generation, and many others. Speech-based techniques include speech activity detection, speaker recognition and verification, "diarization", language recognition, speech to text, text to speech, speech recognition, speaker recognition, emotion detection, disfluency detection, and so forth. And then there are techniques for video processing, social network analysis, sensor fusion, and on and on.

There techniques can be applied in medicine, law, business, politics, sociology, history, psychology, literary analysis, and life in general.

Any one of these techniques or applications could serve as the focus for a whole course -- or a life's work. Although we'll touch on many if not all of them, how deeply we'll focus on what will depend on the interests and abilities of seminar participants. However, you should emerge from the seminar knowing what to do in order to learn more about any of these areas, and confident in your ability to do it.

What about assignments and grades?

There will be reading assignments every week, and about six research/writing assignments, culminating in a term project. Your grade will be 40% class participation (so you really should plan to do the reading, and to come to class most of the time!), 30% writing assignments, and 30% term project. There is no "curve" -- if everyone does excellent work, everyone will get an excellent grade.