Data and Annotations for Sociolinguistics: A Corpus-Based Approach to Sociolinguistic Research The project in Data and Annotations for Sociolinguistics (DASL) investigates best practices in the use of digital speech corpora to address problems in sociolinguistic theory. The quantitative study of linguistic variation is necessarily based upon empirical observation and statistical description of linguistic behavior. Collecting and annotating databases plays a crucial role in quantitative sociolinguistics. The current state of computing technology encourages the collection, annotation, analysis and even summarization and presentation of linguistic behavior wholly within the digital domain. Digital data is easily shared and that in turn encourages a whole range of positive practices. However, the use of speech corpora in sociolinguistics also raises questions both theoretical and methodological. The goal of the DASL Project is to begin to address these issues via a case study involving the analysis of a well-documented sociolinguistic variable as it appears (or does not) in several large well-documented speech corpora. This paper reports on the first phase of DASL: an investigation of the process of -t/d-deletion in four large digital speech corpora spanning a range of speaking styles, from read speech to casual conversations between intimates. The corpora were collected for purposes other than sociolinguistic research but are capable of being re-annotated to fit our needs. -t/d deletion is a well-understood, stable variable common in multiple varieties of English. This variable shows similar patterns of stratification across the many diverse speech communities in which it has been studied. A team of non-specialist annotators, working under the direction of sociolinguists, identifies and codes tokens of potential deletion. The team approach allows for the evaluation of inter-annotator consistency in coding. The interface used to conduct the annotation allows linguists to interact with the corpora and the resulting annotations via the worldwide web so that this project can generalize to include multiple sites. The structure of the DASL Project also encourages collaborative data development and analysis by providing sociolinguists with raw and annotated data and tools for browsing, searching, (re)annotating and distributing that data via the Internet. While we are contributing the four corpora annotated for -t/d deletion to DASL, other researchers will be encouraged to do the same via "data exchange"; those whose make their own contributions of data and annotations will have access to the entire pool of data. The ability to easily share digital data encourages collaboration via: * the comparison of results across studies * the use of stable data to benchmark new or competing models and methodologies * the re-annotation and reuse of existing data for new purposes * the measurement of interannotator consistency * the reduction of impediments facing new participants in the research community Sharing data does not, however, diminish the value of ongoing data collection. Both the research and the research community benefit from new contributions. The researcher gains new skills and a unique appreciation of the subject pool while the research community gains not only a new data set but also new perspectives and new methodological approaches. The DASL Project hopes to encourage data sharing and the re-annotation and reuse of published data as an important complement to first-hand fieldwork. The paper reports results from annotation of the first corpus, the TIMIT Acoustic-Phonetic Corpus of read speech. This data set consists of over 600 speakers each reading a set of 10 phonetically rich sentences selected from a larger pool. The corpus (along with the other corpora to be analyzed) has already been transcribed and segmented so that individual speaker turns can be retrieved separately. Before coding begins, we use a custom-designed sociolinguistic annotation interface to search the orthographic transcripts via a regular expression query, identifying potential tokens of interest. Other filters are applied to further reduce the list of tokens, excluding words that look erroneously like candidates for deletion (e.g., would). Using this approach, the 54,387-word TIMIT corpus was quickly reduced to a review list of 2059 words; from the review list annotators identified 1578 actual -t/d tokens. Once the corpora have been concordanced, filtered and prepared for annotation, an interactive web-based display allows annotators to view each token, listen to the utterance and view the corresponding waveform, access demographic data and code linguistic factors. The annotator can simply click on the word to hear it spoken. Following each token, the interface displays the factors to be coded. Each factor is shown as a radio button, and coding a token entails clicking on the button corresponding to the relevant factor within each factor group. A comment field also appears after each token for the annotator to record notes. Results are easily exported to a spreadsheet or statistical analysis package. Using this approach, annotators have completed coding the 1578 tokens of potential -t/d deletion in the TIMIT corpus with respect to four factor groups: status of the dependent variable; morphological category; preceding segment and following segment. A VARBRUL analysis of the TIMIT data considers social factors (speaker age, sex, region, education) along with the linguistic factors. Additionally, 5% of the tokens in TIMIT have been re-coded by an independent annotator in order to establish a measure of inter-annotator consistency. In addition to the empirical study of -t/d deletion and the methodological questions concerning the use of published speech corpora in sociolinguistics, the paper addresses several other questions: * How do the corpora used in this study relate to the data most commonly used in quantitative sociolinguistics, namely recordings of sociolinguistic interviews? * Do the insights gained from the large scale study of a geographically diffuse subject pool differ qualitatively from speech community studies? * What is the rate of interannotator consistency for the task of coding -t/d deletion? * Can studies of similar variables be organized on a large scale with teams of non-specialist annotators?