The TALANA annotated corpus for French : some experimental results We present the first linguistic results exploiting the newly available annotated corpus for French developed at Talana-Paris 7 (Abeille & al 00). The corpus comprises 1 million words annotated (with longitudinal human validation and double checks) with parts of speech, inflectional morphology, compounds and lemmas, and partially annotated with constituency. It is representative of contemporary normalized written French, and covers a variety of authors and subjects, with extracts ranging from January 90 to 94. With compounds amalgamated and not counting punctuation marks, it comprises 870 000 tokens, 33 000 independent sentences and 17 000 different lemmas. The average number of words per sentence is 26 and the average number of phrases (after parsing the tagged corpus) is 20 (some phrases are unary). After explaining how this corpus was built (i.e. tools used, and hand-correction procedures), we will present the first linguistic results we have obtained when examining the corpus, and explain why we think some of these results shed a new light on some psycholinguistic hypothesis. Here is a non exhaustive list of some points we have focussed on : - We first checked the well known preferences for idiomatic interpretation. We took all the sequences which are possibly ambiguous between compounds and non compound sequences- there are more than 1000- and compute their respective number of occurrences. Examples of such pairs would be : EN FAIT : en fait:adv (in fact) OR en:clitic fait:v (make it) D'AILLEURS : d'ailleurs=adv OR d'=prep + ailleurs=n (from elsewhere) We found an overwhelming proportion of uses as grammatical categories (more than 93% on average, 100% in some cases). We check that this is a lexical preference because the total number of occurrences of compounds is much lower than that of non compounds words (10% vs 90 %) More detailed results on this point may be found in the annex of this paper below. We found a strong lexical preference for grammatical versus lexical categories. We define grammatical categories as function words introducing major constituents (det, prep, clitic and relative pronouns,sub and coord conjunctions) whereas lexical ones are V, Adj, N , Adv and other pronouns. For each possible ambiguous word form (functional vs lexical) -there are more than 500 of them in the corpus- we compare their frequency of apparition as function words with their frequency of apparition as lexical words. Examples of such pairs would be : CAR : car:conjunction (since) OR car:n (bus) OUTRE : outre:prep (in addition of) OR outre:n (drinking container) ENTRE : entre:prep (between) OR entre:v (enter) We found an overwhelming proportion of uses as grammatical categories (more than 95% on the average, sometimes 100%). Again, we check that this is a lexical preference because the total number of occurrences of grammatical categories is not significantly higher than that of lexical categories in the corpus as a whole (53% vs 47%), with the most common category being N. We then tested a preference for grammatical valence (as an auxiliary , as a modal or as a raising verb) versus full valence (with all syntactic arguments receiving a semantic role) for possibly ambiguous verbs (avoir=aux or possess, devoir= must or owe etc). This was done by manually checking a subset of the corpus because valence is not already marked (we chose a subset of 50 verbs), again there was a significantly higher proportion of grammatical valence (more than 70% ) , which seems to be again a lexical preference since occurrences of grammatical valence frames are not higher in the corpus as a whole than that of full valence. Also, we replicated for French the Keenan-Comrie hypothesis on accessibility hierarchy (Keenan & Hawkins 87) i.e. the fact that non canonical realizations (such as relativization or cliticization) are more difficult for less accessible functions and hence more rarely encountered in corpora. We will develop this point more extensively in the final version of our paper. We are also currently exploring a slightly larger notion of functional words (including some negative and degree adverbs, plus auxiliary verbs) for which these results also seem to hold. It is the first time this type of hypothesis can be verified on a large scale for French. ------------ References : ----------- Abeille A, Clement L., Kinyon A. 2001. Building a Treebank for French, In Treebanks : building and using syntactically annotated corpora. Kluwer academic publishers. Keenan, Hawkins. 1987. The psychological validity of the accessibility hierarchy. In Universal Grammar : 15 essays. Keenan (ed). Routledge. London. ----------- Annex : Occurrences of compounds in cases of ambiguity compound/non compound --------- simple word compound word il y a 215 232(52%) plus de 193 299(61%) le plus 179 106(37%) le [mM]onde 150 325(68%) en fait 19 113(86%) alors que 15 305(95%) ainsi que 24 124(84%) d'abord 6 183(97%) d'ailleurs 17 137(89%) sans doute 10 198(95%) pomme de terre 0 1(100% !) (The rarest compound !)