LING0001     Lecture 21    Languages of the World


How many languages are spoken in the world today? The 1996 edition of Ethnologue listed 6,703 living languages, with their original locations divided geograpically as follows:

Living Languages Percentage
The Americas 1,000 15%
Africa 2,011 30%
Europe 225 3%
Asia 2,165 32%
The Pacific 1,302 19%
TOTAL 6,703

The 2005 edition listed 6,912 living languages -- this is not because 209 languages have been created in the past 9 years, but because of a combination of a more complete inventory and some decisions about how many speech communities to distinguish as "languages".

In terms of number of speakers, the current edition shows a range from Mandarin Chinese, with 920 million native speakers in 12 countries, down to languages like Arikara, a Caddoan language of North Dakota, with 10 speakers as of 2007; or Coos (in Southern Oregon) whose last native speaker died in 1972. A graphical representation of this distribution of sizes can be seen in the figure below, which plots the number of languages with N or more speakers, for N from one to one billion.

As of the 1999 revision, Ethnologue assigns three-letter "language codes" to 6783 "languages", for 6059 of which an estimate of number of speakers is given. (In some cases, the languages without a speaker-count estimate are extinct; in other cases, there are no "mother-tongue speakers"; and in some cases, the number of speakers is unknown).

Here's a table with some of the data from this plot. I've included all the powers of 10, along with a few familiar languages to illustrate various parts of the range:

Number of speakers S
Number of languages with S or more speakers
1 (= Plains Miwok)
4 (= Pawnee)
10 (= Wichita)
18 (= Kiowa Apache)
250 (= Oneida)
305 = (Aleut)
854 (= Commanche)
1,000 (= Hawaiian)
1,721 (= Cheyenne)
5,264 (= Hopi)
6,213 (= Muskogee, including Creek and Seminole)
11,905 (= Cherokee)
20,355 (= Dakota)
150,000 (= Navaho)
260,000 (= Irish Gaelic)
600,000 (= Tetun, a lingua franca of East Timor)
2,000,000 (= Gheg Albanian)
3,000,000 (= Yiddish)
4,848,000 (Paraguayan Guarani)
7,372,000 (= Haitian Creole French)
9,472,00 (= Somali)
17,000,000 (= Igbo)
21,000,000 (= Serbo-Croatian)
37,000,000 (= Italian)

Many of the 6,000-odd "living" languages cited in Ethnologue are endangered or nearly extinct. Those represented in the left half of the graph above, with 10,000 or fewer speakers, are especially vulnerable. Roughly half of the world's languages are moribund, in the sense that new generations of children are not being raised to speak them. Within a century, it is likely that the number of living languages will be cut at least in half, and may well be fewer than 1,000. Thus the current rate of extinction for languages is much greater than the rate of extinction for biological species. Most people believe that this loss of linguistic and cultural diversity is a bad thing. Language preservation is difficult, but there are some success stories. For languages that can't be saved, it is still possible to document them for scientific purposes and for the sake of future generations who might want to study or even revive them. For further discussion, see the Ethnologue page on Endangered Languages, or the web sites of the Endangered Languages Project, the Foundation for Endangered Languages, and the Institue for Endangered Languages.

Looking at the other end of the distribution, the "top 20" languages in terms of number of native speakers (again from the 1996 edition of Ethnologue) are:

Rank Language Native speakers
(in millions)
1 Mandarin Chinese 885
2 English 322
3 Spanish 266
4 Bengali 189
5 Hindi 182
6 Portuguese 170
7 Russian 170
8 Japanese 125
9 German 98
10 Wu Chinese 77
11 Javanese 76
12 Korean 75
13 French 72
14 Vietnamese 68
15 Telugu 66
16 Yue Chinese (Cantonese) 66
17 Marathi 65
18 Tamil 63
19 Turkish 59
20 Urdu 57

All of these counts are subject to question. One may question the census figures and also (especially in multilingual cases) the question of who counts as a speaker of which language. For instance, the 1996 edition of Ethnologue cites 266 million native speakers of Spanish, with 352 million including second language speakers. The 1999 revisions increase the number of native speakers of Spanish to 332 million, moving Spanish past English into second place. This does not represent a 25% population increase in 3 years, nor even a 25% increase in available census data, but rather (apparently) a revision in who is counted as a Spanish speaker.

Another set of questions have to do with what counts as a language. For instance, you may be surprised to see Arabic -- certainly one of the world's major languages -- missing entirely from this list of the "top twenty." In fact, Arabic (in all its varieties) has 202 million speakers world-wide, and with this count would be #4 on the list above. However, Ethnologue considers the local colloquial varieties of Arabic to be separate languages, and the largest single colloquial is Egyptian, with 42.5 million speakers.

This is not unreasonable, since different Arabic colloquials are not mutually intelligible, or at least not entirely so. Algerian Colloquial Arabic (for instance) is roughly as different from Egyptian Colloquial Arabic as Portuguese is from Spanish. If we considered Portuguese and Spanish as a single language -- called "Iberian" or something like that -- then the combined language would have 436 million speakers (or 502 million by Ethnologue's 1999 definitions), far ahead of English in second place.

On the other side of the argument, educated people in all the Arabic-speaking countries can speak, read and understand "Modern Standard Arabic", which is also the language used in news broadcasts, newspapers, and so on. Thus an educated Egyptian in Algeria can read the paper, understand the TV news, and converse easily with an educated Algerian. In some sense, they remain part of the same linguistic community, in a way that speakers of Spanish and Portuguese may not.

To take another example, Hindi and Urdu are essentially the same language. For historical and political reasons, they have different writing systems, and some different strata of borrowed vocabulary, but ordinary speakers are likely to be able to understand one another quite well. Combining their counts would give us 182+57 = 239, a 30% increase for Hindi, putting the Hindi/Urdu combination into fourth place -- though Hindi was already in fifth place.

For a third example, consider Turkish, which is listed with 59 million speakers (46 million in Turkey). Speakers of closely related languages that are mutually intelligible with Turkish (at least to some extent) include 13.9 million South Azerbaijani, 7 million North Azerbaijani, 5.4 million Turkmen, 500,000 Gagauz Turkish, 400,000 Khorasani, 300,000 Crimean Turkish, 200,000 Qashqa', 55,000 Salar: about 87 million total. This 47% increase in the count for Turkish would move it to 10th position from 19th position.

A recent and striking example is the change in Serbo-Croatian from 30 years ago to today. Once Serbo-Croatian was considered a single language, documented in single grammars, with single dictionaries. It was well known that there were two ways of writing it -- with roman characters in Croatia, and with cyrillic characters in Serbia -- and that there was a continuum of dialect variation from Serbia in the east to Croatia in the west -- but this did not change the obvious "fact" that it was a single language. Now there are three languages -- Serbian, Croatian, and Bosnian. This is not just an idle matter of nomenclature -- at least in Croatia and in Serbia, strenuous efforts are underway to purge the language of elements that are felt to reflect pollution from linguistic or cultural elements associated with the other end of the geographical, political and historical conflict.

Language families

All of these issues stem from the fact that languages are not a set of distinct and unrelated items, but rather a hierarchy (or tree structure). We can continue to split categories almost down to the level of the individual speaker, or we can lump categories together as long as we can find evidence for common historical origin. (In the lecture on language change, we'll learn something about why common historical origin is a useful and interesting basis of classification for languages).

In practice, the tendency is to categorize languages at level of grouping that depends on several factors:

  • mutual intelligibility
  • speaker attitutudes
  • existence of a nation-state

Each of these has its problems. For instance, intelligibility is not always a symmetric relationship -- sometimes speakers of A can understand speakers of B, but not vice versa -- and it is certainly not transitive -- the fact that languages/dialects A and B are mutually intelligible, and B and C are also mutually intelligible, does not imply that A and C are mutually intelligible.

The last two factors are clearly ideological and political, rather than linguistic in nature. Their practical importance can be seen in the cynical dictum, originally due to Max Weinreich, that "a language is a dialect with an army and a navy."

In any case, this perspective dissolves any intellectual difficulties that we might have had in counting languages or counting speakers, by replacing the original (and naive) questions -- how many languages are there? how many speakers does each one have? -- with more sophisticated questions -- what is the structure of language families? how old are the splits? what sort of communication across the divisions has remained? and so on.

Let's apply this sort of reasoning to the comparison between Egyptian and Algerian colloquial Arabic, on one hand, and Spanish and Portuguese, on the other.

The current Arabic colloquials have all developed since the spread of Islam, which took place during the period between 632 and about 750 (in the Western way of reckoning). Egypt was conquered in 640, and the Maghreb (including present-day Algeria) between 670 and 700. Before then, no Arabic of any kind was spoken in either place; Egyptians spoke Coptic, Greek, Aramaic and so on, while residents of the area now called Algeria spoke mainly Berber languages, with a few speakers of Latin (since it had been a Roman Colony) and of Germanic dialects (since it had been conquered by Goths and Vandals. Since the seventh century, Arabic has become the dominant language in each place (though about 14% of Algerians still have some kind of Berber as their first language). During the intervening 1,300 years, the kind of Arabic spoken in Egypt and the kind of Arabic spoken in Algeria have both changed (as all living languages do), but in different ways, to the point of losing mutual intelligibility.

With respect to Spanish and Portuguese, which we used as a point of comparison, their development as separated languages is closely related to the same set of historical events. As of 220 B.C., the southern part of the Iberian peninsula was a Carthaginian colony. Its people spoken various Celtic dialects, Basque and perhaps some of its now-extinct relatives, and Punic (a Semitic language related to Arabic and Hebrew). Rome conquered southern Iberia in 206 B.C., and the rest of the peninsula somewhat later. Roman colonists and administrators gradually imposed the Latin language. The varieties of Latin spoken in various parts of the Roman Empire changed over time, giving rise to the modern "romance" languages -- Italian, Spanish, French, Romanian and so on.

The vernacular version of Latin -- "vulgar Latin" -- was diverging geographically and socially even before the period when the literary models of classical Latin were established by authors like Cicero and Virgil. However, the Iberian variants of vulgar Latin did not simply remain in place to develop peacefully over the centuries. The Iberian peninsula was conquered by the (germanic) Visigoths and Vandals in the 5th century. Then the southern four fifths of the peninsula was conquered by Arabic-speaking Islamic forces between 700 and 750. The reconquest by Christians was not complete until the 15th century, though the region occupied by present-day Portugal was reconquered in the 12th century, and the political source of modern standard Spanish can be dated to Alfonso VI's kingdom of Leon and Castile in the 11th century.

The local variants of Latin that became Spanish and Portuguese developed in the frontier regions on the northern fringe of the Arab conquest, during the early middle ages. It is likely that some of the peoples involved had continued to speak Celtic dialects until they were converted to Christianity, which in some cases did not occur until after the Arab invasion. Certainly many of the people who became Castilians were originally Basque speakers. In any case, it is not until the 10th or 11th century that we begin to see the crystallization of what became Spanish and Portuguese, out of local variants of Vulgar Latin, with Basque and perhaps Celtic substrates, and a substantial influence from Arabic. The forms of speech that became Spanish and Portuguese were not separated and isolated, but were part of a larger continuum of Latin-derived dialects called Ibero-Romance, with a largely shared vocabulary, similar sound changes, and so on. It was not until considerably later that particular dialects, associated with royal courts and later with modern nation-states, were given a clearly separate identity and a separate line of formal development as Spanish and Portuguese. It is an accident of history that these particular forms -- as opposed to Aragonese, Leonese, or other local variants -- became national languages.

Thus to sum up the history, Spanish and Portuguese developed as separate Ibero-Romance dialects over roughly the same historical period -- the past 1300 years -- that the various variants of colloquial Arabic developed. Spanish and Portuguese became established as national languages and spread (via colonization) around the world. The colloquial variants of Arabic remain, to this day, unwritten vernacular forms, in a context where "Modern Standard Arabic" -- a modern approximation to the language of the Koran -- remains the medium of formal discourse. The result in each case is a sort of family tree, of which relevant portions are given below:

Ibero-Romance language family
    Castilian ("Spanish")
      . . .

Arabic language family
    . . .

From the point of view of internal linguistic description, the different varieties of Arabic are at least as different as the different varieties of Romance. From the point of view of linguistic attitudes, however, the situation is very different. Spanish speakers certainly feel themselves to be members of a different linguistic community from Portuguese speakers -- and speakers of Catalan or Galician feel equally separate, although these are minority languages of Spain rather than languages associated with their own nation-state. By contrast, speakers of different varieties of Arabic generally feel themselves to be members of the same linguistic community, tied together by their common language of formal discourse, which is essentially the language of the Quran.


Independent of language family associations, we can characterize languages on various dimensions where correlations among values recur.

Typological classification can be done at any level of linguistic description, but the commonest forms are phonological, morphological and syntactic.

Phonological typologies deal with issues such as

  • phoneme inventory
  • syllable structure
  • prosody

Thus we might ask how many distinct vowels a language has, and how they are arranged; what sorts of syllable-final consonants a language has, if any; whether the language has word-stress, and if so, if its location is predictable from the structure of the word.

In general, the answers to these questions tend to follow predictable patterns. For example, if a language allows stop consonants to occur in syllable-final position -- as in the English word "dip" -- it will generally also allow nasal consonants in the same position -- as in English "dim". It is often helpful to think of these patterns in terms of a hierarchy of more or less "marked" (i.e. unusual or unexpected) configurations. Thus a (very) partial hierarchy of markedness for syllable structures would be

Consonant Vowel >> Consonant Vowel Nasal >> Consonant Vowel Stop

As a rule, a language that has more "marked" patterns also has less marked ones.

In the case of morphology, an old but still useful typological taxonomy refers to languages as isolating (words lack affixes, and grammatical relationships are mainly signaled by word order), inflecting (words are marked with affixes to indicate their grammatical function), agglutinative (words incorporate long sequences of affixal elements), and polysynthetic (whole sentences may be expressed as single words, with several stems and various functional elements expressing their relationship).

The most basic syntactic typology has to do with the normal order of subject, verb and object in simple sentences. English has the order S V O:

Kim opened the book. 
 S    V         O 

There are six possible orders of subject, verb and object, and at least a few of the world's languages exemplify each possibility. You can find out what they are by checking out what the World Atlas of Language Structures (WALS) has to say on this subject.






