Linguistics 520: Lecture Notes 2


Topics for the first sequence of lectures:

1. What is sound? Be able to define and calculate frequency, period, amplitude, wavelength, power...
2. How does the human vocal tract create and shape sounds?
3. How does human speech encode words?
4. What else is in speech besides words?

The human vocal organs

Or again:

Basic sound production in the vocal tract: buzz, hiss and pop

There are three basic modes of sound production in the human vocal tract that play a role in speech: the buzz of vibrating vocal cords, the hiss of air in turbulent flow past a constriction, and the pop of a closure released.

Laryngeal buzz

The larynx is a rather complex little structure of cartilage, muscle and connective tissue, sitting on top of the trachea. It is what lies behind your "adam's apple." The original role of the larynx is to seal off the airway, in order to prevent aspiration of food or liquid, and also to permit the thorax to be pressurized to provide a more rigid framework for heavy lifting and pushing.

Part of the airway-sealing system in the larynx is a pair of muscular flaps, the vocal cords or vocal folds,  which can be brought together to form a seal, or moved apart to permit free motion of air in and out of the lungs. When any elastic seal is not quite strong enough to resist the pressurized air it restricts, the result is an erratic release of the pressure through the seal, creating a sound. Some homely examples are the Bronx cheer, where the leaky seal is provided by the lips; the belch, where the opening of the esophagus provides the leaky seal; or the rude noises made by grade school boys with their hands under their armpits.

The mechanism of this sound production is very simple and general: the air pressure forces an opening, through which air begins to flow; the flow of air generates a so-called Bernoulli force at right angles to the flow, which combines with the elasticity of the tissue to close the opening again; and then the cycle repeats, as air pressure again forces an opening. In many such sounds, the pattern of opening and closing is irregular, producing a belch-like sound without a clear pitch. However, if the circumstances are right, a regular oscillation can be set up, giving a periodic sound that we perceive as having a pitch. Many animals have developed their larynges so as to be able to produce particularly loud sounds, often with a clear pitch that they are able to vary for expressive purposes.

The hiss of turbulent flow

Another source of sound in the vocal tract -- for humans and for other animals -- is the hiss generated when a volume of air is forced through a passage that is too small to permit it to flow smoothly. The result is turbulence, a complex pattern of swirls and eddies at a wide range of spatial and temporal scales. We hear this turbulent flow as some sort of hiss.

In the vocal tract, turbulent flow can be created at many points of constrictions. For instance, the lower teeth can be pressed against the upper lip -- if air is forced past this constriction, it makes the sound associated with the letter (and IPA symbol) [f].

When this kind of turbulent flow is used in speech, phoneticians call it frication, and sounds that involve frication are called fricatives.

The pop of closure and release

When a constriction somewhere in the vocal tract is complete, so that air can't get past it as the speaker continues to breath out, pressure is built up behind the constriction. If the constriction is abruptly released, the sudden release of pressure creates a sort of a pop. When this kind of closure and release is used as a speech sound, phoneticians call it a stop (focusing on the closure) or a plosive (focusing on the release).

As with frication, a plosive constriction can be made anywhere along the vocal tract, from the lips to the larynx. However, it is difficult to make a firm enough seal in the pharyngeal region to make a stop, although a narrow fricative constriction in the pharynx is possible.

Articulation of consonants: Place and manner

Places of articulation: which active articulator is responsible, and what part of the upper vocal tract is involved -- bilabial, labiodental, dental, alveolar, retroflex, palato-alveolar, palatal, velar, etc.

Manners of articulation: complete closure of the vocal tract ("stop"), narrowing ("approximant"), narrowing close enough to create turbulent flow ("fricative"), closure on one side but not the other ("lateral"), closure with a fricated release ("affricate"), brief ballistic closure ("tap" or "flap"), etc.

Labial Places:

Bilabial place: made with upper and lower lip together (pie, buy, my).

Labiodental place: lower lip and upper front teeth (fie, vie).










Coronal Places:

Dental: tongue tip or blade and upper front teeth (the, thigh).

Interdental: the tip of the tongue protrudes between the upper and lower teeth.

Alveolar: tongue tip or blade and the alveolar ridge (tie, die, nigh, sign, zeal, lie).

Retroflex: tongue tip "curled back" against the back of the alveolar ridge

Palato-alveolar (post-alveolar): tongue blade and the back of the alveolar ridge (shy, she, show).

Dorsal Places:

Palatal: the front of the tongue and the hard palate (you). Palatal sounds are sometime classified as coronal.

Velar: back or "body" of the tongue and the soft palate (back, bag, bang). Velar place of articulation covers a relatively large fraction of the length of the vocal tract, and some languages distinguish between back velars and front velars.








Manners of articulation: stops

Stop: Complete closure of articulators, so no air escapes through the mouth.

Oral stop: In addition, the soft palate is raised so that the nasal tract is blocked off, and no air escapes through the nose. Air pressure builds up behind the closure, and may be released abruptly in a "burst".

Nasal stop: Soft palate is lowered, coupling the nasal cavity as a resonator and permitting air to espace through the nose.

Manners of articulation: fricatives, affricates, approximants, laterals, taps, trills

Fricative: Close approximation of two articulators, resulting in turbulent airflow between them.
Affricate: Oral stop followed by a fricative release.
Approximant: Close approximation of two articulators, without turbulent airflow. Includes "glides".
Lateral approximant: Closure in the middle or on the side(s) of the vocal tract, with free flow through the region left open.
Tap or flap: Tongue tip makes a quick ballistic tap against the alveolar ridge.
Trill: Rapid oscillation of tongue tip (or lips) based on tissue elasticity and aerodynamic forces.

Vowel color and nasality

Between the larynx and the world at large is about 14 to 15 centimeters of throat and mouth. This passageway acts as an acoustic resonator, enhancing some frequencies and attenuating others. The properties of this resonator depend on the position of the tongue and lips, and also on whether the velum (the "soft palate", essentially) is lowered so as to open a side passage to the nasal cavities. Some examples of shapes in a computer model of the human vocal tract, the corresponding resonance patterns, and the sounds that result when a laryngeal buzz in shaped by these resonances, can be found here.

Different positions of the tongue and lips make the difference between one vowel sound and another. As you can easily determine for yourself by experiment, you can combine any vowel sound with any pitch -- or with a whisper, which is a hiss created by turbulent flow at the vocal folds.

Classifying vowel articulation

One traditional way to classify vowel sounds is in terms of the position of the highest point of the tongue in two dimensions -- front to back, and high to low (or "close" to "open") -- along with a third dimension based on the shape of the lips -- rounded vs unrounded.

Coordination of articulators:

In fluent speech, the articulators are moving about as fast as they are physically capable of moving, while at the same time moving with considerable precision of location and coordination. We'll learn more later about this coordination and how it's controlled. Meanwhile, you can see (some aspects of) the coordination of the articulators in this old x-ray movie:

Laryngeal articulations:

Here's a video showing (some of) what happens in your larynx as you talk:

Here's a high-speed video that gives a better sense of how the vocal folds generate air-pressure variation at the time scale of voice pitch (about 60 to 600 oscillations per second):

And, if you're interested, here's a bit more about the anatomy:

Phonetic syllables: the scale and cycle of sonority

Human speech, like many animal vocalizations, tends to involve repetitive cycles of opening and closing the vocal tract. In human speech, we call these cycles syllables. A syllable typically begins with the vocal tract in a relatively closed position -- the syllable onset -- and procedes through a relatively open nucleus. The degree of vocal tract openness correlates with the loudness of the sound that can be made. Speech sounds differ on a scale of sonority, with vowels at one end (the most sonorous end!) and stop consonants at the other end. In between are fricatives, nasal consonants like [m] and [n], and so on. Languages tend to arrange their syllables so that the least sonorous sounds are restricted to the margins of the syllable -- the onset in the simplest case -- and the most sonorous sounds occur in the center of the syllable.

However, there are some cases where the same -- or at least very similar -- sounds can occur in several different syllabic roles. For example, the glides (sometimes called approximants) that begin syllables like "you" and "we" are almost exactly like vowels, except for their syllabic position. In fact, the mouth position and acoustic content of the "consonant" at the start of "you" and of the "vowel" at the end of "we" are just about exactly the same.

In the International Phonetic Alphabet (IPA), the English word "you" (in standard pronunciations) would be written something like [ju], where the [j] refers to the sound we usually write as "y", and the [u] refers to the vowel as in "boo" or "pool". The English word "we" would be written in the IPA as [wi], where the [w] is familiar, and the [i] refers to the vowel found in "see" or "eat".

In fact, the articulation and sound of IPA [j] is quite a lot like the articulation and sound of IPA [i], while the articulation and sound of IPA [w] is quite like that of IPA [u]. What is different is the role in the syllabic cycle -- [j] and [w] are consonants, while [i] and [u] are vowels.

This means that the English words "you" and "we" are something like a phonetic palindrome -- though "you" played backwards sounds more like "oowee" than "we". More important, this underlines that point that phonetics is the study of speech sounds, not just the study of vocal noises.


The International Phonetic Alphabet


In the mid-19th century, Melville Bell invented a writing system that he called "Visible Speech." Bell was a teacher of the deaf, and he intended his writing system to be a teaching and learning tool for helping deaf students learn spoken language.  However, Visible Speech was more than a pedagogical tool for deaf education -- it was the first system for notating the sounds of speech independent of the choice of particular language or dialect. This was an extremely important step -- without this step, it is nearly impossible to study the sound systems of human languages in any sort of general way.

In the 1860's, Melville Bell's three sons -- Melville, Edward and Alexander -- went on a lecture tour of Scotland, demonstrating the Visible Speech system to appreciative audiences. In their show, one of the brothers would leave the auditorium, while the others brought volunteers from the audience to perform interesting bits of speech -- words or phrases in a foreign language, or in some non-standard dialect of English. These performances would be notated in Visible Speech on a blackboard on stage.

When the absent brother returned, he would imitate the sounds produced by the volunteers from the audience, solely by reading the Visible Speech notations on the blackboard. In those days before the phonograph, radio or television, this was interesting enough that the Scots were apparently happy to pay money to see it!

[There are some interesting connections between the "visible speech" alphabet and the later career of one of the three performers,  Alexander Graham Bell, who began following in his father's footsteps as a teacher of the dear, but then went on to invent the telephone. For example, look at the discussion of Bell's "Ear Phonautograph" and artificial vocal tract.]

Phonetic notation for elocution lessons -- and for linguistic description

After Melville Bell's invention, notations like Visible Speech were widely used in teaching students (from the provinces or from foreign countries) how to speak with a standard accent. This was one of the key goals of early phoneticians like Henry Sweet (said to have been the model for Henry Higgins, who teaches Eliza Doolittle to speak "properly" in Shaw's Pygmalion and its musical adaptation My Fair Lady).

The International Phonetic Association (IPA) was founded in 1886 in Paris, and has been ever since the official keeper of the Inernational Phonetic Alphabet (also IPA), the modern equivalent of Bell's Visible Speech. Although the IPA's emphasis has shifted in a more descriptive direction, there remains a lively tradition in Great Britain of teaching standard pronunciation using explicit training in the IPA. 

The IPA and the dimensions of speech production

If you look at the IPA's table of "pulmonic" consonants (roughly, those made while exhaling normally), you will see that it is organized along two main dimensions. 

The columns are labelled by positions of constriction, moving from the lips (bilabial) past the teeth (dental) and the hard palate (palatal) and soft palate (velar) to the larynx (glottal). The rows are labelled by the type of manner of constriction: plosive,nasalfricative, and so forth. The side-by-side pairs of plosives and fricatives are differentiated by whether layrngeal buzz is present during the constriction. You can feel the difference yourself if you put your finger on your adam's apple while saying an  extended [s] or [z].

The IPA vowel quadrilateral is similarly organized along articulatory dimension of high vs. low, front vs. back, rounded vs. unrounded:

Thus the dimensions along which the IPA is organized are basically the physical and functional dimensions of the human vocal tract, as shown in the diagrams earlier on this page. The same was true of Bell's Visible Speech. We'll survey the IPA in more detail in a later lecture and lab.

Apparent design features of human spoken language

We can list a few characteristics of human spoken languages:

1. Large vocabulary: 10,000-100,000 items
2. Open vocabulary: new items are added easily
3. Variation in space and time: different languages and "local accents"
4. Messages are typically structured sequences of vocabulary items

Compare what is known about the "referential" part of the vocal signaling system of other primates:

1. Small vocabulary: ~10 items
2. Closed vocabulary: new "names" or similar items are not added
3. System is fixed across space and time: widely separated populations use the same signals
4. Messages are usually single items, perhaps with repetition

Some general characteristics of other primate vocalizations that are retained by human speech:

1. Vocalizations communicate individual identity
2. Vocalizations communicate attitude and emotional state

Some potential advantages of the human innovations:

1. Easy naming of new people, groups, places, etc.
2. Signs for arbitrarily large inventory of abstract concepts
3. Language learning is a large investment in social identity

How can it work?

Experiments on vocabulary sizes at different ages suggest that children must learn an average of more than 10 items per day, day in and day out, over long periods of time.

A sample calculation:

40,000 items learned in 10 years

10 x 365 = 3,650

40,000 / 3,650 = 10.96

Most of this learning is without explicit instruction, just from hearing the words used in meaningful contexts. Usually, a word is learned after hearing only a handful of examples. Experiments have shown that young children can often learn a word (and retain it for at least a year) from hearing just one casual use.

Let's put aside the question of how to figure out the meaning of a new word, and focus on how to learn its sound.

You only get to hear the word a few times -- maybe only once. You have to cope with many sources of variation in pronunciation: individual, social and geographical, attitudinal and emotional. Any particular performance of a word simultaneously expresses the word, the identity of the speaker, the speaker's attitude and emotional state, the influence of the performance of adjacent words, and the structure of the message containing the word.  Yet you have tease these factors apart so as to register the sound of the word in a way that will let you produce it yourself, and understand it as spoken by anyone else, in any style or state of mind or context of use.

In subsequent use, you (and those who listen to you speak) need to distinguish this one word accurately from tens of thousands of others.

(The perceptual error rate for spoken word identification is about one percent, where words are chosen at random and spoken by arbitrary and previously-unknown speakers. In more normal and natural contexts, performance is better).

Let's call this the pronunciation learning problem. If every word were an arbitrary pattern of sound, this problem would probably be impossible to solve.

What makes it work? 

The Phonological Principle

In human spoken languages, the sound of a word is not defined directly (in terms of mouth gestures and noises). Instead, it is mediated by encoding in terms of a discrete phonological system:

A word's pronunciation is defined as a structured combination of a small set of elements. The available phonological elements and structures are the same for all words (though each word uses only some of them).

This phonological system is defined in terms of patterns of mouth gestures and noises -- this "grounding" of the system is called phonetic interpretation, and the principles of phonetic interpretation are the same for all words.

How does the phonological principle help solve the pronunciation learning problem? Basically, by splitting it into two problems, each one easier to solve:

1. How to "spell" words in terms of a phonological system
2. How to relate the phonological system to sound: "phonetic interpretation"

Because phonetic interpretation is general, i.e. independent of word identity, it follows that every performance of every word by every member of the speech community helps teach phonetic interpretation, because it applies to the phonological system as a whole, rather than to any particular word.

And because phonological representations are digital, i.e. made up of discrete elements in discrete structural relations, it follows that copying can be exact: members of a speech community can share identical phonological representations; and within the performance of a given word on a particular occasion, the (small) amount of information relevant to the identity of the word is clearly defined.

A useful short-hand distinction: phonetics deals with signals -- continuous functions of time -- while phonological deals with symbols -- structured sequences of combinations of features with a small number of discrete values.

"Exemplar theory"

There are many gradient aspects of speech production that are not determined by the phonological representation of words: speaking rate; precision of articulation; formality; vocal effort; emotional and attitudinal loading; intonational patterns and pitch range; regional, ethnic and class-related variation; etc.

Some of these extra-phonological dimensions have predictable lexical associations: common or recent words tend to be faster and less precisely articulated; rare or unexpected words are slower and more precise; it's common to contrastively emphasize aspects of a word that are likely to be confused in the context of utterance; words that are associated with particular ethnic groups may regularly exhibit more of the speech patterns of the associated group; and so on.

These extra-phonological lexical associations were traditionally viewed as marginal phenomena, but recently, a range of ideas known collectively as "exemplar theory" has focused attention on them.

One extreme version: every instance of every word that you've ever heard or spoken is stored as an eidetic memory, and every time you hear or say a word, you interpret it or create by reference to the properties of this extensive set of memories.

Less extreme versions may associate each word with some sort of continuous distribution over phonetic dimensions, perhaps related in some way to the expected phonetic value of its phonological representation. Among psychologists, these would generally be called "feature frequency" theories rather than "exemplar" theories.

There's some evidence that lexical representations become more abstract as primary language learning proceeds, with the process continuing much later in life than previously thought.

Most of the information content in speech production and perception is not related to word identity -- we can see that by considering that the information-theoretic perplexity of words in conversational speech is on the order of 100-1000, i.e. about 7 to 10 bits per word, or 14-20 bits per second, while our best attempts at reasonable-fidelity audio coding for speech are more like 16,000 bits per second. But this extra-phonological information is almost entirely also extra-lexical, in my estimation -- and thus the traditional view was substantially correct. The extra-phonological aspects of word pronunciation are interesting, but marginal in comparison to the phonetic interpretation of the phonological system.

There's one apparent exception that deserves further scrutiny. It's clear that common word sequences may have "pre-compiled" phonological and phonetic properties that are not found in different words with a similar sequence of phonological entities and structures.

Consider the pronunciation of "going to = gonna", "want to = wanna", etc., and compare "showing to", "sent to", etc. These phenomena can be treated purely phonetically -- in terms of reduction, coarticulation, etc., in the phonetic realization of the associated symbolic phonological representation -- or purely phonologically -- in terms of deletions and substitutions in the (symbolic) phonological features and structures. It seems clear that there can be a historical sequence that starts with changes in signals and ends with changes in signals, with a long period in between in which both treatments are simultaneously relevant.