Adding phonetics to speech-community dynamics

[A problem, in blue text, is at the end of the discussion.]

1. Reciprocal learning of continuous random variables

Instead of using a linear-learning model to learn the probability of a set of discrete outcomes, suppose we use a similar approach to learning (some parametric representation of) a continuous probability distribution. In one simple case, we could try to learn the mean value:

AM1 = 0; ASD1 = 1; gamma = 0.05;
Ntrials = 300; Ameans=zeros(Ntrials,1); Ameans(1) = 1.0;
Outputs = random('normal', AM1, ASD1, [Ntrials 1]);
for n=2:Ntrials
    Ameans(n) = (1-gamma)*Ameans(n-1) + gamma*Outputs(n);
end
plot(1:Ntrials,Ameans,'-r',1:Ntrials,0*(1:Ntrials),':b')

As before, we can use the learning-rate parameter to adjust the speed of learning and the stability of the result. And we could smooth the results by updating on the basis of more than one input (and/or more than one previous belief state).

We could use a similar approach to estimate the variance of a gaussian probability distribution; or the various parameters of other pdfs.

Suppose that we have two such learners, learning mean and variance from one another's outputs:

AM1 = 0; ASD1 = 1; BM1 = 1; BSD1 = 1;
gamma = 0.01; NSD = 6;
Ntrials = 4000;
Ameans=zeros(Ntrials,1); Avars = zeros(Ntrials,1);
Bmeans=zeros(Ntrials,1); Bvars = zeros(Ntrials,1);
Ameans(1) = AM1; Avars(1) = ASD1;
Bmeans(1) = BM1; Bvars(1) = BSD1;
for n=2:Ntrials
    AOutput = random('normal', Ameans(n-1), sqrt(Avars(n-1)),1);
    BOutput = random('normal', Bmeans(n-1), sqrt(Bvars(n-1)),1);
    if( abs(BOutput-Ameans(n-1)) <= NSD*sqrt(Avars(n-1)))
       Ameans(n) = (1-gamma)*Ameans(n-1) + gamma*BOutput;
       Avars(n) = (1-gamma)*Avars(n-1) + gamma*((BOutput-Ameans(n-1))^2);
    else
        Ameans(n) = Ameans(n-1);
        Avars(n) = Avars(n-1);
    end
    if( abs(AOutput-Bmeans(n-1)) <= NSD*sqrt(Bvars(n-1)))
       Bmeans(n) = (1-gamma)*Bmeans(n-1) + gamma*AOutput;
       Bvars(n) = (1-gamma)*Bvars(n-1) + gamma*((AOutput-Bmeans(n-1))^2);
    else
        Bmeans(n) = Bmeans(n-1);
        Bvars(n) = Bvars(n-1);
    end
end
plot(1:Ntrials,Ameans,'-r',1:Ntrials,Avars,':b');
hold on
plot(1:Ntrials,Bmeans,'-g',1:Ntrials,Bvars,':k');
hold off

The two learners quickly harmonize their mean and SD estimates.

But note that the subsequent history is not at all like what happens with reciprocal learning of discrete random variables, where the community (whether two or more) is trapped by one of the attractors at the corners of the hypercube [0|1 0|1 ... 0|1], and then stays there forever. Instead, here the two reciprocal learners wander around randomly together in the continuous parameter space.

The same general thing is true of larger communities of reciprocal continuous-pdf learners. The community as a whole takes a random walk in (a relatively compact region of) the parameter space, with the obvious caveat that the spread of parameter values tends to increase with increasing community size.

With gamma of .01 and 10 "people":

gamma = 0.01; Ntrials = 5000; Npeople=10;
Ameans=zeros(Ntrials,Npeople);
for n=1:Npeople
    Ameans(1,n)= 10*rand(1);
end
for n=2:Ntrials
    for p=1:Npeople
        j= mod(p,Npeople)+1;
        Output = random('normal',Ameans(n-1,p), 1,1);
        Ameans(n,j) = (1-gamma)*Ameans(n-1,j) + gamma*Output;
    end
end
plot(Ameans)

Gamma = 0.01 and 20 "people":

With gamma = 0.01 and 40 "people":

With gamma of .1 and 10 "people":

Gamma = 0.1 and 20 "people":

With gamma = 0.1 and 40 "people":

2. Splitting and merging, strengthening and weakening

Given the possibility of multiple "features" with continuous pdfs of this kind, we can allow that (given the right kind of evidence) we can create a new feature rather than adapting the parameters of an old one. (This is a bit like the choice, in on-line clustering algorithms, between adding a new cluster and modifying an old one.)

We could also merge two features if they become similar enough, or if there is inadequate evidence for their distinctness. Features that don't get used can also just gradually fade away, rather than being merged or absorbed into similar features.

However, none of this is going to happen in (the simplest version of) what comes next.

3. Modeling phonology and phonetics together?

In an earlier discussion, we considered the idea that an individual's belief about the pronunciation of a word could be modeled, not as a phonological string, but rather as a probability distribution over phonological strings. We observed that a community of reciprocal learners, updating such beliefs as a linear combination of their prior beliefs and their recent experience of one another's pronunciation, will converge on a shared "lexicon" in which all of the probability mass is concentrated in one of the corners of the hypercube [0|1 0|1 ... 0|1] -- something like looks like a conventional pronouncing dictionary.

But people don't communicate by exchanging strings of discrete symbols (at least they didn't before texting was invented). Instead, they transmit these symbol strings by using continuous sound-sequences that are variably related to them.

A simple way to model this process would be to combine the HW5a idea of "word pronunciation" -- as a distribution over discrete symbol values -- with a second step in which a "speaker" encodes or implements a given symbol or symbol-sequence as a set of continuous values drawn from a parametrically-defined probability density function.

Then a "listener" has to use statistical pattern recognition to decide which "symbol" was intended. To be concrete and simple, this might just involve calculating the Mahalanobis distance to each possible symbol (or just the z-score in a uni-dimensional case), and picking the closest one.

On a given time-step t, once the best-matching pseudo-phonemic representation st is chosen as the apparent pronunciation of word wt, then the listener's learning steps are just those we've already seen -- adapting the phonetic parameters of s on the basis of the current experience (in the simplest case, just the revising the mean values to be a linear combination of the old means and the current values), and adapting w's dictionary entry (to make w's pronunciation as s more probable).

Problem: Implement the simplest possible version of this idea.

We have a "speech community" consisting of two "people" A1 and A2. There is just one "word" w, two "phonemes" ph1 and ph2, and one "phonetic dimension".

Each individual's "lexicon" is then defined by

1) A random variable defining the probability P(ph1) that w will be pronounced as ph1. Initialize this as a random number between 0 and 1, i.e. rand(1) in Matlab. P(ph2) is just 1-P(ph1).

2) Mean values for ph1 and ph2 on the single phonetic dimension. Initialize these as random numbers between 0 and 10 (i.e. 10*rand(1) in Matlab).

On each "turn", A1 and A2 each "pronounce" w. Each then "perceives" the other's pronunciation and updates her lexicon appropriately.

To "pronounce" w, you flip an (unfair) coin to decide whether to say ph1 or ph2, using your current value for P(ph1). Then you generate a random phonetic value using the chosen phoneme's current mean and standard deviation (assume for simplicity that the SD stays constant at 1.0).

To "perceive" the result, you determine whether the number is closer to your current mean value for ph1 or for ph2. Whichever it is, update your belief about the mean phonetic value for that "phoneme"; and also update your belief about the probability of pronouncing w using that "phoneme".

What happens?