COGS 501 -- HW 8a

Simple models of speech community dynamics

[There are three problems, in blue text, interspersed among these notes.]

1. Probability matching by linear learning

At least since Bush & Mosteller 1951, it's been understood that a "linear learner", which updates a probabilistic estimate as a linear function of its past value and recent experience,

will approximate an environmental probability:

Nturns=500; gamma=0.01;
A=zeros(Nturns,1);
Prob = 0.6;
A(1) = 0.0;
for n=2:Nturns
 Output = rand(1)<Prob;
 A(n) = (1.0-gamma)*A(n-1) + gamma*Output;
end
plot(1:Nturns,A,'r',[1 Nturns], [Prob Prob],'b:')

If the learning constant incorporates a larger fraction of recent experience as opposed to past belief, the learner will be faster but also more unstable. If gamma=0.1 in the previous code fragment:

Such models are a good qualitative and even quantitative fit to many results in probability learning, foraging theory, and so on.

2. Reciprocal probability learning

If instead of learning from an environmental random variable, a learner adapts to the outputs of another learner, who is in turn adapting to the outputs of the first learner, the result is convergence of both to probability 1 or probability 0:

Nturns=500; gamma=0.1;
A=zeros(Nturns,1); B=zeros(Nturns,1);
A(1) = 1.0; B(1) = 0.0;
for n=2:Nturns
  Aoutput = rand(1)<A(n-1); Boutput = rand(1)<B(n-1);
  B(n) = (1.0-gamma)*B(n-1) + gamma*Aoutput;
  A(n) = (1.0-gamma)*A(n-1) + gamma*Boutput;
end
plot(1:Nturns,A,'r',1:Nturns,B,'b')

Whether the joint outcome is 0 or 1 depends on the random choices made along the way:

But the positive feedback created by reciprocal linear learning guarantees that sooner or later, both learners will be trapped by one or the other categorical outcome.

3. Probability learning in a network

If we replace the reciprocal learners by a community of learners, accommodating to one another's behavior, the general result is the same -- the whole community converges on 1 or 0. The speed of convergence may now depend not only on the learning constant ("gamma" in our code fragments) but also on who learns from whom -- the "social network" of learning.

Problem (1): Generalize the reciprocal-learning code given above to a network of learners. One general way to do this would be to set up an Npeople by Nturns "belief matrix" B such that B(i,t) is the "belief" (i.e. estimated probability) of person i at time-step t; and an Npeople by Npeople "communication matrix" C such that person i's output is used as learning input by all people j such that C(i,j) is 1. Try this when C is used to represent a circle, in which everyone learns from their next-door neighbor:

C = circshift(eye(Npeople),[0 1]);

And try it when C is used to represent a public meeting, where each person's output is everyone else's input:

C = 1-eye(Npeople);

How does the typical rate of conversgence change with community size for these two types of network?

[You can estimate convergence rate simply by re-running the program a few times and noting by hand how long convergence takes; or you could set up an outer loop to check this automatically for a suitable number of runs.]

4. Learning non-binary discrete random variables

Problem (1a) [optional]. There is nothing special about binary variables in this area -- linear learning works in a similar way for random variables with multiple discrete outcomes. If this isn't obvious to you, try it. It may help you to have a convenient way to throw an unfair n-sided die:

%  Generate random index in 1, 2, ..., n from distribution P 
function [index] = randindex(P)
r = rand;
i = 1;
s = P(1);
while ((r > s) & (i < length(P))),
  i=i+1;
  s=s+P(i);
end
index=i;

5. Yang's model of grammar competition

Here is a much-simplified version of Charles Yang's model of grammar competition.

We assume that there are two sound-classes, x and y, which are merged for some speakers and separate for others. Speakers of the merged dialect pronounce both x-words and y-words using the sound x; but they accept either the sound x or the sound y as valid for words in the merged x/y class.

This model follows the tradition of assuming that grammars are completely deterministic systems, with no optionality or variability. Variation always means that two or more deterministic grammars are in stochastic competition. On any occasion of speech, a speaker flips an unfair coin and picks a grammar to use; and similarly for a hearer.

Yang's learning model assumes that if a hearer gets input that is is consistent with the chosen grammar, that grammar's probability is moved towards 1.0 by some factor gamma; if the input is inconsistent with the chosen grammar, its probability is moved towards 0 by the same factor.

[Given multiple sources of variation, the number of competiting grammars will increase exponentially, which seems like a reductio ad absurdum of the dueling-deterministic-grammars approach; but when we consider only one variable at a time, it's isomorphic to more sensible models that allow variation into the grammar...]

For present purposes, we assume that there are two grammars in competition, U (for "unmerged") and M (for "merged"). Since P(M) = 1-P(U), we can track only the probability of U, without loss of generality. For users of grammar U, there are two word classes x and y that are always pronounced in separate and distinct ways. Users of grammar M pronounce words of both types as if they belonged to class x, and accept pronunciations of either type x or type y as a possible way of pronouncing words of either class.

For any speaker or hearer, the two grammars are in competition. On a given occasion of speech, a speaker i will choose U or M with probability P(i,U) or 1-P(i,U); and a hearer j will similarly choose U or M with probability P(j,U) or 1-P(j,U).

We assume that hearers always knows what word a speaker meant to say. [Relaxing this assumption doesn't change anything qualitatively -- try this for yourself if you want...] But the pronunciation may be consistent or inconsistent with the hearer's chosen grammar. If the pronunciation is consistent, then the probability that the hearer assigns that grammar is incremented (in a normal linear-learning manner). If the pronunciation is inconsistent, then the probability is decreased similarly.

As in the previous model, we assume that the members of the community learn from one another, sometimes functioning as speakers and sometimes as hearers. [Note to avoid confusion: Yang treats a single learner in a fixed environment.]

Here's a simple Matlab implementation of this model:

% Npeople "people", 1 binary "feature", Nturns turns, gamma=.1
% In grammar U, x and y are "unmerged"; 
%    in grammar M, they're "merged" as x.
Npeople = 50; Nturns = 250; gamma=0.15;
% probability assigned by person i to grammar U
%   (probability assigned to grammar M is just 1-U)
% We start with about 80% of people unmerged:
U = zeros(Npeople,Nturns);
for p=1:Npeople
    U(p,1) = rand(1)<0.8;
end
%
% On each turn t, person i "speaks" to neighbor j
%    where we set j = mod(i,Npeople)+1
% Persons i and j each select a (merged or unmerged) grammar
%    according to U(i,t) and U(j,t)
% Person i choses an x-word or a y-word at random
%    (for now we assume that they're equally likely)
% and "pronounces" it according to their selected grammar
% Person j interprets it according to their current grammar
%   and then update their grammar probabilities appropriately.
% We assume that merged speakers accept either x or y
for t=1:(Nturns-1)
    for i=1:Npeople
        j = mod(i,Npeople)+1;
        wordx = rand(1)<0.5;
        % select grammars -- 1 is Unmerged, 0 is merged
        Gi = rand(1)<U(i,t);
        Gj = rand(1)<U(j,t);
        if (Gi==0 & Gj==0) % both merged - j affirms merged
            U(j,t+1) = (1-gamma)*U(j,t); % U decreases by gamma
        elseif (Gi==1 & Gj==1) % both unmerged - j affirms unmerged
            U(j,t+1) = U(j,t)+gamma*(1-U(j,t));
        elseif (Gi==1 & Gj==0) % i unmerged, j merged - j affirms merged
            U(j,t+1) = (1-gamma)*U(j,t);
        elseif (Gi==0 & Gj==1) % i merged, j unmerged
            if wordx==1  % word is type x, j affirms unmerged
                U(j,t+1) = U(j,t)+gamma*(1-U(j,t));
            else % word is type y, j affirms merged
                U(j,t+1) = (1-gamma)*U(j,t);
            end
        end
    end % end of people loop
end
%
plot(U');
axis([1 Nturns -.1 1.1])

In Yang's model, there's a "winner take all" feature of the learning process that creates a non-linear "tipping point". Here, the bias in favor of merger arises simply from the inherent asymmetry of the production/perception model -- by the way things are set up, merged speakers can never encounter falsifying evidence, whereas unmerged speakers will tend to move towards a merged grammar whenever they encounter a merged speaker producing a word of type y (pronounced as if it were of type x).

Problem (2): This bias is great enough that even a single merged speaker will almost always "infect" the entire community -- try it! Note that you might need a larger number of turns to reach convergence.

(Extra credit) What would happen in this model if merged speakers were to take unmerged pronunciation of words of type y as counter-evidence to their current choice of grammar?

6. Modeling "culture" as multiple random variables

The dynamics of arbitrarily complex cultural states can be modeled by adapting sets of random variables in parallel. Thus we might use a set of 20 or so binary random variables to represent the notion "possible pronunciation of a word". A particular individual's belief about the pronunciation of a particular word is then a 20-dimensional vector of probabilities. If there are 1,000 words, then each community member's lexicon is a 1,000-by-20 belief matrix, whose rows are "learned" just as a single binary random variable would be. If the 1,000 lexical dimensions and 20 phonological dimensions are treated as independent, then the generalization is trivial.

But interesting things can happen if parts of an individual's belief space can influence one another in some way. For example, we could model phonological neighborhood effects: learners could avoid (or prefer) moving their beliefs about word-pronunciations into denser phonological neighborhoods.

Or we could model "markedness" effects: learners could prefer simpler syllable structures, or sound sequences that agree (or contrast) in certain features.

We might also introduce a notion of social affinity: it's clear that learners are more disposed to imitate some people than others, and they might go so far as to actively differentiate themselves in some cases. We could model that with an "affinity matrix", where A[i,j] would modulate person j's learning constant with respect to experience derived from person i (and perhaps even change it from attraction to repulsion, by moving the learner's belief away from the experience rather than towards it).

Problem (3): Pick one or two of these ideas, or others like them, and try them out.

Whatever you do, keep it simple.