Problem set #3 -- COGS 502

Due 2/21/2008

1. Download these files

text100_1.nhist, text100_2.nhist, text100_3.nhist, text100_1-2.nhist,
text100_1.nr, text100_1-2.nr

and put them somewhere that Matlab can find them.

2. Read in the .nhist files via calls like this:

% read in lexical histograms for first three 100K blocks of CTS
% each line is
%
% COUNT WORD WORDNUM
%
% where WORDNUM is an arbitrary unique-per-word integer identifier

fid = fopen('text100_1.nhist'); H1 = textscan(fid, '%f %s %f', 10000); fclose(fid);
NH1 = H1{1}; SH1 = H1{2}; VH1 = H1{3};

These files contain the per-word counts for the first three 100K-word segments of the Conversational Telephone Speech corpus discussed in the LNRE notes.

The file text100_1-2.nhist is the same thing for the first 200K words in combined form, for convenience.

3. Read in the .nr files via calls like this:

% read in nr, r values
% each line is
%
% NR R

fir = fopen('text100_1.nr');
NR1 = textscan(fid,'%f %f', 250);
fclose(fid);

4. In the LNRE notes, we showed that the Good-Turing method does a decent job of predicting the probability that new "animals" will be previously-unseen "species"; and we sketched Bill Gale's "Simple Good-Turing method" for smoothing the nr values in order to use the Good-Turing method of estimating the probability of known word-types.

The first step in that process is an "averaging transform", which calculates Zr values to replace the nr values in the "frequency of frequencies" distribution. This transform deals with the quantization of such counts by replacing nr with

Zr = nr /(0. 5*(next − previous))

where next and previous are the next lower and higher r for which nr is non-zero. As Gale explains,

In other words, we estimate the expected nr by the density of nr for large r. For small r, there is no difference, because the length of the intervals is unity.

Do this for the two .nr files.

Plot log(r) against log(nr) and against log(Zr).

5. The Zr alone are not enough to let us use the Good-Turing equation

r* = (r+1)*nr+1/nr

to calculate r*, because the Zr are only defined for the r values for which we had nr to start with. If nr+1 was zero, then we're still out of luck. A minimal approach to dealing with this would be to use piecewise linear interpolation among the Zr -- since they were created on the basis of the local density of nr.

Do this, and use the results to calculate r* and the associated probabilities.

Do the calculated probabilities add up to 1? (Don't forget to include the estimated probability of unseen words!) If not, why not?

Use "renormalization" to make sure that the estimated probabilities do sum to 1. Then check the quality of the predictions -- you can check text100_1 against text100_2 and/or text100_3; and you can check text100_1-2 against text100_3.

You can plot the predicted frequencies against the observed frequencies -- or for an overall evaluation, try the "RMS log error" method discussed in Gale 1994 (p. 11).

5a. [extra credit] Good 1953 suggests (p. 243)

that smoothing can be done by making only local assumptions, for example, that the square root of E(nr | H), as a function of r, is approximately 'parabolic' for any nine consecutive values of r.

Try turning this into a distribution-neutral local smoother that is somewhat more sophisticated than piecewise linear interpolation among the Zr values. Does it work better?

6. The "Simple Good-Turing" method of Gale 1994 proposes a combination of two methods:

the first proposal is to make an extremely simple smooth of the nr ... For the small values of r, simply using the Turing estimate may well be more accurate than the simple smooth. Thus, the second proposal is a rule for switching from the Turing estimates at low frequencies to the [smoothed] estimates for higher frequencies.

By "the Turing estimate", he means just the original formula

r* = (r+1)*nr+1/nr

The proposed "simple smooth" is just linear regression of log(Zr) on log(r). And

The rule for choosing between the Turing estimate and the LGT estimate is to use the Turing estimates so long as they are significantly different from the LGT estimates. Once we use an LGT estimate, then we continue to use them. Thus this is a rule that may have a few Turing estimates of r* first, but then switches to the LGT estimates. Turing estimates are considered "significantly different" from LGT estimates if their difference exceeds 1.65 times the standard deviation (the square root of the variance) of the Turing estimate.

Gale's formula (p. 8) for estimating the variance of the Turing estimate is

Use the "Simple Good Turing" method on text100_1 and text100_1-2. Check its predictions (after renormalizing!). Compare it to the piecewise linear interpolation method from (5) above. [Extra credit: compare the ELE and "add-tiny" methods discussion in Gale 1994, p. 10.]

Note that Gale's estimate for Var(r*T) is not defined for cases where nr+1is zero. If we switch to the LGT regime before we run out of non-zero nr+1 values, there won't be any problem. If you run into a zero along the way, try using the Zr local-averaging trick.