Problem set #1 -- COGS 502

(Due week 3)

Download the files English1.chist and Italian1.chist, and put them in a location where Matlab can find them. These files contain the frequency counts for 26 letters of the alphabet, compiled from some newspaper texts in English and Italian respectively.

(The texts were harvested from the web on January 21, 2008, and cover similar topics in the two languages -- Pakistan, Darfur, etc. Non-alphabetic characters were ignored; uppercase letters were mapped to lowercase; and accents or other diacritics were removed. The counts are based on the first 20,000 characters of the concatenated texts in each languages.)

Read the files into Matlab, and put the two ordered lists of counts into appropriately-named vectors, e.g.

fid = fopen('English1.chist');
EnglishC = fscanf(fid, '%d %*s');
fclose(fid)

1. Turn the lists of English letter counts into an (estimated) discrete probability distributions, e.g.

EnglishP = EnglishC/sum(EnglishC);

What is the entropy of this distribution?

2. In this sample of English text, what is the estimated probability of an alphabetic character being a vowel as opposed to a consonant? What is the entropy of that (two-way) choice?

3. In this English sample, if you know that a letter is a vowel, what is the probability distribution (based on the counts in English1.chist) over the alternatives 'a', 'e', 'i', 'o', 'u'? What is the entropy of this (five-way) choice?

4. How about the corresponding choice among consonant letters?

5. According to the recipe in Shannon's monograph, how should you calculate the per-letter entropy for English alphabetic letters, by combining the entropy of the consonant/vowel distinction with the entropy of the choice among consonants and the entropy of the choice among vowels? Compare the results to the answer that you got in (1).

6. Do what you did in (1)-(5) for the Italian counts.

7. In next week's reading, we'll learn about Alan Turing's recipe for combining the "weight of evidence" from various sources of information in choosing between two alternative hypotheses. This recipe is based on a quantity known as the "log likelihood ratio":

weight of evidence = log(P(evidence|hypothesis1) / P(evidence|hypothesis2)

In the files English1.ttest and Italian1.ttest, you'll find a sequence of alphabetic characters drawn from the same samples described above. Each line presents a single characters, represented twice -- first as a number from 0 to 25 (representing the character's alphabetic position), and second as the character itself.

(Note that the text has been processed as previously described -- all non-alphabetic characters are omitted, uppercase is mapped to lowercase, etc. The sample starts arbitrarily after 20,000 characters of the original collection, so it's likely to begin in the middle of a sentence and even a word.)

Read the first 10,000 lines of these files into Matlab, e.g.

fid = fopen('English1.ttest');
English1T = fscanf(fid, '%d %*s', 10000);
fclose(fid)
English1T = English1T + 1;

I've added one to the numbers because Matlab arrays are indexed from 1 rather than 0.

For each of the characters in the resulting arrays, what "weight of evidence" does it provide relative to the hypothesis that the source was English as opposed to Italian?

Express the result as an array of "weights of evidence".

8. What is the relationship between the weights of evidence in favor of English and the weights of evidence in favor of Italian?

9. If you calculate the cumulative sum of these "weights of evidence", starting from the beginning of the test samples, how many letters do you need to add up before the odds become 50-to-1 in favor of one hypothesis or the other? 100-to-1?