Linguistics 300, F08, Solution to Assignment 4

Here is a simple solution for Assignment 4. Some of your solutions were more sophisticated. For more details concerning the results in the tables below, see the top sheet (Sheet 5) of a4-results.xls.


The tables below were constructed as follows. Some details are included for convenient reference, and I didn't expect you to include them in your assignments.

  1. I worked with a copy of Sheet 1 of a4-correct.xls, on which I deleted the three texts that were to be excluded. This yielded a total of 229 texts (rows 2-230).

  2. For each text, I calculated the mean word length by dividing CharsRaw by WordTokens. The values ranged from 4.56 (proud-1630) to 6.06 (trincoll-e2).

  3. For each text, I calculated the type/token ratio by dividing WordTypesRaw by WordTokensRaw. The values ranged from 0.24 (stat-1690) to 0.91 (nevill).

  4. The values for mean word length and type/token ratio aren't directly comparable, so I converted the values for each text into ranks, which can be compared directly.

  5. Clearly, the ranks have to run in the same direction. It's easy to see that low lexical complexity correlates with low mean word length (simple texts use shorter words). Low lexical complexity also correlates with low type/token ratio. You can see this from the example below, where the toy texts are arranged from least to most complex.

    Toy text Types Tokens Type/token ratio
    a a a 1 3 0.33
    a a b 2 3 0.67
    a b c 3 3 1.00

  6. Using Excel's RANK command, I ranked the texts by mean word length as well as by type/token ratio. I then constructed a measure of lexical complexity by taking the mean of the two ranks. I divided the resulting mean by the total number of texts to give an easily interpretable number between 0 and 1.

    Instead of ranking the texts absolutely, as I did, some of you ranked the texts with respect to the maximum value for the relevant property. For instance, given a maximum mean word length of 6.06, a text with a mean word length of 3.03 would be associated with 0.50 (3.03/6.06). Similarly, given a maximum type/token ratio of 0.91, the text with the minimum type/token ratio would be associated with 0.27 (0.24/0.91). The simple ranking procedure seems to me more intuitive as the lowest value is close to 0, and the highest close to 1. With the other procedure, the maximum is necessarily 1 (max/max), but the minimum doesn't have to be particularly close to 0. However, either procedure yields a number between 0 and 1, which is both sensible and easy to interpret.

  7. In order to divide the texts into ones with high and low lexical complexity, I next calculated the median lexical complexity. There is no reason to assume that lexical complexity changes over time, so I calculated the median for the entire set of 229 texts. The median lexical complexity (based on my procedure) turns out to be 0.50. (It might look like this would have to be the case, but 0.50 isn't a percentile rank, so the median might in principle have differed from 0.50.)

  8. In order to divide the texts into ones with high and low rates of auxiliary do, I calculated the median percentage of auxiliary do. The rate of auxiliary do rose sharply over the two centuries under consideration, so (following Warner) I calculated two separate medians, one for 1500-1575 and one for 1600-1719. In order to do this, I first sorted the entire spreadsheet by date of composition, hiding the texts between 1576 and 1599 for convenience. In calculating the median, the texts without any negative sentences are irrelevant. In order to get them out of the way, I separately sorted the two ranges (1500-1575 and from 1600 onwards) by percentage of do support. The medians for the early and the late texts (based on rows 2-51 and rows 108-195, respectively) were 0.07 and 0.60, respectively.

    Incidentally, the means for the two time periods, calculated on the basis of the percentages for the individual texts, are 0.20 (early) and 0.59 (late). The means for the two time periods, calculated on the basis of the aggregate data, are 0.19 (early) and 0.61 (late). So as it turns out, it wouldn't matter much which type of average we use. However, following Warner, I used the median, as it is less sensitive to extreme values.

  9. For simplicity, I made another copy of the worksheet, so that I could sort it with impunity without affecting the medians calculated up to now. I sorted the sheet by date, sorted the early texts and the late texts separately by lexical complexity, and noted the row numbers that I would need to reference in order to calculate the cells of the two tables. I treated the median value itself (0.50) as high because this resulted in a more even split in the number of texts.

    1500-1575, low lexical complexity rows 2 to 39
    1500-1575, high lexical complexity rows 40 to 86
    1600-1719, low lexical complexity rows 108 to 172
    1600-1719, high lexical complexity rows 173 to 230

  10. Using Excel's COUNTIF function, I counted the number of texts in each subset of the data with a high percentage of do and a low percentage of do. I treated the median values themselves (0.07 and 0.60, respectively) as low because this resulted in a more even split of the number of texts than treating them as high. For completeness, I also counted the number of texts containing no negative sentences whatsoever.

  11. Finally, I calculated the mean percentage of auxiliary do for each of the four groups of texts, based on the aggregate number of negative sentences (rather than on the percentages by text).


Table 1: Occurrence of do in texts of low versus high lexical complexity 1500-1575
Lexical complexity Total
Low High
High DO % 15 10 25
Low DO % 15 10 25
no negative sentences 8 27 35
Total 38 47 85
Mean DO % 0.17

Table 2: Occurrence of do in texts of low versus high lexical complexity 1600-1719
Lexical complexity Total
Low High
High DO % 30 13 43
Low DO % 29 16 45
no negative sentences 6 29 35
Total 65 58 123
Mean DO % 0.65


I didn't necessarily expect you to provide discussion of your results. I include the following for general interest.