English Article Insertion

A Heuristic Algorithm for Article Insertion

Jason Baldridge

December 1996

Due to the existence of many languages which lack one or more articles, a problem faced by many machine translation (MT) systems involves deciding how to correctly insert articles when translating from a language which doesn't have them into one that does. Currently, the insertion task is handled by human posteditors who transform the "bad English" of machine translated text into "good English"; however, the fact that English articles often appear in highly predictive contexts suggests that the problem may lend itself to statistical methods. The importance of getting article insertion right becomes apparent when one considers the strain of reading an English text which lacks articles. Even though statistical methods alone cannot insert articles with human accuracy, they can save a great deal of time by predicting the majority of the insertions correctly. A human posteditor can then produce a fully corrected version with less effort and time.

Automatic article insertion could be handled in two ways. One is to build an algorithm within a particular MT system (for example, one which translates from Japanese to English). The other is to have a stand-alone, post-translation article inserter which sees only the English text. The first method would have the advantage of perhaps benefiting from cues in the source language and possibly having access to a discourse model. A discourse model based on the source language would presumably be more accurate than one produced for article-less English text because a discourse modeler for English would generally rely heavily on the placement of articles for its decisions.

Despite this, an automatic posteditor as proposed by Knight and Chander (1994) would have some important advantages, primary of which is portability-the same module could be used by any MT system because it only works with English and is thus independent of the source language. Also, relatively high accuracy can be achieved without discourse considerations, and the statistical computations of local contexts are quite fast and efficient. In a very limited experiment, Knight and Chander (K&C) found that human subjects who were given the head noun plus four words in the local context were able to predict the author's choice of article 83-88% of the time. This suggests that articles actually do not convey much information in a majority of instances (on the converse, consider trying to predict nouns, verbs, etc.) and that brute statistics should be highly effective.

This paper assumes the posteditor approach to article insertion. I have undertaken the insertion task with the same major simplification that K&C assume in their implementation-namely, that we know where to make a decision insertion and that the decision is only between the and a/an. The text being processed has been produced from an original document, with the exception that a, an, and the have been replaced by a marker. My algorithm looks only two words to either side of the determiner; K&C's looks at the head noun, any words between the head noun and the determiner, and up to two words to the left of the determiner. K&C's human subjects in the above mentioned experiment faced the same task, achieving an accuracy of 83-88%. Using a decision tree approach, K&C report a success rate of 78%. These figures provide a good benchmark, and, likewise, a good motivation for the present experiment. Of course, the restrictions assumed would eventually have to be relaxed so that a marker is no longer necessary and so that the zero article and some/any could be considered in the insertion decision. Only then would an automated article inserter actual be of any practical use.

Nonetheless, the simplified task is important in determining what might be the best way to tackle the higher goal. K&C built decision trees which model the interaction of head nouns and surrounding features in predicting the article, where features are lexical feature (words which occur in a given position more than once) or abstract features, such as part-of-speech, plural marking, tense and subcategory. A tree is built for each noun; however, their construction is computationally expensive. Because of this, K&C built trees for only the 1600 most common head nouns-which, nonetheless, manage to cover 77% of all noun occurrences. They achieved 81% accuracy for insertion on these, and on the rest of the nouns they simply guess the, yielding an overall accuracy of 78%.

At first, one might think that blindly guessing the would wreak greater havoc than this on the overall accuracy. The reason it doesn't is that, statistically, the occurs 67% of the time. Of course, as K&C point out, this provides a clear lower bound for an automatic article inserter's performance. In order to determine an upper bound, K&C also conducted a small experiment in which subjects were given a full text with dummy markers replacing the articles. Being able to incorporate their discourse knowledge boosted the subjects' performance to 94-96%.

The approach I have taken is quite different from K&C's. The basic idea is to look at the contexts immediately surrounding the articles to determine probabilities for each individual context co-occurring with the or a/an (from here on I shall refer to a/an as only a). The insertion program trains itself on a set of data from the on-line Wall Street Journal tagged corpus, creating databases which store the probabilities for each particular context in a class of contexts. The contexts are drawn from the words and tags which are located within two positions of the article. In other words, for every article, the program looks to find the following:

… W/T W/T ART W/T W/T …

Here, "W" refers to the word, "T" to the tag, and "ART" to the article. From this, ten context classes are used:

One word: W _ ART _ W ART ART W _ ART _ W

Two words: W W ART W ART W ART W W

Two tags: T T ART T ART T ART T T

"_" indicates a position that is being ignored. So, for instance, "year earlier" is a particular context in the class of contexts "the next two words to the right" (ART W W).

The distribution of "year earlier" in the Wall Street corpus is 365 a/an's to 20 the's-which is roughly the same statistic the program will derive in the course of its training.

Having gathered the statistics for the various contexts, the program then proceeds to the insertion process. After reading in the text file to test, it creates an untagged (i.e., readable) text file against which the text file with reinserted articles will be compared later. The text file is not part of the training set. Next, the program masks the articles in the array which holds the file by replacing them with the marker "##". The program then begins the process of reinserting the articles. As it scans through the file, it must make a decision whenever it encounters a marker.

When considering a particular article insertion, the program looks at each of the ten aforementioned context classes. The probability of a occurring in that context is computed (remember that this is a binary decision, so the probability of the need not be explicitly computed), and this value is sent to a subroutine which increments or decrements the unknown article's "score", which has an initial value of zero. The section of pseudo-code shown below gives an example of this heuristic process.

sum = ( number a's + number of the's )

if ( sum > 1 ) { probability of a = ( number of a's / sum )

if ( probability of a >= .9 ) { score += 12 }

if ( .7 <= probability of a < .9 ) { score += 7 }

if ( .5 <= probability of a < .7 ) { score += 3 }

if ( .2 <= probability of a < .3 ) { score += -2 }

if ( .05 <= probability of a < .2 ) { score += -3 }

if ( probability of a < .05 ) { score += -7 }}

* x += 1 is equivalent to x = x+1

Thus, the score is increased if the context favors a and lowered if it favors the. Notice that the scoring is lopsided in favor of a. Not only are higher absolute values assigned to a-favoring contexts, they also do not need to meet as stringent probability requirements as the-favoring contexts. This is to offset the natural bias towards the noted earlier. Indeed, most contexts favor the, so when considering ten different contexts, each with its own vote, eight of them which mildly favor the might drown out two which heavily favor and correctly predict a. The decision to use the values shown was simply based on running the program with different values to see what would produce the highest accuracy. More tweaking would probably bring higher results.

There are actually two sets of score assignments, one for word contexts and one for tag contexts. Due to the even greater bias toward the exhibited by tag contexts, they are not allowed to affect the overall score as much as the word contexts. Even so, this is not essential. More importantly, it seems more important not to even consider single tag contexts as they tend to not be very predictive. A move which I have not made here, but which might be productive, is to consider mixed word/tag contexts.

After each context has affected the score, the insertion point is passed to a filter which looks to see if anything matches some highly predictive elements, such as idioms and certain contexts such as "the two words following the article are 'year ago"' (predicting a nearly always). If a match is found, the score is incremented or decremented by a specified (usually large) amount.

The insertion point is then sent to the only part of the program which extends past the local context. This very minimal discourse contribution maintains a file of the last five nouns encountered. If the article in question has any of those nouns following it, its score is decreased to reflect the likelihood of a second mention noun co-occurring with the. This part of the program would be much more effective if it knew what the head noun was.

The last subroutine the insertion point passes through calculates the average probability of all the contexts together. It also notes the maximum probability. At this point, the decision is made whether to choose a or the. An a is inserted if either of the following holds:

the score is greater than ten
the score is greater than zero, the average probability is greater than 0.3, and the maximum probability is greater than 0.6

Failing either or both of these criteria results in the being inserted.

Despite perhaps seeming a bit arbitrary, the program achieved 82.1% accuracy on 3,016 insertion decisions from texts which were used neither to train it nor develop it. This bests K&C's decision tree approach by several percentage points and equals their average performance for predicting the articles of nouns which they have a tree built for. My method might be likened to a democratic decision made by citizens with unequal weights attached to their votes. This inequality is offset by the fact that those whose votes count less are more likely to be able to cast their votes. And at times (e.g., with idioms), the powers that be cast a vote overriding all others.

Each of the subroutines contributes a little toward the whole. Taking out the filter reduced the accuracy to 81.73%. This subroutine would be much more influential if it knew what the head noun was and actually had a serious database to work with-as of now it is not very complete. Removing the recent-noun tracker yielded 82.06%. This is even less significant, but this subroutine would be greatly enhanced if it knew what the head noun was and if it kept track of all head nouns, not just those which had an article. Without the probability statistics, the accuracy reduced to 81.60%. Using the probability statistics alone, without the score, yielded 79.41%. And finally, taking out all of these subroutines and just using the scores given by the straight contexts gave an accuracy of 81.23%. While the differences are not great, given that these subroutines are not fully developed and yet still manage to increase the accuracy, I feel that they could be quite important when the wider task of inserting any article in any location is attempted.

A few things must be said regarding how to consider a success rate such as "82.1%". It should be kept in mind that the program trains and tests itself on the Wall Street Journal. Attempting to insert articles into other kinds of texts is likely to bring the accuracy down a great deal. Also, even though the influence of the tags is actually negligible currently, I think they can perhaps be better exploited. However, presumably the ultimate goal is to be able to process raw text, so a posteditor such as this one would have to ride on the back of a POS tagger or perhaps forego the luxury of tags. A good basal noun phrase detector would be even more useful, and I suspect that dumping the tags and using noun phrases would be ultimately more productive.

Two other things should also be kept in mind. First, the accuracy reported here is a bit unforgiving-an insertion is considered correct only if it matches the author's choice. Yet, in many cases it makes little difference whether one or the other article is used. This is reflected in the experiment in which human subjects only achieved 94-96% even when given the full text. Second, there are also many cases where the article actually is important in determining the information conveyed by a sentence. Consider the difference between

I took the bus yesterday and we got into an accident because the driver was drunk.

I took the bus yesterday and we got into an accident because a driver was drunk.

In the first sentence, we know that the driver was the one driving the bus that the speaker was taking. In the second, we know that it was not the bus driver. There are many other such examples which cannot be determined by brute statistics. These are cases where the information status of a particular noun phrase depends on the article. Quite commonly, these are of the Inferrable type characterized by Prince (1981). It is at this point that the statistical posteditor can pass the text along to another posteditor, be it human or computer, and not feel bad about the fact that it cannot capture these more meaningful uses of the articles.

References

Hawkins, J.H. 1978. Definiteness and Indefiniteness. Humanities Press: N.J.

Knight, K. and I. Chander. 1994. Automated Postediting of Documents. Proceedings of the American Association of Artificial Intelligence AAAI-94. Seattle, WA.

Prince, E.F. 1981. Toward a taxonomy of given/new information. In Cole, P., Radical Pragmatics. NY: Academic Press. Pp. 223-255.