# Language Modelling

## Bayesian Inference for Speech Recognition

The basic task of a speech recogniser is to find the most probable sequence of words W which would sound like acoustic data A when spoken. If P(W|A) is the probability that the words W were spoken as A, the speech recogniser should select the most likely word sequence Ŵ given by

 Ŵ = argmaxW P(W|A)

Bayes' formula can be used to substitute more readily estimated probabilities: P(W), the probability that W is uttered, and P(A|W), the probability that W sounds like A, giving

 Ŵ = argmaxW P(W)P(A|W)P(A)

P(A), the average probability of the acoustic data, is constant, so it can be removed from the maximisation:

 Ŵ = argmaxW P(W)P(A|W)

This formula succinctly expresses the contribution of the two types of statistical model used in speech recognition. An acoustic model is used to estimate P(A|W), and a language model is used to estimate P(W).

## N-grams

For a sequence of n words, W = w1, w2,...,wn, the probability of the sequence can be expressed by the chain rule as the product of the probabilities of each successive word:-

 P(W) = nΠi=1 P(wi|w1,...,wi-1)
where P(wi|w1,...,wi-1) is the probability of word wi given all the preceding words w1,...,wi-1. The history hi = w1,...,wi-1 starts with a length of zero and increases by one for each successive word. In practice, the number of possible histories would be far too great for this to be a feasible way of estimating word probabilities. The number of possibilities has to be reduced by defining a practical number of equivalence classes which can be used to categorise the histories. The dominant approach is to use a trigram model for this purpose. A trigram is a unique sequence of three word types. It is sometimes necessary to make a distinction between word types, which are distinct words in the vocabulary, and work tokens, which are instances of word types occurring in a given utterance. It is usually clear, however, which sense is meant.

In a trigram model, histories are equivalent if they end in the same two words, so

 P(W) = nΠi=1 P(wi|wi-2,wi-1)

A not insignificant detail (as emerges below), is that the first two terms of this product are actually P(w1) and P(w2|w1), as there are not yet sufficient previous words for the general term to apply.

A maximum likelihood estimate of the trigram probabilities is readily obtained by counting the frequencies of trigrams and bigrams occurring in a set of utterances which is used to train the model:-

 P(wi|wi-2,wi-1) = C(wi-2,wi-1,wi)C(wi-2,wi-1)

The trigram probabilities can be substituted in the above product to give an estimate of the probability of the word sequence W. This overall probability is not actually calculated by a speech recogniser, which only uses the trigram probabilities of the sequences explored by the dynamic search algorithm.

 P(W) = C(w1)C( ) × C(w1,w2)C(w1) × C(w1,w2,w3)C(w1,w2) × C(w2,w3,w4)C(w2,w3) ×. . .× C(wn-2,wn-1,wn)C(wn-2,wn-1) = C(w1,w2,w3)C( ) × C(w2,w3,w4)C(w2,w3) ×. . .× C(wn-2,wn-1,wn)C(wn-2,wn-1)

The resulting formula does, however, help resolve a paradox. The probability of W had appeared to depend upon a conditional chain which followed the temporal sequence in which the words were uttered, whereas this direction of dependency was not inherent in the model. A trigram count such as C(w1,w2,w3), for example, can represent the frequency of w1 occurring before (w2,w3) as well as the frequency of w3 occurring after (w1,w2). After cancellation of the first two "start-up" terms of the product, the remaining terms are symmetrical with respect to a forward or backward direction of dependency. C( ) represents the count of all word tokens in the training set.

## Smoothing

For large vocabularies, the number of possible trigrams is very large and many of them are unlikely to occur in the training set. If they are assigned zero probability, however, this precludes recognition of words with corresponding histories. It is therefore necessary for the model to adjust the probability estimates for trigrams which are absent from or under-represented in the training set. This process is called smoothing.

A smoothing algorithm discounts some non-zero counts in order to obtain some probability mass which it reallocates to zero (or low counts). There are algorithms with varying degrees of sophistication:-

• Add-one smoothing: add one to all the counts.
• Linear smoothing: use a weighted sum of trigram, bigram, and unigram frequencies.
• Back-off: back off to a lower-order n-gram if higher-order n-gram has zero count.
• Witten-Bell discounting: estimate the probability of seeing a new n-gram from the number of n-gram types in the training set.
• Good-Turing discounting: discount according to the number of n-grams occurring with each count 0,1,2... (frequency of frequencies).

The comparative performance of these smoothing algorithms can be evaluated by using the smoothed model probabilities to calculate the cross-entropy of each model, using as test data a sufficiently long sample of the target language. The cross-entropy (and the perplexity) of each model can be no less than the entropy of the language, so the model with the least cross-entropy is the best. Any of the algorithms could obviously be optimal if the smoothed probabilities it predicted happened to match the actual distribution of the language sample. In practice, the best smoothing algorithm will be the one with the best underlying assumptions about the distribution of probability distributions across all possible languages.

Various smoothing algorithm, and their underlying assumptions, will now be considered in turn.

### Interpolation

Each count is incremented by one before calculating probabilities by relative frequency. This deceptively simple algorithm seems rather arbitrary, and it is sometimes proposed that a different constant, for example 0.5 or 0.001 is added to each count. The value of one, however, was shown by Laplace to give the correct Bayesian estimate on the prior assumption that each possible combination of frequency counts is equally likely. This is known as Laplace's Law of Succession.

In the unigram case, with a vocabulary of V word types, and count C(wi) for each word type in a sample of N word tokens, the smoothed probabilities are given by

 PLA(wi) = C(wi) + 1N + V

Generalising this to n-grams gives

 PLA(wi|wi-n+1,...,wi-1) = C(wi-n+1,...,wi) + 1C(wi-n+1,...,wi-1) + V

where C( ) = N for unigrams, where all terms vanish from the count in the denominator.

In a typical corpus, V will be quite large (~50,000 word types), and will dominate low bigram or unigram counts in the denominator, with drastic effects on the corresponding relative frequencies. Add-one smoothing is not very effective in this case.

Addition of a constant less than one is known as Lidstone's Law of Succession:-

 PLI(wi|wi-n+1,...,wi-1) = C(wi-n+1,...,wi) + δC(wi-n+1,...,wi-1) + Vδ

This can be revealed as a linear interpolation of the maximum likelihood estimate and the uniform prior 1/V, by substituting

 λ = C(wi-n+1,...,wi-1)C(wi-n+1,...,wi-1) + Vδ

giving

 PLI(wi|wi-n+1,...,wi-1) = λ C(wi-n+1,...,wi)C(wi-n+1,...,wi-1) + (1 - λ) 1V = λ P(wi|wi-n+1,...,wi-1) + (1 - λ) 1V

#### Linear Smoothing

In this method, due to Jelinek and Mercer, the probability is estimated from a linear combination of trigram, bigram, and unigram relative frequencies:-

 PJM(wi|wi-2,wi-1) = λ3P(wi|wi-2,wi-1) + λ2P(wi|wi-1) + λ1P(wi)
This formulation only avoids zero probabilities if the vocabulary is confined to words in the training set, or to those words plus a catch-all unknown word. To allow for a larger vocabulary, the interpolation should include the uniform distribution as a zero-order model:-
 PJM(wi|wi-2,wi-1) = λ3P(wi|wi-2,wi-1) + λ2P(wi|wi-1) + λ1P(wi) + λ0 1V

It is clear that Laplace's and Lidstone's Laws of Succession are just special cases of linear interpolation with the intermediate weights set to zero.

The weights are estimated using held-out data which is different from the data used to calculate the n-gram frequencies. The Baum-Welch algorithm can be used to estimate the values of the weights which maximise the probability of the held-out data. Although linear smoothing performs quite well with a single weight for each order of n-gram, it is usually implemented with multiple weights. In the general case, the algorithm can be expressed recursively:-

PJM(wi|wi-n+1,...,wi-1) = λwi-n+1,...,wi-1P(wi|wi-n+1,...,wi-1) + (1 - λwi-n+1,...,wi-1) PJM(wi|wi-n+2,...,wi-1)

It is not really feasible, however, to estimate separate weights for each n-gram, so all the n-grams with the same count C(wi-n+1,...,wi-1) are given the same weight. In practice the n-grams are partitioned into a few hundred buckets, each of which contains n-grams with counts in a certain range, and all the n-grams in each bucket are allocated the same weight.

For some obscure reason, this method is often referred to as deleted interpolation. When implementing Jelinek-Mercer smoothing, however, Chen and Goodman used two different methods of training the weights: one using held-out test data, and another involving actual deletion of words from the model. It would seem that only the latter method should be called deleted interpolation.

### Discounting

In a set of n-gram counts, there will usually be several unseen n-grams with a zero count, several singletons with a count of one, and so on. Let Nc be the frequency of count value c (c = 0,1,2,...). Smoothing will modify each instance of count c by the same amount to give c*. It is usual to normalise the counts so that the total count remains the same, that is

 Σc cNc = Σc c*Nc = N, with Σc Nc = V

This allows the smoothed n-gram probabilities to be obtained simply by substituting the modified counts for the original counts in the maximum likelihood estimates. For add-one smoothing, normalisation of the smoothed counts gives

 c* = (c + 1) NN + V

After normalisation of add-one smoothing, some counts (c < N/V) will increase and some (c > N/V) will decrease. For bigram and trigram counts, where V becomes vocabulary2 or vocabulary3, N << V for most training sets, and all counts except c = 0 will decrease. It is natural, therefore, to regard smoothing as a discounting of the non-zero counts for redistribution to the zero counts. The discount coefficient dc is the factor which reduces the original count to the modified count:

c* = cdc

Although any method of smoothing can be regarded as discounting, the term is usually reserved for methods which are conceived directly in terms of discounts. These methods are not used by themselves, but are combined with back-off methods, as will be described below.

#### Linear Discounting

Each count is reduced by subtracting a fraction of the count:

dc = 1 - α

The amount redistributed to zero counts by this method is αN. To assign the same probability to n-grams with zero counts as Good-Turing discounting (described below), α can be set to give

 dc = 1 - N1N

#### Absolute Discounting

Each count is reduced by subtracting a constant b:

 dc = c - bc

The amount redistributed to zero counts by this method is b(V - N0).

#### Good-Turing Discounting

For each count value c, which occurs with frequency Nc, the total of the n-gram counts with this value is cNc. Good-Turing discounting redistributes the total count for singletons to the zero counts, redistributes the total count for doubletons to singletons, and so on for increasing count values. This is done by calculating the smoothed counts according to

 c* = (c + 1) Nc + 1Nc

This redistribution is clearly shown in the table below, where the entire column of values labelled cNc is shifted up by one row in the column labelled c*Nc. This leaves the total n-gram count N the same, as shown at the foot of these columns, so the smoothed counts need no further normalisation. The table also reveals an anomaly that occurs at the bottom of the table: the most frequent count (c = 10) is smoothed to zero. The data in this table has been arbitrarily truncated to give a more compact example. There are presumably a few hundred more counts in the actual (unpublished) data before the frequency finally falls to zero. But the problem exemplified in this table will always occur. In practice, as the frequencies become very low there may be more than one count with zero frequency before the maximum count value is reached.

Implementations of Good-Turing discounting avoid the zero frequency problem by only discounting the counts up to a certain threshold (usually c = 5). This is also justifiable on the basis that the larger counts are more reliable. There are two approaches to maintaining normalisation of the smoothed counts when Good-Turing discounting is confined to the range up to and including the threshold value k. Both methods maintain the assignment to the zero counts of a total count equal to the singleton frequency N1 (which results from c*Nc = N1 for c = 0).

Jelinek leaves the smoothed counts alone and normalises the total count back to N by discounting the counts above the threshold by α, where

 α = Σc>k+1 cNc cJ = αc, c > k Σc>k cNc

This method is shown in columns cJ and cJNc of the table, which only differ from columns c* and c*Nc below the line representing the threshold.

Katz leaves the unsmoothed counts alone, and normalises the total count back to N by discounting the relative Good-Turing discounts up to and including the threshold by μ, where

 μ = N1N1 - (k + 1)Nk+1 and the Good-Turing discount dc* is c*c .

Note that the discount coefficient μ is applied to the relative discount 1 - dc*, which is a discount in the more usual sense of 10% off, for example, not a discount coefficient as used elsewhere:-

1 - dcK = μ(1- dc*), where dcK is the Katz discount giving cK = cdcK, 0 < c ≤ k.

This method is shown in columns cK and cKNc of the table, which only differ from columns c and cNc above the line representing the threshold. The table also shows the Good-Turing and Katz relative discounts in the rightmost columns.

#### Discounting Example

The effects of the various implementations of Good-Turing discounting are shown in the following table using bigram data computed by Church and Gale from 22 million words collected from the Associated Press newswire. This is the same example used to illustrate Good-Turing discounting by Jelinek and by Jurafsky and Martin.

NcccNcc*c*NccJcJNccKcKNc1 - dc*1 - dcK
74671100000000.00002720180460.00002720180460.0000272018046
2018046120180460.4468994420.4468994420.3571236855.4%64.7%
44972128994421.265667991.265667991.1451116837.0%43.2%
18893335667992.244226722.244226722.1139856825.4%29.7%
10566844226723.243418953.243418953.1132838619.1%22.3%
6837953418954.232891404.232891404.1028031715.4%18.0%
4819062891405.192499634.48216132628914013.5%
3570972499636.212216805.23186848724996311.3%
2771082216807.242005205.9816570682216809.5%
2228092005208.251838106.7314988992005208.3%
1838110183810007.4813739810183810
0110
53939675393967α = 0.74753939675393967 μ = 1.17

The count frequencies have been arbitrarily set to zero after c = 10, causing c* to fall to zero for c = 10.

### Backoff

Like interpolation models, backoff models combine counts from a hierarchy of distributions, for example trigram counts, bigram counts, and unigram counts. They differ from interpolation models by only using lower order distributions for n-grams with zero counts. A trigram backoff model can naively be represented by

 PB(wi|wi-2,wi-1) = { αP(wi|wi-2,wi-1) if P(wi|wi-2,wi-1) > 0 βP(wi|wi-1) if P(wi|wi-2,wi-1) = 0 and P(wi|wi-1) > 0 γP(wi) if P(wi|wi-2,wi-1) = 0 and P(wi|wi-1) = 0

As before, P() represents a maximum likelihood estimate of conditional probability obtained from relative frequency. The coefficient α is needed to discount the non-zero trigram model probabilities, which sum to one when wi ranges over all word types. The coefficients β and γ normalise the probabilities obtained from the bigram and unigram models to ensure that the combined probabilities sum to one.

Backoff implementations invariably use discounting, so the coefficient α is not needed, as the the probabilities for the non-zero counts have already been discounted in favour of the zero counts, and no longer sum to one. In the canonical back-off method, due to Katz, only the counts below the threshold will have been discounted: above the threshold the maximum likelihood estimates are used directly.

The other coefficients β and γ are not constant, and depend on the context of wi, which is wi-2,wi-1 for trigrams and wi-1 for bigrams. In a trigram model, for example, different coefficients βup,and and βdown,and would be needed to reflect the different subsets of the bigram probabilities P(wi|and) to which backoff would occur for these two different contexts.

Incorporating these observations about the coefficients, and using a generalised recursive formulation, the Katz model for backoff with discounting is

 PKATZ(wi|wi-n+1,...,wi-1) = { PK(wi|wi-n+1,...,wi-1) if C(wi-n+1,...,wi) > 0 αwi-n+1,...,wi-1PKATZ(wi|wi-n+2,...,wi-1) if C(wi-n+1,...,wi) = 0

PKATZ() is a probability calculated using counts cK discounted by the Katz algorithm described above.

When using this method, all of the trigram, bigram and unigram counts are first smoothed using frequencies obtained from the training set. Conditional probabilities are then calculated using subsets of these smoothed counts.

### Kneser-Ney Smoothing

This is a form of absolute discounting which can be used either in a backoff model, as originally described by Kneser and Ney, or in a interpolation model, as implemented by Chen and Goodman. The special feature is the conditioning of lower-order models by context. Some unigrams, such as "Francisco", will only occur in a limited number of contexts such as "San Francisco". If these contexts are common in the training set, the unigram count will be corresponding high, even though it is unlikely to occur in any other contexts. Kneser-Ney smoothing bases the backoff distribution on the number of contexts in which a word occurs, instead of the number of occurrences of the word. For a bigram model, the formula is:

 PKN(wi|wi-1) = C(wi-1,wi) - DC(wi-1) + λ(wi-1) |{v|C(vwi) > 0}|Σw|{v|C(vw) > 0}|

where D is the absolute discount, λ(wi-1) is to normalise the probabilities, and |{v|C(vwi) > 0}| is the number of words v that can be the context of wi.

### Other Smoothing Methods

Some other methods of smoothing n-gram counts are described by Chen and Goodman, including Witten-Bell smoothing, which was developed for text compression, and Church-Gale smoothing, which combines Good-Turing smoothing with bucketing. Kneser and Ney observed that most smoothing algorithms, including all of the interpolation and discounting algorithms mentioned above, can be described by a unified recursive formula

 Psmooth(wi|wi-n+1,...,wi-1) = { α(wi-n+1,...,wi-1) if C(wi-n+1,...,wi) > 0 γ(wi-n+1,...,wi-1)Psmooth(wi|wi-n+2,...,wi-1) if C(wi-n+1,...,wi) = 0

The appropriate functions α() and γ() for each algorithm are tabulated by Chen and Goodman. They show that it is quite easy to implement an interpolation version of a backoff algorithm, and vice versa.

## Evaluation of Language Models

### Entropy

The entropy of a language is the average amount of information conveyed by all sequences of words in the language:

 H(L) = limn→∞ - 1n Σ p(w1,...,wn)log2p(w1,...,wn) bits per word

where the summation is over all sequences of words of length n.

If the language is stationary and erogodic (which is assumed to be the case), the entropy can be estimated by taking the average logprob of a single sufficiently long sequence of words in the language:

 H(p) = limn→∞ - 1n nΣ i=1 log2p(wi|hi) bits per word

For a given language model, the probabilities m(wi|hi) predicted by the model will differ from the true probabilities, which are unknown. When the entropy of a word sequence is calculated with the model probabilities instead of the true probabilities, it is called the cross-entropy of the model:

 H(p,m) = limn→∞ - 1n nΣ i=1 log2m(wi|hi) bits per word

The cross-entropy is an upper bound on the true entropy, and the model with the lowest cross-entropy is closest to the entropy of the language, and therefore the best model.

In language modelling, the usual metric is perplexity, defined as 2H(p,m). The perplexity indicates the average number of equiprobable choices for the next word in a sequence. When comparing language models, perplexity is calculated using the same test set, which must be distinct from the training set used to construct the model. For the comparison to be valid, each model should have the same vocabulary.

### Toolkits

Several toolkits have been developed which can be used to build language models from text files provided by the user. They perform the mechanics of counting n-grams, discounting, calculating weights, and optimising parameters. They support many of the smoothing algorithms described above. To evaluate the models, they calculate the perplexity on a test set. The most widely used toolkits are the CMU-Cambridge Statistical Language Modeling Toolkit and the SRI Language Modeling Toolkit. There is a good description of the CMU-Cambridge toolkit by Clarkson and Rosenfeld.

### Results

Chen and Goodman performed an extremely thorough evaluation of language models, including the smoothing algorithms described above and their variants. They carefully implemented each algorithm, resolving and describing many points of detail which were not always clear from the published descriptions of their authors. They evaluated each smoothing algorithm using data from four corpora:

 Brown 1 million words 53,850 word types North American Business News 243 million words 20,000 word types newspaper text Switchboard 3 million words 9,800 word types telephone conversation transcriptions Broadcast News 130 million words 50,000 word types TV and radio news transcriptions

The best smoothing algorithm was Kneser-Ney smoothing, the authors' variant of which consistently outperformed the other algorithms. Next best were the algorithms of Katz and Jelinek-Mercer. Add-one smoothing and additive smoothing performed poorly, only approaching the baseline version of Jelinek-Mercer for very large training sets of 10 million sentences.

## Beyond N-grams

N-gram models perform very well, but many researchers, especially linguists, believe there is room for improvement. Frederick Jelinek was famously quoted as saying

“Anytime a linguist leaves the [IBM speech] group, the recognition rate goes up.”

An n-gram model cannot capture many significant features of speech, such as:-

• remote references
• constraints of syntax
• meaning

In a research proposal, for example, researchers at UCL Department of Phonetics and Linguistics disparage N-gram smoothing and continue

“...our greatest criticism is the treatment of words as empty symbols, devoid of linguistic function. Words form classes of meaning and function which also affect their collocational probabilities”

Improvements which have been proposed include:-

### Class Models

Predict the probability of a word based a mapping of word histories to equivalence classes. Although N-grams are examples of equivalence classes, class models are usually understood to involve other types of equivalence classes such as parts of speech, or states of a language parser.

### Dynamic Models

Build a dynamic model by caching recently observed words to reflect the current topic of the discourse.

### Mixture Models

Combine models by linear combination of estimates trained on different clusters of texts.

### Discourse Models

Use knowledge of the current topic or the current speech act.

### Refinements of the N-gram model

• Use n-grams with n > 3, or variable-length n-grams.
• Use long-distance triggers to supplement local n-grams.
• Count collocations like "Sydney Morning Herald" as single items.

## References

The basic concepts of language modelling, in the context of speech recognition, are well explained in text books such as Jelinek (chapter 4) and Jurafsky and Martin (chapter 6). There is also a good survey on the web by Roukos in the NSF Survey of the State of the Art in Human Language Technology.

Joshua Goodman's home page has a link to his PowerPoint slides, which give an up-to-date presentation of "The State of the Art of Language Modeling" in April, 2000.

The definitive reference, however, is Chen and Goodman's excellent paper, which contains a comprehensive tutorial on smoothing algorithms, and presents the results of their extensive testing of different algorithms and models. (Be sure to refer to the 1998 version of this paper, which is considerably expanded and updated, compared to the 1996 version.)

### Books

F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, Cambridge, USA, 1998.

Daniel Jurafsky and James H. Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, New Jersey, 2000.

### Papers

S. F. Chen, J. T. Goodman, "An Empirical Study of Smoothing Techniques for Language Modeling", Computer Speech and Language, vol. 13, pp. 359-394, October 1999. (PostScript)

J. T. Goodman, "Putting it all Together: Language Model Combination" ICASSP-2000, Istanbul, June 2000. (PostScript)

S.M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recogniser", IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35, vol. 3, pp 400-401, March 1987.

R. Kneser and H. Ney, "Improved Backing-off for m-gram Language Modeling", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 181-184, 1995.

P.R. Clarkson and R. Rosenfeld, "Statistical Language Modeling Using the CMU-Cambridge Toolkit", Proceedings ESCA Eurospeech, 1997. (PostScript)

E.S. Ristad, "A Natural Law of Succession", Research Report CS-TR-495-95, Princeton University, 1995. (Acrobat PDF)

### People

The following researchers in the Language modelling field have home pages: