Language model
A statistical '''language model''' assigns a [[probability]] to a sequence of ''m'' words by means of a [[probability distribution]].
Language modeling is used in many [[natural language processing]] applications such as [[speech recognition]], [[machine translation]], [[part-of-speech tagging]], [[parsing]] and [[information retrieval]].
In [[speech recognition]] and in [[data compression]], such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a [[document]] in a collection. With query ''Q'' as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, ''P(Q|Md)''.
Estimating the probability of sequences can become difficult in [[corpora]], in which [[phrase]]s or [[Sentence (linguistics)|sentence]]s can be arbitrarily long and hence some sequences are not observed during [[training]] of the language model ([[data sparseness problem]] of [[overfitting]]). For that reason these models are often approximated using smoothed [[N-gram]] models.
== N-gram models ==
In an n-gram model, the probability of observing the sentence w1,...,wm is approximated as
Here, it is assumed that the probability of observing the ''ith'' word ''wi'' in the context history of the preceding ''i-1'' words can be approximated by the probability of observing it in the shortened context history of the preceding ''n-1'' words (''nth order [[Markov property]]).
The conditional probability can be calculated from n-gram frequency counts:
The words '''bigram''' and '''trigram''' language model denote n-gram language models with ''n=2'' and ''n=3'', respectively.
=== Example ===
In a bigram (n=2) language model, the probability of the sentence ''I saw the red house'' is approximated as
whereas in a trigram (n=3) language model, the approximation is
== See also ==
* [[Factored language model]]
== References ==
*{{cite conference | author=J M Ponte and W B Croft | url=http://citeseer.ist.psu.edu/ponte98language.html | title=A Language Modeling Approach to Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1998 | pages=275-281}}
*{{cite conference | author=F Song and W B Croft | url=http://citeseer.ist.psu.edu/song99general.html | title=A General Language Model for Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1999 | pages=279-280}}
[[Category:Statistical natural language processing]]
{{compu-AI-stub}}
[[ca:Model de llenguatge]]
[[zh:語言模型]]