Models & Optimisation and Mathematical Analysis Journal
Volume 6, Numéro 1, Pages 10-14

Textual Data Selection Based On Mean Square Difference Probability For Language Modeling

Authors : Mezzoudj Fréha . Benyettou Abdelkader .


The language model (LM) is an important module in many applications that produce natural language text such as Automatic Speech Recognition, Machine Translation systems, etc. Generally, the amount of training data which are suitable for training language models dedicated to specific target task is limited. Hence this kind of textual data are too costly to produce, the use of textual data selected from others domains can be useful. This paper proposes to investigate the Mean Square Difference Probability (MSDP) criteria between two models representing respectively in-domain and out-domain-specific data for textual data selection. This technique is analyzed and tested on French broadcast news and TV shows transcription data. Results show that, the selection data based on Mean Square Difference Probability is competitive compared to other criteria of state of the art such as Difference Cross-Entropy (dXent) data selection.


cross-entropy ; data selection ; mean square difference probability ; n-gram language model ; perplexity ; textual corpus