حوسبة اللغة العربية
Volume 2, Numéro 1, Pages 1-17

التنقيب الآلي للكلمات العربية باستخدام التعلم الآلي Yamcha

الكاتب : أحمد عبد الغني .


The automatic POStagging is the process of assigning the appropriate POS tag for each word in text depending on the context. This process is an essential step in most NLP applications such as automatic summarization, information retrieval, machine translation, and other applications. This research aims to present an Arabic POS tagger based on a statistical approachand machine learningsystems. The machine learning system used in this paper is the YAMCHA (Yet Another Multipurpose CHunk Annotator) tool, which is an open source tool that performs many language processing tasks, such as automatic POStagging,name entity recognition, syntax analysis, and other linguistic tasks. Yamcha uses an algorithm in machine learning called Support Vector Machines that is used to classify data with great accuracy and efficiency because it uses part of the data in training, and it also allows changing the range and type of linguistic information relied on in machine learning (feature set and window-size). Therefore, the proposed system requires a large amount of data analyzed at the level of POSin order to train it. Corpus used in this research has the size of 100,039 words, and it was divided by 70% for training and 30% for testing, so the size of the training corpus was 64,608 words, and the size of the testing corpus was 35,431 words, and thetag setused in training and testing was 48 morphological tags. The system was trained several times with changing the range of linguistic information used in training process, and thennew texts were testedand evaluated in order to reach the best results in the automatic POS tagging. The lowest error rate achieved was 11.4%, when the previous word was considered in the training process without considering its POS tag (F: -1..0: 0..).

الكلمات المفتاحية

YAMCHA التنقيب الآلي التعلم الآلي مدونة التدريب