AL-Lisaniyyat
Volume 22, Numéro 2, Pages 18-22
2016-05-30

An Extensible Schema For Building Large Weakly-labeled Semantic Corpora

Authors : English S. Matthew .

Abstract

In NLP data drives research, as evidenced by the frequency with which seminal works of database engineering such as The Penn Treebank have been employed as a basis for experimentation. Traditionally large-scale expertly annotated corpora are expensive and time consuming to produce. This paradigm drove researchers to adopt automated methods for generating labelled data with available tools such as Freebase, DBpedia, and the "infoboxes" found on Wikipedia pages. These knowledge bases have been, or are in the process of being, subsumed by Wikidata, an initiative to concentrate such disparate data repositories in an organized machine readable format. This resource is an important research tool. In this paper, we review our experience using Wikidata in constructing a large annotated corpus under distant supervision, moreover we make the materials, the code used to generate our annotations, freely available to all interested parties.

Keywords

Wikidata - Semantic Corpora -

Challenges In Building Corpora For Algerian Arabic From Cmc Content

Omari Mohammed .  Bouhania Bachir . 
pages 594-617.


The Medea Of Euripides And Seneca: A Female Monster Labeled A Greek Hero

Nabil Aziz Hamadi .  Imene Sara Bellaha . 
pages 193-204.