Hungarian word sense disambiguated corpus

Introduction

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. These word forms are considered to be frequent in Hungarian language usage and they have more than one well-defined senses. The Hungarian National Corpus and its Heti Világgazdaság (HVG) subcorpus provided the basis for corpus text selection. This corpus is a fine-grained lexical sample corpus. When building the corpus, we followed the format designed for corpora prepared for WSD tasks of SensEval/SemEval international conference workshops organized by the Association for Computational Linguistics.
In the first step, the set of senses to be distinguished was defined for each word form and provided with a short description (definition). Following international standards, annotation was carried out by two independent linguists. Finally, a third, independent annotator checked the cases when annotations were dissimilar and finalized the tags of these samples.

Reference

Vincze, Veronika, Szarvas, György, Almási, Attila, Szauter, Dóra, Ormándi, Róbert, Farkas, Richárd, Hatvani, Csaba, Csirik, János: Hungarian Word-sense Disambiguated Corpus. In: Proceedings of 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco.

For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).

Downloads

The corpus in the SemEval XML format.