The Hungarian forum corpus for Opinion Mining

Introduction

This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.

We downloaded all the 1294 forum posts from the three-month-long period preceding the referendum and these were annotated by two independent linguists. Annotators were told to label forum posts independently according to the most likely vote their composer would give. Based on this, we determined three categories of comments, i.e. irrelevant, supporting and rejecting ones. However, preliminary results showed us that a significant proportion of the posts belonged to another class, namely those stating that they would intentionally vote invalidly because they did not like the idea of asking such a question in a referendum. So, finally we had to classify the posts into four groups (irrelevant, supporting, rejecting and invalid).

Comments labeled differently by the annotators were given to a third linguist, who made the final decision on the ambiguous annotations. In this way our disambiguated dataset consisting of 1294 documents from 85 authors was yielded.

Reference

Gábor Berend and Richárd Farkas: Opinion Mining in Hungarian based on textual and graphical clues In: Proceedings of the 4th Intern. Symposium on Data Mining and Intelligent Information Processing, Santander, 2008.

For further information please contact Gábor Berend (berendg AT inf.u-szeged.hu).

Downloads

The corpus in the SemEval XML format.