Hungarian sentiment corpus (HuSent)

Introduction

HuSent is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]. The corpus is made up of 154 opinion texts, and comprises of approximately 17 thousand sentences and 251 thousand tokens.

The corpus HuSent was created by Precognox Ltd and MTA-SZTE Research Group on Artificial Intelligence (RGAI).

Corpus annotation principles

The database was processed according to the following main annotation principles:

Sentiment fragments: The whole construction, expressing positive or negative opinion was annotated firstly in the raw texts of the corpus.
Sentiment words: Then we annotated the sentiment words, expressing positive or negative opinion at the lexeme level.
Targets: We also annotated the targets of the sentiment words. Entities and their aspects were annotated with different tags (Target 1-20) and we applied the same tag for a given target in a given document of the corpus consistently. Product names that functioned as a title were annotated as topics.
Modifiers: We also annotated the elements, modifying the prior polarity (also called sentiment shifters or semantic orientation) of the sentiment words such as negators, irreals and increasing and decreasing intensifiers.

Basic statistical data of the corpus

tag	total frequency	frequency in positive sentiment fragments	frequency in negative sentiment fragments
PosSentiment	7200	-	-
NegSentiment	8442	-	-
SentiWordPos	8100	6247	1853
SentiWordNeg	8090	1347	6743
Topic	1371	-	-
Target	7867	3743	4124
Negation	3347	1385	1962
IntensifierPlus	5218	2538	2680
IntensifierMinus	1151	327	824
Irreal	942	273	669
OtherShifter	722	388	334
Total	52455	16248	19189

The corpus was annotated by two annotators with a 65.02% agreement rate.

Licensing and reference

The database can be used free of charge for research and educational purposes.

When writing a paper or producing a software application, tool, or interface based on HuSent, it is necessary to properly cite the following paper:

Szabó, M. K., Vincze, V., Simkó, K. I., Varga, V., & Hangya, V. (2016). A Hungarian Sentiment Corpus Manually Annotated at Aspect Level. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Slovenia, Portoroz: European Language Resources Association. (ELRA) 2873-2878.

Downloads

The HuSent corpus.

For further information please contact Martina Katalin Szabo (martina@inf.u-szeged.hu).