Hungarian sentiment corpus (HuSent)
Introduction
HuSent is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]. The corpus is made up of 154 opinion texts, and comprises of approximately 17 thousand sentences and 251 thousand tokens.
The corpus HuSent was created by Precognox Ltd and MTA-SZTE Research Group on Artificial Intelligence (RGAI).
Corpus annotation principles
The database was processed according to the following main annotation principles:
- Sentiment fragments: The whole construction, expressing positive or negative opinion was annotated firstly in the raw texts of the corpus.
- Sentiment words: Then we annotated the sentiment words, expressing positive or negative opinion at the lexeme level.
- Targets: We also annotated the targets of the sentiment words. Entities and their aspects were annotated with different tags (Target 1-20) and we applied the same tag for a given target in a given document of the corpus consistently. Product names that functioned as a title were annotated as topics.
- Modifiers: We also annotated the elements, modifying the prior polarity (also called sentiment shifters or semantic orientation) of the sentiment words such as negators, irreals and increasing and decreasing intensifiers.
Basic statistical data of the corpus
The corpus was annotated by two annotators with a 65.02% agreement rate.
Licensing and reference
The database can be used free of charge for research and educational purposes.
When writing a paper or producing a software application, tool, or interface based on HuSent, it is necessary to properly cite the following paper:
Szabó, M. K., Vincze, V., Simkó, K. I., Varga, V., & Hangya, V. (2016). A Hungarian Sentiment Corpus Manually Annotated at Aspect Level. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Slovenia, Portoroz: European Language Resources Association. (ELRA) 2873-2878.
Downloads
The HuSent corpus.
For further information please contact Martina Katalin Szabo (martina@inf.u-szeged.hu).