Hungarian sentiment corpus (HuSent)

Introduction

HuSent is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]. The corpus is made up of 154 opinion texts, and comprises of approximately 17 thousand sentences and 251 thousand tokens.

The corpus HuSent was created by Precognox Ltd and MTA-SZTE Research Group on Artificial Intelligence (RGAI).

precognox_logo

Corpus annotation principles

The database was processed according to the following main annotation principles:

  • Sentiment fragments: The whole construction, expressing positive or negative opinion was annotated firstly in the raw texts of the corpus.
  • Sentiment words: Then we annotated the sentiment words, expressing positive or negative opinion at the lexeme level.
  • Targets: We also annotated the targets of the sentiment words. Entities and their aspects were annotated with different tags (Target 1-20) and we applied the same tag for a given target in a given document of the corpus consistently. Product names that functioned as a title were annotated as topics.
  • Modifiers: We also annotated the elements, modifying the prior polarity (also called sentiment shifters or semantic orientation) of the sentiment words such as negators, irreals and increasing and decreasing intensifiers.

Basic statistical data of the corpus

tag

total frequency

frequency in positive sentiment fragments

frequency in negative sentiment fragments

PosSentiment

7200

-

-

NegSentiment

8442

-

-

SentiWordPos

8100

6247

1853

SentiWordNeg

8090

1347

6743

Topic

1371

-

-

Target

7867

3743

4124

Negation

3347

1385

1962

IntensifierPlus

5218

2538

2680

IntensifierMinus

1151

327

824

Irreal

942

273

669

OtherShifter

722

388

334

Total

52455

16248

19189

The corpus was annotated by two annotators with a 65.02% agreement rate.

Licensing and reference

The database can be used free of charge for research and educational purposes.

When writing a paper or producing a software application, tool, or interface based on HuSent, it is necessary to properly cite the following paper:

Szabó, M. K., Vincze, V., Simkó, K. I., Varga, V., & Hangya, V. (2016). A Hungarian Sentiment Corpus Manually Annotated at Aspect Level. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Slovenia, Portoroz: European Language Resources Association. (ELRA) 2873-2878.

Downloads

The HuSent corpus.

For further information please contact Martina Katalin Szabo (martina@inf.u-szeged.hu).