HunOr: A Hungarian-Russian Parallel Corpus

Introduction

The HunOr corpus currently comprises approximately 800 thousand words, but is undergoing continuous enlargement. Texts of the corpus are from various sources, for instance, printed version, electronic publication etc. Corpus texts are morphologically analyzed and some of the parts are manually aligned and annotated for Named Entities.

Composition of the corpus

The HunOr corpus consists of three subcorpora on the basis of the text genres: literature, scientific and official language subcorpora. Nevertheless, the corpus is going to be extended with a newspaper subcorpus within a short period of time.

Literary texts

  • Boris Akunin - Grigory Chartishvili: Kladbisenskie istorii 'Cemetery Stories'
  • Fyodor Mikhaylovich Dostoevsky: Zapiski iz podpolya 'Notes from Underground'
  • Ilya Ilf, Yevgeny Petrov: Dvenadtsat stulyev 'The Twelve Chairs'
  • Isaak Emmanuilovich Babel: Konarmija 'Red Cavalry'
  • Nikolay Vasilyevich Gogol: Zapiski sumasshedshego 'Diary of a Madman'
  • Frigyes Karinthy: Tanár úr, kérem 'Please Sir'
  • Ferenc Móra: Aranykoporsó 'The Gold Coffin'
  • Géza Gárdonyi: Egri csillagok 'Stars of Eger'
  • Kálmán Mikszáth: A fekete város 'The Black Town'
  • Jenõ Rejtõ: A tizennégy karátos autó 'The 14-carat roadster'

Scientific texts

  • Vitaly Orlov: Hranitel nenuzhnih veshey 'The keeper of needless things'
  • Nikolay Berdyaev: O vecno-babyom v russkoy duse 'About the "eternal femininity" in the Russian soul'

Official texts

  • A magyar kultúra ezer esztendeje 'One thousand years of Hungarian culture'
  • Nemzeti jelképek, nemzeti ünnepek 'National symbols, national days'
  • Magyar Nobel-díjasok egy jobb világért 'Nobel laureates from Hungary for a better world'
  • Törvény a szomszédos államokban élõ magyarokról: érdekek és célok 'Act on Hungarians living in neighbouring countries: interests and goals'

Statistical data on the corpus

Text genre Tokens   Sentences  
  Russian Hungarian Russian Hungarian
Literature 789,001 798,641 67,021 61,505
Scientific 6,683 7,228 370 348
Official 14,774 13,522 668 568
Total 810,458 819,391 68,059 62,421

 

Named Entities

Named Entities Russian Hungarian
Person 1704 1656
Location 732 603
Organization 148 116
Miscellaneous 327 253
Total 2910 2628

Downloads

Reference

  • Szabó, Martina Katalin; Vincze, Veronika; Nagy T., István 2012: HunOr: A Hungarian-Russian Parallel Corpus. Accepted to: LREC 2012.

For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).