HunOr: A Hungarian-Russian Parallel Corpus
Introduction
The HunOr corpus currently comprises approximately 800 thousand words, but is undergoing continuous enlargement. Texts of the corpus are from various sources, for instance, printed version, electronic publication etc. Corpus texts are morphologically analyzed and some of the parts are manually aligned and annotated for Named Entities.
Composition of the corpus
The HunOr corpus consists of three subcorpora on the basis of the text genres: literature, scientific and official language subcorpora. Nevertheless, the corpus is going to be extended with a newspaper subcorpus within a short period of time.
Literary texts
- Boris Akunin - Grigory Chartishvili: Kladbisenskie istorii 'Cemetery Stories'
- Fyodor Mikhaylovich Dostoevsky: Zapiski iz podpolya 'Notes from Underground'
- Ilya Ilf, Yevgeny Petrov: Dvenadtsat stulyev 'The Twelve Chairs'
- Isaak Emmanuilovich Babel: Konarmija 'Red Cavalry'
- Nikolay Vasilyevich Gogol: Zapiski sumasshedshego 'Diary of a Madman'
- Frigyes Karinthy: Tanár úr, kérem 'Please Sir'
- Ferenc Móra: Aranykoporsó 'The Gold Coffin'
- Géza Gárdonyi: Egri csillagok 'Stars of Eger'
- Kálmán Mikszáth: A fekete város 'The Black Town'
- Jenõ Rejtõ: A tizennégy karátos autó 'The 14-carat roadster'
Scientific texts
- Vitaly Orlov: Hranitel nenuzhnih veshey 'The keeper of needless things'
- Nikolay Berdyaev: O vecno-babyom v russkoy duse 'About the "eternal femininity" in the Russian soul'
Official texts
- A magyar kultúra ezer esztendeje 'One thousand years of Hungarian culture'
- Nemzeti jelképek, nemzeti ünnepek 'National symbols, national days'
- Magyar Nobel-díjasok egy jobb világért 'Nobel laureates from Hungary for a better world'
- Törvény a szomszédos államokban élõ magyarokról: érdekek és célok 'Act on Hungarians living in neighbouring countries: interests and goals'
Statistical data on the corpus
Text genre | Tokens | Sentences | ||
---|---|---|---|---|
Russian | Hungarian | Russian | Hungarian | |
Literature | 789,001 | 798,641 | 67,021 | 61,505 |
Scientific | 6,683 | 7,228 | 370 | 348 |
Official | 14,774 | 13,522 | 668 | 568 |
Total | 810,458 | 819,391 | 68,059 | 62,421 |
Named Entities
Named Entities | Russian | Hungarian |
---|---|---|
Person | 1704 | 1656 |
Location | 732 | 603 |
Organization | 148 | 116 |
Miscellaneous | 327 | 253 |
Total | 2910 | 2628 |
Downloads
- The HunOr corpus.
Reference
- Szabó, Martina Katalin; Vincze, Veronika; Nagy T., István 2012: HunOr: A Hungarian-Russian Parallel Corpus. Accepted to: LREC 2012.
For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).