SzegedParalell Corpus

Introduction

The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria. Sentences representing the grammar of the given language (usually taken from language books) and authentic texts are both included in the parallel corpus, thus, the balance is maintained between artificially constructed and natural language structures.

Both paragraph and sentence alignment were checked and corrected manually.

Data on corpus texts are shown here:

texts	sentence alignment unit
language book sentences	3,496
texts on the European Union	1,518
Horizon Magazine	3,980
Resource Ingatlan Info	1,340
literature	88,716
miscellaneous	695
TOTAL	99,745

Reference

Krisztina Tóth, Richárd Farkas, András Kocsor: Hybrid algorithm for sentence alignment of Hungarian-English parallel corpora. Acta Cybernetica 18(3):463-478. (2008)

For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).

Downloads

The corpus in "one alignment in a row" format.
The corpus with separate English and Hungarian files.