Szeged Treebank


The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:

  • fiction
  • compositions of pupils between 14-16 years of age
  • newspaper articles (from the newspapers Népszabadság, Népszava, Magyar Hírlap, HVG)
  • texts in informatics
  • legal texts
  • business and financial news

The treebank exists in three versions:

  • Szeged Treebank 1.0 is annotated for noun phrases and clauses;
  • Szeged Treebank 2.0 contains a deep phrase-structured syntactic analysis for all sentences;
  • Szeged Dependency Treebank contains dependency-style annotation of all sentences.

A morphologically reannotated version of the corpus, Szeged Corpus 2.5 has just been released, where participles, causative, frequentative and model verbs are distinctively marked, and unknown or misspelled words have been corrected, along with some minor morphological modifications.
If you are interested in Szeged Corpus 2.5, please contact Veronika Vincze.

Baseline experiments

We conducted baseline experiments on the Szeged Dependency Treebank with three state-of-the-art dependency parsers: MALT (Nivre et al. 2004), MST (McDonald et al. 2005) and the Bohnet parser (Bohnet 2010). Results are presented in Farkas et al. (2012) and the training/development/test and the crossvalidation splits can be also accessed by sending a licence agreement (see below). A detailed classification of parsing errors can be downloaded here.

Conversion from constituency to dependency

The two sets of manual annotations for both constituency and dependency syntax on the same bunch of texts make it possible to evaluate the quality of a rule-based automatic conversion from constituency to dependency trees. We automatically converted the constituency treebank into dependency trees following the principles described here. The accuracy of the conversion was 96.51 (ULA) and 93.85 (LAS). For a detailed error analysis please refer to Simkó et al. (2014).

Coreference-annotated version

A section of the Szeged Treebank has been manually annotated for coreference relations. It is freely available for research and educational purposes. If you are interested in this version, please contact Veronika Vincze.


In order to have access to the corpora, a signed licence agreement should be sent to Veronika Vincze (vinczev AT or fax number: +3662546737).