Szeged Dependency Treebank

Introduction

The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank. From the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency corpus for Hungarian.

The corpus contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:

  • fiction
  • compositions of pupils between 14-16 years of age
  • newspaper articles (from the newspapers Népszabadság, Népszava, Magyar Hírlap, HVG)
  • texts in informatics
  • legal texts
  • business and financial news

The format of the database follows the CoNLL-2009 Shared Task norms. For a detailed description of the dependency relations applied please see this paper.

Reference

Vincze, Veronika; Szauter, Dóra; Almási, Attila; Móra, György; Alexin, Zoltán; Csirik, János 2010: Hungarian Dependency Treebank. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.

Download

In order to have access to the corpus, please send a signed licence agreement to Veronika Vincze (vinczev AT inf.u-szeged.hu or fax number: +3662546737).