Language Resources

The main achievements of the Human Language Techonology group at the University of Szeged in the field of language resource construction are the coordination of the construction of the Szeged TreeBank and the Hungarian Ontology. Besides these datasets there are several language resources which are freely available (see downloads).

Szeged TreeBank 2.0

Syntactic analysis and annotation, that is, the marking of different syntactic units (e.g. nominal or adjectival phrases, postpositional structures, verbs and their arguments) are key steps in natural language processing. A treebank representation that describes the syntactic structure of sentences already exists for most Western European languages and a number of Middle and Eastern European languages, so it was time to create a precisely analyzed Hungarian treebank.

The Group relied on already known sources and existing theories when forming the treebank. After studying and comparing them, our linguists developed a consistent system of syntactic rules. The defined syntactic units were annotated by an automatic pre-annotating unit on texts of the Szeged Corpus 2.0, then linguists checked and corrected the annotated structures. Szeged Treebank 2.0 is based on Szeged TreeBank 1.0.

The database formed in this way provides a reliable basis to develop various computer applications. The determination of annotated syntagmas and their relationship helps in further linguistic processing, among others semantic analysis of texts. 82,000 sentences (1.2 million word entries + 250,000 punctuation marks), that is, the entire Szeged Corpus 2.0 file has been annotated. Treebank files are stored in XML format; their inner structure is described by TEI P4 DTD (Document Type Definition) scheme.

After the TreeBank had been completed, the Group started forming a system of rules essential for the syntactic parser. This system of rules was set up relying on two main sources.

One part of the rules was set up by the linguists of the consortium, which was then completed by rules obtained from the annotated treebank by computer learning methods. The developments were carried out in co-operation with MorphoLogic Ltd. and HAS Research Institute for Linguistics.

Hungarian ontology - the Hungarian WordNet

Computer application development concerning Hungarian language calls for the development of a Hungarian vocabulary database manageable by automated processes. In computational linguistics, ontology can be defined as the data structure of formally defined concepts and relations, by means of which semantic inferences can be drawn. The so-called language ontologies form an important sub-class of computational ontologies.

The objectives of the EuroWordNet project are to create a semantically structured, general purpose Hungarian concept set on the basis of the results and formalism of EuroWordNet language ontology, furthermore to supplement this ontology with a special sub-language already examined by the consortium, that is, a domain-specific ontology including expressions of business language, and finally to present a potential application of the so-created concept network in the field of information extraction.

The main result of the project is the development of a large, strictly structured natural language concept set (ontology), which will help in finding solutions to several important scientific and technological problems. Regarding scientific achievements, it is important to emphasize that developments belong to the semantics of Hungarian language, i.e. of a language which significantly differs from other investigated European languages in typology and morphology.

As the structure of WordNet ontologies is much more complex than that of any simple lexicon or thesaurus, its application potentials are far richer. As a mental encyclopedia of native speakers of Hungarian, a Hungarian WordNet ontology could - to a large extent - assist language teaching in schools. Its standardized interconnection with the other WordNets guarantees its applicability in teaching foreign languages as well. The proper acquisition of the lexical material of the studied foreign language for example, may significantly contribute to the learner's clear understanding of the differences and similarities of his/her native and the target language. In addition, the concept network of WordNet may have a significant role in psycho-linguistic experiments concerning Hungarian language.

Beyond purely scientific applicability, electronic-based language technology applications of a Hungarian WordNet may also open new vistas. Search efficiency of different search engines is greatly increased if these tools have reliable access to the semantic environment of the search expression. This may lead to the improvement of future search engines that are capable of satisfying user needs to a greater extent. This may also increase the efficiency of information extraction and machine translation technologies by providing information about the semantic attributes of the analyzed text. Automatisms supported by ontologies can handle the context of the information to be extracted or translated, therefore, it is likely to produce more reliable results than mere pattern matching or word-by-word translating methods.

The developments were carried out in co-operation with MorphoLogic Ltd. and HAS Research Institute for Linguistics.