Natural Language Processing

The Natural Language Processing group has been involved in human language technology research since 1998, and by now, it has become one of the leading workshops of Hungarian computational linguistics. The Group is engaged in processing mostly Hungarian and English texts. Its general objective is to develop language-independent or easily adaptable technologies.

Recently, members of the group have focused on three main research topics. First, they have been working on the development of magyarlanc, a morphological and syntactic parser for Hungarian. Currently, the analysis of non-standard texts is the main challenge in this field and novel datasets and techniques have been developed to improve performance. Second, several research topics from computational semantics have also been investigated for some years within the group. For instance, the group has developed some machine-learning based solutions for the identification of non-factual (e.g. negated or speculative) text spans. Moreover, our researchers' interest include multiword expressions (linguistic units that consist of several tokens but have some special characteristics concerning their syntactic and semantic features): several datasets and machine learning tools have been created within the framework of an international cooperation, involving more than 20 countries. Third, the group has also been investigating the role of natural language processing tools in the detection of dementia for some years, together with experts of speech technology and psychiatrists.

The group has organized several national and international conferences: the annual Hungarian Computational Linguistics Conferences since 2003 (MSZNY), the Fourth Global WordNet Conference (GWC2008), the Shared Task of the CoNLL-2010, the Second International Workshop on Computational Linguistics for Uralic Languages in 2016 (IWCLUL2016) and the PARSEME Shared Tasks in 2017 and 2018.

People

Veronika Vincze (contact), László Vidács, Gábor Berend, László Tóth, András Kicsi, Péter Pusztai, Martina Katalin Szabó

Research topics

Linguistic processing of Hungarian
Hungarian is a stereotype of morphologically rich and free word order languages. We are constantly improving linguistic processing tools for Hungarian. Our group was the coordinator of the development of the Szeged Treebank and Szeged Dependency Treebank. We also harmonized two Hungarian morphological coding systems in order to make available morphological analysers and corpora compatible to each other. The toolkit called magyarlanc is for the basic linguistic processing of Hungarian texts, i.e. language-specific segmentation, POS-tagging and dependency parsing. The toolkit consists of only JAVA modules (there are no wrappers for other programming languages), which guarantees its platform independency and its ability to be integrated into bigger systems (e.g. web servers). We hope that the free availability of the toolkit fosters the research not just on the Hungarian language but on all the morphologically rich languages in general.
Read more

Automatic keyphrase extraction
Free text tagging is the task of assigning a few natural language phrases to documents which summarize them and semantically represent their content. The Group has developed a solution for the automatic tagging of the [origo] news archive's 400K articles.
Read more

For an effective way of processing scientific papers, keywords should be given by authors of the papers, however, manually assigned keyphrases are rarely provided and creating them by hand would be costly and time-consuming. The constant growth of the number of publications boosted interest in automatic keyphrase generation. We developed an automatic keyphrase extraction system and achieved top rank at a SemEval-2 challenge. We adapted the keyphrase extraction system for extracting the reason (pros and con phrases) of authors' opinion from product reviews collected from the epinions.com site. The datasets for two fairly different product review domains were constructed semiautomatically.
Read more

Multiword expressions
Multiword expressions are lexical units that consist of two or more words (tokens), however, they exhibit special syntactic, semantic, pragmatic or statistical features. From an NLP point of view, their treatment is not free of problems since - on the one hand - the system should recognize that they count as one lexical unit (and not two or more words connected) therefore it is advisable to store them as one unit in the lexicon. On the other hand, special rules for their treatment should also be included in the system.
Read more

NLP for user-generated content
Recently, we have started experimenting with the adaptation of standard natural language processing tools for user generated content (blogs, facebook, tweets). We won the RepLab 2013 – An evaluation campaign for Online Reputation Management Systems where the task was to develop target-oriented sentiment analysis tools for tweets.

Uncertainty and negation detection
In information extraction, many applications seek to extract factual information from text. That is why it is of high importance to distinguish uncertain and/or negated text spans from factual information. In most cases, what users need is factual information, thus, uncertain or negated text spans should be treated in a special way. Depending on the given task, the system should either neglect such texts or separate them from factual information (later, the user can decide whether s/he needs them). For the training and evaluation of such systems, corpora annotated for negation and speculation are necessary.
Read more

Information Extraction / Web Mining
Due to the exponentially growing number of publications, the necessity for automatic information extraction is strong in the biomedical domain. The Group's activities in this field mainly focus on the disambiguation of biological terms and the detection of uncertain and negative assertions.
Read more

In medical documents (e.g. findings or case histories) there is a huge amount of information encoded in free text format. Automated processing of these texts would make these data easily accessible. The Group's results in this field involve automatic coding of radiological findings and anonymization of medical documents.
Read more

Scientific social web mining aims at extracting global patterns in the network of researchers of a given field. The Group has developed a method for analyzing full text regions of homepages of researchers, with the help of which scientific social information can be automatically acquired.
Read more

A Named Entity (NE) is a phrase in the text which uniquely refers to an entity of the world. The identification and classification of NEs serves as a base for several NLP applications (especially in information extraction and machine translation). The group has constructed several manually annotated NE corpora for Hungarian and has developed a NER system, which has been successfully applied to Hungarian and English business news and English clinical texts.
Read More

Disambiguating person names is a challenging task: it can be seen as a special word sense disambiguation task. On the one hand, names seem to be ambiguous, thousands of people can share their first name or surname. On the other hand, certain names tend to occur in several versions. Thus, results of queries contain homepages that belong to different people with the same name, moreover, certain homepages belonging to a name are not yielded.
Read more

Language Resources
Our group members are experienced in constructing language resources and corpora: besides the two main language resources (Szeged Treebank and the Hungarian WordNet), they have built several other databases as well.
Read More

Projects

PARSEME
The IC1207 COST Action, PARSEME, is an interdisciplinary scientific network devoted to the role of multi-word expressions (MWEs) in parsing. It gathers interdisciplinary experts (linguists, computational linguists, computer scientists, psycholinguists, and industrials) from 31 countries. It addresses different methodologies (symbolic, probabilistic and hybrid parsing) and language technology applications (machine translation, information retrieval, etc.). The most important results of the action include the creation of a manually annotated database of verbal multiword expressions for 20 languages, the organization of several workshops and two shared tasks.

e-magyar
e-magyar is a new toolset for the analysis of Hungarian texts. It was produced as a collaborative effort of the Hungarian language technology community integrating the best state-of-the-art tools, enhancing them where necessary, making them interoperable and releasing them with a clear license. It is a free, open, modular text processing pipeline which is integrated in the GATE system offering further prospects of interoperability. It analyses Hungarian texts from tokenization to syntactic parsing, together with named entity recognition. Members of the Szeged NLP team were involved in the development of morphologicl and syntactic parsing modules.

Information retrieval from hungarian radiology reports
In this project information retrieval of MR reports is carried out using manual annotations of anonymized reports. Machine learning is applied on the annotated reports to label Body parts, changes and their properties in free text. In the near future we plan to apply deep learning and ontology based method to make the analysis more credible. Based on this phase the project will seek answers for questions related to concrete illnesses.

Classification of non-functional requirements
Requirements engineering is one of the very first tasks of the software development processes which fundamentally influences the quality of the software under development. The requirements are mostly given in natural language form which can be both functional and non-functional requirements. The non-functional requirements are the foundation of the quality aspects of the software such as security, usability, reliability. Classifying the non-functional requirements is one of the most important tasks of software engineering. The object of the project is to develop machine-learning (and deep-learning) based methods and tools which can support system analysts in classifying non-functional requirements given in natural language form. The collection of classified non-functional requirements can be used for both analysis and design phases.

Multi-label classification for tagging user feedbacks given in natural language form
When users or customers express their expectations relating to the software, they use natural languages. These sentences of feedbacks or requirements often contain more than one aspect of the expectations, therefore, they can be classified more than one classes.The object of the project is to develop machine-learning (and deep-learning) based methods which can be applied to multi-label classification and to develop tagger tool based on these methods. The method is to be extended also to support multi-label tagging process of sentences.

Former projects

Telemedicine-focused research (2013-2015)
Under the umbrella of the TOMI project (Telemedicine-focused research activities in the fields of Matematics, Informatics and Medical sciences, TÁMOP-4.2.2.A-11/1/KONV-2012-0073), we develop computational linguistic tools for identifying early-stage Alzheimer's disease from transcripts.
Read more

futurICT.hu (2012-2014)
The project futurICT.hu (Infocommunicational technologies and the society of the future, TÁMOP-4.2.2.C-11/1/KONV-2012-0013) dedicated a subproject to develop and validate novel algorithms in the field of natural language processing. Our methods primarily focus on extracting information from huge textual and speech databases, with special emphasis on user generated content (blogs, facebook, tweet). We work on novel techniques for parsing morphologically rich languages; processing non-canonical texts (like tweets); summarizing by keywords; detecting multiword expressions. We are going to demonstrate the value of these methods through the applications of opinion mining from tweets (going beyond positive/negative classification); intelligent communication via smart phones; deep analysis of scholarly papers.

Criminal Information Extraction (2010-2013)
We developed an information extraction system for the Coordination Centre Against Organized Crime.

Nexum KarrierPortal (2013-2014)
We developed a high-recall CV parser and semantic candidate ranker for the Nexum Ltd.
Read more

Maszeker (2009-2012)
The MASZEKER Project (TECH_08_A2/2-2008-0092) started off in 2009 with the aim of developing a model-based semantic search system primarily for English and Hungarian patents and folklore texts. In the first year, the members of the consortium selected the natural language parsers (POS-tagger for English, dependency parser, NE-recognizer) to be used in the search system and adapted them to the features of the subtasks and domains. During the second year of the project, our colleagues developed the prototype of the syntactic parser for English patents, they created a morphological parser that exploits the harmonized MSD-KR coding system and a POS-tagger that is built on the output of the former parser. Moreover, the dependency parser for Hungarian is also being implemented. In the last phase of the project, we concentrated primarily on the semantic processing of patents: a word sense disambiguation module was constructed and a semantic lexicon was built.

Belami (2008-2011)
Within the framework of the BELAMI project, in 2009, we focused on the text mining problems in Ambient Assisted Living applications. In a study, the research group identified syntactic and semantic analysis of the transcripts of sound materials (speech recognition) and automatic collection of information from web sources as the two most important, related data-mining problems. in 2010, emphasis was put on developing domain adaptation models. One of the basic assumptions in machine learning is that data used in training and testing exhibit the same random distribution. During the training phase, the model captures the patterns and connections that are characteristic of each class, which yields that if the trained model is applied to data from another distribution, the classifier's performance decreases to a great extent. For domain adaptation problems, we developed a novel algorithm with the core idea of model transformation. The machine learning algorithm implemented in this way was tested on synthetic databases and an opinion mining problem. We gave adequate solutions for text mining problems hindering the development of Ambient Assisted Living applications by means of applying easily adaptable machine learning techniques.

Textrend (2007-2010)
The objective of the Textrend project was to develop a business and governmental decision support toolbox using trend- and text-analysis tools. Within the framework of the project, text analysis tools for English adapted for the Textrend toolbox were developed. Integrability was achieved using the UIMA toolkit. During the last year of the project, we integrated the text processing tools developed for Hungarian into the toolbox and research was made on the topics of automatic keyphrase assignment and topic monitoring of document sets, distance-based visualizations of tagsets, automatic keyphrase extraction and extraction of persons' attributes from the web. The project terminated in November 2010 with success.
Read more

Examination of National and Ethnic Identity from narratives (2006-2008)
In the focus of the project lay the examination of historic narratives pertaining to traumatic events of the Hungarian past (Trianon, World War II, Holocaust, '56) in respect of historically changing identity construction strategies. Examinations require application, adaptation and development of natural language analysis and processing methods. Therefore, the objective of the project was to implement software that enables researchers to extract information and draw conclusions pertaining to group identity and inter-group relations in narrative texts.
Read more

Hungarian WordNet Ontology (2005-2007)
The objective of the project was to create the framework of a unified national ontology, which contains a freely available top concept set and a domain specific (telecommunication and information services) ontology. The network of concepts is founded on a freely available ontology infrastructure, with its own ontology management methodology, tool set, practical guidelines and the cooperative institutional system necessary for the maintenance of the framework. The developed ontology infrastructure can be utilised in many other fields of research and application in the future. This is due to the fact that any newly developed domain concept set can easily be concatenated to the developed top concept set.
Read more

Hungarian-English Machine Translation System (2005-2007)
The main aim of the project was to implement a Hungarian-English machine translation system. The system is based on the prototype of three applications: (1) an example sentence translator, (2) a software supporting free text comprehension, and (3) a form-filler translator. The system supports the filling of official forms, the translation of business letters into English, and facilitates the appearance of Hungarian companies in the international scene. The developed system is generally expected to enhance the country's international integration, to make EU development resources more available, to increase competitiveness of certain economic operators on the international market, and to encourage innovation activities of state-financed organisations.
Read more

Information Extraction for Genomic Research (2004-2006)
The main objective of the present project was to create and develop an independent system and software capable of three things: firstly, of processing data from scientific literature (Medline abstracts) linguistically; secondly, of extracting information from the processed texts; and thirdly, of identifying correlations by the utilisation of a graph-based, analytic representation of the extracted information. Questions raised by DNA-chip technology are a new challenge for bioinformatics these days. As opposed to static information stored in DNA databases, DNA-chip experiments provide information in large quantities on dynamic changes in the expression of thousands of genes. Our major objective was to utilise both sets of information in extracting new types of results and correlations, which is part of the new vistas that have opened up in the bioinformatics field of genomic research.
Read more

Unified Hungarian Ontology (2004-2006)
The objective of the project was to create the framework of a unified national ontology, which contains a freely available top concept set and a domain specific (telecommunication and information services) ontology. The network of concepts is founded on a freely available ontology infrastructure, with its own ontology management methodology, tool set, practical guidelines and the cooperative institutional system necessary for the maintenance of the framework. The developed ontology infrastructure can be utilised in many other fields of research and application in the future. This is due to the fact that any newly developed domain concept set can easily be concatenated to the developed top concept set.
Read more

Knowledge based semantic search system (2003-2004)
The objective of the project was to develop a knowledge-based Hungarian semantic search engine, which eliminates the shortcomings of state-of-the-art search technologies (typically reduced to operating with surface technologies) by enabling in-depth understanding of texts dealing with special subjects. The proposed technology has been implemented as part of an intelligent traumatological information system in order that the National Traumatology and Emergency Institute can actively benefit from it. The practical objective of the project was, therefore, to enable traumatologists and nurses to formulate their queries in free text, and to enable the system to answer these questions in the same form on the basis of documents available in the medical protocol and case repository.
Read more

Learning Full Hungarian Syntax (2002-2004)
The project's main objective was to develop a syntactic analysis method supported by machine learning algorithms. A further objective was to implement the method in the form of a program prototype. Developments inferred the existence of a syntactically fully analysed textual database, i.e. a treebank (see Szeged Treebank 2.0). Apart from these, the consortium endeavoured to develop a technology that is capable of recognising and managing special tokens and named entities (e.g., proper names, dates, figures, formulae, internet and e-mail addresses, etc.).
Read more

Information Extraction from Short Business News (2001-2003)
The project set three major objectives: firstly, to develop an information extraction technology for economic and business news; secondly, to implement the technology thus developed in the form of a program prototype; finally, (as a prerequisite for the former two points) to build a syntactically and semantically annotated database (see Szeged Treebank 1.0), which covers the given domain in the most representative way possible.
Read more

Hungarian POS tagging (2000-2002)
The main objective of the project was to develop an effective Hungarian POS tagging method and a program prototype implementing it. As a prerequisite, it was necessary to develop a suitably sized, morpho-syntactically annotated and disambiguated textual database (see Szeged Corpus 1.0 and 2.0). The corpus served as the learning database for machine learning algorithms, which constituted the core of automatic disambiguation method.
Read more

Selected publications

Csuvik V, Horváth D, Horváth F, Vidács L. 2020. Utilizing Source Code Embeddings to Identify Correct Patches. Proceedings of the Second International Workshop on Intelligent Bug Fixing (IBF 2020). :18-25.

Berend G. 2020. Massively Multilingual Sparse Word Representations. Eighth International Conference on Learning Representations - ICLR 2020. . :1-16.

Tóth L, Nagy B, Gyimóthy T, Vidács L. 2020. Why Will My Question Be Closed? NLP-Based Pre-Submission Predictions of Question Closing Reasons on Stack Overflow Proceedings of the 42nd International Conference on Software Engineering, NIER Track (ICSE 2020). :105-108.

Kicsi A, Csuvik V, Vidács L, Horváth F, Beszédes Á, Gyimóthy T, Kocsis F. 2019. Feature Analysis using Information Retrieval, Community Detection and Structural Analysis Methods in Product Line Adoption. Journal of Systems and Software. 155:70-90.

Tóth L, Nagy B, Janthó D, Vidács L, Gyimóthy T. 2019. Towards an Accurate Prediction of the Question Quality at Stack Overflow Using a Deep-Learning-Based NLP Approach. Proceedings of ICSOFT 2019, 14th International Conference on Software Technologies. :631-639.

Tóth L, Vidács L. 2018. Study of Various Classifiers for Identification and Classification of Non-Functional Requirements. Proceedings of the 18th International Conference on Computational Science and Its Applications (ICCSA 2018). 10964:492-503.

Kicsi A, Vidács L, Csuvik V, Horváth F, Beszédes Á, Kocsis F. 2018. Supporting Product Line Adoption by Combining Syntactic and Textual Feature Extraction. New Opportunities for Software Reuse - 17th International Conference on Software Reuse (ICSR 2018). :1-16.

Kicsi A, Tóth L, Vidács L. 2018. Exploring the Benefits of Utilizing Conceptual Information in Test-to-Code Traceability. Proceedings of the IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE 2018 @ ICSE).

Zsolt S, Alex S-N, István NT, Ádám C-K, Vincze V, Farkas R. 2018. Relevance Segmentation of Long Documents. XIV. Magyar Számítógépes Nyelvészeti Konferencia. :405-412.

Vincze V, Farkas R, Szántó Z, Simkó KIlona. 2017. Universal Dependencies and Morphology for Hungarian - and on the Price of Universality. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, {EACL} 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers. :356–365.

Berend G. 2016. Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling. CoRR. abs/1612.07130