Development of a Part-of-Speech Tagging Method for Hungarian by using Machine Learning Algorithms
Duration: 1 September 2000 - 30 June 2002
Funding: Ministry of Education
- University of Szeged, Department of Informatics, HLT Group (coordinator) [link]
- MorphoLogic Ltd. Budapest [link]
Brief project summary
POS tagging plays a central role in natural language processing (NLP) and is the alpha of further language analysis processes. Similarly to many other languages, Hungarian words can have more than one part-of-speech (e.g., the word "ég" means "sky" as a noun and "burn" as a verb). In natural language computer systems, part-of-speech determination of a given word is key in a given context. For instance, an intelligent, web-based dictionary lookup program could run more efficiently on a POS tagged text (it would be able to decide which translation of the word "ég" to offer to the user). Regarding the above problem, partners of the IKTA 27/2000 project have developed a morpho-syntactically annotated and disambiguated Hungarian corpus. Having examined texts containing several million word entries, consortium members found that in Hungarian every other word is polysemous. This suggested the necessity to develop a technology, which is capable of analyzing words of an unknown text and tagging them with appropriate POS tags. A further argument for implementing the aforementioned technology was that syntactic and semantic analysis of natural language texts was not viable without appropriate morpho-syntactic disambiguation.
Preliminaries
Collection of special text files (corpora) started in the early 70's in Hungary. These texts, however, were only assorted according to thematic principles, linguistic analysis and morpho-syntactic annotation were not conducted on them. The first significant change was brought about by "MULTEXT" initiative, which created a general morpho-syntactic encoding scheme, the so-called MSD (Morpho-Syntactic Description) code set applicable for several languages. Within the scope of the 1995-97 "MULTEXT-EAST" Copernicus Project, the encoding scheme was adapted to Central- and Eastern-European languages as well. Participants of the project implemented an annotated corpus of the above-mentioned languages to demonstrate the behavior and applicability of the MSD encoding scheme. The corpus - latter known as TELRI corpus - was compiled on the basis of George Orwell's novel 1984. Texts were tagged with appropriate morpho-syntactic (MSD) codes by way of manual annotation.
Roots of the current project go back to 1998, when certain members of the consortium started their research into the applicability of Inductive Logic Programming (ILP) methods in morpho-syntactic analysis. Experiments were made on the above-mentioned TELRI corpus within the scope of a European ESPRIT Project. The Hungarian version of the corpus, however, did not fully implement the MSD encoding scheme, that is, it did not classify pronouns, adverbs, numerals and conjunctions. Due to these deficiencies and to the fact that its size proved to be small for further research, the presented project was organized with the main goal to create a suitably large training corpus for machine learning algorithms. Founders of the project aimed not only at increasing the amount of corpus text, but at improving the quality of the annotations as well, meaning full conformity to the MSD encoding scheme and accurate manual tagging. The creation of a POS tagger prototype was also included among the goals of the project.
Szeged Corpus 1.0 - the foundations
When selecting texts for the corpus, the main criteria was that they should be thematically representative of different text types, thus they should derive from largely different genres. As a result, the created corpus contained texts from five genres, each comprising circa 200,000 words. Naturally, the mentioned quantity (1 million word entries + punctuation marks) is still insufficient to cover an entire written language, but due to its variability, it serves as a good training database for machine learning algorithms and as a reference material for different research applications in the future. The genres included in the first version of the Szeged Corpus are the following:
- fiction (selections from Jenõ Rejtõ Piszkos Fred, a kapitány (Dirty Fred, the Captain), Antal Szerb Utas és holdvilág (Journey by Moonlight), George Orwell 1984)
- short essays of 14-16-year-old students
- newspaper articles (excerpts from three daily and one weekly papers)
- computer-related texts (excerpts from Balázs Kis: Windows 2000 manual book and some issues of the ComputerWorld, Számítástechnika magazine)
- legal texts (excerpts from laws on economic enterprises and authors' rights)
Corpus annotation
After compiling the corpus, consortium partners outlined the annotation strategy. In order to meet international standards the Hungarian version of MSD was selected for tagging the words. Corpus files are available in XML format [link], and their inner structure is described by the TEI(xLite and P4) DTD (Document Type Definition) [link].
After selecting the encode scheme and determining the format of the corpus files, the first step was to divide texts into units suitable for processing. Texts were divided into divisions (between <div> and </div> tags), where one division comprised a chapter of a novel, a newspaper article, a single short essay etc.; paragraphs (marked by <p> and </p> tags); sentences (between <s> and </s>); word entries (identified by <w> and </w>); and punctuation marks (marked by <c> and </c>tags).
Tags | Number of tags |
<div> | 3,365 |
<p> | 17,144 |
<s> | 68,932 |
<w> | 1,009,024 |
<c> | 203,005 |
The next step of the process was morpho-syntactic parsing. Word entries were collected in a lexicon and were morpho-syntactically pre-analyzed by the HuMor morphological parser (product of MorphoLogic Ltd. Budapest). The lexicon contained 163,000 different word entries and 15,000 named entities, mainly proper nouns. The parser determined the possible morpho-syntactic labels of the lexicon entries. Since the HuMor software cannot produce all of the attributes relevant to the MSD encoding scheme, linguists had to manually check each entry of the lexicon and create a relatively large list of exceptions. Most of this work was based on the Hungarian Explanatory Dictionary (J. Juhász, I. Szõke, G. O. Nagy, M. Kovalovszky (eds.), Budapest, Akadémiai Press, 1972), however, linguists had to rely on their intuition in a large number of neologies. Finally, the whole text was re-parsed using the exception dictionary and the morpho-syntactic labels were converted to comply with the MSD encoding system. As a result, the ambiguous version of the corpus was created.
After the pre-processing described above, the entire corpus was manually disambiguated (POS tagged) by linguists. During POS tagging annotators selected the correct MSD label of a word entry from a set of possible labels that best applied to the context. A special software (the so-called Tagging Assistant) was developed by the Department of Informatics to support manual annotation. Senior linguists and computer programs checked the quality and consistence of the annotators' work.
Application of machine learning algorithms
In previous studies, researchers of the consortium investigated the applicability of several machine learning algorithms for learning POS tagging rules. The presented 1 million-word-long POS tagged corpus proved to serve as a sufficiently large learning database for machine learning methods. Taggers are designed to process the learning database, generate learning tasks for the algorithms, analyze the output of the algorithms, and finally create accurate POS tagging rules. Researchers experimented with different kinds of POS tagging methods and compared their results based on accuracy. Brill's transformation-based learning method worked with 96.52% per word accuracy when trained and tested on the corpus. The HMM-based TnT tagger performed 96.18% while the RGLearnrule-based tagger produced 94.54% accuracy. Researchers also experimented with the combination of different learning methods in order to increase accuracy. The best accuracy result, delivered by combined POS taggers, was 96.95%.
Results clearly indicate that, in spite of the agglutinative nature of Hungarian language and the structural differences between Hungarian and other Indo-European languages, all transformational, statistical and rule-based learning methods can be applied effectively to the problematics of POS tagging. The applied and developed methods use internationally accepted morpho-syntactic classes (MSD coding system) and they also leave room for forming language-specific classes. From a technical point of view, POS taggers are discrete program modules that can be installed into any application requiring linguistic analysis.
Program prototype
In the last phase of the project, the consortium developed a program prototype performing POS tagging. The prototype utilizes the rules generated by the aforementioned methods. The program is capable of performing segmentation of texts (into divisions, sentences, words etc.), and of pre-analysing and disambiguating them. Developers tested the prototype on different text samples and evaluated the results. The prototype is available in the form of a Windows DLL, which enables its adaptability to any Windows application. To demonstrate the usability of the developed disambiguation method, MorphoLogic Ltd. installed the program module into its MoBiMouse dictionary program. Results of the project were also utilized in the latter NKFP 2/017/2001 project of the consortium.