Information Extraction from Short Business News
Duration: 1 July 2001 - 31 July 2003
Funding: Ministry of Education
- MorphoLogic Ltd. Budapest (coordinator) [link]
- University of Szeged, Department of Informatics, HLT Group [link]
- Research Institute for Linguistics at HAS, Department of Corpus Linguistics [link]
Brief project summary
The goal of project was to create a system that is able to perform information extraction (IE) from short business news. During the IE process, first textual data (natural language text) has to be parsed for relevant information, then the identified information has to be extracted and stored in a pre-defined structure. It is important that the system disregards irrelevant information, and that the structured data can be easily managed and queried by automated means. To accomplish this goal, participants represented the most typical events of business life by so-called semantic frames. The recognition of semantic frames was supported by shallow syntactic parsing methods. Consortium members applied machine learning algorithms for determining shallow syntactic rules. The learning process was conducted on the Szeged Treebank 1.0 already containing hierarchic noun phrase (NP) annotation and the marking of clause boundaries.
Preliminaries
Results of the IKTA 27/2000 R&D project (Development of a Part-of-Speech Tagging Method for Hungarian by using Machine Learning Algorithms) were used in the currently described project. Consortium members had to conduct some basic and applied research in the areas of NP annotation and recognition and the definition of the most relevant components of Hungarian natural language sentences. This was necessary to be able to define the semantic frames that can describe a particular event of the domain of short business news.
Szeged Corpus 2.0 - the extended version
Szeged Corpus 1.0, developed in the framework of the IKTA 27/2000 R&D project, was already at hand when the consortium started working on its new NLP project. However, the existing genres (a selection of fiction, short essays of 14-16-year-old students, newspaper articles, scientific and legal texts) did not fully cover the topic area selected for the new project. Therefore, it was necessary to extend the corpus with a section of short business news. The consortium added an extra 200 thousand-word-long text sample of business-related articles to the existing 1 million words. The source of the short news was the archive of the Hungarian News Agency.
Similarly to texts already included in the corpus, the newly added 200 thousand words were morpho-syntactically annotated and POS tagged, thereby resulting in the second version of the Szeged Corpus. The database served as the experimental database for the information extraction system to be built in the project.
From corpus to treebank: the creation of the Szeged Treebank 1.0 and the semantic lexicon
To be able to perform automated IE, it was undoubtedly necessary to add new features to the existing texts by extending their annotation with syntactic and semantic information. The task, therefore, was to mark structural and content-related features to words, word structures, partial sentence structures, grammatical relations etc.
IE systems do not aim at the detailed analysis and full understanding of texts, they only focus on and analyze the relevant items. Therefore, there was a need for an analysis technology that is able to locate, extract and organize sparse items and formally define them. This selective kind of analysis is supported by shallow parsing, when sentences are only parsed for certain structures, elements or types of phrases. Effective shallow parsing had to be preceded by thorough research concerning the syntax of Hungarian sentences and rules covering the recognition of phrases. The result of the research showed that in Hungarian, nominal structures are the ones that typically bare the most significant meaning (semantic content) within a sentence. For that reason, the consortium decided to mark noun-phrases (NPs) in the first place. Annotation was conducted manually on the entire Szeged Corpus 2.0 on automatically pre-parsed sentences. Pre-parsing was completed with the help of the CLaRK programme, in which syntactic rules were defined by linguistic experts for the recognition of NPs. Due to the fact that the CLaRK parser did not fully cover the occurring NP structures (its coverage was around 70%), manual validation and correction could not be avoided. In total, 250 thousand highest level NPs were found, and the deepest NP structure contained 9 NPs imbedded into each other. The majority of the hierarchic NP structures were between 1 to 3 NPs deep. As a continuation of shallow parsing, the clause structure (CP) of the corpus sentences was marked. Labelling clauses followed the same approach as earlier phases of NLP: it comprised an automatic pre-annotation followed by manual correction and supplementation. With this, the first - shallow parsed - version of the treebank was created. Files are stored in XML [link], and their inner structure is described by the TEI (xLite or P4) DTD [link] scheme.
To store semantic information related to word-meanings, the consortium chose to apply basic lexicon format, due to the lack of Hungarian ontologies. The complete identification of a word's "meaning" supposes the detailed representation of human knowledge, which could not be executed in the framework of a 2-year-long project. Instead, different concepts, i.e. possible semantic roles were associated with each word and were stored in a lexicon along with its morpho-syntactic and lexico-semantic features. Roles were only defined for the short business news topic area, e.g. SELLER, BUYER, PRODUCT, PRICE, DATE etc. Manual annotation was completed for the short business news section of the corpus. The relations of the different semantic roles were represented by so-called semantic frames. Possible frames were defined manually by linguistic experts. These semantic frames allowed mapping between the lexical representation and the semantic role of a word.
The resulting shallow parsed and semantically annotated database - forming the Szeged Treebank 1.0 - was already sufficient enough to serve as a learning base for the training of an information extraction application.
Application of machine learning algorithms
In previous studies, consortium members investigated the applicability of machine learning algorithms for learning NP recognition rules and rules for mapping semantic frames. The 200 thousand-word-long business news section of the Szeged Treebank 1.0 - described above - served as the learning database for machine learning methods. Members experimented with different kinds of learning algorithms and compared their results based on the accuracy of the generated rules. Experience showed that the ILP-based C 4.5 algorithm, maximum entropy models and language modelling methods are equally capable of accurate recognition.
Using the NP recognition rules retrieved from the annotated corpus and combined with the manually defined expert rules, researchers developed an automated NP recognition tool. Semantic mapping rules were also acquired by machine learning algorithms using the manually annotated semantic roles as their learning source.
The trained mapping tool takes a morpho-syntactically annotated and syntactically shallow parsed piece of text and performs two operations. In the first step, it processes the NPs, already identified by the NP recognition tool, and assigns semantic roles to them using the semantic lexicon described previously. The second operation determines the relationships between the roles, i.e. maps semantic frames onto the existing structures. Semantic mapping is realized by simple pattern-matching methods using the frames previously defined by experts. Based on the results of the described operations, the mapping tool builds a semantic representation of the input text, already containing the required information.
Program prototype
As a final step of the project, an IE program prototype called NewsPro was developed by the consortium. NewsPro performs morpho-syntactic annotation, shallow syntactic parsing and semantic structure identification on the input document, and extracts the relevant information from the available data. The program was implemented as a successive, modular, bottom-up analysis method. Its first main module, called HumorEsk (product of MorphoLogic Ltd, Budapest) performs morpho-syntactic annotation and shallow syntactic parsing incl. NP recognition, identification of verbs and their argument structure. (HumorEsk can be used as a separate natural language analysis tool.) The second module is responsible for the identification of semantic roles and the mapping of semantic frames.
The prototype is available in static Win32 library format and represents the result of the analysis process in XML structure. It is able to process large amounts of text in the short business news domain and the extracted information can be arranged in a structured database for further analysis. The results produced by the program prototype were tested against the manually annotated corpus.