Natural Language Processing, Information Extraction and Graph-based Analytic Infrastructure for the Support of Genomic Research
Duration: 1 December 2004 - 30 November 2006
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
- Southern Great Plain Bio-innovation Centre, Genomic Centre (coordinator) [link]
- University of Szeged, Department of Informatics, HLT Group [link]
- Data Explorer Ltd. [link]
- University of Szeged, Gyula Juhász Teacher Training College, Department of Computer Science [link]
- Szeged Biological Research Centre of HAS link [link]
Brief project summary
Issues imposed by DNA-chip technologies have recently become a major challenge for bioinformatics. As opposed to the static information stored in DNA databases, DNA-chip experiments provide immense information on the dynamic change of several thousands of gene expressions. Our overall objective is to utilize both types of information stores in retrieving new types of findings and interrelations that are included in the new perspectives of the bioinformatics branch of genomic research. Consequently, the main objective of the present project is the development of an application that is capable of linguistic processing of textual data (abstracts of scientific publications in the Internet-based MedLine database), extracting information from the abstracts, and then - utilizing a graph-based analytic infrastructure of the experimental and literary results - identifying new relationships and systems of interconnections.
The system's field of operation is the support of genetic research, in other words, the information extracted is related to genes, inter-gene relations and gene operation. With the help of this system, researchers might query the appearance of specific genes or gene groups in publications. The system, based on a pre-prepared (but continuously updated) database, explores and displays the graph of inter-gene relations for the genes in the query. Inter-gene relation means whether the genes appear in the same publication, and the strength of the relation is determined by the number of such publications. The system is capable of visualizing not only the interrelation of the queried genes but also their connections to other genes. The use of the system is interactive: the user can limit or extend the scope of search, and might as well include new genes in the display of connections.
For bioinformatics, a young branch of science, the quantitative analysis of the diverse information is a huge task since the recognition of novel data-patterns requires the handling of all the available data, IT resources and the content of all literary databases of genomic research. This requires modern analytic techniques (pattern recognition, data-mining).
The aim of computerized text processing is the understanding of the written text. The latest international research attempts to answer how the implementation of an artificial conceptual structure (language ontology) and the consequent mapping of concepts included in the sentence to this ontology might improve the understanding of the text. This requires an adequate preliminary sentence analysis. Members of the consortium aimed at implementing the steps of the required linguistic analysis and content processing in the different work-phases.
Above automated information extraction, the most accurate description possible of the interrelationship among the data is needed for the exploration of new implications and connections. To achieve this, we implemented the development of data-mining and data-visualizing methods and their integration into a software environment. For the efficient implementation of the methods, we built a so-called graph-based analytic infrastructure.
The end-product of the project is a system that could considerably reduce the costs of pharmaceutical research. On the basis of experience gained in DNA-chip research, we might say that with such a system the costs of pharmaceutical research might be reduced by up to 10% since the number of experiments can substantially be reduced and the time for gaining basic information can be shortened to a great extent.
Preliminaries
The HLT Group have been successfully applying machine learning algorithms for solving linguistic problems in an automated way. Therefore, they employed this technology in this project as well. The previous national R&D projects directly connected to the present work are the IKTA 37/2002 project (Machine Learning of Syntax Rules), and the NKFP 2/017/2001 project (Information Extraction from Short Business News). As results of these projects, a general-purpose syntactic parser and a domain-specific information extraction system were built.
Several preliminary works have been carried out in the field of biotechnology, too, including the BIO-0005/2000 project that examined the use and effect of DNA-chips in gene-expressive monitoring of melanomes, and the biotechnology project BIO-0006/2002 for the development and chemical-genomic use of Ligand-chip that examined the effects of chemicals on gene-expression.
Results accomplished
The main aspects of linguistic work included the creation of an English technical terminology database and gene-expression data-store, and the design and implementation of a syntactic parser built on them. The English syntactic parser is designed to divide the texts into sentences and words, capable of recognising the syntactic units of the sentence and, based on the relationship among them, building the syntactic tree of the sentence.
Another important preliminary work included the implementation of a genomic ontology (conceptual network). This is inevitable for enhancing the efficiency of the end product of the project, i.e. the information extraction mechanism to be built. The genomic ontology organises the concepts of the field into a well manageable system on the basis of pre-defined relationships. This conceptual web helps to implement the semantic analysis that supplements syntactic analysis.