Information extraction from biological publications

Achievements in biological research worldwide manifest themselves in the form of patents, publications (at present, MedLine database in itself contains over 12 million publications). This exponentially growing set of documents contains a lot of useful information, which is, however, "hidden" in the text. The objective of computational linguistics (text mining) is to extract these pieces of information automatically.

The target information is usually biological entities (genes, proteins, DNA, etc) and the relationships among them. Our system for biomolecular event extraction participated in the BioNLP'09 Shared Task on Event Extraction.

Detecting uncertain and negative assertions is essential in most Text Mining tasks where in general, the aim is to derive factual knowledge from textual data. This is especially so for many tasks in the biomedical (medical and biological) domain, where these language forms are used extensively in textual documents and are intended to express impressions, hypothesised explanations of experimental results or negative findings. In biological interaction extraction, the aim is to mine text evidence for biological entities with certain relations between them. Here, while an uncertain relation or the non-existence of a relation might be of some interest for an end-user as well, such information must not be confused with real textual evidence. To support the recognition of speculative and negated content in biological papers and medical documents, we constructed the BioScope corpus which is also the training corpus of the CoNLL2010 Shared Task (with the title "Learning to detect hedges and their scope in natural language text").

For reliable information extraction, disambiguation of biological and other terminologies is the primary preprocessing step. Its importance comes from the fact that in the language use of special fields or communities, certain words are integrated into the language of the particular field as a technical term assuming a specific sense. See e.g. MMP-25 human gene name, which research teams use to indicate three different genes. Therefore, the disambiguation of such expressions and the recognition of various linguistic forms and attitudes, such as speculation (conditional), negation, past or future, are of key importance for efficient processing and IE applications, since the very objective of IE is to collect facts and data from textual documents.

References

Richárd Farkas: The strength of co-authorship in gene name disambiguation. BMC Bioinformatics 2008, 9:69.
György Szarvas, Veronika Vincze, Richárd Farkas, György Móra and János Csirik: The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts. BMC Bioinformatics 2008, 9(Suppl 11):S9.
György Móra, Richárd Farkas, György Szarvas, Zsolt Molnár: Exploring ways beyond the simple supervised learning approach for biological event extraction. In: Proceedings of BioNLP 2009 (NAACL workshop).