Clinical Information Extraction


Medical institutes usually store a considerable amount of valuable information (patient data) as free text. Such information has great potential to aid research related to diseases or improving the quality of medical care. The size of document repositories makes automated processing in a cost-efficient and timely manner an increasingly important issue. The intelligent processing of clinical texts is the main goal of Natural Language Processing for medical texts.

Automatic coding of radiological findings

An open competition organized by American hospitals and research institutes in the spring of 2007 aimed at tagging radiological findings with ICD-9-CM codes (a coding system for invoicing, consistent with the International Classification of Diseases /ICD/) automatically.
The importance of developing computational processes that make automatic classification of finds possible is well-illustrated by the fact that in the United States, 25 billion dollars are appropriated yearly for coding textual documents of the medical domain and for correcting incidental errors.
From the submitted systems a conclusion can be drawn that the efficient - almost human level - processing of clinical documents does not seem impossible with the help of available tools of today. The Group achieved first place at this competition.
In 2008, a similar shared task focused on the development of automatic systems that analyzed clinical discharge summary texts and addressed the following question: "Who's obese and what co-morbidities do they (definitely/most likely) have?". Target diseases included obesity and its 15 most frequent comorbidities exhibited by patients while the target labels corresponded to expert judgments based on textual evidence and intuition (separately). Our system became second in this challenge.




For the application of case histories in data mining tasks, it is crucial to protect personal data. For this purpose, before publishing a medical database, names of persons (doctor, patient), phone numbers, addresses, names of hospitals, etc. have to be anonymized. An open, international competition was launched by a consortium of American hospitals and research institutes for this task, with which the Group's general NER system customized for medical texts coped successfully. The Group's statistical NER system makes use of several word-level features, such as regular expressions, word form features, context of relevant words, frequency lists, etc. On the basis of these features, various statistical machine learning methods were tested and applied, by means of which the precision of our model reached 99,75%. In this competition, the Group achieved second place.


Smoking status identification

Anonymized databases make it possible for researchers to make statistics, analyses about their patients and their illnesses, such as comparing patients' smoking habits - e.g. with regard to a given illness -, mapping effects of smoking may be useful statistics of this kind. Final reports usually mention patients' addictions insofar as they are revealed in the course of examinations, or may be related to the patient's complaints.
Since addictions such as smoking, alcohol consumption, etc. are usually put down in the running text part of the report, automatic diagnosis of smoking status is a good test of how efficient the extraction of facts, utilizable, structured information from hospital documents can be.
American hospitals and research institutes announced a competition for this task in the summer of 2006, according to which, reports were to be classified in one of the categories below:

  • unknown: it contains no information on patient's smoking habits
  • non-smoker
  • active smoker: currently smoking, or has quitted within the past year
  • past-smoker: has not been smoking for a year at least
  • smoker: document is not clear whether patient is active or ex-smoker.

At this competition, the Group's classification system achieved an outstanding result.