CoNLL-2010 Shared Task

Introduction

The CoNLL-2010 Shared Task aimed at identifying hedges and their scope in natural language texts. 23 teams participated in the shared task from all over the world and 22, 16 and 13 teams submitted output for Task1 (biological papers), Task1 (Wikipedia) and Task2, respectively.

In Natural Language Processing (NLP) - in particular, in Information Extraction (IE) - many applications aim at extracting factual information from text. In order to distinguish facts from unreliable or uncertain information, linguistic devices such as hedges (indicating that authors do not or cannot back up their opinions/statements with facts) have to be identified. Applications should handle detected speculative parts in a different manner.
Hedge detection has received considerable interest recently in the biomedical NLP community, including research papers addressing the detection of hedge devices in biomedical texts, and some recent work on detecting the in-sentence scope of hedge cues in text. Exploiting the hedge scope annotated BioScope corpus and publicly available Wikipedia weasel annotations, the goals of the Shared Task were:

Task 1: Detecting uncertain information

The aim of this task is to identify sentences in texts which contain unreliable or uncertain information. In particular, the task is a binary classification problem, i.e. to distinguish factual versus uncertain sentences.
As training data

biological abstracts and full articles from the BioScope (biomedical domain) corpus and
paragraphs from Wikipedia possibly containing weasel information

are provided. Both types of texts were annotated manually for hedge/weasel cues by two independent linguists. Differences between the two annotations were later resolved by a third annotator. The annotation of weasel/hedge cues was carried out on the phrase level, and sentences containing at least one cue are considered as uncertain, while sentences with no cues are considered as factual.

Since uncertainty cues play an important role in detecting sentences containing uncertainty, they are tagged in the training data to enhance training. On the other hand, they will not be given in the evaluation dataset since cue tagging in submissions is not mandatory (however, we encourage participants to do that).
Additionally, unannotated (but pre-processed) paragraphs from Wikipedia are offered as well. These data do not contain any annotation for weasel cues and/or uncertainty. Using these data enables sampling from a large pool of Wikipedia articles. Since evaluation will be partly carried out on Wikipedia paragraphs, the exploitation of raw Wikipedia texts other than offered here is PROHIBITED when training the systems.
Evaluation will be carried out on the sentence level: i.e. whether a sentence contains hedge/weasel information or not (the F-measure of the uncertain class will be employed as the chief evaluation metric). As for the submitted system outputs, we expect the certainty attributes for each sentence to be filled. Official evaluation will be based on the certainty attribute values (sentence level evaluation). Providing ccue tags for Task1 as in the train data (i.e. linguistic evidence supporting the sentence-level decision) is NOT mandatory, however, we will evaluate them for those who submit them. This will be used for information only, official ranking will be based on the sentence level F-measure of "uncertain" class.

Evaluation will be carried out

in-domain, where only the labeled data provided by us and ANY unlabeled data are allowed to be used separately for each domain (i.e. biomedical train data for biomedical test set and Wikipedia train data for Wikipedia test set). No manually crafted resources of uncertainty information (i.e. lists, annotated data etc.) should be used in any domain. On the other hand, tools exploiting manual annotation of linguistic phenomena not related to uncertainty (such as POS taggers, parsers etc. trained on labeled corpora) are allowed to be used.
cross-domain, where only the labeled data provided by us and ANY unlabeled data are allowed to be used for both domains (i.e. Wikipedia train data for biomedical test set, biomedical train data for Wikipedia test set or a union of Wikipedia and biomedical train data for both test sets). No manually crafted resource of uncertainty information (i.e. lists, annotated data etc.) should be used in any domain. On the other hand, tools exploiting manual annotation of linguistic phenomena not related to uncertainty (such as POS taggers, parsers etc. trained on labeled corpora) are allowed to be used.
open, where ANY data and/or ANY additional manually created information and resource (which may be related to uncertainty) are allowed for both domains. However, the exploitation of raw Wikipedia texts other than provided by us is prohibited and annotating the test set is also prohibited.

The motivation behind the cross-domain and the open challenges is that in this way, it can be assessed whether the addition of extra (i.e. not domain-specific) information to the systems can contribute to performance.
Evaluation will be carried out on the sentence level: i.e. whether a sentence contains hedge/weasel information or not (the F-measure of the uncertainty class will be employed as the chief evaluation metric).
The biological evaluation set will consist of biomedical full articles (i.e. no abstracts are included in the evaluation dataset).

Task 2: Resolution of in-sentence scopes of hedge cues

For the second task, in-sentence scope resolvers have to be developed. Biological scientific texts from the BioScope corpus, in which instances of speculative - that is, keywords and their scope - are annotated manually, serve as the training data. This task falls within the scope of semantic analysis of sentences exploiting syntactic patterns (hedge spans can be usually determined on the basis of syntactic patterns dependent on the keyword).
Task2 involves the annotation of "cue"+"xcope" tags in sentences. We expect the systems to add cue and corresponding xcope tags linked together by using some unique IDs as in the training data. Scope-level F-measure will be used as the chief metric where true positives are scopes which match the gold standard clue words AND gold standard scope boundaries assigned to the clue word. That is, correct scope boundaries with incorrect clue annotation AND correct clue words with bad scope boundaries will be BOTH considered as errors (see FAQ for examples).
Evaluation will be carried out using the same biomedical full articles we use for Task1 (but the level of analysis required for Task2 is different).

Software tools (accessible without registration)

The scorers are extended to give cue-level reports as well:

Scorer tool for Task 1 (Java implementation) [readme].
Scorer tool for Task 2 (Java implementation) [readme].
Collection reader and CAS consumer for the shared task XMLs under uCompare and UIMA (for UIMA see FAQ).

Trial Data (accessible without registration)

Sample data for Task 1 and the same dataset without annotation (as the evaluation data will be released)
Sample data for Task 2 and the same dataset without annotation (as the evaluation data will be released)