CoNLL-2010 Shared Task Learning to detect hedges and their scope in natural language text

Frequently Asked Questions

Should I participate in both tasks?
It is not mandatory but we encourage each participant to do that. We believe that real-world NLP applications should use a scope-level uncertainty recogniser (i.e. the main goal is to develop systems for Task2) but this requires a good keyword detector (as also for Task1).

Should I tag cues in Task1? What are then their roles?
The evaluation of Task1 will be carried out using the sentence-level F-measure of the uncertainty class, so tagging cues is not mandatory. On the other hand, cues might play an important role in Task1 and we will distribute the training datasets with the actual cues marked to enhance learning and participants are welcome to submit marked cues as well (it would enable us to derive interesting statistics on them).

Why not evaluate Task1 on the cue level?
The motivation of the sentence-level binary classification type evaluation is from the application side. We think that when we disregard the problem of identifying the linguistic scopes of the cues, practically the only useful information for an application (basically, an IE system) is on the sentence level and this application then can decide what to do with that sentence (disregard it, treat it separately, etc.).

Where is the test data extracted from?
The biological evaluation set will consist of biomedical full articles (i.e. no abstracts are included in the evaluation dataset).

How should the submission format look like?
The scorers have two arguments: the gold-standard XML and the prediction XML. The format of the predictions must be that of the training data files. To be more precise the document ids must be the same (their order is arbitrary), the order of sentences within documents mustn't be changed and unique ids must be used for scopes and their cues (the format of the ids is arbitrary).

The Task1Scorer calculates F-measure (along with true positive, false positive and false negative frequencies) for the uncertain class.
The Task2Scorer calculates F-measure (along with true positive, false positive and false negative frequencies) for the scopes, where a scope is true positive when its cue word(s) and scope boundaries are also predicted correctly.

What are Collection readers and CAS consumers?
The components read and write CoNLL2010st/Bioscope XML files for UCompare and UIMA. There are several processing tools adapted into these frameworks, thus they can be easily used in a pipeline as preprocessors for uncertainty detection. Note that these modules are best within the uCompare system as UIMA components also use also uCompare types. They can be used in UIMA toolchains if the ucompare.jar is present on the classpath.

On what corpora are the biological abstracts in the training data based on?
The texts of the biological abstracts are in perfect match (at the character level) with the texts used in GENIA Treebank. The reason for including GENIA texts in the training set is that the participants may want to exploit the rich linguistic annotation in GENIA Treebank when training the systems.

What does the scope-level F-measure of Task 2 mean?
Here are several examples which can explain the evaluation metric:

True positives
- Correct clue annotation with correct scope boundaries:
  <sentence id="S1.95">Thus, it was examined <xcope id="X1.95.1"><cue type="speculation" ref="X1.95.1">whether</cue> the iORFs have homologous regions in other genomes</xcope>.</sentence>

False positives
- Annotation of clue and scope where no annotation would be necessary (i.e. the sentence is not an instance of hedge):
  <sentence id="S3.301"><xcope id="X3.301.1">The probability of seeing m at least C(m) times in the regulon <cue type="speculation" ref="X3.301.1">can</cue> be approximated by the Poisson distribution</xcope>:</sentence>
- Gold standard annotation:
  <sentence id="S3.301">The probability of seeing m at least C(m) times in the regulon can be approximated by the Poisson distribution:</sentence>
- Incorrect clue annotation with correct scope boundaries (note that this also yields a false negative score as the gold standard scope is missed):
  <sentence id="S1.166">This cluster <xcope id="X1.166.1"><cue type="speculation" ref="X1.166.1">may represent</cue> a novel selenoprotein family</xcope>.</sentence>
- Gold standard annotation:
  <sentence id="S1.166">This cluster <xcope id="X1.166.1"><cue type="speculation" ref="X1.166.1">may</cue> represent a novel selenoprotein family</xcope>.</sentence>
- Correct clue annotation with incorrect scope boundaries (note that this also yields a false negative score as the gold standard scope is missed):
  <sentence id="S3.248">A second constraint associated with a hierarchical ensemble learning method is the multiplicative increase in the number of parameters associated with it, though this problem <xcope id="X3.248.1"><cue type="speculation" ref="X3.248.1">may</cue> be ameliorated by the use of parameter-free algorithms that employ restricted search spaces</xcope>.</sentence>
- Gold standard annotation:
  <sentence id="S3.248">A second constraint associated with a hierarchical ensemble learning method is the multiplicative increase in the number of parameters associated with it, though <xcope id="X3.248.1">this problem <cue type="speculation" ref="X3.248.1">may</cue> be ameliorated by the use of parameter-free algorithms that employ restricted search spaces</xcope>.</sentence>
False negatives:
- No marking of clue and scope boundaries where it would be necessary:
  <sentence id="S2.108">Previous studies have indicated that quite stringent joint E-values must be used to transfer interactions safely between organisms 3435.</sentence>
- Gold standard annotation:
  <sentence id="S2.108">Previous studies have <xcope id="X2.108.1"><cue type="speculation" ref="X2.108.1">indicated that</cue> quite stringent joint E-values must be used to transfer interactions safely between organisms</xcope> 3435.</sentence>