Evaluation results
Congratulations to all participants for submitting their predictions! The definitive results are presented on the following tables.
BBcat
Team 
Precision 
BOUN 
0.679 
LIMSI 
0.503 
Team 
F1 
Recall 
Precision 
VERSE 
0.558 
0.615 
0.510 
TurkuNLP 
0.521 
0.448 
0.623 
LIMSI 
0.485 
0.646 
0.388 
HK 
0.474

0.392

0.599

whunlpre 
0.471 
0.407 
0.559 
UMS 
0.463 
0.399 
0.551

DUTIR 
0.457 
0.382 
0.568 
WXU 
0.455 
0.383 
0.560 
 
 
 
 
UTS 
0.451 
0.382 
0.551

 
 
 
 
Team 
Mean precision 
Predictions 
False Negatives 
LIMSI 
0.771 
393 
48 
Team 
SER 
Mismatches 
Matches 
Insertions 
Deletions 
Recall 
Precision 
Predictions 
TagIt 
0.628 
209.446 
465.554 
86 
347 
0.456 
0.612 
761 
LIMSI 
0.827 
198.159 
368.841 
192 
455 
0.361 
0.486 
759 
whunlp 
0.901

228.066

278.934

178

515

0.273

0.407

685

Team 
F1 
Recall 
Precision 
SER 
Mismatches 
Matches 
Insertions 
Deletions 
Predictions 
LIMSI 
0.192 
0.191 
0.193 
1.558 
15.267 
59.733 
234 
237 
309 
UTS 
0.190

0.133

0.331

1.042

29.310

41.690

55

242

126

whunlpre 
0.182 
0.111 
0.498 
0.984 
5.126 
34.874 
30 
273 
70 
BBkb+ner
Team 
Mean precision 
Predictions 
False Negatives 
LIMSI 
0.202 
95 
130 
Evaluation service
The evaluation service is available!
General evaluation algorithm
The evaluation is performed in three steps:
 Pairing of reference and predicted annotations.
 Filtering of referenceprediction pairs.
 Computation of measures
There may be additional filtering or rearrangement steps in order to accommodate a specific subtask or to compute alternate scores. The description of each task details the specifics of the evaluation.
Pairing
The pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function S_{p} that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of S_{p} on all pairs, and so that no pair has a S_{p} equal to zero. S_{p} is specific to the task, refer to the description of the evaluation of each subtask for the specification of S_{p}.
 A pair where S_{p} equals to 1 is called a True Positive, or a Match.
 A pair where S_{p} is below 1 is called a Partial Match, or a Substitution.
 A reference annotation that has not been paired is called a False Negative, or a Deletion.
 A predicted annotation that has not been paired is called a False Positive, or an Insertion.
The filtering step selects a subset of referencepredicted pairs, from which the scores will be computed. In all subtasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.
Measures are scores computed from the referencepredicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.
Each subtask has a different set of measures. Participants are ranked by the first measure.
The pairing matches reference Lives_In events with predicted Lives_In events. The matching similarity function is defined as:
If
the Bacteria argument in the reference and the prediction events are the same entity or equivalent entities,
and the Location argument in the reference and the prediction events are the same entity or equivalent entities
then S_{p} = 1.
Otherwise S_{p} = 0
The submissions are measured using Recall, Precision and F1.
Two additional alternate evaluations are computed: one for only Lives_In events where the Location argument has type Habitat, and another if it has type Geographical.
As for the event subtask, the pairing matches reference Lives_In events with predicted Lives_In events. However the entities, which are the potential arguments of events, are not given as input. The pairing similarity function takes into account how much the boundaries and types of the arguments match.
S_{p} = S_{arg}(Bacteria) . S_{arg}(Location)
Where S_{arg} is a similarity function between two entities, S_{arg}(role) is the similarity between the arguments role of the reference and predicted Lives_In events:
Where T is the type similarity function, and B is the boundaries similarity function:
If both entities have the same type, then T = 1. Otherwise T = 0.
I_{c} is the number of characters covered by both of the two entities, and U_{c} is the number of characters covered by either of the two entities. B can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap.
As with the event subtask, submissions are measured using Recall, Precision and F1, with Recall and Precision redefined in order to take partial matches into account:
Recall = ΣS_{p} / N
Precision = ΣS_{p} / M
Where N is the number of reference events, and M the number of predicted events.
Alternate evaluations measure the submissions for events with a Location argument of type Habitat, and for events with a Location argument of type Geographical.
Alternate evaluations are also provided by ignoring the effect of boundary errors. In these evaluations B is still used for the pairing algorithm but partial matches are considered as full true positives.
In the cat subtask, entities are given as input; the predictions consist in the normalization of the entities. Therefore the pairing step is skipped as the normalization of each entity can be evaluated independently. The similarity between the reference and the predicted normalization is distinct for Habitat and Bacteria entities. For Habitat entities, the Wang similarity is used with a weight of 0.65. The similarity for Bacteria is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.
The submissions are evaluated with their Precision, defined as:
Precision = ΣS_{p} / N
Where N is the number of entities.
Two alternate evaluations evaluate only the normalization of Habitat and Bacteria entities respectively.
In the cat+ner subtask, the entities are paired using a similarity function that takes into account the boundaries accuracy as well as the normalization accuracy:
Where B is the boundaries similarity function, and C is the normalization similarity function:
B = I_{c} / U_{c}
For Bacteria: If the predicted taxon identifier is the same as the reference C = 1, otherwise C = 0.
For Habitat: C = Wang(0,65).
I_{c} is the number of characters covered by both of the two entities, and U_{c} is the number of characters covered by either of the two entities. B
can be seen as an adaptation of the Jaccard index to entity boundaries.
It is equal to 1 if the two entities have the exact same boundaries,
and it is equal to 0 if the two entities do not overlap.
For the normalization of Habitat entities, the Wang similarity is used with a weight of 0.65. The similarity for Bacteria is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.
Submissions are evaluated using the Slot Error Rate (Makhoul et al., 1999). Alternate scores are provided to measure the contribution of boundaries accuracy and normalization accuracy, for both Habitat and Bacteria entities.
The evaluation of the kb and kb+ner subtasks is based on the capacity of submissions to populate an kowledge base. For the evaluation two knowledge bases are built: one derived from the reference annotations, and one derived from the predicted annotations. The submissions are evaluated by comparing the predicted KB with the reference KB.
For building the KB, each Lives_In events are turned into an association between a bacterial taxon and an habitat concept from OntoBiotope. The bacterial taxon is the NCBI_Taxonomy normalization of the Bacteria argument. The habitat is the OntoBiotope normalization of the Location argument. In other words, the Lives_In event is turned into a taxonhabitat association by getting rid of the textbound entity.
All associations are collected, and redundant associations removed. In the evaluation service they are named KB_Lives_In.
Reference and predicted KB_Lives_In are paired in the same way annotations would be paired using a similarity function S_{p}. However the pairing algorithm searches for the closest reference association for each predicted association. In this way each reference association can be paired to several predicted associations, the score associated with this reaference association is the mean of similarities to each paired predicted association. The final score is the sum of the reference association scores.
The similarity function is defined as:
S_{p} = C_{Bacteria} . C_{Habitat}
C_{Bacteria}: If the reference and predicted taxon identifiers are equal, then C_{Bacteria} = 1, otherwise C_{Bacteria} = 0.
