Evaluation serviceIf you use the evaluation, please cite the following publication: @inproceedings{bb2016overview, title={Overview of the bacteria biotope task at bionlp shared task 2016}, author={Del{\.e}ger, Louise and Bossy, Robert and Chaix, Estelle and Ba, Mouhamadou Evaluation resultsCongratulations to all participants for submitting their predictions! The definitive results are presented on the following tables.
BBcat
BBkb+ner
General evaluation algorithmThe evaluation is performed in three steps:
PairingThe pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function S_{p} that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of S_{p} on all pairs, and so that no pair has a S_{p} equal to zero. S_{p} is specific to the task, refer to the description of the evaluation of each subtask for the specification of S_{p}.
The filtering step selects a subset of referencepredicted pairs, from which the scores will be computed. In all subtasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.
Measures are scores computed from the referencepredicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.
Each subtask has a different set of measures. Participants are ranked by the first measure.
The pairing matches reference Lives_In events with predicted Lives_In events. The matching similarity function is defined as:
If
the Bacteria argument in the reference and the prediction events are the same entity or equivalent entities, and the Location argument in the reference and the prediction events are the same entity or equivalent entities then S_{p} = 1. Otherwise S_{p} = 0
The submissions are measured using Recall, Precision and F1. Two additional alternate evaluations are computed: one for only Lives_In events where the Location argument has type Habitat, and another if it has type Geographical.
As for the event subtask, the pairing matches reference Lives_In events with predicted Lives_In events. However the entities, which are the potential arguments of events, are not given as input. The pairing similarity function takes into account how much the boundaries and types of the arguments match.
S_{p} = S_{arg}(Bacteria) . S_{arg}(Location)
Where S_{arg} is a similarity function between two entities, S_{arg}(role) is the similarity between the arguments role of the reference and predicted Lives_In events:
S_{arg} = T . B
Where T is the type similarity function, and B is the boundaries similarity function:
If both entities have the same type, then T = 1. Otherwise T = 0.
B = I_{c} / U_{c}
I_{c} is the number of characters covered by both of the two entities, and U_{c} is the number of characters covered by either of the two entities. B can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap.
As with the event subtask, submissions are measured using Recall, Precision and F1, with Recall and Precision redefined in order to take partial matches into account:
Recall = ΣS_{p} / N
Precision = ΣS_{p} / M Where N is the number of reference events, and M the number of predicted events.
Alternate evaluations measure the submissions for events with a Location argument of type Habitat, and for events with a Location argument of type Geographical.
Alternate evaluations are also provided by ignoring the effect of boundary errors. In these evaluations B is still used for the pairing algorithm but partial matches are considered as full true positives. In the cat subtask, entities are given as input; the predictions consist in the normalization of the entities. Therefore the pairing step is skipped as the normalization of each entity can be evaluated independently. The similarity between the reference and the predicted normalization is distinct for Habitat and Bacteria entities. For Habitat entities, the Wang similarity is used with a weight of 0.65. The similarity for Bacteria is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.
The submissions are evaluated with their Precision, defined as:
Precision = ΣS_{p} / N
Where N is the number of entities. Two alternate evaluations evaluate only the normalization of Habitat and Bacteria entities respectively.
In the cat+ner subtask, the entities are paired using a similarity function that takes into account the boundaries accuracy as well as the normalization accuracy:
Where B is the boundaries similarity function, and C is the normalization similarity function:
B = I_{c} / U_{c}
For Bacteria: If the predicted taxon identifier is the same as the reference C = 1, otherwise C = 0. For Habitat: C = Wang(0,65).
I_{c} is the number of characters covered by both of the two entities, and U_{c} is the number of characters covered by either of the two entities. B can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap. For the normalization of Habitat entities, the Wang similarity is used with a weight of 0.65. The similarity for Bacteria is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different. Submissions are evaluated using the Slot Error Rate (Makhoul et al., 1999). Alternate scores are provided to measure the contribution of boundaries accuracy and normalization accuracy, for both Habitat and Bacteria entities.
The evaluation of the kb and kb+ner subtasks is based on the capacity of submissions to populate an kowledge base. For the evaluation two knowledge bases are built: one derived from the reference annotations, and one derived from the predicted annotations. The submissions are evaluated by comparing the predicted KB with the reference KB.
For building the KB, each Lives_In events are turned into an association between a bacterial taxon and an habitat concept from OntoBiotope. The bacterial taxon is the NCBI_Taxonomy normalization of the Bacteria argument. The habitat is the OntoBiotope normalization of the Location argument. In other words, the Lives_In event is turned into a taxonhabitat association by getting rid of the textbound entity.
All associations are collected, and redundant associations removed. In the evaluation service they are named KB_Lives_In.
Reference and predicted KB_Lives_In are paired in the same way annotations would be paired using a similarity function S_{p}. However the pairing algorithm searches for the closest reference association for each predicted association. In this way each reference association can be paired to several predicted associations, the score associated with this reaference association is the mean of similarities to each paired predicted association. The final score is the sum of the reference association scores.
The similarity function is defined as:
