bb3-eval

Evaluation results

Congratulations to all participants for submitting their predictions! The definitive results are presented on the following tables.

BB3-results

BB-event

Download detailed results.

BB-kb

Download detailed results.

Evaluation service

The evaluation service is available!

General evaluation algorithm

The evaluation is performed in three steps:

1. Pairing of reference and predicted annotations.
2. Filtering of reference-prediction pairs.
3. Computation of measures

There may be additional filtering or rearrangement steps in order to accommodate a specific sub-task or to compute alternate scores. The description of each task details the specifics of the evaluation.

Pairing

The pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function S_p that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of S_p on all pairs, and so that no pair has a S_p equal to zero. S_p is specific to the task, refer to the description of the evaluation of each sub-task for the specification of S_p.

- A pair where S_p equals to 1 is called a True Positive, or a Match.
- A pair where S_p is below 1 is called a Partial Match, or a Substitution.
- A reference annotation that has not been paired is called a False Negative, or a Deletion.
- A predicted annotation that has not been paired is called a False Positive, or an Insertion.

Filtering

The filtering step selects a subset of reference-predicted pairs, from which the scores will be computed. In all sub-tasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.

Measures

Measures are scores computed from the reference-predicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.

Each sub-task has a different set of measures. Participants are ranked by the first measure.

Sub-tasks evaluations

event

The pairing matches reference Lives_In events with predicted Lives_In events. The matching similarity function is defined as:

the Bacteria argument in the reference and the prediction events are the same entity or equivalent entities,

and the Location argument in the reference and the prediction events are the same entity or equivalent entities

then S_p = 1.

Otherwise S_p = 0

The submissions are measured using Recall, Precision and F-1.

Two additional alternate evaluations are computed: one for only Lives_In events where the Location argument has type Habitat, and another if it has type Geographical.

event+ner

As for the event sub-task, the pairing matches reference Lives_In events with predicted Lives_In events. However the entities, which are the potential arguments of events, are not given as input. The pairing similarity function takes into account how much the boundaries and types of the arguments match.

S_p = S_arg(Bacteria) . S_arg(Location)

Where S_arg is a similarity function between two entities, S_arg(role) is the similarity between the arguments role of the reference and predicted Lives_In events:

S_arg = T . B

Where T is the type similarity function, and B is the boundaries similarity function:

If both entities have the same type, then T = 1. Otherwise T = 0.

B = I_c / U_c

I_c is the number of characters covered by both of the two entities, and U_c is the number of characters covered by either of the two entities. B can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap.

As with the event sub-task, submissions are measured using Recall, Precision and F-1, with Recall and Precision redefined in order to take partial matches into account:

Recall = ΣS_p / N

Precision = ΣS_p / M

Where N is the number of reference events, and M the number of predicted events.

Alternate evaluations measure the submissions for events with a Location argument of type Habitat, and for events with a Location argument of type Geographical.

Alternate evaluations are also provided by ignoring the effect of boundary errors. In these evaluations B is still used for the pairing algorithm but partial matches are considered as full true positives.

cat

In the cat sub-task, entities are given as input; the predictions consist in the normalization of the entities. Therefore the pairing step is skipped as the normalization of each entity can be evaluated independently. The similarity between the reference and the predicted normalization is distinct for Habitat and Bacteria entities. For Habitat entities, the Wang similarity is used with a weight of 0.65. The similarity for Bacteria is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.

The submissions are evaluated with their Precision, defined as:

Precision = ΣS_p / N

Where N is the number of entities.

Two alternate evaluations evaluate only the normalization of Habitat and Bacteria entities respectively.

cat+ner

In the cat+ner sub-task, the entities are paired using a similarity function that takes into account the boundaries accuracy as well as the normalization accuracy:

Sp = B . C

Where B is the boundaries similarity function, and C is the normalization similarity function:

B = I_c / U_c

For Bacteria: If the predicted taxon identifier is the same as the reference C = 1, otherwise C = 0.

For Habitat: C = Wang(0,65).

For the normalization of Habitat entities, the Wang similarity is used with a weight of 0.65. The similarity for Bacteria is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.

Submissions are evaluated using the Slot Error Rate (Makhoul et al., 1999). Alternate scores are provided to measure the contribution of boundaries accuracy and normalization accuracy, for both Habitat and Bacteria entities.

kb and kb+ner

The evaluation of the kb and kb+ner sub-tasks is based on the capacity of submissions to populate an kowledge base. For the evaluation two knowledge bases are built: one derived from the reference annotations, and one derived from the predicted annotations. The submissions are evaluated by comparing the predicted KB with the reference KB.

For building the KB, each Lives_In events are turned into an association between a bacterial taxon and an habitat concept from OntoBiotope. The bacterial taxon is the NCBI_Taxonomy normalization of the Bacteria argument. The habitat is the OntoBiotope normalization of the Location argument. In other words, the Lives_In event is turned into a taxon-habitat association by getting rid of the text-bound entity.

All associations are collected, and redundant associations removed. In the evaluation service they are named KB_Lives_In.

Reference and predicted KB_Lives_In are paired in the same way annotations would be paired using a similarity function S_p. However the pairing algorithm searches for the closest reference association for each predicted association. In this way each reference association can be paired to several predicted associations, the score associated with this reaference association is the mean of similarities to each paired predicted association. The final score is the sum of the reference association scores.

The similarity function is defined as:

S_p = C_Bacteria . C_Habitat

C_Bacteria: If the reference and predicted taxon identifiers are equal, then C_Bacteria = 1, otherwise C_Bacteria = 0.

C_Habitat = Wang(0.65)

For habitats, the Wang similarity is used with a weight of 0.65.

Page updated

Google Sites

Report abuse