# bb3-eval

## Evaluation results

**Congratulations to all participants** for submitting their predictions! The definitive results are presented on the following tables.

### BB-event

Download detailed results.

### BB-kb

Download detailed results.

### BB-cat+ner

### BB-event+ner

### BB-kb+ner

## Evaluation service

**The evaluation service is available!**

## General evaluation algorithm

The evaluation is performed in three steps:

Pairing of reference and predicted annotations.

Filtering of reference-prediction pairs.

Computation of measures

There may be additional filtering or rearrangement steps in order to accommodate a specific sub-task or to compute alternate scores. The description of each task details the specifics of the evaluation.

### Pairing

The pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function *S*_{p} that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of *S*_{p} on all pairs, and so that no pair has a *S*_{p} equal to zero*. S*_{p} is specific to the task, refer to the description of the evaluation of each sub-task for the specification of *S*_{p}.

A pair where

*S*_{p}equals to 1 is called a*True Positive*, or a*Match*.A pair where

*S*_{p}is below 1 is called a*Partial Match*, or a*Substitution*.A reference annotation that has not been paired is called a

*False Negative*, or a*Deletion*.A predicted annotation that has not been paired is called a

*False Positive*, or an*Insertion*.

### Filtering

The filtering step selects a subset of reference-predicted pairs, from which the scores will be computed. In all sub-tasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.

### Measures

Measures are scores computed from the reference-predicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.

Each sub-task has a different set of measures. **Participants are ranked by the first measure**.

## Sub-tasks evaluations

### event

The pairing matches reference *Lives_In* events with predicted *Lives_In* events. The matching similarity function is defined as:

If

the *Bacteria* argument in the reference and the prediction events are the same entity or equivalent entities,

and the *Location* argument in the reference and the prediction events are the same entity or equivalent entities

then *S*_{p} = 1.

Otherwise *S*_{p} = 0

The submissions are measured using *Recall*, *Precision* and *F-1*.

Two additional alternate evaluations are computed: one for only *Lives_In* events where the *Location* argument has type *Habitat*, and another if it has type *Geographical*.

### event+ner

As for the event sub-task, the pairing matches reference *Lives_In* events with predicted *Lives_In* events. However the entities, which are the potential arguments of events, are not given as input. The pairing similarity function takes into account how much the boundaries and types of the arguments match.

S_{p} = S_{arg}(Bacteria) . S_{arg}(Location)

Where *S*_{arg} is a similarity function between two entities, *S*_{arg}*(role)* is the similarity between the arguments role of the reference and predicted *Lives_In* events:

*S*_{arg} = *T* . *B*

Where *T* is the type similarity function, and *B* is the boundaries similarity function:

If both entities have the same type, then *T* = 1. Otherwise *T* = 0.

*B = I*_{c}* / U*_{c}

*I*_{c} is the number of characters covered by **both** of the two entities, and *U*_{c} is the number of characters covered by **either** of the two entities. *B* can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap.

As with the event sub-task, submissions are measured using *Recall*, *Precision* and *F-1*, with *Recall* and *Precision* redefined in order to take partial matches into account:

Recall = ΣS_{p} / N

Precision = ΣS_{p} / M

Where *N* is the number of reference events, and *M* the number of predicted events.

Alternate evaluations measure the submissions for events with a *Location* argument of type *Habitat*, and for events with a *Location* argument of type *Geographical*.

Alternate evaluations are also provided by ignoring the effect of boundary errors. In these evaluations *B* is still used for the pairing algorithm but partial matches are considered as full true positives.

### cat

In the cat sub-task, entities are given as input; the predictions consist in the normalization of the entities. Therefore the pairing step is skipped as the normalization of each entity can be evaluated independently. The similarity between the reference and the predicted normalization is distinct for *Habitat* and *Bacteria* entities. For *Habitat* entities, the Wang similarity is used with a weight of 0.65. The similarity for *Bacteria* is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.

The submissions are evaluated with their Precision, defined as:

Precision = ΣS_{p} / N

Where *N* is the number of entities.

Two alternate evaluations evaluate only the normalization of *Habitat* and *Bacteria* entities respectively.

### cat+ner

In the cat+ner sub-task, the entities are paired using a similarity function that takes into account the boundaries accuracy as well as the normalization accuracy:

Sp = B . C

Where *B* is the boundaries similarity function, and *C* is the normalization similarity function:

B = I_{c} / U_{c}

For *Bacteria*: If the predicted taxon identifier is the same as the reference C = 1, otherwise C = 0.

For *Habitat*: C = Wang(0,65).

*I*_{c} is the number of characters covered by **both** of the two entities, and *U*_{c} is the number of characters covered by **either** of the two entities. *B* can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap.

For the normalization of *Habitat* entities, the Wang similarity is used with a weight of 0.65. The similarity for *Bacteria* is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.

Submissions are evaluated using the Slot Error Rate (Makhoul *et al.*, 1999). Alternate scores are provided to measure the contribution of boundaries accuracy and normalization accuracy, for both *Habitat* and *Bacteria* entities.

### kb and kb+ner

The evaluation of the kb and kb+ner sub-tasks is based on the capacity of submissions to populate an kowledge base. For the evaluation two knowledge bases are built: one derived from the reference annotations, and one derived from the predicted annotations. The submissions are evaluated by comparing the predicted KB with the reference KB.

For building the KB, each *Lives_In* events are turned into an association between a bacterial taxon and an habitat concept from OntoBiotope. The bacterial taxon is the *NCBI_Taxonomy* normalization of the *Bacteria* argument. The habitat is the *OntoBiotope* normalization of the *Location* argument. In other words, the *Lives_In* event is turned into a taxon-habitat association by getting rid of the text-bound entity.

All associations are collected, and redundant associations removed. In the evaluation service they are named *KB_Lives_In*.

Reference and predicted *KB_Lives_In* are paired in the same way annotations would be paired using a similarity function *S*_{p}. However the pairing algorithm searches for the closest reference association for each predicted association. In this way each reference association can be paired to several predicted associations, the score associated with this reaference association is the mean of similarities to each paired predicted association. The final score is the sum of the reference association scores.

The similarity function is defined as:

S_{p} = C_{Bacteria} . C_{Habitat}

C_{Bacteria}: If the reference and predicted taxon identifiers are equal, then C_{Bacteria} = 1, otherwise C_{Bacteria} = 0.

C_{Habitat} = Wang(0.65)

For habitats, the Wang similarity is used with a weight of 0.65.