# SeeDev evaluation

## Evaluation results

**Congratulations to all participants** for submitting their predictions! The definitive results are presented on the following table.

* Results hidden upon request of the team

**Download detailed results.**

## Evaluation service

**The evaluation service is available!**

If you use the evaluation, please cite the following publication:

@inproceedings{seedev2016overview, title={Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016.}, author={Chaix, Estelle and Dubreucq, Bertrand and Fatihi, Abdelhak and Valsamou, Dialekti and Bossy, Robert and Ba, Mouhamadou and

Del{\.e}ger, Louise and Zweigenbaum, Pierre and Bessieres, Philippe and Lepiniec, Loic and others}, booktitle={Proceedings of the 4th BioNLP Shared Task Workshop}, pages={1--11}, year={2016} }

## General evaluation algorithm

The evaluation is performed in two steps:

Pairing of reference and predicted annotations.

Filtering of reference-prediction pairs.

Computation of measures

There may be additional filtering or rearrangement steps in order to accommodate a specific sub-task or to compute alternate scores. The description of each task details the specifics of the evaluation.

### Pairing

The pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function *S*_{p} that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of *S*_{p} on all pairs, and so that no pair has a *S*_{p} equal to zero*. S*_{p} is specific to the task, refer to the description of the evaluation of each sub-task for the specification of *S*_{p}.

A pair where

*S*_{p}equals to 1 is called a*True Positive*, or a*Match*.A pair where

*S*_{p}is below 1 is called a*Partial Match*, or a*Substitution*.A reference annotation that has not been paired is called a

*False Negative*, or a*Deletion*.A predicted annotation that has not been paired is called a

*False Positive*, or an*Insertion*.

### Filtering

The filtering step selects a subset of reference-predicted pairs, from which the scores will be computed. In all sub-tasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.

### Measures

Measures are scores computed from the reference-predicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.

Each sub-task has a different set of measures. **Participants are ranked by the first measure**.

## Sub-tasks evaluations

### DeeDev-binary

The pairing similarity function of SeeDev-binary is defined as:

If the reference and predicted events have the same type and if the two arguments are the same, then S_{p} = 1; otherwise S_{p} = 0.

The submissions are evaluated with *Recall*, *Precision* and *F-1*. Note that the events of type *Is_Linked_To*, *Has_Sequence_Identical_To*, *Is_Functionally_Equivalent_To* are considered commutative: the two arguments can be reversed. Event of all other types are not commutative.

Alternate scores are provided for each event type.

### SeeDev-full

The pairing similarity function of SeeDev-full is derived from SeeDev-binary, additionally it allows for mistakes in the optional arguments:

S_{p} = S_{p-binary} . S_{Neg} . S_{Opt}

Where *S*_{p-binary} is the pairing function of SeeDev-binary described above. Therefore, in order to be paired, a reference and a predicted event must have the same type and the same mandatory arguments.

*S*_{Neg} is the negation similarity:

If both reference and predicted events are negated, then S_{Neg} = 1, if neither reference and predicted events are negated, then S_{Neg} = 1, otherwise S_{Neg} = 0.5

If the predicted event is negated where the reference event is not (or vice-versa), then *S*_{Neg} applies a penalty halving the score.

SOpt is the similarity for optional arguments:

S_{Opt} = 1 - (E_{Opt} / N)

*E*_{Opt} is the number of errors in the optional arguments, and *N* is the cardinality of the union of all optional arguments in the reference and predicted events. The errors are counted as follows:

A missing optional argument counts as 1 error.

An extra optional argument counts as 1 error.

A wrong optional argument counts as 2 errors.

The submissions are evaluated with *Recall*, *Precision* and *F-1*. The service also computes the *Slot Error Rate*.

Alternate scores are provided for each event type.