Tasks‎ > ‎SeeDev‎ > ‎

SeeDev evaluation

Evaluation results

Congratulations to all participants for submitting their predictions! The definitive results are presented on the following table.

Teams F1 Recall Precision
LitWay 0.432 0.448 0.417
UniMelb 0.364 0.386 0.345
VERSE 0.342 0.458 0.273
-- 0.335 0.245 0.533
ULISBOA 0.306 0.256 0.379
LIMSI 0.255 0.318 0.212
DUTIR* -- -- --
* Results hidden upon request of the team

Evaluation service

General evaluation algorithm

The evaluation is performed in two steps:

  1. Pairing of reference and predicted annotations.
  2. Filtering of reference-prediction pairs.
  3. Computation of measures
There may be additional filtering or rearrangement steps in order to accommodate a specific sub-task or to compute alternate scores. The description of each task details the specifics of the evaluation.

Pairing

The pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function Sp that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of Sp on all pairs, and so that no pair has a Sp equal to zero. Sp is specific to the task, refer to the description of the evaluation of each sub-task for the specification of Sp.

  • A pair where Sp equals to 1 is called a True Positive, or a Match.
  • A pair where Sp is below 1 is called a Partial Match, or a Substitution.
  • A reference annotation that has not been paired is called a False Negative, or a Deletion.
  • A predicted annotation that has not been paired is called a False Positive, or an Insertion.

Filtering

The filtering step selects a subset of reference-predicted pairs, from which the scores will be computed. In all sub-tasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.

Measures

Measures are scores computed from the reference-predicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.

Each sub-task has a different set of measures. Participants are ranked by the first measure.

Sub-tasks evaluations

DeeDev-binary

The pairing similarity function of SeeDev-binary is defined as:

If the reference and predicted events have the same type and if the two arguments are the same, then Sp = 1; otherwise Sp = 0.

The submissions are evaluated with Recall, Precision and F-1. Note that the events of type Is_Linked_To, Has_Sequence_Identical_To, Is_Functionally_Equivalent_To are considered commutative: the two arguments can be reversed. Event of all other types are not commutative.

Alternate scores are provided for each event type.

SeeDev-full

The pairing similarity function of SeeDev-full is derived from SeeDev-binary, additionally it allows for mistakes in the optional arguments:

Sp = Sp-binary . SNeg . SOpt

Where Sp-binary is the pairing function of SeeDev-binary described above. Therefore, in order to be paired, a reference and a predicted event must have the same type and the same mandatory arguments.

SNeg is the negation similarity:

If both reference and predicted events are negated, then SNeg = 1, if neither reference and predicted events are negated, then SNeg = 1, otherwise SNeg = 0.5

If the predicted event is negated where the reference event is not (or vice-versa), then SNeg applies a penalty halving the score.

SOpt is the similarity for optional arguments:

SOpt = 1 - (EOpt / N)

EOpt is the number of errors in the optional arguments, and N is the cardinality of the union of all optional arguments in the reference and predicted events. The errors are counted as follows:

  • A missing optional argument counts as 1 error.
  • An extra optional argument counts as 1 error.
  • A wrong optional argument counts as 2 errors.

The submissions are evaluated with Recall, Precision and F-1. The service also computes the Slot Error Rate.

Alternate scores are provided for each event type.