Evaluation results
Congratulations to all participants for submitting their predictions! The definitive results are presented on the following table.
Teams 
F1 
Recall 
Precision 
LitWay 
0.432 
0.448 
0.417 
UniMelb 
0.364 
0.386 
0.345 
VERSE 
0.342 
0.458 
0.273 
 
0.335 
0.245 
0.533

ULISBOA 
0.306 
0.256 
0.379 
LIMSI 
0.255 
0.318 
0.212 
DUTIR* 
 
 
 
* Results hidden upon request of the team
Evaluation service
General evaluation algorithm
The evaluation is performed in two steps:
 Pairing of reference and predicted annotations.
 Filtering of referenceprediction pairs.
 Computation of measures
There
may be additional filtering or rearrangement steps in order to
accommodate a specific subtask or to compute alternate scores. The
description of each task details the specifics of the evaluation.
Pairing
The
pairing step associates each reference annotation with the best
matching predicted annotation. The criterion for "best matching" is a
similarity function S_{p} that, given a reference
annotation and a predicted annotation, yields a real value between 0 and
1. The algorithm selects the pairing that maximizes the sum of S_{p} on all pairs, and so that no pair has a S_{p} equal to zero. S_{p} is specific to the task, refer to the description of the evaluation of each subtask for the specification of S_{p}.
 A pair where S_{p} equals to 1 is called a True Positive, or a Match.
 A pair where S_{p} is below 1 is called a Partial Match, or a Substitution.
 A reference annotation that has not been paired is called a False Negative, or a Deletion.
 A predicted annotation that has not been paired is called a False Positive, or an Insertion.
The
filtering step selects a subset of referencepredicted pairs, from
which the scores will be computed. In all subtasks the main score is
computed on all pairs without any filter applied. Filtering is used to
compute alternate scores in order to assess the strengths and weaknesses
of a prediction. One typical use of filtering is to distinguish the
performance of different annotation types.
Measures
are scores computed from the referencepredicted annotation pairs after
filtering. They may count False Positives, False Negatives, Matches,
Partial Matches, or an aggregate of these scores like Recall, Precision,
F1, Slot Error Rate, etc.
Each subtask has a different set of measures. Participants are ranked by the first measure.
Subtasks evaluations
DeeDevbinary
The pairing similarity function of SeeDevbinary is defined as:
If the reference and predicted events have the same type and if the two arguments are the same, then S_{p} = 1; otherwise S_{p} = 0.
The submissions are evaluated with Recall, Precision and F1. Note that the events of type Is_Linked_To, Has_Sequence_Identical_To, Is_Functionally_Equivalent_To are considered commutative: the two arguments can be reversed. Event of all other types are not commutative.
Alternate scores are provided for each event type.
SeeDevfull
The pairing similarity function of SeeDevfull is derived from SeeDevbinary, additionally it allows for mistakes in the optional arguments:
S_{p} = S_{pbinary} . S_{Neg} . S_{Opt}
Where S_{pbinary} is the pairing function of SeeDevbinary described above. Therefore, in order to be paired, a reference and a predicted event must have the same type and the same mandatory arguments.
S_{Neg} is the negation similarity:
If both reference and predicted events are negated, then S_{Neg} = 1, if neither reference and predicted events are negated, then S_{Neg} = 1, otherwise S_{Neg} = 0.5
If the predicted event is negated where the reference event is not (or viceversa), then S_{Neg} applies a penalty halving the score.
SOpt is the similarity for optional arguments:
S_{Opt} = 1  (E_{Opt} / N)
E_{Opt} is the number of errors in the optional arguments, and N is the cardinality of the union of all optional arguments in the reference and predicted events. The errors are counted as follows:
 A missing optional argument counts as 1 error.
 An extra optional argument counts as 1 error.
 A wrong optional argument counts as 2 errors.
The submissions are evaluated with Recall, Precision and F1. The service also computes the Slot Error Rate.
Alternate scores are provided for each event type.