SeeDev

Event extraction of genetic and molecular mechanisms involved in plant seed development

Information Extraction Goal

1. Promote complex event extraction on regulations in plants.

2. Assess the performance of event extraction systems in this subject.

Motivation in biology

A comprehensive understanding of the molecular network underlying the regulation of seed development is a major scientific challenge with high potential impact on fundamental research, agriculture and industry. Seed development requires the coordinated growth of different tissues that involves complex genetics and environmental regulation. Most of this knowledge is spread in thousands of articles. SeeDev task focuses on seed storage and reserve accumulation, which is a critical issue in agriculture. SeeDev task focuses on the model organism, Arabidopsis thaliana.

The SeeDev task is based on the knowledge model Gene Regulation Network for Arabidopsis (GRNA) that meets the needs of text-mining (i.e. manual annotation of texts and automatic information extraction), experimental data indexing and retrieval and reuse in other plant systems. It is also expected to meet the requirements of the integration of the text knowledge with knowledge derived from experimental data in view of modeling in systems biology.

Representation and Task setting

The SeeDev corpus annotation follows the BioNLP-ST2013 representation.

Entities

GRNA model defines 16 different types of entities.

Events

The GRNA model defines five sets of event types that may be combined in complex events.

Where and When

• Presence_In_Genotype

• Occurrence_In_Genotype

• Presence_At_Stage

• Occurrence_During

• Localization

Function

• Involvement_In_Process

•Transcription_Or_Translation

• Functional_Equivalence

Regulation

• Regulation_Of_Accumulation

• Regulation_Of_Development_Phase

• Regulation_Of_Expression

• Regulation_Of_Molecule_Activity

• Regulation_Of_Process

• Regulation_Of_Tissue_Development

Composition and Membership

• Primary_Structure_Composition

• Protein_Complex_Composition

• Protein_Domain_Composition

• Family_Membership

• Sequence_Identity

Interaction

• Interaction

• Binding

Each event type can be associated with the Negation modality. The formal representation with the role names can be found here.

The arguments of the event are strongly typed, which means that all types of entities are not possible as event arguments. The possible combinations of entity types per event, i.e. event signature are specified here.

Event and entities of SeeDev task in AlvisAE editor.

Evaluation and criteria

There are two subtasks, binary relation extraction and full event extraction with the same datasets. The labels are the same, except Is_Linked_To, which is specific to the binary framework. An on-line evaluation service will be soon available for each task.

1. Binary relation extraction

Participant systems are evaluated for their capacity to extract relations that involve two entity arguments.

Input: document texts, gold entity annotations. List of argument types for each event.

To be predicted: binary events between all types of entities. The representation is the same as the training data.

Evaluation: The evaluation measures will be Recall, Precision and F1-measure of predicted events against gold events.

Download: training - development.

Relations names - Relation signatures

Example

.txt

The Arabidopsis LEAFY COTYLEDON1 (LEC1) gene is required for the specification of cotyledon identity and the completion of embryo maturation.

.a1

T1 Genotype 4 15 Arabidopsis

T2 Gene 16 32 LEAFY COTYLEDON1

T3 Gene 34 38 LEC1

T4 Regulatory_Network 65 100 specification of cotyledon identity

T5 Development_Phase 82 100 cotyledon identity

T6 Tissue 82 91 cotyledon

T7 Development_Phase 109 140 completion of embryo maturation

T8 Tissue 123 129 embryo

.a2

E1 Regulates_Development_Phase Agent:T3 Development:T7

E2 Regulates_Process Agent:T3 Process:T4

E3 Is_Functionally_Equivalent_To Element1:T3 Element2:T2

E4 Occurs_In_Genotype Molecule:T3 Genotype:T1

E5 Regulates_Development_Phase Agent:T1 Development:T7

E6 Regulates_Process Agent:T1 Process:T4

2. Full event extraction

Participant systems are evaluated for their capacity to extract all types of events, the number of argument is variable between two and eight. It is three in most of the cases. There is no trigger word in SeeDev event representation.

Input: document texts with the gold entities. List of argument types for each event type.

To be predicted: events of all types and negation modalities. The events relate either entities of all types or other events. The representation is the same as the training data.

Evaluation

Two kinds of evaluation measures results, text-bound and biological.

(1) The text-bound evaluation will evaluate the predictions by Recall and Precision of predicted events against gold events.

(2) The "biological" evaluation measures how much of the information content of the corpus can be extracted automatically. Duplicate information will be counted only once. The normalization of the text entities with respect to standard nomenclatures will be provided.

Download: training - development.

Event names - Event signatures

Example

.txt

The Arabidopsis LEAFY COTYLEDON1 (LEC1) gene is required for the specification of cotyledon identity and the completion of embryo maturation.

.a1

T1 Genotype 4 15 Arabidopsis

T2 Gene 16 32 LEAFY COTYLEDON1

T3 Gene 34 38 LEC1

T4 Regulatory_Network 65 100 specification of cotyledon identity

T5 Development_Phase 82 100 cotyledon identity

T6 Tissue 82 91 cotyledon

T7 Development_Phase 109 140 completion of embryo maturation

T8 Tissue 123 129 embryo

a.2

E1 Regulation_Of_Development_Phase Agent:T3 Development:T7 Organism_Genotype:T1

E2 Regulation_Of_Process Agent:T3 Process:T4 Organism_Genotype:T1

E3 Functional_Equivalence_To Element1:T3 Element2:T2

Note that the n-ary events E1 and E2 in the full event example are rewritten in the binary representation above into five binary relations. The general rewriting principle is: (1) the two main first arguments of the event are kept in a binary relation with the same name as the event. (2) Additional binary relations are generated to link the secondary arguments to the main arguments, (in red in the examples).

For instance, in event E1, the genotype T1 is linked to the gene T3 by Exists_In_Genotype and to the development phase T7 by Regulates_Development_Phase.

Corpus

Documents

The SeeDev corpus is a set of 86 paragraphs from 20 full articles on seed development of Arabidopsis thaliana, that have been manually selected by domain experts.

Annotation

The whole corpus was annotated by four biologists, among which three experts of seed development. The entities were automatically pre-annotated by the Alvis Suite. An annotator has revised all entity annotations. The annotation of events was performed in a double-blind manner. The annotators with a third biologist have build a consensus gold annotation.

The guidelines document details the annotation principle of each entity and event type and provide examples and counter examples.

History

The SeeDev task has similar goals and representation as previous tasks on molecular information extraction (e.g. LLL, Genia, GRN, GRO, CG).