Tasks‎ > ‎

SeeDev

Event extraction of genetic and molecular mechanisms involved in plant seed development


Downloads
 SeeDev-full  Training  Development  Test
 SeeDev-binary  Training  Development
 Test





Information Extraction Goal 

1. Promote complex event extraction on regulations in plants.

2. Assess the performance of event extraction systems in this subject.

Motivation in biology

A comprehensive understanding of the molecular network underlying the regulation of seed development is a major scientific challenge with high potential impact on fundamental research, agriculture and industry. Seed development requires the coordinated growth of different tissues that involves complex genetics and environmental regulation. Most of this knowledge is spread in thousands of articles. SeeDev task focuses on seed storage and reserve accumulation, which is a critical issue in agriculture. SeeDev task focuses on the model organism, Arabidopsis thaliana.

The SeeDev task is based on the knowledge model Gene Regulation Network for Arabidopsis (GRNA) that meets the needs of text-mining (i.e. manual annotation of texts and automatic information extraction), experimental data indexing and retrieval and reuse in other plant systems. It is also expected to meet the requirements of the integration of the text knowledge with knowledge derived from experimental data in view of modeling in systems biology.

Representation and Task setting

The SeeDev corpus annotation follows the BioNLP-ST2013 representation.

Entities

GRNA model defines 16 different types of entities.
 

 

 

 

 

 

 

Molecule

 

DNA

Gene

"LEC1" "APETALA2"

 

Gene_Family

"LEC AP2-like"

 

Box

"5'-GCATCG-3'"

 

Promoter

"BCCP2"

DNA Product

 

RNA

"FUS3 transcript"

Amino acid sequence

Protein

"WRI1"

Protein_Family

"SSPs"

Protein_Complex

"SIN3/HDAC"

Protein_Domain

"MADS-domain"

 

 

Hormone

"ABA"

Dynamic Process

Regulatory_Network

"embryonic process"

Metabolic pathway

"FA biosynthesis"

Context

 

Biological

context

Genotype

"fus3 mutant"

Tissue

"embryo"

Development_Phase

"meristem formation"

 

Environmental_Factor

"in vitro"

Events

The GRNA model defines five sets of event types that may be combined in complex events. 


Where and When

Presence_In_Genotype

Occurrence_In_Genotype

Presence_At_Stage

• Occurrence_During

Localization

Function

Involvement_In_Process

Transcription_Or_Translation

Functional_Equivalence

Regulation

• Regulation_Of_Accumulation

• Regulation_Of_Development_Phase

• Regulation_Of_Expression

 Regulation_Of_Molecule_Activity

 Regulation_Of_Process

 Regulation_Of_Tissue_Development



 

Composition and Membership

Primary_Structure_Composition

Protein_Complex_Composition

Protein_Domain_Composition

Family_Membership

Sequence_Identity

 Interaction

• Interaction

• Binding


Each event type can be associated with the Negation modality. The formal representation with the role names can be found here.

The arguments of the event are strongly typed, which means that all types of entities are not possible as event arguments. The possible combinations of entity types per event, i.e. event signature are specified here.


Event and entities in AlviAE editor

Event and entities of SeeDev task in AlvisAE editor.

Evaluation and criteria

There are two subtasks, binary relation extraction and full event extraction with the same datasets. The labels are the same, except Is_Linked_Towhich is specific to the binary framework. An on-line evaluation service will be soon available for each task.

1.  Binary relation extraction

Participant systems are evaluated for their capacity to extract relations that involve two entity arguments.

Input: document texts, gold entity annotations. List of argument types for each event. 

To be predicted: binary events between all types of entities. The representation is the same as the training data.

Evaluation: The evaluation measures will be Recall, Precision and F1-measure of predicted events against gold events.

Download: training - development.

Example

.txt

The Arabidopsis LEAFY COTYLEDON1 (LEC1) gene is required for the specification of cotyledon identity and the completion of embryo maturation.

.a1

T1       Genotype 4 15           Arabidopsis

T2       Gene 16 32    LEAFY COTYLEDON1

T3       Gene 34 38    LEC1

T4       Regulatory_Network 65 100            specification of cotyledon identity

T5       Development_Phase 82 100            cotyledon identity

T6       Tissue 82 91  cotyledon

T7       Development_Phase 109 140         completion of embryo maturation

T8       Tissue 123 129         embryo

.a2

E1       Regulates_Development_Phase    Agent:T3    Development:T7

E2       Regulates_Process    Agent:T3    Process:T4

E3       Is_Functionally_Equivalent_To    Element1:T3    Element2:T2

E4       Occurs_In_Genotype    Molecule:T3    Genotype:T1

E5       Regulates_Development_Phase    Agent:T1    Development:T7

E6       Regulates_Process    Agent:T1    Process:T4


2.  Full event extraction

Participant systems are evaluated for their capacity to extract all types of events, the number of argument is variable between two and eight. It is three in most of the cases. There is no trigger word in SeeDev event representation.

Input: document texts with the gold entities. List of argument types for each event type.

To be predicted: events of all types and negation modalities. The events relate either entities of all types or other events. The representation is the same as the training data.

Evaluation

Two kinds of evaluation measures results, text-bound and biological. 

(1) The text-bound evaluation will evaluate the predictions by Recall and Precision of predicted events against gold events. 

(2) The "biological" evaluation measures how much of the information content of the corpus can be extracted automatically. Duplicate information will be counted only once. The normalization of the text entities with respect to standard nomenclatures will be provided.

Download: training - development.

Event names - Event signatures

Example

.txt

The Arabidopsis LEAFY COTYLEDON1 (LEC1) gene is required for the specification of cotyledon identity and the completion of embryo maturation. 

.a1

T1       Genotype 4 15           Arabidopsis

T2       Gene 16 32    LEAFY COTYLEDON1

T3       Gene 34 38    LEC1

T4       Regulatory_Network 65 100            specification of cotyledon identity

T5       Development_Phase 82 100            cotyledon identity

T6       Tissue 82 91  cotyledon

T7       Development_Phase 109 140         completion of embryo maturation

T8       Tissue 123 129         embryo

a.2

E1       Regulation_Of_Development_Phase    Agent:T3 Development:T7    Organism_Genotype:T1

E2       Regulation_Of_Process    Agent:T3 Process:T4    Organism_Genotype:T1

E3       Functional_Equivalence_To    Element1:T3    Element2:T2

Note that the n-ary events E1 and E2 in the full event example are rewritten in the binary representation above into five binary relations. The general rewriting principle is: (1) the two main first arguments of the event are kept in a binary relation with the same name as the event. (2) Additional binary relations are generated to link the secondary arguments to the main arguments, (in red in the examples).

For instance, in event E1, the genotype T1 is linked to the gene T3 by Exists_In_Genotype and to the development phase T7 by Regulates_Development_Phase.

Corpus 

Documents

The SeeDev corpus is a set of 86 paragraphs from 20 full articles on seed development of Arabidopsis thaliana, that have been manually selected by domain experts.

Annotation

 The whole corpus was annotated by four biologists, among which three experts of seed development. The entities were automatically pre-annotated by the Alvis Suite. An annotator has revised all entity annotations. The annotation of events was performed in a double-blind manner. The annotators with a third biologist have build a consensus gold annotation.
The guidelines document details the annotation principle of each entity and event type and provide examples and counter examples.    

History

The SeeDev task has similar goals and representation as previous tasks on molecular information extraction (e.g. LLL, Genia, GRNGRO, CG).