BB3

This dataset is now obsolete.

Check out the new iteration of the Bacteria Biotope in BioNLP Open Shared Tasks 2019.

Bacteria Biotope - Event extraction of microorganisms and habitats with ontologies and their linking

Information Extraction Goal

1. Promote information extraction on microorganism biodiversity.

2. Assess the performance of automatic categorization and relation extraction systems.

Motivation in biology

Bacteria biotope is a critical information for studying the interaction mechanisms of the bacteria with their environment from genetic, phylogenetic and ecology perspectives. The information on habitats where bacteria live is a particularly critical in applied microbiology such as food processing and safety, health sciences and waste processing. There is also fundamental research that requires this knowledge like metagenomics studies and phylogeography/phyloecology.

No database supplies the habitats of bacteria in a comprehensive and normalized way. This information is spread in free texts over articles and databases. Bacteria biotopes are described in large sets of free texts, scientific papers and numerous databases, such as databases of sequences (e.g. SRA, GenBank, GOLD), biological sample banks, collections (e.g. ATCC, DSMZ) and biodiversity surveys (e.g. GBIF).

IE goal

To fulfill these needs, as in previous Shared Tasks in 2011 and 2013 the IE systems should be able to: 1. detect mentions of habitats and species; 2. categorize them with large ontologies; 3. extract events between bacteria and their habitats. Once the habitats are identified, they must be normalized to be compared. The OntoBiotope ontology is used for microorganism biotope description, it has been successfully used in previous BB tasks. The well-recognized resource NCBI taxonomy is used for organism classification and normalization.

The concept "seedling" of the "growing plant part of OntoBiotope is used to tag "axenic rice seedling" in the text.

Events, entities and categories of BB task in AlvisAE editor.

Representation and Task setting

BB3 corpus annotation follows BioNLP-ST2013 representation with a few modifications in order to cope with discontinuous entities and categorization with concepts from an ontology. An updated version of the habitat part of OntoBiotope is provided to participants in OBO format. It contains concepts organized in a hierarchy of is-a relations. Ontobiotope concepts may have more than one parent and the concept labels have synonyms. The NCBI taxonomy can be downloaded here.

Entities

The BB task includes three types of entities, bacteria, habitats and geographical places.

Bacteria

Bacteria entities are annotated as contiguous spans of text that contains a full unambiguous prokaryote taxon name, the type label is Bacteria. The Bacteria type is a taxon, at any taxonomic level from phylum (Eubacteria) to strain. The category that the text entities have to be assigned to is the most specific and unique category of the NCBI taxonomy resource. In case a given strain, or a group of strains is not referenced by NCBI, it is assigned with the closest taxid in the taxonomy.

Habitat

Habitat entities are annotated as spans of text that contains a complete mention of a potential habitat for bacteria, the type label is Habitat. Habitat entities are assigned one or several concepts from the habitat subpart of the OntoBiotope ontology. The assigned concepts are as specific as possible. OntoBiotope defines most relevant microorganism habitats from all areas considered by microbial ecology (hosts, natural environment, anthropized environments, food, medical, etc.). Habitat entities are rarely referential entities, they are usually noun phrases including properties and modifiers. There are rare cases of habitats referred with adjectives or verbs. The spans are generally contiguous but some of them are discontinuous in order to cope with conjunctions.

Geographical

Geographical entities are geographical and organization places denoted by official names.

Events

The BB task considers a single type of event.

Lives_In

The Lives_in event has two mandatory arguments, the bacterium and the location where it lives (either an Habitat or a Geographical entity). Whenever the habitat is host part and the host is mentioned in the text, both habitats participate in two distinct events. The Lives_In event does not include trigger words.

Evaluation and criteria

We propose three sub-tasks with two modalities each. Each sub-task has a plain modality where named-entities are given as input, thus participants are not required to perform named-entity recognition. In the second modality, named-entities are not provided, thus participant methods must perform named-entity recognition and submissions will be partly evaluated on their boundary accuracy.

Examples for each task here.

1. Bacteria and habitat detection and categorization (BB-cat and BB-cat+ner)

Participant systems are evaluated for their capacity to categorize two kind of entities, Bacteria and Habitat with the NCBI Taxonomy taxa and OntoBiotope habitat concepts respectively.

Input: document texts, title and paragraph segmentation, NCBI Taxonomy and OntoBiotope habitat ontology. Entities (BB-cat),

To be predicted: Bacteria and Habitat entity categories. Entity boundaries (BB-cat+ner).

Evaluation

The evaluation will focus on the accuracy of the predicted categories compared to gold reference. The measures will be similar to BB13. A concept distance measure has been designed in order to sanction over-generalization or over-specialization with a fair penalty. Note that if an entity has several categories, then it is a conjunction: all categories must be predicted.

For cat+ner, boundary accuracy will be factored in the evaluation since the inclusion or exclusion of modifiers can change the meaning and the categorization of phrases.

2. Entity and event extraction (BB-event and BB-event+ner)

Participant systems are evaluated for their capacity to extract Lives_In events among Bacteria, Habitat and Geographical entities.

Input: document texts, title and paragraph segmentation. Bacteria, Habitat and Geographical entities (BB-event)

To be predicted: Lives_In events. Bacteria, Habitat and Geographical entities (BB-event+ner).

Evaluation

The evaluation measures will be Recall and Precision of predicted events against gold events. Gold coreferences will be used to factor equivalent relations.

For event+ner, boundary accuracy will be factored in the evaluation.

2. Knowledge Base extraction (BB-kb and BB-kb+ner)

Participant systems are evaluated for their capacity to build a knowledge base from the corpus. The knowledge base is the set of Lives_in events with the categories of their Bacteria and Habitat arguments. This sub-task is a sum of sub-tasks cat and event, since participants must predict entities categorizations and Lives_In events.

The goal of the task is to measure how much of the information content of the corpus can be extracted automatically.

Input: document texts, title and paragraph segmentation. Bacteria and Habitat entities (BB-kb)

To be predicted: Bacteria and Habitat entity categories, Lives_In events. Bacteria and Habitat entities (BB-kb+ner).

Evaluation

The evaluation measures will be Recall and Precision of predicted events against gold events.

For cat+ner, boundary accuracy will be factored in the evaluation.

Corpus

Documents

The corpus is composed of scientific paper abstracts. In previous BB tasks, the corpus was composed of general purpose texts (i.e. mostly web pages of genomics projects) that were easily understandable by task participants. The scientific papers abstracts are more useful sources of detailed and scientific information for biologists. The corpus is a subset of a corpus of 1,16 million PubMed references that mention bacteria and habitats. 215 documents have been automatically selected in order to reflect the diversity of the habitats.

Annotation

The manual annotation was performed by 7 annotators of Bibliome group from the biology (2) and computer science (4) domains in a double-blind way after an automatic preannotation by the Alvis Suite. The annotators have used AlvisAE editor for annotation and adjudication.

History of the task

The three subtasks are similar to BB previous tasks (BB'11, BB'13), except for the knowledge base task and taxon normalization. The official results of BB'13 were: 0.66 SER for habitat detection and categorization and 0.14 F1-measure on event and entity extraction. More recent results may be found in [BMC Bioinformatics, vol 16, supp. 10 and 16].

Results on the new taxon categorization task has been reported in previous work notably about the SPECIES and ORGANISM software[1], which achieved over 85% F-measure.

Contact

Robert Bossy

Bibliome-MaIAGE

Inra, France

e-mail: robert (dot) bossy (at) jouy (dot) inra (dot) fr

[1] The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, et al. (2013) PLoS ONE 8(6): e65390