Supporting Resources - backup
The goal of the supporting resources for the BioNLP Shared Task 2016 is to provide the task participants with annotations from state-of-the-art automated tools in order to minimize the time-investment necessary to participate in the shared task and to allow for participants to experiment on how to leverage automated analyses provided by existing Natural Language Processing systems. [responsibility...]
Resources and Formats
Available resources are listed below. Each resource is a file to be downloaded, it is the result of a tool (tools are presented) on the train and dev datasets.
POS Tagging
GENIA Tagger is a tool for part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text. If you make use of the tagging from GENIA tagger, please cite: Tsuruoka, Y., Tateishi, Y., Kim, J. D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J. I. (2005). Developing a robust part-of-speech tagger for biomedical text. Advances in informatics, 382-392.
genia-tagger_train+dev_resources.zip: the resources produced by GENIA tagger on the train and dev datasets
Parsing
Stanford Parser is a widely used statistical parser. If you make use of the parses from the Stanford Parser, please cite: Klein, D. and Manning, C. (2002). Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems.
stanford-parser_train+dev_resources.zip : the resources produced by Stanford Parser on the train and dev datasets.
The Enju parser is a robust syntactic parser for English, based on a probabilistic HPSG grammar. If you make use of the Enju parses, please cite: Miyao, Y. and Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics.
enju-parser_train+dev_resources.zip : the resources produced by Enju Parser on the train and dev datasets
The C&C CCG Parser is a dependency parser. If you make use of the CCG parses, please cite: Clark, S., & Curran, J. R. (2007). Wide-coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4), 493-552.
ccg-parser_train+dev_resources.zip : the resources produced by CCG parser on the train and dev datasets.
Term Extraction
BioYaTeA is an extended version of the YaTeA (Aubin and Hamon, 2006) term extractor adapted to the biomedical domain. If you make use of the BioYaTeA resources, please cite: Golik, W., Bossy, R., Ratkovic, Z., & NĂ©dellec, C. (2013). Improving term extraction with linguistic analysis in the biomedical domain. In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing13), Special Issue of the journal Research in Computing Science (pp. 24-30).
bioyatea_train+dev_resources.zip : the resources produced by BioYatea on the train and dev datasets
Named Entity Recognition
Stanford NER is a named entity recognition tool for person, organization and location entities. If you make use of the Stanford NER annotations, please cite: Finkel, J. R., Grenager, T., & Manning, C. (2005, June). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363-370). Association for Computational Linguistics.
stanfordner_train+dev_resources.zip : the resources produced by Stanford NER on the train and dev datasets
LINNAEUS is a software for species name recognition and normalization. If you make use of the LINNAEUS annotations, please cite: Gerner, M., Nenadic, G., & Bergman, C. M. (2010). LINNAEUS: a species name identification system for biomedical literature. BMC bioinformatics, 11(1), 1.
linnaeus_train+dev_resources.zip : the resources produced by LINNAEUS on the train and dev datasets
OrganismTagger is a hybrid rule-based/machine-learning system that extracts organism mentions from the biomedical literature, normalizes them to their scientific name, and provides grounding to the NCBI Taxonomy database. If you make use of the OrganismTagger annotations, please cite: Naderi, N., Kappler, T., Baker, C. J., & Witte, R. (2011). OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents. Bioinformatics, 27(19), 2721-2729.
organismtagger_train+dev_resources.zip : the resources produced by Organism tagger on the train and dev datasets.
SR4GN is a software that provides species recognition for gene normalization. If you make use of the SR4GN annotations, please cite: Wei, C. H., Kao, H. Y., & Lu, Z. (2012). SR4GN: a species recognition software tool for gene normalization. PloS one, 7(6), e38460.
sr4gn_train+dev_resources.zip : the resources produced by SG4GN on the train and dev datasets
[! moved from bb3_supporting-resources
SPECIES identifies taxonomic mentions in documents and maps them to corresponding NCBI Taxonomy entries. If you make use of the SPECIES annotations, please cite: Pafilis, E., Frankild, S.P., Fanini, L., Faulwetter, S., Pavloudi, C., Vasileiadou, A., Arvanitidis, C. and Jensen, L.J. (2013). The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS One, 8(6), p.e65390.
SPECIES_train+dev_resources.zip : the resources produced by SPECIES on the train and dev datasets
]
Sentence Splitting & Tokenization
Sentence splitting and tokenization were performed using in-house tools developed at INRA as part of the AlvisNLP suite.
segmentation_train+dev_resources.zip : the resources produced by segmentation on the train and dev datasets
Data Visualization
BRAT is a tool for visualization of annotations. Manual annotations from the shared task corpora are provided in the BRAT format to enable participants to visualize the annotated corpora.
brat_train+dev_resources.zip : train and dev datasets in the BRAT format.