Coder Social home page Coder Social logo

webygit / biomedical_corpora Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dterg/biomedical_corpora

0.0 2.0 0.0 22 KB

Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). This has been published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.

biomedical_corpora's Introduction

This table compiles the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). This has been published as part of the paper:

Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152

If you know of other relevant corpora, please submit a pull request and I'll happily approve it.

Corpus Year Format Documents Original Publication Downloaded From Other URLs
Ab3P (Abbreviation Plus P-Precision) 2008 BioC 1250 PubMed Abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576267/ http://bioc.sourceforge.net/
AIMed 2005 BioC ~ 1000 MEDLINE abstracts (200 abstracts) http://www.sciencedirect.com/science/article/pii/S0933365704001319 http://corpora.informatik.hu-berlin.de/ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.3218&rep=rep1&type=pdf
AnatEM (Anatomical entity mention recognition) 2013 CONLL, standoff 1212 docs (500 docs from AnEM + 262 from MLEE + 450 others) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM
AnEM 2012 BioC 500 docs (PubMed and PMC); abstracts and full text drawn randomly http://www.nactem.ac.uk/anatomy/docs/ohta2012opendomain.pdf http://corpora.informatik.hu-berlin.de/
AZDC (Arizona Disease Corpus) 2009 IeXML, .txt 2856 PubMed abstracts (2775 sentences). Other source says 794 PubMed Abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352871/ http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/IeXML/goldcorpus/azdc-1.xml http://diego.asu.edu/downloads/AZDC_6-26-2009.txt
BEL (BioCreative V5 BEL Track) 2016 BioC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/ https://wiki.openbel.org/display/BIOC/Datasets
BioADI 2009 BioC 1201 PubMed abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788358/ http://bioc.sourceforge.net/
BioCause 2013 standoff 19 full-text documents http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-2 http://www.nactem.ac.uk/biocause/download.php
BioCreative-PPI XML https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html
BioGRID 2017 BioC 120 full text articles https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225395/ http://bioc.sourceforge.net/BioC-BioGRID.html
BioInfer 2007 BioC 1100 sentences from biomedical literature http://www.biomedcentral.com/1471-2105/8/50 http://corpora.informatik.hu-berlin.de/ http://mars.cs.utu.fi/BioInfer
BioMedLat 2016 standoff 643 BioASQ questions/factoids https://www.semanticscholar.org/paper/BioMedLAT-Corpus-Annotation-of-the-Lexical-Answer-Neves-Kraus/b0f09f94015771c31bd2483efdd8f0f86996384e https://github.com/mariananeves/BioMedLAT
BioText 2004 txt 100 titles and 40 abstracts http://biotext.berkeley.edu/papers/acl04-relations.pdf https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html
CDR (BioCreative V) BioC http://bioc.sourceforge.net/
CellFinder 1.0 2012 BioC 10 full documents from PMC from (Loser et al. 2009) on "Human Embryonic Stem Cell Lines and Their Use in International Research" http://www.nactem.ac.uk/biotxtm2012/presentations/Neves-pres.pdf http://corpora.informatik.hu-berlin.de/ http://cellfinder.de/about/annotation/
CG Cancer-Genetics (BioNLP-ST 2013) 2013 BioC, standoff http://aclweb.org/anthology/W/W13/W13-2008.pdf http://2013.bionlp-st.org/tasks/cancer-genetics
CHEMDNER (BioCreative IV Track 2) 2013 BioC / standoff http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf http://www.biocreative.org/tasks/biocreative-iv/chemdner/
Chemical Patent Corpus 2014 standoff 200 patents http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477 http://biosemantics.org/index.php/resources/chemical-patent-corpus
CoMAGC 2013 XML 821 sentences on prostate, breast and ovarian cancer http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323 http://biopathway.org/CoMAGC/
CRAFT 2012 97 full OA biomedical articles http://bionlp-corpora.sourceforge.net/CRAFT/
Craven (Wisconsin corpus) 1999 other 1,529,731 sentences (automated) https://www.biostat.wisc.edu/~craven/ie/ReadMe https://www.biostat.wisc.edu/~craven/ie/
CTD (BioCreative IV Track 3) BioC http://www.biocreative.org/tasks/biocreative-iv/track-3-CTD/
DDICorpus 2011 2013 BioC 792 texts from DrugBank and 233 Medline abstracts https://www.ncbi.nlm.nih.gov/pubmed/23906817 http://bioc.sourceforge.net/ http://corpora.informatik.hu-berlin.de/ http://labda.inf.uc3m.es/ddicorpus
DIP-PPI (Database of Interaction Proteins) other Only proteins from yeast. https://www2.informatik.hu-berlin.de/~hakenber/corpora/
EBI:diseases 2008 other 856 sentences from 624 abstracts http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S3-S3 https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases
eFIP 2012 2015 xlsx https://www.ncbi.nlm.nih.gov/pubmed/23221174 https://www.ncbi.nlm.nih.gov/pubmed/25833953 http://research.bioinformatics.udel.edu/iprolink/corpora.php
EMU (Extractor of Mutations) 2011 other https://www.ncbi.nlm.nih.gov/pubmed/21138947 http://bioinf.umbc.edu/EMU/ftp/
EU-ADR 2012 other 300 PubMed abstracts (drug-disoder, drug-target, gene-disorder, SNP-disorder) http://www.sciencedirect.com/science/article/pii/S1532046412000573 http://biosemantics.org/index.php/resources/euadr-corpus
Exhaustive PTM (BioNLP 2011) http://dl.acm.org/citation.cfm?id=2002902.2002920 https://github.com/dterg/exhaustive-ptm
FlySlip 2007 CONLL 82 abstracts, 5 full papers https://www.ncbi.nlm.nih.gov/pubmed/17990496 http://compbio.ucdenver.edu/ccp/corpora/obtaining.shtml http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip/Flyslip-resources
FSU-PRGE 2010 leXML 3236 MEDLINE abstracts (35,519 sentences) http://aclweb.org/anthology/W/W10/W10-1838.pdf http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html
GAD 2015 csv http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0472-9 http://ibi.imim.es/research-lines/biomedical-text-mining/corpora/
GeneReg 2010 BioC 314 Abstracts http://www.lrec-conf.org/proceedings/lrec2010/pdf/407_Paper.pdf http://corpora.informatik.hu-berlin.de/ http://www.julielab.de/Resources/GeneReg.html
GeneTag (BioCreative II Gene Mention) 2005 BioC 20,000 sentences MEDLINE https://www.ncbi.nlm.nih.gov/pubmed/15960837 https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html http://bioc.sourceforge.net/
GENIA (BioNLP Shared Task 2009) http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/detail.shtml#downloads
GENIA (BioNLP Shared Task 2011) BioC, standoff https://sites.google.com/site/bionlpst/home/epigenetics-and-post-translational-modifications http://2011.bionlp-st.org http://corpora.informatik.hu-berlin.de/
GENIA (term annotation) 2003 BioC, XML http://corpora.informatik.hu-berlin.de/ http://www.nactem.ac.uk/aNT/genia.html
GETM 2010 BioC, standoff http://dl.acm.org/citation.cfm?id=1869970 http://corpora.informatik.hu-berlin.de/ http://getm-project.sourceforge.net/
GREC (Gene Regulation Event Corpus) 2009 BioC, standoff, XML 240 MEDLINE (167 on E.coli and 73 on Human) http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-349 http://corpora.informatik.hu-berlin.de/ http://www.nactem.ac.uk/GREC/
HIMERA 2016 standoff http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717 http://www.nactem.ac.uk/himera/
HPRD50 (Human Protein Reference Database) 2004 BioC 50 abstracts https://www.ncbi.nlm.nih.gov/pubmed/14681466 http://corpora.informatik.hu-berlin.de/ http://www2.bio.ifi.lmu.de/publications/RelEx/
IDP4+ 2007 anndoc 860 abstracts/full-texts https://academic.oup.com/bioinformatics/article/33/12/1852/2991428 https://www.tagtog.net/-corpora/IDP4+
IEPA 2002 BioC slightly over 300 MEDLINE abstracts https://www.ncbi.nlm.nih.gov/pubmed/11928487 http://corpora.informatik.hu-berlin.de/ http://orbit.nlm.nih.gov/resource/iepa-corpus
iHOP 2004 other ~ 160 sentences https://www.ncbi.nlm.nih.gov/pubmed/15226743 http://www.ihop-net.org/UniPub/iHOP/info/gene_index/manual/1.html
iProLINK / RLIMS 2004 other, XML, BioC https://www.ncbi.nlm.nih.gov/pubmed/15556482 http://research.bioinformatics.udel.edu/iprolink/corpora.php
iSimp 2014 BioC 130 MEDLINE abstracts (1199 sentences) https://www.ncbi.nlm.nih.gov/pubmed/24850848 http://research.bioinformatics.udel.edu/isimp/corpus.html
Linnaeus 2010 standoff http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-85 https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html http://linnaeus.sourceforge.net/
LLL (Learning Language in Logic) 2005 BioC https://www.cs.york.ac.uk/aig/lll/lll05/lll05-nedellec.pdf http://corpora.informatik.hu-berlin.de/ http://genome.jouy.inra.fr/texte/LLLchallenge/
MEDSTRACT BioC 199 PubMed citations https://www.ncbi.nlm.nih.gov/pubmed/11604766 http://bioc.sourceforge.net/
MedTag 2005 other https://www.researchgate.net/publication/234785358_MedTag_a_collection_of_biomedical_annotations ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz https://sourceforge.net/projects/medtag/
Metabolite and Enzyme 2011 BioC, XML 296 abstracts http://link.springer.com/article/10.1007%2Fs11306-010-0251-6 http://www.nactem.ac.uk/metabolite-corpus/ http://argo.nactem.ac.uk/bioc/
miRTex 2015 BioC, standoff 350 abstracts (200 development, 150 test) http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004391 http://research.bioinformatics.udel.edu/iprolink/corpora.php
MLEE 2012 CONLL, standoff 262 PubMed abstracts on molecular mechanisms of cancer (specifically relating to angiogenesis) https://academic.oup.com/bioinformatics/article/28/18/i575/249872/Event-extraction-across-multiple-levels-of http://nactem.ac.uk/MLEE/
mTOR pathway event corpus (BioNLP 2011) 2011 standoff http://dl.acm.org/citation.cfm?id=2002919 https://github.com/dterg/mtor-pathway/tree/master/original-data
MutationFinder 2007 other 305 abstract (development data set), 508 abstract test set https://www.ncbi.nlm.nih.gov/pubmed/17495998 http://mutationfinder.sourceforge.net/ https://github.com/rockt/SETH
Nagel XML, standoff http://sourceforge.net/projects/bionlp-corpora/files/ProteinResidue/
NCBI Disease 2012 other 6881 sentences in 793 PubMed abstracts https://www.ncbi.nlm.nih.gov/pubmed/24393765 http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html
OMM (Open Mutation Miner) 2012 other 40 full texts https://www.ncbi.nlm.nih.gov/pubmed/22759648 http://www.semanticsoftware.info/open-mutation-miner
OSIRIS 2008 BioC, XML, standoff 105 articles https://www.ncbi.nlm.nih.gov/pubmed/18251998 http://corpora.informatik.hu-berlin.de/ https://sites.google.com/site/laurafurlongweb/databases-and-tools/corpora
PC (Pathway Curation) (BioNLP-ST 2013) 2013 BioC http://argo.nactem.ac.uk/bioc/ http://2013.bionlp-st.org/tasks/pathway-curation
PennBioIE-oncology 2004 leXML 1414 PubMed abstracts on cancer http://www.aclweb.org/anthology/W04-3111 http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html
pGenN (Plant-GN) 2015 BioC 104 MEDLINE abstracts http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135305 http://research.bioinformatics.udel.edu/iprolink/corpora.php
PICAD 2011 XML 1037 sentences from PubMed http://dl.acm.org/citation.cfm?doid=2147805.2147853 http://ani.stat.fsu.edu/~jinfeng/resources/PICAD.txt http://corpora.informatik.hu-berlin.de/
PolySearch (includes v1. and v2.) other https://www.ncbi.nlm.nih.gov/pubmed/25925572 http://polysearch.cs.ualberta.ca/downloads
ProteinResidue other http://bionlp-corpora.sourceforge.net/
SCAI_Klinger 2008 CONLL https://academic.oup.com/bioinformatics/article/24/13/i268/235854/Detection-of-IUPAC-and-IUPAC-like-chemical-names https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html
SCAI_Kolarik 2008 CONLL http://www.lrec-conf.org/proceedings/lrec2008/workshops/W4_Proceedings.pdf#page=55 https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html
SETH 2016 standoff 630 publications from The American Journal of Human Genetics and Human Mutation https://www.ncbi.nlm.nih.gov/pubmed/?term=27256315 https://github.com/rockt/SETH/tree/master/resources/SETH-corpus
SH (Schwartz and Hearst) 2003 BioC 1000 PubMed Abstracts https://www.ncbi.nlm.nih.gov/pubmed/12603049 http://bioc.sourceforge.net/
SNPCorpus 2011 BioC 296 MEDLINE abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3194196/ http://corpora.informatik.hu-berlin.de/ http://www.scai.fraunhofer.de/snp-normalization-corpus.html
Species 2013 standoff 800 PubMed abstracts http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390 http://species.jensenlab.org/ http://species.jensenlab.org/
T4SS (Type 4 Secretion System) 2011 CONLL http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780
T4SS Event Extraction (BioNLP 2010) 2010 other http://dl.acm.org/citation.cfm?id=1869961.1869980 https://github.com/dterg/t4ss-event
tmVar 2013 BioC 500 PubMed abstracts https://www.ncbi.nlm.nih.gov/pubmed/23564842 https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#tmVar https://github.com/rockt/SETH
VariomeCorpus (hvp) 2013 BioC https://www.ncbi.nlm.nih.gov/pubmed/23584833 http://corpora.informatik.hu-berlin.de/ http://www.opennicta.com/home/health/variome
Yapex 2002 other 99 training, 101 test MEDLINE abstracts https://www.ncbi.nlm.nih.gov/pubmed/12460631 http://www.rostlab.org/~nlprot/yapex.txt https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html

biomedical_corpora's People

Contributors

dterg avatar

Watchers

James Cloos avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.