Coder Social home page Coder Social logo

comorbidity / medgen-umls Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 5.29 MB

NCBI Medical Genetics, UMLS Unified Medical Language System Concepts and PubMed linked citations

License: Apache License 2.0

Python 8.03% Shell 68.04% Makefile 23.94%
ctakes ctakes-clinical-pipeline gene-ontology genetic-testing genetic-variants genetics human-phenotype-ontology mysql ncbi phenotypes

medgen-umls's Introduction

This package greatly simplifies the creation of local mirrors for NLM National Library of Medicine sources, which currently includes:

  • NCBI Medical Genetics linked sources
  • UMLS Unified Medical Language System
  • PubMed annotated content

Mirrored datastores are then converted automatically to SQL database (currently MySQL supported with plans to support generic SQL).

medgen-umls was made with simplicity and automation in mind. Over 100 URLs (and counting) have been rounded up, their data normalized for database manipulation, to provide ease of access to as much open access medical genomics data as possible.


medgen-umls is a free and open source library under the [Apache 2.0 License](http://www.apache.org/licenses/), a copy of which is included within the repository.

All questions, concerns, support, and curse words should be directed to package maintainers Andy McMurry ( [email protected] ) and medgen-mysql contributors.

Contributions to this library are encouraged via fork and pull request. Diffs may be accepted when attached to nicely written emails.


Any Unix-like operating system (including OS X) will run medgen-umls.

Medgen-umls downloads are automated via Makefile and run entirely within bash scripts. Thus the requirements are small:

  • bash
  • wget
  • mysql (optional)

It's possible to use this repository for downloading purposes only. See "USAGE" for details.


Clone this repository using Mercurial:

git clone [email protected]:comorbidity/medgen-umls.git
cd medgen-umls

If you only want to download medgen-umls files, you do NOT need MySQL. If you to save files to MySQL you need a running server and mysql user https://dev.mysql.com/doc/refman/8.0/en/create-user.html

The first time you use this repository, you must run the database scripts that create a mysql user that will be able to load the medgen databases:

make user

(Note that MySQL must be running and you must have the ability to use the "root" superuser.)

make all

If you want every database downloaded and installed, simply run make all. Done.

Note that due to the size of some of these databases, it could take days to run everything the first time. Successive runs will take far less time since only newer files will be downloaded to your local mirror.

make <dbname>

The Makefile in the root of the medgen-mysql directory provides ability to make <dbname> for each supported database. (See below for complete list.)

For each database desired, type make <dbname> to complete all of the tasks associated with downloading, extracting, and inputting to MySQL these particular sources.

For example, make clinvar will complete the following steps:

./mirror.sh clinvar/urls
./unpack.sh clinvar
./create_database.sh clinvar
./load_database.sh clinvar
./index_database.sh clinvar

All of the above steps can be run individually on the command line, so if you only want to run the download script, run ./mirror.sh <dbname>/urls, which puts downloaded content into <dbname>/mirror.

Note that already-downloaded files will not be re-downloaded, as long as wget is convinced that the remote and local files are identical. If these files are not identical, wget will redownload this particular file.

This conservative updating means that you can schedule regular updates of your medical genetics databases without overusing your connection.

Note also that datasets vary widely in how much disk space they require. Some datasets are EXTREMELY LARGE. Average use is usally ~ 50GB.

PubTator:NCBI Text Mined mutations for all PubMed abstracts
clinvar:NCBI Clinical Variants
GTR:NCBI Genetic Testing Reference
gene:NCBI Entrez Gene database
GeneReviews:NCBI Gene Reviews
GO:http://GeneOntology.org
hugo:http://GeneNames.org
medgen:NCBI Medical Genetics
disgenet:Disease Gene Network
hpo:Human Phenotype Ontology
orphanet:Rare diseases
PubMed PMID linkages to the above sources

example1: mirror NCBI Medical Genetics with primary sources

$./mirror.sh medgen/urls
$./mirror.sh gene/urls
$./mirror.sh GTR/urls
$./mirror.sh clinvar/urls
$./mirror.sh hpo/urls
$./mirror.sh GeneReviews/urls

example2: mirror PubMed annotations containing gene mutations with primary sources

$./mirror.sh PubTator
$./mirror.sh gene/urls
$./mirror.sh pubmed/urls

example: create mysql database for PubTator

$./create_database.sh PubTator

example: unzip PubTator mirrored flat files

$./unpack.sh PubTator

example: load PubTator database with mirrored flat files

$./load_database.sh PubTator

  • $mysql_dataset opens mysql client for the current dataset
  • processlist show active SQL commands with elapsed time (selects, DML, indexes)
  • info table schema with load statistics

example: open a mysql client for the PubTator database

source ./PubTator/db.config
$mysql_dataset

example: show PubTator tables and statistics. Make you have sufficent MEMORY for the indexes! | To check on the status of the load see processlist .

mysql> call info;
+--------------+--------+-------------------+------------+---------+----------+----------+-----------------+
| table_schema | ENGINE | TABLE_NAME        | TABLE_ROWS | million | data_MB  | index_MB | TABLE_COLLATION |
+--------------+--------+-------------------+------------+---------+----------+----------+-----------------+
| PubTator     | InnoDB | chemical2pubtator |   27453916 | 27.45   | 1549.00M | 0.00M    | utf8_unicode_ci |
| PubTator     | InnoDB | disease2pubtator  |   27825311 | 27.83   | 1870.00M | 0.00M    | utf8_unicode_ci |
| PubTator     | InnoDB | gene2pubtator     |   10800507 | 10.80   | 657.00M  | 0.00M    | utf8_unicode_ci |
| PubTator     | InnoDB | log               |         36 | 0.00    | 0.02M    | 0.00M    | utf8_unicode_ci |
| PubTator     | InnoDB | mutation2pubtator |     537030 | 0.54    | 29.56M   | 23.08M   | utf8_unicode_ci |
| PubTator     | InnoDB | README            |         11 | 0.00    | 0.02M    | 0.00M    | utf8_general_ci |
| PubTator     | InnoDB | species2pubtator  |   16563014 | 16.56   | 805.00M  | 0.00M    | utf8_unicode_ci |
+--------------+--------+-------------------+------------+---------+----------+----------+-----------------+

show active SQL commands (processlist) running for this dataset. | NOTE: some datasets take a very long time to load and index.

mysql> call ps;
+-----+----------+-----------+----------+---------+------+-------+-----------+
| ID  | USER     | HOST      | DB       | COMMAND | TIME | STATE | INFO      |
+-----+----------+-----------+----------+---------+------+-------+-----------+
| 115 | pubtator | localhost | PubTator | Query   |   74 | NULL  |           |
|                                                                            |
|   load data local infile 'mirror/gene2pubtator'                            |
|   into table gene2pubtator                                                 |
|   fields terminated by '\t' ESCAPED BY ''                                  |
|   lines terminated by '\n' ignore 1 lines                                  |
|                                                                            |
+-----+----------+-----------+----------+---------+------+-------+-----------+

medgen-umls's People

Watchers

 avatar

medgen-umls's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.