ncbi-hackathons / complexphenotypes Goto Github PK

View Code? Open in Web Editor NEW

4.0 15.0 4.0 115.01 MB

Complex Phenotypes Team at the NCBI Hackathon

License: MIT License

R 1.18% Python 0.93% HTML 97.89%

ncbi dbgap alpha-release phenotype

complexphenotypes's Introduction

dbGaPdb

A searchable database of sequencing and phenotype data

Hackathon Team: David McGaughey, Filip Cvetkovski, Michelle Miron, Robert Butler, Sean King, Luning Hoa, Sean Davis and Ben Busby

Intro

The Complex Phenotypes database is a relational database that enables users to find what data sets are available for download based on the phenotype and type of data they are interested in from NCBI's Sequence Read Archive and Genotypes and Phenotypes databases. These are the largest public repositories of phenotpye and sequencing data. However, currently finding data of interest by phenotype is challenging. Searchable Complex Phenotypes is a way to make metadata more easily accessible.

This repository contains a R package that allows you access all pulic metadata to explore what data is available. You can do this in two ways. One is to the query the database in R and second is to use a shiny app to query the database.

Example Query

Quick start

Query examples in R
shiny app via R studio or web

Web Query

Installation

Dependencies

Further Use

complexphenotypes's People

Contributors

Stargazers

Watchers

Forkers

seanking94 michellemiron seandavi b-rich

complexphenotypes's Issues

Also use GEOmetadb to attempt to match run info to dbGap

https://www.bioconductor.org/packages/devel/bioc/html/GEOmetadb.html

GEO holds SNP arrays, right? I assume there are a lot of those in dbGap

Work on CDE integration

Migrating R package to "clean" repository?

Does anyone mind if I move the R package to a new repo? This will reduce file size (important for fast install) and put the R package as the top-level directory making development a bit easier.

Looks like we still need the high-level study metadata?

The equivalent of this:

ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000001/phs000001.v3.p1/GapExchange_phs000001.v3.p1.xml

dbGaP study info dump json file is available on ftp

The following is the email sent to David. Put it over here for the record.
James

From: James L. Hao [email protected]
Date: Wed, Aug 16, 2017 at 8:04 PM
Subject: Re: empty files ftpDownload dbgapr
To: "McGaughey, David (NIH/NEI) [E]" [email protected]

Hi David,

The dbGaP study info dump json is available now through the following ftp. The
ftp://ftp.ncbi.nlm.nih.gov/dbgap/r-tool/public_datadump/

You may go through the sample file again. It includes 4 studies. 2 of them are root studies, another 2 are sub-studies.
sample_dbgap_study_info_dump_pretty.json

The fields 'is_root', 'has_child', and 'has_parent' can help to identify parent-child relationship.

The sample and subject count of chip info will be added later.

I looked into the empty ftp files issue of phs000803.v1.p1. It turns out that the respective database tables are not loaded. It happens occupationally because of all kind of reasons. You may simply ignore them in this case. It should be populated sometime later in most of cases.

If you search for phs000803 through the Advanced Search (URL below), you will see 0 variable returned, which confirms that the variable related table is indeed empty.
https://www.ncbi.nlm.nih.gov/projects/gapsolr/facets.html

Please do not hesitate to write back if you have any questions.

Keep in touch.

Cheers,
Luning

Writing dbGap metadata includes a newline, which splits a record

phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz

readLines('phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz')[14002:14005]

Note how the last record is split into two lines. This breaks reading the record as a tsv file. Records may need to be written as csv with quotes or have newlines stripped before writing. Alternatively (and this might be better), if dbGaPR pulls data directly into R without file creation, just use that rather than writing files.

[1] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t1\tNORMAL\t2"
[2] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t2\tPOSSIBLE DEMENTIA\t1"
[3] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t3\tFACTORS SUCH AS ILLITERACY, NOT"
[4] " FLUENT IN ENGLISH, OR DEPRESSION THAT CAUSES POOR TESTING\t4"