Coder Social home page Coder Social logo

kevinhu / cancer_data Goto Github PK

View Code? Open in Web Editor NEW
14.0 1.0 6.0 685 KB

A unified downloader+preprocessor for cancer genomics datasets

Home Page: https://cancer_data.kevinhu.io

License: MIT License

Python 100.00%
cancer-research cancer-genomics tcga ccle gtex genomics data-science dataset hdf5-format

cancer_data's Introduction

cancer_data

This package provides unified methods for accessing popular datasets used in cancer research.

Full documentation

Installation

pip install cancer_data

System requirements

The raw downloaded files occupy approximately 15 GB, and the processed HDFs take up about 10 GB. On a relatively recent machine with a fast SSD, processing all of the files after download takes about 3-4 hours. At least 16 GB of RAM is recommended for handling the large splicing tables.

Datasets

A complete description of the datasets may be found in schema.csv.

Collection Datasets Portal
Cancer Cell Line Encyclopedia (CCLE) Many (see portal) https://portals.broadinstitute.org/ccle/data (registration required)
Cancer Dependency Map (DepMap) Genome-wide CRISPR-cas9 and RNAi screens, gene expression, mutations, and copy number https://depmap.org/portal/download/
The Cancer Genome Atlas (TCGA) Mutations, RNAseq expression and splicing, and copy number https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
The Genotype-Tissue Expression (GTEx) Project RNAseq expression and splicing https://gtexportal.org/home/datasets

Features

The goal of this package is to make statistical analysis and coordination of these datasets easier. To that end, it provides the following features:

  1. Harmonization: datasets within a collection have sample IDs reduced to the same format. For instance, all CCLE+DepMap datasets have been modified to use Achilles/Arxspan IDs, rather than cell line names.
  2. Speed: processed datasets are all stored in high-performance HDF5 format, allowing large tables to be loaded orders of magnitude faster than with CSV or TSV formats.
  3. Space: tables of purely numerical values (e.g. gene expression, methylation, drug sensitivities) are stored in half-precision format. Compression is used for all tables, resulting in size reductions by factors of over 10 for sparse matrices such as mutation tables, and over 50 for highly-redundant tables such as gene-level copy number estimates.

How it works

The schema serves as the reference point for all datasets used. Each dataset is identified by a unique id column, which also serves as its access identifier.

Datasets are downloaded from the location specified in download_url, after which they are checked against the provided downloaded_md5 hash.

The next steps depend on the type of the dataset:

  • reference datasets, such as the hg19 FASTA files, are left as-is.
  • primary_dataset objects are preprocessed and converted into HDF5 format.
  • secondary_dataset objects are defined as being made from primary_dataset objects. These are also processed and converted into HDF5 format.

To keep track of which datasets are necessary for producing another, the dependencies column specifies the dataset ids that are required for making another. For instance, the ccle_proteomics dataset is dependent on the ccle_annotations dataset for converting cell line names to Achilles IDs. When running the processing pipeline, the package will automatically check that dependencies are met, and raise an error if they are not found.

Notes

Some datasets have filtering applied to reduce their size. These are listed below:

  • CCLE, GTEx, and TCGA splicing datasets have been filtered to remove splicing events with many missing values as well as those with low standard deviations.
  • When constructing binary mutation matrices (depmap_damaging and depmap_hotspot), a minimum mutation frequency is used to remove especially rare (present in less than four samples) mutations.
  • The TCGA MX splicing dataset is extremely large (approximately 10,000 rows by 900,000 columns), so it has been split column-wise into 8 chunks.

cancer_data's People

Contributors

abearab avatar deepsourcebot avatar dependabot-preview[bot] avatar dependabot[bot] avatar dillon-zephyrai avatar imgbotapp avatar jrouly avatar kevinhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

cancer_data's Issues

Collisions in downloaded_name break caching

Hi, it looks like in cases where two schema entries share the same downloaded_name, any operation that touches both will show md5 mismatches and rerun the download step.

You could put a download_as field in the schema, or form a different filename out of a combination of fields guaranteed to be unique. Then rename the file after download.

tcga_normalized_gene_expression fails to download due to md5sum mismatch

Hi, thanks for putting this repo together, it looks very handy.

On cancer_data version 0.1.0, I tried to download the tcga_normalized_gene_expression dataset via

cancer_data.download("tcga_normalized_gene_expression"),

but this failed with the message:

243iB [00:00, 58.9kiB/s]
EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz does not match provided md5sum. Attempting second download.
Downloading https://pancanatlas.xenahubs.net/download/EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz
243iB [00:00, 36.1kiB/s]
Second download of EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz failed. Recommend manual inspection.

Yet when I manually inspect the md5sum of the .gz, everything looks ok:

import hashlib

import cancer_data

schema_md5 = cancer_data.schema().loc['tcga_normalized_gene_expression']['downloaded_md5']

fname = "/Users/pat/Downloads/EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz"

with open(fname, "rb") as f:
    data = f.read()
    observed_md5 = hashlib.md5(data).hexdigest()

assert schema_md5 == observed_md5  # "5fbfb5a4854a2cfc8a95c3ada5379fd4"

Am I doing something silly? Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.