Coder Social home page Coder Social logo

gcp-variant-transforms's Introduction

GCP Variant Transforms

Build Status Coverage Status

Overview

This is a tool for transforming and processing VCF files in a scalable manner based on Apache Beam using Dataflow on Google Cloud Platform.

It can be used to directly load VCF files to BigQuery supporting hundreds of thousands of files, millions of samples, and billions of records. Additionally, it provides a preprocess functionality to validate the VCF files such that the inconsistencies can be easily identified.

Please see this presentation for a high level overview of BigQuery and how to effectively use Variant Transforms and BigQuery. Please also read the blog post about how a GCP customer used Variant Transforms for breakthrough clinical data science with BigQuery.

Prerequisites

  1. Setup a Google Cloud account and create a project.
  2. Sign up and install the Google Cloud SDK
  3. Enable the Genomics, Compute Engine, Cloud Storage, and Dataflow APIs
  4. Open the billing page for the project you have selected or created, and click Enable billing.
  5. Create a new BigQuery dataset by visiting the BigQuery web UI, clicking on the down arrow icon next to your project name in the navigation, and clicking on Create new dataset.

Loading VCF files to BigQuery

Using docker

The easiest way to run the VCF to BigQuery pipeline is to use the docker image and run it with the Google Genomics Pipelines API as it has the binaries and all dependencies pre-installed.

First, set up the pipeline configurations shown below and save it as vcf_to_bigquery.yaml. The parameters that you need to replace are:

  • my_project: This is your project name that contains the BigQuery dataset.
  • gs://my_bucket/vcffiles/*.vcf: A location in Google Cloud Storage where the VCF file are stored. You may specify a single file or provide a pattern to load multiple files at once. Please refer to the Variant Merging documentation if you want to merge samples across files. The pipeline supports gzip, bzip, and uncompressed VCF formats. However, it runs slower for compressed files as they cannot be sharded.
  • my_bigquery_dataset: Your BigQuery dataset to store the output.
  • my_bigquery_table: This can be any ID you like (e.g. vcf_test).
  • gs://my_bucket/staging and gs://my_bucket/temp: These can be any folder in Google Cloud Storage that your project has write access to. These are used to store temporary files needed for running the pipeline.
name: vcf-to-bigquery-pipeline
docker:
  imageName: gcr.io/gcp-variant-transforms/gcp-variant-transforms
  cmd: |
    ./opt/gcp_variant_transforms/bin/vcf_to_bq \
      --project my_project \
      --input_pattern gs://my_bucket/vcffiles/*.vcf \
      --output_table my_project:my_bigquery_dataset.my_bigquery_table \
      --staging_location gs://my_bucket/staging \
      --temp_location gs://my_bucket/temp \
      --job_name vcf-to-bigquery \
      --runner DataflowRunner

Next, run the following command to launch the pipeline. Replace my_project with your project name, gs://my_bucket/temp/runner_logs with a Cloud Storage folder to store the logs from the pipeline.

gcloud alpha genomics pipelines run \
    --project my_project \
    --pipeline-file vcf_to_bigquery.yaml \
    --logging gs://my_bucket/temp/runner_logs \
    --zones us-west1-b \
    --service-account-scopes https://www.googleapis.com/auth/bigquery

Please note the operation ID returned by the above script. You can track the status of your operation by running:

gcloud alpha genomics operations describe <operation-id>

The returned data will have done: true when the operation is done. A detailed description of the Operation resource can be found in the API documentation

The underlying pipeline uses Cloud Dataflow. You can navigate to the Dataflow Console, to see more detailed view of the pipeline (e.g. number of records being processed, number of workers, more detailed error logs).

Running from github

In addition to using the docker image, you may run the pipeline directly from source. First install git, python, pip, and virtualenv:

sudo apt-get install -y git python-pip python-dev build-essential
sudo pip install --upgrade pip
sudo pip install --upgrade virtualenv

Run virtualenv, clone the repo, and install pip packages:

virtualenv venv
source venv/bin/activate
git clone https://github.com/googlegenomics/gcp-variant-transforms.git
cd gcp-variant-transforms
pip install --upgrade .

You may use the DirectRunner (aka local runner) for small (e.g. 10,000 records) files or DataflowRunner for larger files. Files should be stored on Google Cloud Storage if using Dataflow, but may be stored locally for DirectRunner.

Example command for DirectRunner:

python -m gcp_variant_transforms.vcf_to_bq \
  --input_pattern gcp_variant_transforms/testing/data/vcf/valid-4.0.vcf \
  --output_table projectname:bigquerydataset.tablename

Example command for DataflowRunner:

python -m gcp_variant_transforms.vcf_to_bq \
  --input_pattern gs://my_bucket/vcffiles/*.vcf \
  --output_table my_project:my_bigquery_dataset.my_bigquery_table \
  --project my_project \
  --staging_location gs://my_bucket/staging \
  --temp_location gs://my_bucket/temp \
  --job_name vcf-to-bigquery \
  --setup_file ./setup.py \
  --runner DataflowRunner

Running VCF files preprocessor

The VCF files preprocessor is used for validating the datasets such that the inconsistencies can be easily identified. It can be used as a standalone validator to check the validity of the VCF files, or as a helper tool for VCF to BigQuery pipeline. Please refer to VCF files preprocessor for more details.

Running jobs in a particular region/zone

You may need to constrain Cloud Dataflow job processing to a specific geographic region in support of your project’s security and compliance needs. See Running in particular zone/region doc.

Additional topics

Development

gcp-variant-transforms's People

Contributors

allieychen avatar arostamianfar avatar bashir2 avatar mhsaul avatar nmousavi avatar samanvp avatar snarfed avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.