Coder Social home page Coder Social logo

lifeomic / spark-vcf Goto Github PK

View Code? Open in Web Editor NEW
14.0 24.0 2.0 322 KB

Spark VCF data source implementation for Dataframes

License: MIT License

Scala 100.00%
spark vcf vcf-files spark-sql genomics variants lifeomic dataframe genotype team-clinical-intelligence

spark-vcf's Introduction

spark-vcf

Spark VCF data source implementation in native spark.

Introduction

Spark VCF allows you to natively load VCFs into an Apache Spark Dataframe/Dataset. To get started with Spark-VCF, you can clone or download this repository, then run mvn package and use the jar. We are also now in Maven central.

Since spark-vcf is written specifically for Spark, there is less overhead and performance gains in many areas.

Installation

Spark-vcf can be packaged from source or added as a dependency to your Maven based project.

To install spark vcf, add the following to your pom:

<dependency>
  <groupId>com.lifeomic</groupId>
  <artifactId>spark-vcf</artifactId>
  <version>0.3.0</version>
</dependency>

For sbt:

libraryDependencies += "com.lifeomic" % "spark-vcf" % "0.3.0"

If you are using gradle, the dependency is:

compile group: 'com.lifeomic', name: 'spark-vcf', version: '0.3.0'

Getting Started

Getting started with Spark VCF is as simple as:

val myVcf = spark.read
    .format("com.lifeomic.variants")
    .load("src/test/resources/example.vcf")

The schema contains the standard vcf columns and has the options to expand INFO and/or FORMAT columns. An example schema from 1000 genomes is shown below:

 |-- chrom: string (nullable = true)
 |-- pos: long (nullable = true)
 |-- start: long (nullable = true)
 |-- stop: long (nullable = true)
 |-- id: string (nullable = true)
 |-- ref: string (nullable = true)
 |-- alt: string (nullable = true)
 |-- qual: string (nullable = true)
 |-- filter: string (nullable = true)
 |-- info: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- gt: string (nullable = true)
 |-- sampleid: string (nullable = true)

There are options that you can use as well for the Format and Info columns. To return the format fields as a map, instead of separate fields, you can set the use.format.map variable to true. This can be used to speed up the spark job even more, as it doesn't have to read the header file for type and column information.

val mappedFormat = spark.read
    .format("com.lifeomic.variants")
    .option("use.format.map", "true")
    .load("src/test/resources/example.vcf")

You can also stringly type the formats as well by setting use.format.type to false.

One more note worth mentioning: while the core of spark-vcf is written as a Spark data source, it is still advisable to use the BGZFEnhancedGzipCodec from Hadoop-BAM for splitting bgzip files, so that Spark can properly partition the files. For example:

val sparkConf = new SparkConf()
        .setAppName("testing")
        .setMaster("local[8]")
        .set("spark.hadoop.io.compression.codecs", "org.seqdoop.hadoop_bam.util.BGZFEnhancedGzipCodec")

TODO

  • Provide performance benchmarks compared to other libraries
  • Get Travis CI set up

License

The MIT License

Copyright 2017 Lifeomic

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

spark-vcf's People

Contributors

cfbevan avatar dependabot[bot] avatar dmmiller612 avatar joedimarzio avatar loscm avatar mjtieman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-vcf's Issues

Invalid value of AF read; Flag ingerpretation

Hi,
I have met surprising behaviour while loading this simple VCF with 2.1 version.
The source vcf looks like:

##fileformat=VCFv4.2
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of called genotypes">
##INFO=<ID=MAJ,Number=1,Type=String,Description="Major allele is REF or ALT">
##contig=<ID=MT,length=16569>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10109   rs376007522     A       T       22940.40        .       AC=86;AN=176;AF=0.4886;DP=17307;MAJ=REF;DB
1       10150   rs371194064     C       T       2276.70 .       AC=31;AN=176;AF=0.1761;DP=15279;MAJ=REF;DB

I read it using com.lifeomic.variants and below is what is read.

scala> val myVcf = spark.read.format("com.lifeomic.variants").option("use.info.map", "true").option("use.info.type", "true").load("....vcf")
scala> myVcf.show
+-----+-----+-----+-----+-----------+---+---+--------+------+---+---+-----+-----+----+----+
|chrom|  pos|start| stop|         id|ref|alt|    qual|filter| an|maj|   dp|   af|  ac|  db|
+-----+-----+-----+-----+-----------+---+---+--------+------+---+---+-----+-----+----+----+
|    1|10109|10108|10109|rs376007522|  A|  T|22940.40|     .|176|REF|17307|[0.0]|[86]|null|
|    1|10150|10149|10150|rs371194064|  C|  T| 2276.70|     .|176|REF|15279|[0.0]|[31]|null|
+-----+-----+-----+-----+-----------+---+---+--------+------+---+---+-----+-----+----+----+

What is suprising is:

  1. The AF data is lost. It is set 0.0 without visible reason.
  2. Flag DB is not interpreted correctly, I would expect db to have boolean value, while the DB data is lost completely here...

Thank you for your time, best wishes,
Piotr

Map Headers Before Collect

Currently, when headerMap is created we collect the dataframe before mapping the rows to header key->value tuples. We should perform the mapping on executors before collecting.

Support for . values

Hi,

thank you for a great tool. Could you please consider supporting a numeric value ".", which in VCF means (AFAIK): no value (null maybe)?

For example UnifiedGenotyper (GATK) after declaring a field:

##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
can sometimes put:

UG_RPA=.

info field.
Thank you,
Piotr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.