Coder Social home page Coder Social logo

aprilweilab / picovcf Goto Github PK

View Code? Open in Web Editor NEW
10.0 1.0 0.0 120 KB

Single-header C++ library for fast/low-memory VCF (Variant Call Format) parsing.

Home Page: https://picovcf.readthedocs.io/

License: MIT License

CMake 2.98% Dockerfile 0.96% C++ 96.06%
c-plus-plus comp-bio header-only header-only-library variant-calling vcf

picovcf's Introduction

picovcf

Single-header C++ library for fast/low-memory VCF (Variant Call Format) parsing. Gzipped VCF (.vcf.gz) is optionally supported.

There are a lot of great tools for processing VCF files out there, but not many C++ libraries that are small (only parsing, no extra functionality) and easy to use. picovcf attempts to fill this niche by providing a header-only library using modern C++ (C++11) that allows clients to be selective about which parts of the VCF file get parsed.

Features:

  • Fast and easy to use VCF(.GZ) parsing.
  • Convert VCF(.GZ) to Indexable Genotype Data (IGD) format, which is a very simple format that is more than 3x smaller than VCF.GZ at Biobank scale and more than 15x faster to read
  • Fast and easy to use IGD parsing.

More details can be found in the supplement of our preprint "Genotype Representation Graph" paper.

Using the library

Either copy the latest header file (picovcf.hpp) into your project directly, or make use of something like git submodules to include https://github.com/aprilweilab/picovcf.

See the vcfpp.cpp for an example of how to use the APIs. Read the docs for an overview of the API.

When building code that uses picovcf.hpp, define VCF_GZ_SUPPORT=1 (-DVCF_GZ_SUPPORT=1 on most compiler command lines) to enable zlib support for compressed VCF files.

Build and run the tests/tools

picovcf does not need to be built to be used, since it is a single header that gets built as part of your project. However, if you want to build the tests and tools:

cd picovcf
mkdir build && cd build
cmake .. -DENABLE_VCF_GZ=ON
make

NOTE: -DENABLE_VCF_GZ=ON is optional, and links against libz in case you want to support .vcf.gz (compressed) files in the tools.

To convert from a .vcf or .vcf.gz file to .igd, run:

./vcfconv <vcf filename> <output IGD filename>

To view basic statistics for an IGD file, use igdpp. Some commands to try are ./igdpp stats <igd file> or ./igdpp range_stats <igd file>.

Finally, to run the unit tests:

EXAMPLE_VCFS=../test/example_vcfs/ ./picovcf_test

There is a Dockerfile that encodes all the build steps and dependencies, including documentation build.

Build the documentation

Requires Python packages sphinx, sphinx-rtd-theme, breathe. Requires Doxygen.

From the same build/ directory as above:

DOC_BUILD_DIR=$PWD sphinx-build -c ../doc/ -b html -Dbreathe_projects.picovcf=$PWD/doc/xml ../doc/ $PWD/doc/sphinx/

Indexable Genotype Data (IGD)

picovcf also defines an extremely simple binary file format that can be used for fast access to genotype data. Most other genotype data formats are not indexable directly: that is, you cannot jump directly to the 1 millionth variant without first scanning all the previous (almost million) variants. IGD has the following properties:

  • Indexable. You can use math to figure out where the ith variant will be in the file.
  • Uncompressed. No need to link in compression libraries.
  • Simple format: all variants are expanded into binary variants. So if a Variant has N alternate alleles, then IGD will store that as N rows containing 0 (reference allele) or 1 (alternate allele). Each of these binary variants is stored as either a bitvector (non-sparse) or a list of sample indexes (sparse). A flag in the index indicates which way each variant is stored.
  • Very small. Oftentimes smaller than compressed formats like .vcf.gz or .bgen. The more low-frequency mutations (such as for really large sample sizes) the smaller the file, assuming you are using the default implementation of dynamically choosing between sparse/non-sparse representation.

For example, the following are from chromosome 22 of a real dataset:

  • .vcf: 11GB
  • .vcf.gz: 203MB
  • .bgen: 256MB
  • .igd: 183MB

Converting the .vcf.gz to .bgen (via qctool) took 23 minutes, but converting to .igd only took 3 minutes. Furthermore, iteratively accessing all the variants (and genotype data) in the .igd file was approximately 15x faster than accessing the same data in the .vcf.gz file (using picovcf). On Biobank-scale real datasets, IGD is on average 3.5x smaller than .vcf.gz.

How do I use IGD in my project?

  • Clone picovcf and follow the instructions in this README to build the example tools for that library.
    • If you want to be able to convert .vcf.gz (compressed VCF) to IGD, make sure you build with -DENABLE_VCF_GZ=ON
  • One of the built tools will be vcfconf, which converts from VCF to IGD. Run vcfconv <vcf file> <igd file> to convert your data to IGD.
  • Do one of the following:

picovcf's People

Contributors

dcdatcu avatar dcdehaas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.