This repository includes scrips and pipelines for genomic analyses in Apache Spark, an open-source cluster-computing framework for big data processing. The workflow is mostly based on Hail, an open-source scalable framework for exploring and analyzing genomic data. Hail is exposed through Python and backed by distributed algorithms built on top of Apache Spark.
vcf_filtering_tutorial contains a small tutorial on VFC filtering. It also includes a PCA to explore if there is any structure in the data. The tutorial was implemented in Databricks.