Intro
Fastq files are widely used in bioinformatics to store raw sequencing data. One fastq entry contains information about sequencing machine, nucleotides (A, T, C, G or N) and sequencing quality. They are textual file organized as follow:
@IDENTIFIER.1 various_info_about_sequencing_machin_for_example
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
Drastic drop in sequencing cost leads computinal biologists to the Big Data playground.
Now imagine the fastq format for millions or billions entries.
This textual format is incredibly inefficient on modern computers (with more than 1 CPU).
Besides, tremendous efforts have been made to handle the data deluge. One of the most recent example is the Apache Spark ecosystem. It distributes tasks on an arbitrary number of nodes from GPU-enabled super computers to commodity PC.
Further development in bioinformatics may require such infrastructure, so we need to transform the old fastq standard to a new Spark-friendly format.
Unordered resources
Formats
Spark-related
- Spark documentation
- Setting up Spark with Maven
- Spark 2.0.1 Docker container with Java 8 and Hadoop 2.7.2
- ADAM: BAM/SAM serialization using Apache Avro
- Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
- Changing Spark's default Java serialization to Kryo
- Writing efficient Spark jobs