keuv-grvl / fastq-serializer Goto Github PK

Spark RDD-friendly serialized fastq files

Java 100.00%

fastq-serializer's Introduction

Intro

Fastq files are widely used in bioinformatics to store raw sequencing data. One fastq entry contains information about sequencing machine, nucleotides (A, T, C, G or N) and sequencing quality. They are textual file organized as follow:

@IDENTIFIER.1 various_info_about_sequencing_machine_for_example
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/

Drastic drop in sequencing cost leads computinal biologists to the Big Data playground. Now imagine the fastq format for millions or billions entries. This textual format is incredibly inefficient on ~~modern~~ computers (with more than 1 CPU).

Besides, tremendous efforts have been made to handle the data deluge. One of the most recent example is the Apache Spark ecosystem. It distributes tasks on an arbitrary number of nodes from GPU-enabled super computers to commodity PC.

Further development in bioinformatics may require such infrastructure, so we need to transform the old fastq standard to a new Spark-friendly format.

Run the project

First, you will need Git, Java and Maven. This project was set up with Git 2.7.4, Java 1.8 and Maven 3.3.9.

Then,

git clone https://github.com/keuv-grvl/fastq-serializer.git
cd fastq-serializer/
mvn clean compile package assembly:single
java -cp target/fastqserializer-*-jar-with-dependencies.jar fr.isima.fastqserializer.HelloSpark
# or
java -jar target/fastqserializer-*-jar-with-dependencies.jar

Unordered resources

Formats and tools

Spark-related

fastq-serializer's People

Contributors

Watchers

Forkers

gitter-badger

fastq-serializer's Issues

Run test programs on Spark

Hello, World
WordCount
Any interesting other ones

Convert fastq file to fqrdd

Convert a fastq file to fqrdd.

Example: myprog import --input input.fastq --output output.fqrdd

Read compressed files

Voluminous fastq files are often compressed. Reading them without uncompressing them would fit many use case.

So we should able to read:

standard textual fastq
gzipped fastq
bzip2ed fastq

More compression algorithms might be added later.

Install Spark

Standalone Spark should be enough

Detect encoding

Given a fastq file, we should automatically detect the encoding. Thus we can load the data without knowing the encoding.

Possible encodings are:

Trim reads

Some sequencing reads may have bad quality, especially at the beginning and the end of the read.
Let's suppose a read with most of the nucleotide with a quality of 30 (symbolized by .) but with some nucleotides with a quality of 5 at the end (symbolized by !):

AAAAATTTTTGGGGGCCCCC
...............!!!..

We compute the mean quality for each window (window size = 5):

AAAAA: (5*30)/5     = 30
AAAAT: (5*30)/5     = 30
AAATT: (5*30)/5     = 30
AATTT: (5*30)/5     = 30
ATTTT: (5*30)/5     = 30
TTTTT: (5*30)/5     = 30
TTTTG: (5*30)/5     = 30
TTTGG: (5*30)/5     = 30
TTGGG: (5*30)/5     = 30
TGGGG: (5*30)/5     = 30
GGGGG: (5*30)/5     = 30
GGGGC: (4*30+1*5)/5 = 25
GGGCC: (3*30+2*5)/5 = 20
GGCCC: (2*30+3*5)/5 = 15
GCCCC: (2*30+3*5)/5 = 15
CCCCC: (3*30+2*5)/5 = 20

One could trim the end of the read to enhance the quality of the read. Here we trim when the quality within the window is below 20 (window GGCCC), so we trim GGCCCCC So we obtain:

AAAAATTTTTGGG
.............

Example: myprog trim --window-size 5 --min-window-qual 20 --input input.fqrdd --output output.fqrdd

Here is an example of trimming software which use sliding windows : https://github.com/najoshi/sickle
We will only focus on single end reads for the moment.

Statistics about fqrdd

Input is a fqrdd file.

We want to know:

number of:
- entries
- nucleotides (A, T, G, C, N)
sequence length:
- for each entry
- distribution
mean quality:
- per entry
- per nucleotide position for all entries
- distribution

Compare several serializers

They are many serializer available (see here for examples).

We should test some of them and compare resulting files.

Read fastq file

Read a simple fastq file using BioJava.

Here is a very small data set (250 sequences, 22ko) :

http://molb7621.github.io/workshop/_downloads/SP1.fq

Serialize RDD

Once fastq entries are loaded in a RDD, we should save the RDD to the disk, then load the file to a RDD.

See:

Files might have a .fqrdd extension for the moment.

Sample fqrdd

We want to randomly select N entries without replacement from the input file and write them to the output file

Example: myprog sample --number 10000 --input input.fqrdd --output output.fqrdd

Input and output are fqrdd files.
Number of entries is mandatory and user-supplied. It can not be greater than the total number of entries.

Load fastq in a RDD

Load a fastq file as a Collection, then convert it in JavaRDD.

http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections

Refactor file structure

The project structure is currently messy on testBranch.

We should choose a common file structure. Here is a suggestion:

├── data/
│   ├── SP1.fq
│   └── SP2.fq
├── docs/
│   └── expected-features.md
├── lib/
├── src/
│   ├── main/
│   │   └── java/
│   │       └── com/
│   │           └── jonas/
│   │               └── artifact/
│   │                   └── App.java
│   └── test/
│       └── ...
├── .gitignore
├── pom.xml
└── README.md

Filter fqrdd

We want to filter the input file based on:

read identifiers (user-supplied)
min and/or max read length
min and/or max read mean quality
nucleotide composition

Filtering by composition discards reads if they contain any other nucleotide than A, T, G or C (often labeled as N).

Examples:

By composition: myprog filter --only-atgc --input input.fqrdd --output output.fqrdd
By length: myprog filter --min-length 100 --max-length 999 --input input.fqrdd --output output.fqrdd
By quality: myprog filter --min-qual 20 --max-qual 50 --input input.fqrdd --output output.fqrdd
By identifiers: myprog filter --id-file list_of_ids.txt --input input.fqrdd --output output.fqrdd
By composition & length & quality & identifiers: myprog filter --only-atgc --min-length 100 --max-length 999 --min-qual 20 --max-qual 50 --id-file list_of_ids.txt --input input.fqrdd --output output.fqrdd

Export fqrdd

We want to export a fqrdd input file to fastq or fasta format.
This is necessary for pipelining our tool to existing software which heavily rely on fastq/fasta formats.