fastq-serializer's Introduction

Intro

Fastq files are widely used in bioinformatics to store raw sequencing data. One fastq entry contains information about sequencing machine, nucleotides (A, T, C, G or N) and sequencing quality. They are textual file organized as follow:

@IDENTIFIER.1 various_info_about_sequencing_machin_for_example
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/

Drastic drop in sequencing cost leads computinal biologists to the Big Data playground. Now imagine the fastq format for millions or billions entries. This textual format is incredibly inefficient on ~~modern~~ computers (with more than 1 CPU).

Besides, tremendous efforts have been made to handle the data deluge. One of the most recent example is the Apache Spark ecosystem. It distributes tasks on an arbitrary number of nodes from GPU-enabled super computers to commodity PC.

Further development in bioinformatics may require such infrastructure, so we need to transform the old fastq standard to a new Spark-friendly format.

Unordered resources

Formats

Spark-related

Recommend Projects

gitter-badger / fastq-serializer Goto Github PK

fastq-serializer's Introduction

Intro

Unordered resources

Formats

Spark-related

fastq-serializer's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent