Coder Social home page Coder Social logo

fastq-serializer's Introduction

Intro

Fastq files are widely used in bioinformatics to store raw sequencing data. One fastq entry contains information about sequencing machine, nucleotides (A, T, C, G or N) and sequencing quality. They are textual file organized as follow:

@IDENTIFIER.1 various_info_about_sequencing_machine_for_example
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/

Drastic drop in sequencing cost leads computinal biologists to the Big Data playground. Now imagine the fastq format for millions or billions entries. This textual format is incredibly inefficient on modern computers (with more than 1 CPU).

Besides, tremendous efforts have been made to handle the data deluge. One of the most recent example is the Apache Spark ecosystem. It distributes tasks on an arbitrary number of nodes from GPU-enabled super computers to commodity PC.

Further development in bioinformatics may require such infrastructure, so we need to transform the old fastq standard to a new Spark-friendly format.

Run the project

First, you will need Git, Java and Maven. This project was set up with Git 2.7.4, Java 1.8 and Maven 3.3.9.

Then,

git clone https://github.com/keuv-grvl/fastq-serializer.git
cd fastq-serializer/
mvn clean compile package assembly:single
java -cp target/fastqserializer-*-jar-with-dependencies.jar fr.isima.fastqserializer.HelloSpark
# or
java -jar target/fastqserializer-*-jar-with-dependencies.jar

Unordered resources

Formats and tools

Spark-related


Join the chat at https://gitter.im/fastq-serializer/Lobby

fastq-serializer's People

Contributors

keuv-grvl avatar tjelysee avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

gitter-badger

fastq-serializer's Issues

Read compressed files

Voluminous fastq files are often compressed. Reading them without uncompressing them would fit many use case.

So we should able to read:

  • standard textual fastq
  • gzipped fastq
  • bzip2ed fastq

More compression algorithms might be added later.

Trim reads

Some sequencing reads may have bad quality, especially at the beginning and the end of the read.
Let's suppose a read with most of the nucleotide with a quality of 30 (symbolized by .) but with some nucleotides with a quality of 5 at the end (symbolized by !):

AAAAATTTTTGGGGGCCCCC
...............!!!..

We compute the mean quality for each window (window size = 5):

AAAAA: (5*30)/5     = 30
AAAAT: (5*30)/5     = 30
AAATT: (5*30)/5     = 30
AATTT: (5*30)/5     = 30
ATTTT: (5*30)/5     = 30
TTTTT: (5*30)/5     = 30
TTTTG: (5*30)/5     = 30
TTTGG: (5*30)/5     = 30
TTGGG: (5*30)/5     = 30
TGGGG: (5*30)/5     = 30
GGGGG: (5*30)/5     = 30
GGGGC: (4*30+1*5)/5 = 25
GGGCC: (3*30+2*5)/5 = 20
GGCCC: (2*30+3*5)/5 = 15
GCCCC: (2*30+3*5)/5 = 15
CCCCC: (3*30+2*5)/5 = 20

One could trim the end of the read to enhance the quality of the read. Here we trim when the quality within the window is below 20 (window GGCCC), so we trim GGCCCCC So we obtain:

AAAAATTTTTGGG
.............

Example: myprog trim --window-size 5 --min-window-qual 20 --input input.fqrdd --output output.fqrdd

Here is an example of trimming software which use sliding windows : https://github.com/najoshi/sickle
We will only focus on single end reads for the moment.

Statistics about fqrdd

Input is a fqrdd file.

We want to know:

  • number of:
    • entries
    • nucleotides (A, T, G, C, N)
  • sequence length:
    • for each entry
    • distribution
  • mean quality:
    • per entry
    • per nucleotide position for all entries
    • distribution

Sample fqrdd

We want to randomly select N entries without replacement from the input file and write them to the output file

Example: myprog sample --number 10000 --input input.fqrdd --output output.fqrdd

Input and output are fqrdd files.
Number of entries is mandatory and user-supplied. It can not be greater than the total number of entries.

Refactor file structure

The project structure is currently messy on testBranch.

We should choose a common file structure. Here is a suggestion:

├── data/
│   ├── SP1.fq
│   └── SP2.fq
├── docs/
│   └── expected-features.md
├── lib/
├── src/
│   ├── main/
│   │   └── java/
│   │       └── com/
│   │           └── jonas/
│   │               └── artifact/
│   │                   └── App.java
│   └── test/
│       └── ...
├── .gitignore
├── pom.xml
└── README.md

Filter fqrdd

We want to filter the input file based on:

  • read identifiers (user-supplied)
  • min and/or max read length
  • min and/or max read mean quality
  • nucleotide composition

Filtering by composition discards reads if they contain any other nucleotide than A, T, G or C (often labeled as N).

Examples:

  • By composition: myprog filter --only-atgc --input input.fqrdd --output output.fqrdd
  • By length: myprog filter --min-length 100 --max-length 999 --input input.fqrdd --output output.fqrdd
  • By quality: myprog filter --min-qual 20 --max-qual 50 --input input.fqrdd --output output.fqrdd
  • By identifiers: myprog filter --id-file list_of_ids.txt --input input.fqrdd --output output.fqrdd
  • By composition & length & quality & identifiers: myprog filter --only-atgc --min-length 100 --max-length 999 --min-qual 20 --max-qual 50 --id-file list_of_ids.txt --input input.fqrdd --output output.fqrdd

Export fqrdd

We want to export a fqrdd input file to fastq or fasta format.
This is necessary for pipelining our tool to existing software which heavily rely on fastq/fasta formats.

Examples:

  • myprog export --format fastq --input input.fqrdd --output output.fastq
  • myprog export --format fasta --input input.fqrdd --output output.fasta

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.