Coder Social home page Coder Social logo

spark-dnaligning's Introduction

Spark-DNAligning

The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has introduced several computational challenges. DNA short read alignment is an important problem in bioinformatics. The exponential growth in the number of short reads has increased the need for an ideal platform to accelerate the alignment process. Apache Spark is a cluster-computing framework that involves data parallelism and fault tolerance. Spark-DNAligning is a Spark-based algorithm to accelerate DNA short reads alignment problem.

How to use it?

  1. Prepare the DNA reference file by running toLine.py script. You can find it in helper_scripts folder.

  2. Compress the short reads file by running:

bzip2 short_reads_file_name
  1. Create an Amazon S3 bucket. Upload the following files to it:
  • The DNA reference file.
  • The short reads file.
  • The jar file.
  1. The jar file and the DNA reference file need to be saved on your Amazon EMR cluster. So, download the content of your Amazon S3 bucket by running the following command on your Amazon EMR cluster:
aws s3 cp s3://path/to/your/s3/bucket . --recursive
  • Note that the short reads file needs to be on S3 bucket and there is no need to download it to your Amazon EMR cluster.
  1. Start Spark-DNAligning by running the following command:
spark-submit --class com.ku.Aligning.DNACluster --driver-memory 4g --executor-memory 4g --executor-cores 3 --num-executors 3 dna.jar 16 36 /home/hadoop/s_suisLine.fa path/to/your/s3/bucket/100kGood.fa.bz2 path/to/your/s3/bucket/ Streptococcus_suis
  1. Download all the output files by running the following command:
aws s3 sync s3://path/to/your/s3/bucket .
  1. Merge all the output files by running toSam script. You can find it in helper_scripts folder.
./toSam

spark-dnaligning's People

Contributors

maryom avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.