Coder Social home page Coder Social logo

miracle-yao / rnaseq-workflow-mapping-assembly-and-differential-gene-expression-analysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from groverkaushal/rnaseq-workflow-mapping-assembly-and-differential-gene-expression-analysis

0.0 0.0 0.0 106 KB

This project uses an workflow pipeline to generate map and assemble RNAseq reads to a reference genome. Furthermore, we generate counts data and identify differentially expressed genes from 2 conditions.

License: MIT License

Shell 100.00%

rnaseq-workflow-mapping-assembly-and-differential-gene-expression-analysis's Introduction

RNAseq Workflow: Mapping, Assembly, and Differential Gene Expression Analysis

Overview

This project uses an RNAseq workflow pipeline to generate count data and identify differentially expressed genes from sequencing reads. The reads are mapped using a reference genome. The workflow consists of the following steps:

  1. Fastq files downloaded using SRA-Toolkit
  2. Quality Assessment with FASTQC
  3. High Quality Read Filtering using FastP
  4. Reference Genome Mapping using HiSAT2
  5. Assembly using StringTie
  6. Counts data generated using Cufflinks
  7. Differentially Expressed Genes calculated using CuffDiff



Flowchart of the workflow followed in my project

Fig: Flowchart of the workflow followed in my project

Datasets

This project involves transcriptomic analysis to compare the salinity stress response in salinity-tolerant genotypes of chickpea. The analysis was conducted on a salinity-tolerant chickpea genotype under both control and saline environments. The dataset includes RNAseq sequencing reads from two control group samples and two saline group samples.

  • BioProject: PRJNA842022
  • SRA Study: SRP376874

  • Run ID: SRR19383303 (Control Sample 1)
  • Run ID: SRR19383302 (Control Sample 2)
  • Run ID: SRR19383301 (Saline Sample 1)
  • Run ID: SRR19383300 (Saline Sample 2)

  • Year of Experiment: 2022
  • Instrument: Illumina NextSeq 500
  • Layout: Single
  • Organism: Cicer arietinum (Chickpea)
  • Total Bases per Sample: ~510 Mb (Million Bases)
  • No. of reads per Sample: ~10.4 Million reads
  • Estimated Genome Size: 500 Mb
  • Estimated Read Coverage per Sample: ~1X

System Requirements

  • Python 3
  • Conda

Installing Dependencies

  1. Create a conda environment and activate it:

    conda create -n grover
    conda activate grover
  2. Install FastQC:

    sudo apt -y install fastqc
  3. Install Fastp:

    conda install -c bioconda fastp
  4. Install ncbi_datasets:

    conda install -c conda-forge ncbi-datasets-cli
    
  5. Install HiSAT2:

    git clone https://github.com/DaehwanKimLab/hisat2.git
    cd hisat2
    make
    echo 'export PATH=$(pwd):$PATH' >> ~/.bashrc
    source ~/.bashrc
    cd ..
    
  6. Install SamTools:

    wget https://github.com/samtools/samtools/releases/download/1.20/samtools-1.20.tar.bz2
    tar -xf samtools-1.20.tar.bz2 
    rm samtools-1.20.tar.bz2
    cd samtools-1.20/
    make
    echo 'export PATH=$(pwd):$PATH' >> ~/.bashrc
    source ~/.bashrc
    cd ..
    
  7. Install StringTie:

    git clone https://github.com/gpertea/stringtie
    cd stringtie
    make release
    echo 'export PATH=$(pwd):$PATH' >> ~/.bashrc
    source ~/.bashrc
    cd ..
    
  8. Install CuffLinks:

    wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
    tar -xf cufflinks-2.2.1.Linux_x86_64.tar.gz
    rm cufflinks-2.2.1.Linux_x86_64.tar.gz
    cd cufflinks-2.2.1.Linux_x86_64
    echo 'export PATH=$(pwd):$PATH' >> ~/.bashrc
    source ~/.bashrc
    cd ..
    

Workflow

Quality Assessment

The first step involves evaluating the quality of the sample using FASTQC. Summary statistics obtained from FASTQC provide insights into various quality metrics, allowing us to identify any potential issues in the sequencing data. Fastp tool was used to remove duplicated reads, trim the low quality ends, remove the low quality reads, trim adapter sequences, remove low complexity sequences, trim poly G tail. After filtering the HQ reads, Again the FastQC reports were generated.

chmod +x Fastp_and_FastQC.sh
./Fastp_and_FastQC.sh

Mapping

Following the quality assessment, we perform mapping of the reads on a reference genome using HiSAT2. First the reference genome fasta file, gtf file and gff file was downloaded. Then HiSAT2 tool was used to generate 4 mapping sam files from the 4 SRR Fastq files.

chmod +x mapping.sh
./mapping.sh

Assembly

Now, first the 4 sam files were sorted and compressed to 4 bam (binary) files using SamTools. Next the 4 bam files were assembled individually using StringTie. The reference gff file was given as input to generate 4 assembled gtf files. These 4 gtf files were further merged into 1 "merged.gtf" file.

chmod +x assembly.sh
./assembly.sh

Differentially Expressed Genes

Now, finally the differentially expressed genes were calculated between the 2 conditions - control and stress, each with 2 samples. For this we use the 4 bam files generated by HiSAT2, and the merged.gtf file generated by StringTie. We use the CuffDiff tool from Cufflinks package to calculate the DEG's. The results were stored in "gene_exp.diff" file.

chmod +x deg.sh
./deg.sh

Contact

For any questions or further information, please contact Kaushal Grover at [email protected].


License

This project is licensed under the MIT License - see the LICENSE file for details.

rnaseq-workflow-mapping-assembly-and-differential-gene-expression-analysis's People

Contributors

groverkaushal avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.