diffExprProject

Description

This is a project for Loyola University Chicago's computational biology class (COMP 383). I chose track 1 which is differential expression.
This project presents a pipeline that enables a user to determine the most differentially expressed genes between different samples of sequencing reads and search a genetic database for other species of a specified family that express those genes. This is a highly dynamic tool that enables a user to customize the pipeline to their needs.
The project requires the following requirements preinstalled:

kallisto - command line application to quantify transcripts per million (TPMs) from a set of reads mapped to a reference genome for assembly
ncbi-blast+ - command line application to run a blast search of a query against a genome database locally downloaded
fastq-dump - command line application to split SRA objects into reverse and forward reads
R packages - sleuth, dplyr for statistical analysis of TPMs to find the most differentially expressed transcripts between the samples
Python packages - the following set-up section

Set-up

First clone this repository locally with the following command line arguments:
- git clone [HTTPS]
Once cloned, move into the main project folder such that your current working directory is the diffExprProject folder
- cd diffExprProject
Install python requirements
- create a virtual environment, conda environment, or install locally; below is an example of using python virtual environments
  - python3 -m venv venv -> this create a virtual environment
  - source venv/bin/activate -> this activates it, should see a (venv)
- install requirements from requirements.txt
  - pip3 install -r requirements.txt

Test Run

RUN

To run
- python3 run.py -e [EMAIL] -t testData # make sure that the testData folder is inputed as shown and not with any slashs nor relative paths (i.e. . or ..)
For more information on flags and arguments
- python3 run.py --help

make sure email is one that can be used to access NCBI sequences; ensure this is run from the main project directory
The test folder contains sample fastq files, a Betaherpesvirinae genome database folder to blast the most significantly expressed genes across, and a metatable that stores information about sample data.
The links folder is are the SRA links from where the data comes from.
if you change the name of the testData folder or input a different one, ensure that you update the metadata table and/or links flags to the appropriate flag as they default to the test folder but can be adjusted

Output

The outputs will be saved in PipelineProject_Rohan_Sethi directory in which results are saved in results and data for input into various functions is saved in data
The log file contains information from the test run
The log file already in this repo is the one from the full run and not from the test sample data

More Complicated Run

one could run the run.py file or run each script separately to meet whatever needs necessary; use the --help flag to get more information for each script is looking to run separately
all flags in brackets are optional and default to the following (may want to change if not running test data):
- -s = testData/links/fileLinks.txt
- -i = NC_006273.2
- -e = no default, need to specify one
- -m = testData/metatable.tsv
- -l = ./PipelineProject_Rohan_Sethi/PipelineProject.log
- -n = Betaherpesvirinae
- -b = None
- -u = 10
- -t = None
usage: python3 run.py [-h] [-s INPUT] [-i INDEX] -e EMAIL [-m METATABLE] [-l LOGFILE] [-n NAME] [-b BLASTDB] [-u NUMSELECT] [-t TESTDATA]
options:
- -h, --help show this help message and exit
- -s INPUT, --input INPUT input file with NCBI links for the SRA sample data to download from
- -i INDEX, --index INDEX input accession id for index to assemble the reads
- -e EMAIL, --email EMAIL input email for NCBI access via biopython
- -m METATABLE, --metatable METATABLE metatable tab deliminated containing information about the samples; an example in testData
- -l LOGFILE, --logfile LOGFILE name/path of log file to store important output information tab delimited
- -n NAME, --name NAME
  name of species to blast against to see what other species the most differentially expressed genes are expressed in
- -b BLASTDB, --blastdb BLASTDB if already have blast genome fasta file, then input path to it here; this is if you want to skip the downloaded step if already have a blast fasta; must match the name parameter if using; if not inputed then ignores and downloads what is passed to name parameter
- -u NUMSELECT, --numSelect NUMSELECT number of blast results to store from the blast search
- -t TESTDATA, --testData TESTDATA input test data folder name; only if wanting to run test run, else ignore

rsethi21 / diffexprproject Goto Github PK

diffexprproject's Introduction

diffExprProject

Description

Set-up

Test Run

RUN

Output

More Complicated Run

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent