- This is a project for Loyola University Chicago's computational biology class (COMP 383). I chose track 1 which is differential expression.
- This project presents a pipeline that enables a user to determine the most differentially expressed genes between different samples of sequencing reads and search a genetic database for other species of a specified family that express those genes. This is a highly dynamic tool that enables a user to customize the pipeline to their needs.
- The project requires the following requirements preinstalled:
- kallisto - command line application to quantify transcripts per million (TPMs) from a set of reads mapped to a reference genome for assembly
- ncbi-blast+ - command line application to run a blast search of a query against a genome database locally downloaded
- fastq-dump - command line application to split SRA objects into reverse and forward reads
- R packages - sleuth, dplyr for statistical analysis of TPMs to find the most differentially expressed transcripts between the samples
- Python packages - the following set-up section
- First clone this repository locally with the following command line arguments:
- git clone [HTTPS]
- Once cloned, move into the main project folder such that your current working directory is the diffExprProject folder
- cd diffExprProject
- Install python requirements
- create a virtual environment, conda environment, or install locally; below is an example of using python virtual environments
- python3 -m venv venv -> this create a virtual environment
- source venv/bin/activate -> this activates it, should see a (venv)
- install requirements from requirements.txt
- pip3 install -r requirements.txt
- create a virtual environment, conda environment, or install locally; below is an example of using python virtual environments
- To run
- python3 run.py -e [EMAIL] -t testData # make sure that the testData folder is inputed as shown and not with any slashs nor relative paths (i.e. . or ..)
- For more information on flags and arguments
- python3 run.py --help
- make sure email is one that can be used to access NCBI sequences; ensure this is run from the main project directory
- The test folder contains sample fastq files, a Betaherpesvirinae genome database folder to blast the most significantly expressed genes across, and a metatable that stores information about sample data.
- The links folder is are the SRA links from where the data comes from.
- if you change the name of the testData folder or input a different one, ensure that you update the metadata table and/or links flags to the appropriate flag as they default to the test folder but can be adjusted
- The outputs will be saved in PipelineProject_Rohan_Sethi directory in which results are saved in results and data for input into various functions is saved in data
- The log file contains information from the test run
- The log file already in this repo is the one from the full run and not from the test sample data
-
one could run the run.py file or run each script separately to meet whatever needs necessary; use the --help flag to get more information for each script is looking to run separately
-
all flags in brackets are optional and default to the following (may want to change if not running test data):
- -s = testData/links/fileLinks.txt
- -i = NC_006273.2
- -e = no default, need to specify one
- -m = testData/metatable.tsv
- -l = ./PipelineProject_Rohan_Sethi/PipelineProject.log
- -n = Betaherpesvirinae
- -b = None
- -u = 10
- -t = None
-
usage: python3 run.py [-h] [-s INPUT] [-i INDEX] -e EMAIL [-m METATABLE] [-l LOGFILE] [-n NAME] [-b BLASTDB] [-u NUMSELECT] [-t TESTDATA]
-
options:
- -h, --help show this help message and exit
- -s INPUT, --input INPUT input file with NCBI links for the SRA sample data to download from
- -i INDEX, --index INDEX input accession id for index to assemble the reads
- -e EMAIL, --email EMAIL input email for NCBI access via biopython
- -m METATABLE, --metatable METATABLE metatable tab deliminated containing information about the samples; an example in testData
- -l LOGFILE, --logfile LOGFILE name/path of log file to store important output information tab delimited
- -n NAME, --name NAME
name of species to blast against to see what other species the most differentially expressed genes are expressed in - -b BLASTDB, --blastdb BLASTDB if already have blast genome fasta file, then input path to it here; this is if you want to skip the downloaded step if already have a blast fasta; must match the name parameter if using; if not inputed then ignores and downloads what is passed to name parameter
- -u NUMSELECT, --numSelect NUMSELECT number of blast results to store from the blast search
- -t TESTDATA, --testData TESTDATA input test data folder name; only if wanting to run test run, else ignore