Coder Social home page Coder Social logo

stevenwingett / cell-differentiation-classifier Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 2.0 120 KB

A logistic regression workflow that takes RNA-seq data to assess the differentiation potential of a cell line

License: GNU General Public License v3.0

Python 6.50% Shell 29.05% Jupyter Notebook 62.71% R 1.74%

cell-differentiation-classifier's Introduction

Cell-Differentiation-Classifier

A logistic regression workflow that takes RNA-seq data to assess the differentiation potential of a cell line.

Project Homepage: https://github.com/StevenWingett/Cell-Differentiation-Classifier

Repository Data Files

data_to_download_metadata.tsv - extensive information of the HipSci datasets analysed.

download_files_list_wget.txt - use with the wget command to download the HipSci FASTQ data files of interest.

dataset_summary.tsv - summary of the datasets, including differentiation scores.

Repository Scripts

cell_diff_class_map.def - Definitions file to build a (Singularity)[https://sylabs.io/singularity/] container for consistent QC/trimming/mapping of RNA-seq FASTQ files.

create_classifier_datafile.py - Python3 script that collates the Kallisto mapping results into one file. The script also determines the log10(tpm + 1) value for each transcript in each cell line (or more accurately accession, since one cell line may encompass multiple accessions). In addition, the script returns the mean log10(tpm + 1) and standard deviation log10(tpm + 1) for each transcript across all accessions. Then, to compare transcript levels between different accessions, the script determines the z-score for each transcript (i.e. it compares the expression of a transcript in a given accession against that same transcript in different accessions.)

This script takes as input the *.abundance.tsv files generated by Kallisto and a metadata file, which by default will be named 'dataset_summary.tsv'.

python3 create_classifier_datafile.py -m [metadata file] [Kallisto abundance files]

This project's GitHub repository contains the metadata file 'dataset_summary.tsv' used for processing the selected HipSci datasets, which should also be used as a template for processing additional datasets. If constructing your own metadata file, you will need to include 'Accession', 'Cell_line', 'Diff_efficiency' and 'Retention_group' columns. (The Retention_group is simply a way of filtering which accessions are analysed by logistic regression. The unedited Jupyter notebook (logistic_regression_classifier.) will only process accessions in Retention_group==0.)

build_logistic_regression_classifier.ipynb - Jupyter notebook that builds the logistic regression classifier model. Takes as input the data file generated by the create_classifier_datafile.py script.

run_logistic_regression_classifier.ipynb Jupyter notebook that runs the logistic regression classifier on mapped data after the logistic regression model has already been established. Takes as input the data file generated by the create_classifier_datafile.py script and the model coefficients generated by build_logistic_regression_classifier.ipynb.

map_fastq_file.sh - Bash script for QC/trimming/mapping of RNA-seq FASTQ files. The script processes all files with extension *.fastq.gz in the current working directory. The Kallisto transcriptome reference file should be passed as an argument.

map_fastq_file.sh [Kallisto transcriptome reference file] 

Misc Folder

The Misc folder contains additional helper scripts which are not part of the main processing workflow, but which may be helpful when analysing data.

basic_prediction_summay.ipynb - Jupyter Notebook to give a general summary (e.g. confusion matrix, Cohen's Kappa) of already-classified results (binary format).

transcript_gene_lookup.ipynb - Jupyter Notebook to take a file of data which includes a 'target_id' column (corresponding to transcript IDs) and a transcript-gene lookup file (columns:'target_id' and 'gene_id'), and then generate a new file which contains the input data and a new column listing the corresponding gene names.

Workflow overview

To improve consistency between samples (and any future samples), map only the forward read of paired-end data, or use single-end data. This choice was made because unknown future samples may be single-ended, but if we have mapped against paired-end data originally to build our classification model, it will therefore not be possible to process all the data uniformly (i.e. since obviously single-end data cannot be mapped as paired-end data).

Below gives more detail on the QC, trimming and mapping processes.

QC

Quality control is performed with (FastQC v0.11.9)[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/] and (MultiQC v1.11QC)[https://multiqc.info/].

Trimming

Uses (TrimGalore! v0.6.3)[https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/], powered by (Cutadapt version 3.1)[http://journal.embnet.org/index.php/embnetjournal/article/view/200].

Reads are trimmed with standard parameters, and then re-trimmed to a length of 50nt (TrimGalore! parameter --hardtrim5 50)

Mapping

Mapping uses (Kallisto, v0.46.1)[https://pachterlab.github.io/kallisto/]:

kallisto quant -i [Kallisto index] -o $F.kallisto --single -l 200 -s 20 -t 20 [FASTQ FILE] 

The Homo sapiens GRCh38 (v102) cDNA genome was downloaded from Ensembl to make the Kallisto index.

Processing data

We recommend using the Singularity container for QC/trimming/mapping. Run the command below to build the container using the definitions file. (The --bind $PWD:/mnt argument makes your current working directory ($PWD) visible to the container inside the container's /mnt folder).

sudo singularity build --bind $PWD:/mnt cell_diff_class_map.sif cell_diff_class_map.def

To perform the QC/trimming/mapping, run the command:

singularity run --bind $PWD:/mnt cell_diff_class_map.sif [Kallisto index]

Note: This command requires the metadata file dataset_summary.tsv to be in your current working directory. The FASTQ files for processing and the Kallisto index should also be in your current working directory. FASTQ files with the extension *.fastq.gz will be processed.

Should you wish to enter the Singularity container (e.g. for using Kallisto to generate a Kaillisto transcriptome index), run the command:

singularity shell --bind $PWD:/mnt cell_diff_class_map.sif

Steven Wingett, The MRC-Laboratory of Molecular Biology, Cambridge, UK

cell-differentiation-classifier's People

Contributors

stevenwingett avatar

Stargazers

 avatar

Watchers

 avatar

cell-differentiation-classifier's Issues

Process extra data

Process:

  1. Retained datasets
  2. Datasets to check if Jerber model gives similar results to this classifier
  3. Magda's cell lines

Check pivot

Check all instances of the default sorting by the pivot command in all script will not cause problems. Re-sort the applicable data if necessary to correct for this.

Check ROC and PR curves

Check these plots are using the correct input data (e.g. not binary data?) in the Jupyter Notebooks.

Export coefficients from 100 logistic regressions

The coefficients (+intercepts) from the stability modelling could be exported to a separate file and then used by the run classifier script as an additional piece of analysis to assess the differentiation potential of a cell line.

This file will be large. Maybe this could be saved on OSF for people to download, rather than saving in the Git repo.

Create a simple summary script

Write a script to produce a confusion matrix etc. to analyse a list of already generated table of results, i.e.

X y
1 0
1 1
0 0
0 1
1 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.