A Single Cell data analysis for Chron’s Disease with COTAN

2023 Computational Health Laboratory project - MD Computer Science University of Pisa.

Group Members

Project description

This project makes single-cell RNA sequencing analysis with COTAN and more: In detail, it performs data cleaning, clustering of the cells, and gene set enrichment. Given multiple patients analyzed can output for each cell the best intersection of genes between the patients. This application also can find the exclusive genes for each condition (healthy, inflamed, not-inflamed samples) dividing them by new genes found in the enrichment and known genes from a given reference. Last, for each patient it can perform the pathway enrichment of the genes found by the previous experiments.

How to run

Clone the repository

git clone https://github.com/aprs3/CHL_Project.git

Install COTAN and his dependencies in R

devtools::install_github("seriph78/COTAN")

Download the dataset files in the correct dataset folder. Compress with gzip the scp.barcodes.tsv, scp.features.tsv, scp.raw.mtx. Then rename them features.tsv.gz, barcodes.tsv.gz, matrix.mtx.gz.

Scripts and helper files description

This is the list of R scripts, the usage is commented on top of every script.

main.R: Performs the data cleaning, clustering, and gene set enrichment analysis using COTAN. Outputs multiple cluster cuts from a given range as .csv files. It can be run by either setting the arguments by command line or by setting the "args" array manually such that the first and second element of the list are respectively the dataset folder name and the patient's ID. The script will create all the folders automatically. If one of the args is set to -1, the script will procedurally ask the user to insert variable values in an interactive way. Also, this script will calculate the clusterization for the targeted enrichment genes for all the number of clusters ranging between two variables, "start" and "end".
venn.R: Given a list of patients, it calculates the optimal number of enriched genes clusters for each patient (within a range set by the variables start and end) such that the clusters where a cell type's markers are placed variates as least as possible. Finally, it saves into a .txt file the optimal number of clusters for each patient and the .csv file with the intersections found with the method described in section 5.1.1. To execute it, set the variable "dataset" with the selected dataset name (for example TI_IMM), the list "to_load" with the patients' IDs, and the variables start and end that specify the range where to search the optimal number of clusters.
find state markers.R:This script loads the .csv files containing, for each patient group, the intersection for each cell type of the optimal clusters (as calculated by venn.R) containing the genes which behaved like that specific cell's markers, and removes from each patient group the union of the genes which appeared in the other two intersections. The resulting files contain the genes which behave, for each cell type, like that specific cell's markers only when the patient presents a specific condition (inflamed, not inflamed, or healthy).
separate state markers.R: This script loads the .csv files generated by find_state_markers.R and removes (or better, puts into a separate column) the genes which were already present in their respective known_cells_genes.csv, meaning that what is left are genes which do behave just like the cells' markers only under a specific patient condition (healthy, inflamed or not inflamed) and that were not already known
utils.R: Various utilities for reading files, as well as displaying the various plots

List of helper files:

known cells genes.csv: a copy of this file is present in each dataset. They contain the set of cells found in the datasets by Kong et al. dataset, as well as their gene markers.
enrichment list.txt: the list of genes we were requested to perform gene enrichment analysis on (contained in enrichment_list.txt).

Folders structure and files description

The folder structure is the same for each dataset. For simplicity, we will use the TI_IMM dataset as an example, with only one patient and their files.

TI_IMM
├── dataset #(not included in this repository as the datasets are too heavy)
│   ├── features.scp.gz: features data
│   ├── barcodes.scp.gz: barcodes data
│   └── matrix.mtx.gz:   dataset matrix
└── H101694
    ├── enrichment_csv
    │   └── TI_IMM_H101694_EnrichmentGenesClusters_x.csv: Clusterization data (with "x" clusters) of the enrichment target genes 
    │── clustering
    │   └── Results of COTAN's clusterization and merging procedures.
    └── plot
        ├── TI_IMM_H101694__00_ECDPlot.pdf: ECD plot
        ├── TI_IMM_H101694__01_CellSizePlot.pdf: original library size plot
        ├── TI_IMM_H101694__02_GenesSizePlot.pdf: original genes count plot
        ├── TI_IMM_H101694__03_MitocondrialPlot.pdf: original mitochondrial genes' percentages plot
        ├── TI_IMM_H101694__04_CellSizePlot_cut_n.pdf: library size plot after n cuts
        ├── TI_IMM_H101694__05_GeneCountPlot_cut_n.pdf: gene size plot after n cuts
        ├── TI_IMM_H101694__06_MitocondrialCount_cut_n.pdf: mitochondrial cells plot after n cuts
        ├── TI_IMM_H101694__07_PCACells.pdf: plots the "A" and "B" clusters
        ├── TI_IMM_H101694__08_PCACellsBRemoval.pdf: PCA after the removal of cluster "B" (if removed)
        ├── TI_IMM_H101694__09_CleanPlotGenes.pdf: B cell group genes mean expression
        ├── TI_IMM_H101694__10_CleanPlotUDE.pdf: "Nu" correlation plot
        ├── TI_IMM_H101694__11_CleanPlotNu.pdf: plot of the "Nu" values
        ├── TI_IMM_H101694__12_PCACells_cut_n.pdf: Plotting again PCA after cluster B removal (if removed)
        ├── TI_IMM_H101694__13_CleanPlotGenes_cut_n.pdf: B cell group genes mean expression after n cuts
        ├── TI_IMM_H101694__14_CleanPlotUDE_cut_n.pdf:"Nu" correlation plot after n cuts
        ├── TI_IMM_H101694__15_CleanPlotNu_cut_n.pdf: "Nu" values after n cuts
        ├── TI_IMM_H101694__16_GDIPlot.pdf: GDI plot
        ├── TI_IMM_H101694__22_FineClustersSummary.pdf: Summary of the (not merged yet) clusters' statistics
        ├── TI_IMM_H101694__23_MergedClustersSummary.pdf: Summary of the merged clusters' statistics (not reliable because of a bug in the COTAN library)
        ├── TI_IMM_H101694__26_clustersMarkersHeatmapPlot.pdf: Heatmap which associates to each cluster the enrichment of the markers of certain known cell types
        ├── TI_IMM_H101694__27_enrichmentHm.pdf: Enrichment heatmap with the targeted genes' dendrogram (unclustered)
        ├── TI_IMM_H101694__27_enrichmentHm_n.pdf: Enrichment heatmap with the targeted genes' dendrogram (clustered with "n" clusters)
        └── TI_IMM_H101694__28_enrichmentHmUnclustered.pdf: Enrichment heatmap (no clustering applied)

Packages requirements

The libraries' dependencies used to install these packages are installed automatically.

COTAN
Zeallot
Seurat
VennDiagram
limma
clusterProfiler
ReactomePA

asduffo / chl_project Goto Github PK

chl_project's Introduction

A Single Cell data analysis for Chron’s Disease with COTAN

Group Members

Project description

How to run

Scripts and helper files description

Folders structure and files description

Packages requirements

Resources

chl_project's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent