Coder Social home page Coder Social logo

chl_project's Introduction

A Single Cell data analysis for Chron’s Disease with COTAN

2023 Computational Health Laboratory project - MD Computer Science University of Pisa.

Group Members

Project description

This project makes single-cell RNA sequencing analysis with COTAN and more: In detail, it performs data cleaning, clustering of the cells, and gene set enrichment. Given multiple patients analyzed can output for each cell the best intersection of genes between the patients. This application also can find the exclusive genes for each condition (healthy, inflamed, not-inflamed samples) dividing them by new genes found in the enrichment and known genes from a given reference. Last, for each patient it can perform the pathway enrichment of the genes found by the previous experiments.

How to run

Clone the repository

git clone https://github.com/aprs3/CHL_Project.git

Install COTAN and his dependencies in R

devtools::install_github("seriph78/COTAN")

Download the dataset files in the correct dataset folder. Compress with gzip the scp.barcodes.tsv, scp.features.tsv, scp.raw.mtx. Then rename them features.tsv.gz, barcodes.tsv.gz, matrix.mtx.gz.

Scripts and helper files description

This is the list of R scripts, the usage is commented on top of every script.

  • main.R: Performs the data cleaning, clustering, and gene set enrichment analysis using COTAN. Outputs multiple cluster cuts from a given range as .csv files. It can be run by either setting the arguments by command line or by setting the "args" array manually such that the first and second element of the list are respectively the dataset folder name and the patient's ID. The script will create all the folders automatically. If one of the args is set to -1, the script will procedurally ask the user to insert variable values in an interactive way. Also, this script will calculate the clusterization for the targeted enrichment genes for all the number of clusters ranging between two variables, "start" and "end".
  • venn.R: Given a list of patients, it calculates the optimal number of enriched genes clusters for each patient (within a range set by the variables start and end) such that the clusters where a cell type's markers are placed variates as least as possible. Finally, it saves into a .txt file the optimal number of clusters for each patient and the .csv file with the intersections found with the method described in section 5.1.1. To execute it, set the variable "dataset" with the selected dataset name (for example TI_IMM), the list "to_load" with the patients' IDs, and the variables start and end that specify the range where to search the optimal number of clusters.
  • find state markers.R:This script loads the .csv files containing, for each patient group, the intersection for each cell type of the optimal clusters (as calculated by venn.R) containing the genes which behaved like that specific cell's markers, and removes from each patient group the union of the genes which appeared in the other two intersections. The resulting files contain the genes which behave, for each cell type, like that specific cell's markers only when the patient presents a specific condition (inflamed, not inflamed, or healthy).
  • separate state markers.R: This script loads the .csv files generated by find_state_markers.R and removes (or better, puts into a separate column) the genes which were already present in their respective known_cells_genes.csv, meaning that what is left are genes which do behave just like the cells' markers only under a specific patient condition (healthy, inflamed or not inflamed) and that were not already known
  • utils.R: Various utilities for reading files, as well as displaying the various plots

List of helper files:

  • known cells genes.csv: a copy of this file is present in each dataset. They contain the set of cells found in the datasets by Kong et al. dataset, as well as their gene markers.
  • enrichment list.txt: the list of genes we were requested to perform gene enrichment analysis on (contained in enrichment_list.txt).

Folders structure and files description

The folder structure is the same for each dataset. For simplicity, we will use the TI_IMM dataset as an example, with only one patient and their files.

TI_IMM
├── dataset #(not included in this repository as the datasets are too heavy)
│   ├── features.scp.gz: features data
│   ├── barcodes.scp.gz: barcodes data
│   └── matrix.mtx.gz:   dataset matrix
└── H101694
    ├── enrichment_csv
    │   └── TI_IMM_H101694_EnrichmentGenesClusters_x.csv: Clusterization data (with "x" clusters) of the enrichment target genes 
    │── clustering
    │   └── Results of COTAN's clusterization and merging procedures.
    └── plot
        ├── TI_IMM_H101694__00_ECDPlot.pdf: ECD plot
        ├── TI_IMM_H101694__01_CellSizePlot.pdf: original library size plot
        ├── TI_IMM_H101694__02_GenesSizePlot.pdf: original genes count plot
        ├── TI_IMM_H101694__03_MitocondrialPlot.pdf: original mitochondrial genes' percentages plot
        ├── TI_IMM_H101694__04_CellSizePlot_cut_n.pdf: library size plot after n cuts
        ├── TI_IMM_H101694__05_GeneCountPlot_cut_n.pdf: gene size plot after n cuts
        ├── TI_IMM_H101694__06_MitocondrialCount_cut_n.pdf: mitochondrial cells plot after n cuts
        ├── TI_IMM_H101694__07_PCACells.pdf: plots the "A" and "B" clusters
        ├── TI_IMM_H101694__08_PCACellsBRemoval.pdf: PCA after the removal of cluster "B" (if removed)
        ├── TI_IMM_H101694__09_CleanPlotGenes.pdf: B cell group genes mean expression
        ├── TI_IMM_H101694__10_CleanPlotUDE.pdf: "Nu" correlation plot
        ├── TI_IMM_H101694__11_CleanPlotNu.pdf: plot of the "Nu" values
        ├── TI_IMM_H101694__12_PCACells_cut_n.pdf: Plotting again PCA after cluster B removal (if removed)
        ├── TI_IMM_H101694__13_CleanPlotGenes_cut_n.pdf: B cell group genes mean expression after n cuts
        ├── TI_IMM_H101694__14_CleanPlotUDE_cut_n.pdf:"Nu" correlation plot after n cuts
        ├── TI_IMM_H101694__15_CleanPlotNu_cut_n.pdf: "Nu" values after n cuts
        ├── TI_IMM_H101694__16_GDIPlot.pdf: GDI plot
        ├── TI_IMM_H101694__22_FineClustersSummary.pdf: Summary of the (not merged yet) clusters' statistics
        ├── TI_IMM_H101694__23_MergedClustersSummary.pdf: Summary of the merged clusters' statistics (not reliable because of a bug in the COTAN library)
        ├── TI_IMM_H101694__26_clustersMarkersHeatmapPlot.pdf: Heatmap which associates to each cluster the enrichment of the markers of certain known cell types
        ├── TI_IMM_H101694__27_enrichmentHm.pdf: Enrichment heatmap with the targeted genes' dendrogram (unclustered)
        ├── TI_IMM_H101694__27_enrichmentHm_n.pdf: Enrichment heatmap with the targeted genes' dendrogram (clustered with "n" clusters)
        └── TI_IMM_H101694__28_enrichmentHmUnclustered.pdf: Enrichment heatmap (no clustering applied)

Packages requirements

The libraries' dependencies used to install these packages are installed automatically.

  • COTAN
  • Zeallot
  • Seurat
  • VennDiagram
  • limma
  • clusterProfiler
  • ReactomePA

Resources

chl_project's People

Contributors

asduffo avatar aprs3 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.