Coder Social home page Coder Social logo

becavin-lab / checkatlas Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 53.13 MB

One liner tool to check the quality of your single-cell atlases.

Home Page: https://checkatlas.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.17% Python 7.17% R 0.03% HTML 92.63%
control quality scanpy seurat single-cell multiqc python

checkatlas's Introduction

CheckAtlas

PyPI PyPI - Downloads PyPI - License install with bioconda

codecov CI Documentation Status Gitter

Static Badge Static Badge Static Badge

CheckAtlas is a one liner tool to check the quality of your single-cell atlases. For every atlas, it produces the quality control tables and figures which can be then processed by multiqc. CheckAtlas is able to check the quality of Scanpy, Seurat, and CellRanger files.

More information on the read the doc page

Summary

Powered by nextflow, checkatlas can be ran in one command line:

nextflow run nf-core-checkatlas -r dev --path search_folder/

The checkatlas workflow start with a fast crawl through your working directory. It detects Seurat (.rds), Scanpy (.h5ad) or cellranger (.h5) atlas files.

Then, it goes through all atlas files and produce summary information:

  • All basic QC (nRNA, nFeature, ratio_mito)
  • General information (nbcells, nbgenes, nblayers)
  • All elements in atlas files (obs, obsm, uns, var, varm)
  • Reductions (pca, umap, tsne)
  • All metrics (clustering, annotation, dimreduction, specificity)

All tables and figs are saved in the checkatlas_files folder in your search folder.

A single html report is produced, using MultiQC, in checkatlas_files/Checkatlas-MultiQC.html.

Checkatlas workflow

Examples

  • Evaluate and compare different scanpy atlases: Example 1

  • Evaluate different version of one atlas: Example 2

  • Evaluate Scanpy, Seurat and CellRanger objects in your folder: Example 3

  • Evaluate an integrated Scanpy atlas with the corresponding raw CellRanger atlases: Example 4

  • Evaluate different Cellranger atlases with multiple chemistry version and cellranger version: Example 5

Installation

CheckAtlas is in two parts. The checkatlas pythn module which can be downloaded with PyPi, and the checkatlas workflow which can be downloaded with nextflow.

pip install checkatlas
nextflow pull becavin-lab/nf-core-checkatlas

You need also to install a version of MultiQC with checkatlas capability (for the moment). This version of MultiQC is available at checkatlas branch of github.com:becavin-lab/MultiQC.

git clone [email protected]:becavin-lab/MultiQC.git
cd MultiQC/
git checkout checkatlas
pip install .

Finally, checkatlas comes with rpy2 to perform the interface between python and R. But, it does not automatically install Seurat. So if you want to screen Seurat atlases you need to perfrom this last installation

% R
> install.packages('Seurat')
> library(Seurat)

Development

This project is in a very early development phase. All helpers are welcome. Please contact us or submit an issue.

Read the CONTRIBUTING.md file.

Checkatlas has two repositories:

It has a module on MultiQC

The checkatlas package is available on PyPI

The bioconda recipe has been submitted

Project developed thanks to the project template : (https://github.com/rochacbruno/python-project-template/)

checkatlas's People

Contributors

drbecavin avatar paolaporracciolo avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

ryan2han

checkatlas's Issues

Parse additional cellranger outputs

Thank you for the nice looking tool. Would you consider adding to the multiQC reports parsing outputs from additional cellranger files? I'm thinking the metrics_summary.csv and (less easily) the web_summary.html for each sample. The .csv files could be aggregated into a table with basic quality scores, mapping metrics etc. The web_summary.html file is harder to parse I guess, but ideally could capture the knee plots of UMI counts for a quick comparison across samples.

Best wishes, Chris

QC figures too big

QC figures too big for multiQC html file
Reduc the size when produced in checkatlas
atlas.create_qc_plots(adata, atlas_path, atlas_info, fig_path)

Add SpatialData management

  • add detectio of spatial data
  • addd fast screening of spatialdata QC metrics
  • Display spatial data in multiqc

Add Kruskal stress calculation

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add knee plots

Hi @drbecavin,

this package looks great! We are currently considering to add it to the nf-core scrnaseq workflow (see nf-core/scrnaseq#80).

One feature I'd love to see are knee plots for QC metrics. I find them superior over the current violin plots for finding inflection points and they would also be easy to render for many samples simultaneously.

In particular, I think the following plots would be useful:

  • cell rank vs. total counts
  • cell rank vs. detected genes
  • cell rank vs. mitochondrial fraction

Here's an example from the cellranger report

image

Here's another example from some custom python script I usually use for single cell QC with scanpy:

image
(y-axis= cell rank, n_genes_by_counts = number of detected genes, red lines indicate cutoffs I chose)

The knee plots could be (as opposed to the violin plots) easily combined into a single, interactive multiQC figure. This helps identifying outliers with bad quality when working with many single-cell samples. Here's an example of such a plot from the nf-core/rnaseq multiqc report:

image

Create github with all metrics

Create a separate github named
singlecell-metrics for outsourcing all metrics and document them outside of checkatlas

Goal : Increased visibility !

Add test datasets

Add datasetst list from scPermut in checkatlas tutorial.
Use cellxgene tools for that !

Dimensionality reduction metrics

Add metrics for dim reduction analysis

  • caterogrize local, global estimation metrics
  • Choose atlas with "specific" umap and tsne.
  • Add mammuth test tool from "The spurious art of ..."
  • Implement benchmark
  • Add human estimation on good and bad UMAP
  • compare metrics to human estimation

Dim reduction metric management for seurat object

Describe the bug
Not working because the sparse matrix in python cannot be converted in R

To Reproduce
Run kruskal stress calc

To fix
Implement distance calculation in R and return the distance matrix not the count matrix.

Seurat import

Is your feature request related to a problem? Please describe.
For the moment, the user has to manually install Seurat in its environments. Or use only conda for checkatlas install.

Describe the solution you'd like
It would be nice to add Seurat with pip (but seems impossible)

Umap with seurat_cluster does not display cluster

When suerat_clusters is present it does not display as category but as numerical

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Integration with nf-core pipelines

Hi @drbecavin,

I am an active contributor to the nf-core project and have been working on the scRNA-seq and spatialtranscriptomics pipelins in the past. For both pipelines, we are considering to integrate checkatlas to generate MultiQC reports (see nf-core/scrnaseq#80 and nf-core/spatialvi#40).

From what I understood, the checkatlas architecture is rather complex, consisting of

  • a python library that takes a h5ad object and computes various QC metrics
  • a nextflow workflow that executes the different parts of the python library via CLI wrappers. The Nextflow workflow itself is wrapped in another Python CLI script.
  • a MultiQC module that reads the outputs of this workflow to generate a report
  • an R script to convert Seurat to h5ad.

To integrate checkatlas in one of our pipelines, we need to define a nextflow module that takes h5ad files as input, and generates files that can be ingested by a downstream MultiQC process. In addition we need a standalone container including all required dependencies (see also #25).

While it would be totally possible to create a container that contains both the Python dependencies, nextflow+java and R dependencies it seems a bit convoluted to run a nextflow workflow that starts a docker container that runs a python script that runs a nextflow workflow that runs another python script. It's also suboptimal in terms of resource management, because the checkatlas-nextflow running in the container cannot make use of the cluster/cloud scheduler the "outer" nextflow pipeline was configured to run with.

From our perspective, it would be better to separate the python library from the nextflow workflow in checkatlas. That way we could have a lightweight container for the python part, and build a "checkatlas" nextflow (sub)workflow that can be integrated in both pipelines. If necessary, conversion from Seurat to h5ad would run in a separate process with a separate container -- avoiding manual installation of R packages (mitigating issues like #24). In general, I think it is best to have nextflow as the outermost layer, to let it handle all dependencies and take advantage of its flexible resource management (local vs. hpc vs cloud).

Let me know what you think!

Cheers,
Gregor

CC @fasterius @cavenel (nf-core/spatialtranscriptomics), @fmalmeida (nf-core/scrnaseq)

Multiome seurat object bad conversion

Multiome seurat object are not converted accordingly to Scanpy objkect
Should be converted to two or thrree object depending of the number of assays ?

Tutorial for adding metrics in checklatlas

Update and fix the metrics adding protocol in chekatlas

Create a tutorial to clearly define the steps for adding a metrics:

  • define in which group it is
  • Add the metric in chekatlas code
  • Add a rapid documentation (where)
  • Add a wiki about the metric in checkatlas doc

Improve Mito and ribo QC calculation

For some Atlas the code:

# mitochondrial genes
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))

in atlas.create_qc_tables(adata, atlas_path, atlas_info, args)

Does not work because the annotation is in Ensembl format (ENSG0000...) or other.
Need to be fixed by adding MT and ribo annotation !

Or need to add an issue directly in scanpy ?

Automatic/Semi-automatic search of celltype key

In the search of the feature used to describe celltype, need to add some obs keyword.
For the moment everything is in

atlas.OBS_CLUSTERS = [
    "CellType",
    "celltype",
    "seurat_clusters",
    "orig.ident",
]

Need to fix this !

Maybe add an argument in chckatlas software so theused can tell us what is the right obs key ?

Implement metrics object

Implement a metric calculation structure for 4 types of metrics

  • count distribution
  • clustering
  • specificity
  • dimensionality reduction

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.