oscar-franzen / alona Goto Github PK

analysis of single cell RNA sequencing data and cell type annotation

License: GNU General Public License v3.0

Python 97.07% R 0.29% Perl 2.65%

alona's Introduction

What is "alona", exactly?

alona (yes, spelled with a lowercase 'a') is a scientific analysis software for high-dimensional gene expression data from single cells, so-called scRNA-seq data. The typical input dataset to alona contains thousands of columns and rows. Each column represents one biological cell and rows are measured features (in our case, genes). alona automates almost all steps in the analysis of the input data.

Detailed description

alona is a Python-based software for analysis of scientific data generated using a technology called single cell RNA sequencing (scRNA-seq) (1). alona is command-line centered and performs normalization, quality control, clustering, cell type annotation and visualization. alona integrates many state of the art algorithms in scRNA-seq analysis to a single convenient tool. In comparison with other scRNA-seq analysis pipelines, alona is designed as a command-line tool and not primarily as an importable library. Nevertheless, alona is modular and specific functions can be imported into existing Python projects. The goal ofalona is to facilitate fast and consistent single cell data analysis without requiring typing new code; at the same time, alona is written in portable and easily modifiable Python code, which makes it easy to adopt to new projects. Running alona to analyze scRNA-seq data is simple, fast and requires very little tweaking, as most parameters have sensible defaults. The clustering method used in alona is similar to the R package called Seurat; i.e., first computing k nearest neighbors in PCA space, followed by generating a shared nearest neighbor graph. alona uses the Leiden algorithm (an improved version of the Louvain algorithm) to identify tightly connected communities from the graph.

alona also exists as a parallel cloud-based service (2), which contains a subset of the features of the standalone version.

Is alona a replacement for my favorite scRNA package XYZ?

It depends on your goals. The goal of alona is to enable fast exploratory analysis of new datasets by automating as many steps as possible. alona generates data tables which can easily be loaded into your analysis toolkit of choice, where more in-depth statistical analysis can take place.

An example

Installation

Requirements

Linux or MacOS (should in principle work on Windows too, but it has not been tested)
Python (version >= 3.6)

Dependencies

alona relies heavily on numpy, pandas, matplotlib, scipy and others. Complete list of dependencies (missing dependencies are installed if pip3 is used for installation, see below): click, matplotlib, numpy, pandas, scipy, scikit-learn, leidenalg, umap-learn, statsmodels, igraph, and seaborn.

Install using git and pip3

The fastest way to install alona is to first clone the GitHub repository and then use pip to install it. pip is a package manager for Python. If you don't have pip installed, it can be installed by the following command on Debian-based systems (e.g. Ubuntu):

sudo apt-get install python3-pip

Then simply...

# Clone the repository
git clone https://github.com/oscar-franzen/alona/

# Enter the directory
cd alona

# Install the package
python3 -m pip install . --user

# Clean
cd ..
rm -rfv alona

Input data files

Species

alona works on scRNA-seq data from any organism. However, cell type prediction methods only work with human and mouse data. If your organism is not mouse and human, use the flag --species other to indicate this.

FASTQ files

Pre-processing (mapping, read counting, etc) of FASTQ files is not a step included in alona, because processing of FASTQ files is usually performed on high performance clusters whereas the analysis can be performed on a laptop or desktop computer. We provide a short tutorial on how to preprocess FASTQ files, please see here.

Formats

The input file is a single gene expression matrix in plain text format. The header of the matrix are barcodes and the first column are gene symbols. Fields should be separated by tabs, commas or spaces (but not a mix). The file can be compressed with zip, gzip or bzip2. In addition, data can also be in Matrix Market format (a format popular in NCBI GEO), consisting of three files (one file for the actual data values, a second file for barcodes and a third file for gene symbols), which must be bundled together in a tar file (can be compressed with gzip or not).

Note on ERCC Spikes

ERCC spikes can be included and will be automatically detected and handled. Make sure ERCC "genes" are labeled with the prefix ERCC_ or ERCC-.

Usage example

Here is one example of calling the pipeline using the data set GSM3689776.

python3 -m alona \
        --species human \
        --embedding tSNE \
        --hvg_n 1000 \
        --leiden_res 0.1 \
        --output ~/Temp/test \
        --header yes \
        --minexpgenes 0.001 \
        --remove_mito yes GSM3689776_mouse_10X_matrix.txt.gz

Output files

Here is an example of the directory structure and files created by alona in the output directory test.

rand[~/alona/python-alona/test]> tree
.
├── csvs
│   ├── clusters_leiden.csv
│   ├── CTA_RANK_F
│   │   ├── cell_type_pred_best.txt
│   │   └── cell_type_pred_full_table.txt
│   ├── embeddings.csv
│   ├── highly_variable_genes.csv
│   ├── Mahalanobis.csv
│   ├── median_exp.csv
│   ├── pca.csv
│   ├── snn_graph.csv
│   └── SVM
│       ├── SVM_cell_type_pred_best.txt
│       └── SVM_cell_type_pred_full_table.txt
├── input.mat
├── input.mat.C
├── normdata_ERCC.joblib
├── normdata.joblib
├── plots
│   ├── 2d_plot_tSNE.pdf
│   ├── barplot_ge.pdf
│   └── barplot_rrc.pdf
├── settings.txt
└── unmappable.txt

4 directories, 20 files

All command line options

rand[~/]> python3 -m alona --help
Usage: alona.py [OPTIONS] FILENAME

Options:
  -o, --output TEXT               Specify name of output directory
  -df, --dataformat [raw|rpkm|log2]
                                  How the input data have been processed. For
                                  example, if data are raw read counts, then
                                  select "raw".  [default: raw]
  -mr, --minreads INTEGER         Minimum number of reads per cell to keep the
                                  cell.  [default: 1000]
  -mg, --minexpgenes FLOAT        Pre-filter the data matrix and remove genes
                                  according to this threshold. Can be
                                  specified either as a fraction of all cells
                                  or as an an integer (translates to the
                                  absolute number of cells that at minimum
                                  must express the gene).  [default: 0.01]
  --qc_auto BOOLEAN               Automatic filtering of low quality cells.
                                  [default: True]
  --mrnafull                      Data come from a full-length protocol, such
                                  as SMART-seq2.  [default: False]
  --exclude_gene TEXT             Remove any gene matching this regular
                                  expression.
  -d, --delimiter [auto|tab|space]
                                  Data delimiter. The character used to
                                  separate data values. The default setting is
                                  to autodetect this character.  [default:
                                  auto]
  -h, --header [auto|yes|no]      Data has a header line. The default setting
                                  is to autodetect if a header is present or
                                  not.  [default: auto]
  -m, --remove_mito [yes|no]      Remove mitochondrial genes from analysis
                                  [default: no]
  --hvg [seurat|Brennecke2013|scran|Chen2016|M3Drop_smartseq2|M3Drop_UMI]
                                  Method to use for identifying highly
                                  variable genes.  [default: seurat]
  --hvg_n INTEGER                 Number of top highly variable genes to use.
                                  [default: 1000]
  --pca [irlb|regular]            PCA method to use.  [default: irlb]
  --pca_n INTEGER                 Number of PCA components to use.  [default:
                                  75]
  --nn_k INTEGER                  k in the nearest neighbour search.
                                  [default: 10]
  --prune_snn FLOAT               Threshold for pruning the SNN graph, i.e.
                                  the edges with lower value (Jaccard index)
                                  than this will be removed. Set to 0 to
                                  disable pruning. Increasing this value will
                                  result in fewer edges in the graph.
                                  [default: 0.067]
  --leiden_partition [RBERVertexPartition|ModularityVertexPartition]
                                  Partitioning algorithm to use. Can be
                                  RBERVertexPartition or
                                  ModularityVertexPartition.  [default:
                                  RBERVertexPartition]
  --leiden_res FLOAT              Resolution parameter for the Leiden
                                  algorithm (0-1).  [default: 0.8]
  --ignore_small_clusters INTEGER
                                  Ignore clusters with fewer or equal to N
                                  cells.  [default: 10]
  --annotations PATH              An optional file containing gene
                                  descriptions. The first column in this file
                                  must be gene identifiers and the second
                                  column is any string. The two columns should
                                  be separated by a tab character. No header
                                  is allowed. Gene symbols must match gene
                                  symbols of the input data matrix.
  --custom_clustering PATH        An optional file containing a pre-generated
                                  clustering. This option can be used if
                                  clustering has already been performed
                                  externally. The file should contain two
                                  columns, delimited by a tab character,
                                  without header. The first column should
                                  contain cell identifiers and the second
                                  column should contain the cluster.
  --embedding [tSNE|UMAP]         Method used for data projection. Can be
                                  either tSNE or UMAP.  [default: tSNE]
  --perplexity INTEGER            The perplexity parameter in the t-SNE
                                  algorithm.  [default: 30]
  -s, --species [human|mouse|other]
                                  Species your data comes from.  [default:
                                  mouse]
  --dark_bg                       Use dark background in scatter plots.
                                  [default: False]
  --de_direction [any|up|down]    Direction for differential gene expression
                                  analysis.
  --add_celltypes TEXT            Add markers for these additional cell types
                                  to the heatmap plot. Separate multiple cell
                                  types with commas.
  --overlay_genes TEXT            Generate scatter plots in 2d space (using
                                  method specified by --embedding), where gene
                                  expression is overlaid on cells. Specify
                                  multiple genes by comma separating gene
                                  symbols.
  --highlight_specific_cells TEXT
                                  Specific cells can be highlighted in scatter
                                  plots in 2d space (using method specified by
                                  --embedding). Specify multiple cells by
                                  comma separating cell identifiers (usually
                                  barcodes).
  --violin_top INTEGER            Generate violin plots for the specified
                                  number of top expressed genes per cluster.
                                  [default: 10]
  --timestamp                     Add timestamp label to plots.  [default:
                                  False]
  -lf, --logfile TEXT             Name of log file. Set to /dev/null if you
                                  want to disable logging to a file.
                                  [default: alona.log]
  -ll, --loglevel [regular|debug]
                                  Set how much runtime information is written
                                  to the log file.  [default: regular]
  -n, --nologo                    Hide the logo.
  --seed INTEGER                  Set seed to get reproducible results.
  --version                       Display version number.
  --help                          Show this message and exit.

Detailed help for command line options

option	detailed description
`-out, --output [TEXT]`	Specify name of output directory. If this is not given then a directory with the format: alona_out_N will be created, where N is a 8 letter random string, in the current working directory.
`-df, --dataformat [raw\|rpkm\|log2]`	Specifies how the input data has been normalized. There are currently three options: `raw` means input data are raw read counts (alona will take care of normalization steps); `rpkm` means input data are normalized as RPKM but not logarithmized and alona will not perform any more normalization except for loging; `log2` means that input data have been normalized and logarithmized and alona will not perform these steps. Default: raw
`--minexpgenes [float\|int], -mg`	Pre-filter the data matrix and remove genes according to this threshold. Can be specified either as a fraction of all cells or as an an integer (translates to the absolute number of cells that at minimum must express the gene).
`--mrnafull`	Data come from a full-length protocol, such as SMART-seq2. This option is important if data represent full mRNAs. Drop-seq/10X and similar protocols sequence the ENDS of an mRNA, it is therefore not necessary to normalize for gene LENGTH. However, if we sequence the complete mRNA then we must also normalize measurements for the length of the gene, since longer genes have more mapped reads. If this option is not set, then cell type prediction may give unexpected results when analyzing full-length mRNA data. Default: False
`--hvg [method]`	Method to use for identifying highly variable genes, must be one of: seurat, Brennecke2013, scran, Chen2016, M3Drop_smartseq2, or M3Drop_UMI. This option specifies the method to be used for identifying variable genes. `seurat` is the method implemented in the Seurat R package (3). It bins genes according to average expression, then calculates dispersion for each bin as variance to mean ratio. Within each bin, Z-scores are calculated and returned. Z-scores are ranked and the top N are selected. `Brennecke2013` refers to the method proposed by Brennecke et al (4). `Brennecke2013` estimates and fits technical noise using RNA spikes (technical genes) by fitting a generalized linear model with a gamma function and identity link and the parameterization w=a_1+u+a0. It then uses a chi2 distribution to test the null hypothesis that the squared coefficient of variation does not exceed a certain minimum. FDR<0.10 is considered significant. Currently, `Brennecke2013` uses all the genes to estimate noise. `scran` fits a polynomial regression model to technical noise by modeling the variance versus mean gene expression relationship of ERCC spikes (the original method used local regression) (5). It then decomposes the variance of the biological gene by subtracting the technical variance component and returning the biological variance component. `Chen2016` (6) uses linear regression, subsampling, polynomial fitting and gaussian maximum likelihood estimates to derive a set of HVG. `M3Drop_smartseq2` models the dropout rate and mean expression using the Michaelis-Menten equation to identify HVG (7). `M3Drop_smartseq2` works well with SMART-seq2 data but not UMI data, the former often being sequenced to saturation so zeros are more likely to be dropouts rather than unsaturated sequencing. `M3Drop_UMI` is the corresponding M3Drop method for UMI data. Default: `seurat`
`--pca [irlb\|regular]`	The PCA method to use. Does not have a big impact on the results. The number of components to use is specified with the `--pca_n` flag (default is the first 75).
`--hvg_n [int]`	Number of highly variable genes to use. If method is `brennecke` then `--hvg_n` determines how many genes will be used from the genes that are significant. Default: 1000
`--qc_auto [True\|False]`	Automatically filters low quality cells using five quality metrics and Mahalanobis distances. Three standard deviations from the mean is considered an outlier and will be removed. Default: True
`--embedding [tSNE\|UMAP]`	The method used to project the data to a 2d space. Only used for visualization purposes. t-SNE is more commonly used in scRNA-seq analysis. UMAP may be better at preserving the global structure of the data. Default: tSNE
`--seed [int]`	Set a seed for the random number generator. This setting is used to generate plots and results that are numerically identical. Algorithms such as t-SNE and Fast Truncated Singular Value Decomposition need random numbers. Setting a seed guarantees that the random numbers are the same across sessions.
`--overlay_genes [TEXT]`	Can be used to specify one or more genes for which gene expression will be overlaid on the 2d embedding. The option is useful for examining the expression of individual genes in relation to clusters and cell types. Multiple genes can be given by separating them with comma. If multiple genes are specified, one plot will be generated for each gene.
`--highlight_specific_cells [TEXT]`	Sometimes it can be useful to highlight where a specific cell is falling on the 2d embedding. This option is used to highlight such cells in the scatter plot. Cell identifiers refer to those present in the header of the data matrix. Multiple cell identifiers can be entered separated by commas.
`--violin_top [int]`	Generates violin plots for the top genes of every cluster. The argument specifies how many of the top expressed genes of every cluster are included. "Top" is defined by ranking on the mean within every cluster.
`--timestamp`	Adds a small timestamp to the bottom left corner of every plot. Can be useful when sharing plots in order to distinguish different versions.
`--exclude_gene [TEXT]`	Sometimes we want to exclude certain genes from the analysis. For example tRNA genes or rRNA. This flag can be used to specify a regular expression pattern, which will be matched to the input data and the corresponding genes excluded.
`--annotations [PATH]`	Use this flag to specify a file containing gene annotations. The file should contain two tab-separated columns: one for the genes and one for the annotations. Gene annotation will be added as an additional column in the differential expression analysis files. This option can be useful in case the genome is using systematic gene identifers and not gene symbols.
`--de_direction [any\|up\|down]`	Specifies the direction of the differential gene expression analysis. Default is `up`, because usually we want to find genes that are more expressed in one cluster compared to the other.

Differential gene expression analysis

A common goal is to define genes that are differentially expressed between cell clusters. alona implements linear models for DE discovery similar to the R package limma. DE analysis is performed by default and the results are written to two files. A linear model y~x is fitted between gene expression and clusters and t statistics and p-values are calculated for coefficients. P-values are two-sided if direction is set to any (otherwise one-sided). The final output for the DE analysis is written into three tables:

1. csvs/all_t_tests.csv

This file contains p-values in a matrix format. The number of rows is equal to the number of input genes. Number of columns is equal to the number of comparisons. The column header contains the performed comparisons, e.g. 1_vs_0 indicates that cluster 1 is compared to cluster 0.

2. csvs/all_t_tests_long.tsv

This is in long format, separated by tabs, and is generated for easy filtering based on selected cutoffs. Every row corresponds to one hypothesis test. Columns correspond to:

Column	What it is
comparison_A_vs_B	The comparison that was performed.
gene	Gene that was tested.
p_val	P-value of the test from the t statistic.
FDR	False Discovery Rate based on the Benjamini-Hochberg procedure.
t_stat	t statistic
logFC	log fold change
mean.A	mean expression of the gene in cluster A
mean.B	mean expression of the gene in cluster B
annotations	this column is only added if the `--annotation` flag is specified

3. csvs/markers.tsv

Contains marker genes by combining p-values of the pairwise tests. P-values are combined with Simes' method.

Bugs

Please file a bug report through Github.

Contact and Support

Limited support is available through e-mail:

Oscar Franzen [email protected]

Cite

A manuscript has been submitted.

License

GPLv3

alona's People

Contributors

Stargazers

Watchers

Forkers

frankzelph mubashermohammed 6littlestar7 kbonney

alona's Issues

The GSM3689776 dataset in the example seems not working with alona

Hi, as I can run alona successfully with my own data, I find it quite weird that alona failed with the example case. And it seems that GSM3689776 is from mouse rather than human, so I changed the species parameter to mouse, but it still throw out errors when do CTA_RANK_F step. The row values from the same column in the median_expr_Z matrix seems were all the same and ret returns an empty list. Here's the error message showed:

/home/z013/miniconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
return (a < x) & (x < b)
/home/z013/miniconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
return (a < x) & (x < b)
/home/z013/miniconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:1827: RuntimeWarning: invalid value encountered in greater_equal
cond2 = (x >= np.asarray(_b)) & cond0
/home/z013/miniconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:1912: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= _a)
Traceback (most recent call last):
File "/home/z013/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/z013/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/z013/.local/lib/python3.7/site-packages/alona/main.py", line 9, in
main()
File "/home/z013/.local/lib/python3.7/site-packages/alona/main.py", line 5, in main
run(prog_name='alona.py')
File "/home/z013/miniconda3/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/z013/miniconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/z013/miniconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/z013/miniconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/z013/.local/lib/python3.7/site-packages/alona/alona.py", line 187, in run
alonacell.analysis()
File "/home/z013/.local/lib/python3.7/site-packages/alona/cell.py", line 439, in analysis
self.CTA_RANK_F(marker_plot=True)
File "/home/z013/.local/lib/python3.7/site-packages/alona/celltypes.py", line 177, in CTA_RANK_F
_df = pd.DataFrame(ret[k].to_list()) #refactored
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/series.py", line 910, in getitem
return self._get_with(key)
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/series.py", line 958, in _get_with
return self.loc[key]
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1768, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1954, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1595, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1547, in _get_listlike_indexer
indexer = ax.get_indexer_for(key)
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4501, in get_indexer_for
return self.get_indexer(target, **kwargs)
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2729, in get_indexer
target, method=method, limit=limit, tolerance=tolerance
File "/home/z013/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2753, in get_indexer
indexer = self._engine.get_indexer(target._ndarray_values)
File "pandas/_libs/index.pyx", line 287, in pandas._libs.index.IndexEngine.get_indexer
File "pandas/_libs/hashtable_class_helper.pxi", line 1668, in pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'dict'

Installation warning

Hi,

I wanted to install but got a permission issue. See output below. This is not necessarily an issue, but I use brew to install a lot of the packages, like Python3. Why is the permission needed?

Best,

Sander

git clone https://github.com/oscar-franzen/alona/
Cloning into 'alona'...
remote: Enumerating objects: 164, done.
remote: Counting objects: 100% (164/164), done.
remote: Compressing objects: 100% (112/112), done.
remote: Total 2235 (delta 115), reused 101 (delta 52), pack-reused 2071
Receiving objects: 100% (2235/2235), 32.77 MiB | 9.53 MiB/s, done.
Resolving deltas: 100% (1549/1549), done.

swvanderlaan@Sanders-MBP ~/git
$ cd alona/

swvanderlaan@Sanders-MBP ~/git/alona
$ pip3 install .
Processing /Users/swvanderlaan/git/alona
Collecting click>=7.0 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
     |████████████████████████████████| 81kB 3.9MB/s
Collecting matplotlib>=3.0.3 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/c3/8b/af9e0984f5c0df06d3fab0bf396eb09cbf05f8452de4e9502b182f59c33b/matplotlib-3.1.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (14.4MB)
     |████████████████████████████████| 14.4MB 14.0MB/s
Requirement already satisfied: numpy>=1.16.3 in /usr/local/lib/python3.7/site-packages (from alona-oscarf==0.1) (1.16.4)
Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.7/site-packages (from alona-oscarf==0.1) (0.24.2)
Requirement already satisfied: scipy>=1.2.1 in /usr/local/lib/python3.7/site-packages (from alona-oscarf==0.1) (1.3.0)
Requirement already satisfied: scikit-learn>=0.21.0 in /usr/local/lib/python3.7/site-packages (from alona-oscarf==0.1) (0.21.2)
Collecting leidenalg>=0.7.0 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/b6/cc/d76baf78a3924ba6093a3ce8d14e2289f1d718bd3bcbb8252bb131d12daa/leidenalg-0.7.0.tar.gz (92kB)
     |████████████████████████████████| 102kB 9.5MB/s
Collecting umap-learn>=0.3.9 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/ad/92/36bac74962b424870026cb0b42cec3d5b6f4afa37d81818475d8762f9255/umap-learn-0.3.10.tar.gz (40kB)
     |████████████████████████████████| 40kB 19.5MB/s
Collecting statsmodels>=0.9.0 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/b5/b8/50f9b86bbd87b1de961f439c2b93dfc41dd0cb9d65f6b7d824b287b50b21/statsmodels-0.10.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (10.5MB)
     |████████████████████████████████| 10.5MB 6.0MB/s
Collecting python-igraph>=0.7.1 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/0f/a0/4e7134f803737aa6eebb4e5250565ace0e2599659e22be7f7eba520ff017/python-igraph-0.7.1.post6.tar.gz (377kB)
     |████████████████████████████████| 378kB 24.6MB/s
Collecting seaborn>=0.9.0 (from alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
     |████████████████████████████████| 215kB 4.6MB/s
Collecting kiwisolver>=1.0.1 (from matplotlib>=3.0.3->alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/df/93/8bc9b52a8846be2b9572aa0a7c881930939b06e4abe1162da6a0430b794f/kiwisolver-1.1.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (113kB)
     |████████████████████████████████| 122kB 26.1MB/s
Collecting cycler>=0.10 (from matplotlib>=3.0.3->alona-oscarf==0.1)
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=3.0.3->alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/11/fa/0160cd525c62d7abd076a070ff02b2b94de589f1a9789774f17d7c54058e/pyparsing-2.4.2-py2.py3-none-any.whl (65kB)
     |████████████████████████████████| 71kB 25.8MB/s
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/site-packages (from matplotlib>=3.0.3->alona-oscarf==0.1) (2.8.0)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.7/site-packages (from pandas>=0.24.2->alona-oscarf==0.1) (2019.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/site-packages (from scikit-learn>=0.21.0->alona-oscarf==0.1) (0.13.2)
Collecting numba>=0.37 (from umap-learn>=0.3.9->alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/bb/0d/db1d84ec79b223dedb72d7f1823a77797494348fea4b1809d92affea720a/numba-0.45.1-cp37-cp37m-macosx_10_9_x86_64.whl (1.9MB)
     |████████████████████████████████| 1.9MB 28.2MB/s
Collecting patsy>=0.4.0 (from statsmodels>=0.9.0->alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/ea/0c/5f61f1a3d4385d6bf83b83ea495068857ff8dfb89e74824c6e9eb63286d8/patsy-0.5.1-py2.py3-none-any.whl (231kB)
     |████████████████████████████████| 235kB 12.1MB/s
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=3.0.3->alona-oscarf==0.1) (41.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from cycler>=0.10->matplotlib>=3.0.3->alona-oscarf==0.1) (1.12.0)
Collecting llvmlite>=0.29.0dev0 (from numba>=0.37->umap-learn>=0.3.9->alona-oscarf==0.1)
  Downloading https://files.pythonhosted.org/packages/d7/45/9216cdbf71b94ae8eefe24cfbdf4c1b9045a58f297c6e2eab5cb8d05faf3/llvmlite-0.29.0-cp37-cp37m-macosx_10_9_x86_64.whl (15.9MB)
     |████████████████████████████████| 15.9MB 9.7MB/s
Building wheels for collected packages: alona-oscarf, leidenalg, umap-learn, python-igraph
  Building wheel for alona-oscarf (setup.py) ... done
  Stored in directory: /private/var/folders/kc/c5rw4cb94c1c149gsm2ygfc00000gn/T/pip-ephem-wheel-cache-343trjkl/wheels/3e/d9/a1/924c5305a0d16ca57576f982b5f166408578d9034b85d44012
  Building wheel for leidenalg (setup.py) ... done
  Stored in directory: /Users/swvanderlaan/Library/Caches/pip/wheels/29/55/48/5a04693a10f50297bcda23819ca23ab3470a61dd911851c8bd
  Building wheel for umap-learn (setup.py) ... done
  Stored in directory: /Users/swvanderlaan/Library/Caches/pip/wheels/d0/f8/d5/8e3af3ee957feb9b403a060ebe72f7561887fef9dea658326e
  Building wheel for python-igraph (setup.py) ... done
  Stored in directory: /Users/swvanderlaan/Library/Caches/pip/wheels/41/d6/02/34eebae97e25f5b87d60f4c0687e00523e3f244fa41bc3f4a7
Successfully built alona-oscarf leidenalg umap-learn python-igraph
Installing collected packages: click, kiwisolver, cycler, pyparsing, matplotlib, python-igraph, leidenalg, llvmlite, numba, umap-learn, patsy, statsmodels, seaborn, alona-oscarf
ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/usr/local/setup.cfg'
Consider using the `--user` option or check the permissions.

ValueError: DataFrame constructor not properly called!

Hi there,

Thank you for providing such a great tools to do scRNA-analysis. when I run the GSM3689776 data, there ia a error and can not finish running, the output info is :

Traceback (most recent call last):
File "/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/.local/lib/python3.7/site-packages/alona/main.py", line 9, in
main()
File "/.local/lib/python3.7/site-packages/alona/main.py", line 5, in main
run(prog_name='alona.py')
File "/.local/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/.local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/.local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/.local/lib/python3.7/site-packages/alona/alona.py", line 187, in run
alonacell.analysis()
File "/.local/lib/python3.7/site-packages/alona/cell.py", line 439, in analysis
self.CTA_RANK_F(marker_plot=True)
File "/.local/lib/python3.7/site-packages/alona/celltypes.py", line 177, in CTA_RANK_F
_df = pd.DataFrame(k)
File "/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 528, in init
raise ValueError("DataFrame constructor not properly called!")
ValueError: DataFrame constructor not properly called!

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Hi,

alona seems a very good tool to do scRNA-analysis, when I run the GSM3689776 data, there ia a problem in the qc step, the output info is :

INFO 3,013 columns detected.
INFO 28,692 genes detected
INFO 224 gene(s) were not mappable.
INFO These have been written to unmappable.txt
INFO Current dimensions: (28468, 3013)
INFO detected and removed 13 mitochondrial gene(s)
INFO Keeping 2961 out of 3013 cells
INFO Removing 15,314 genes.
Traceback (most recent call last):
File "/opt/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/x/alona/alona/main.py", line 7, in
main()
File "/home/x/alona/alona/main.py", line 4, in main
run(prog_name='alona.py')
File "/opt/miniconda3/lib/python3.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/opt/miniconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/miniconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/x/alona/alona/alona.py", line 218, in run
alonacell.load_data()
File "/home/x/alona/alona/cell.py", line 424, in load_data
self.find_low_quality_cells()
File "/home/x/alona/alona/cell.py", line 334, in find_low_quality_cells
robust_cov = MinCovDet(random_state=seed).fit(qc_mat)
File "/opt/miniconda3/lib/python3.7/site-packages/sklearn/covariance/robust_covariance.py", line 639, in fit
X = check_array(X, ensure_min_samples=2, estimator='MinCovDet')
File "/opt/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/opt/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.