Coder Social home page Coder Social logo

theislab / diffxpy Goto Github PK

View Code? Open in Web Editor NEW
181.0 6.0 23.0 1.75 MB

Differential expression analysis for single-cell RNA-seq data.

Home Page: https://diffxpy.rtfd.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
differential-expression single-cell-rna-seq gene-set-enrichment bioinformatics transcriptomics tensorflow

diffxpy's Introduction

PyPI Docs

Fast and scalable differential expression analysis on single-cell RNA-seq data

diffxpy covers a wide range of differential expression analysis scenarios encountered in single-cell RNA-seq scenarios and integrates into scanpy workflows. Import diffxpy as import diffxpy.api as de to access the following differential expression analysis-related tools:

  1. differential expression analysis in the module de.test.*
  2. gene set enrichment analysis based on differential expression calls in the module de.enrich.*

Refer to the documentation and the tutorials for details of these modules.

diffxpy's People

Contributors

adkinsrs avatar davidsebfischer avatar dsm-72 avatar gokceneraslan avatar hoeze avatar zethson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

diffxpy's Issues

coefficient standard deviation in summary has wrong shape

File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/stats.py", line 186, in wald_test
raise ValueError('stats.wald_test(): theta_mle and theta_sd have to contain the same number of entries')
ValueError: stats.wald_test(): theta_mle and theta_sd have to contain the same number of entries

theta_sd has length of parameters, instead this should be the parameter which is tested x genes, so should be length genes

de.two_sample input data bug

import os
import diffxpy.base as de
from batchglm.api.models.nb_glm import Simulator, Estimator, InputData
import abc
import logging
from typing import Union, Dict, Tuple, List, Set
import time
import pandas as pd
import xarray as xr
import numpy as np

sim = Simulator(num_observations=1000, num_features=1000)
sim.generate_sample_description(num_confounders=0, num_batches=2)
sim.generate()
start = time.time()
res = de.two_sample(data=sim.input_data, grouping="batch", noise_model="nb")
end = time.time()

yields:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 1035, in two_sample
    X = _parse_data(data, gene_names)
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 568, in _parse_data
    X = xr.DataArray(data, dims=("observations", "features"))
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/xarray/core/dataarray.py", line 228, in __init__
    variable = Variable(dims, data, attrs, encoding, fastpath=True)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/xarray/core/variable.py", line 263, in __init__
    self._dims = self._parse_dimensions(dims)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/xarray/core/variable.py", line 424, in _parse_dimensions
    % (dims, self.ndim))
ValueError: dimensions ('observations', 'features') must have the same length as the number of data dimensions, ndim=0

I get a similar bug when feeding in an adata object and with test_lrt.

selection of noise model in base.py

I m not sure whether the choice of noise model via an import statement within the function is so clean. Try something along the lines of creating an instance within a switch and then call the TF session code on this instance, this should be independent of the type of model? The hyperparameters might be model specific, you could also define them as variables within the switch.

Divide model fitting of pairwise and vsrest into separate run

Right now this is divided into one fit per group. Add optional fit of a single model which fits all dispersion parameters in one go (means are MLE anyway). We will use this for runtime benchmarking to make sure we dont loose too much on overheads here. Expose option to fit separate or together to user.

diagnostic plots

add function diagnostic plots to DE-single output objects

  1. volcano plot: plot lfc vs log pvalue
  2. scatter inferred lfc vs observed lfc (ie log ratio of averages) for two sample test (just dont compute this ratio for other tests so that this plotting function returns None)

de.test.design_matrix() not compatible with adata.obs

function call:
dmat = de.test.design_matrix(sample_description=adata.obs, formula='~1+condition+sample+chip')

Error:

TypeError Traceback (most recent call last)
in ()
----> 1 dmat = de.test.design_matrix(sample_description=adata_counts.obs, formula='~1+condition+sample+chip')

~/github_packages/diffxpy/diffxpy/testing/base.py in design_matrix(data, sample_description, formula, dmat)
1349 if dmat is None:
1350 sample_description = _parse_sample_description(data, sample_description)
-> 1351 dmat = data_utils.design_matrix(sample_description=sample_description, formula=formula)
1352
1353 return dmat

~/github_packages/batchglm/batchglm/data.py in design_matrix(sample_description, formula, as_categorical, return_type)
125 sample_description[col] = sample_description[col].astype("category")
126
--> 127 dmat = patsy.highlevel.dmatrix(formula, sample_description)
128
129 if return_type == "dataframe":

~/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "

~/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,

~/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None

~/anaconda3/lib/python3.6/site-packages/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:

~/anaconda3/lib/python3.6/site-packages/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
446 cat_sniffers[factor] = CategoricalSniffer(NA_action,
447 factor.origin)
--> 448 done = cat_sniffers[factor].sniff(value)
449 if done:
450 examine_needed.remove(factor)

~/anaconda3/lib/python3.6/site-packages/patsy/categorical.py in sniff(self, data)
196 # fastpath to avoid doing an item-by-item iteration over boolean
197 # arrays, as requested by #44
--> 198 if hasattr(data, "dtype") and safe_issubdtype(data.dtype, np.bool_):
199 self._level_set = set([True, False])
200 return True

~/anaconda3/lib/python3.6/site-packages/patsy/util.py in safe_issubdtype(dt1, dt2)
679 if safe_is_pandas_categorical_dtype(dt1):
680 return False
--> 681 return np.issubdtype(dt1, dt2)
682
683 def test_safe_issubdtype():

~/anaconda3/lib/python3.6/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
724 """
725 if not issubclass_(arg1, generic):
--> 726 arg1 = dtype(arg1).type
727 if not issubclass_(arg2, generic):
728 arg2_orig = arg2

TypeError: data type not understood

add a multitest mapreduce operation

Allow user to set up a test such as a Wald test on the effect of condition in a model ~1+condition and broadcast this across a range of subset in each of which this test is supposed to be carried our for each gene. We could for example define a class "mapreduce_single" (or similar) which takes the data set and defines the subsetting in the constructor. This class would be handed the data object and the observation meta data vector which conains assignments to louvain groups. Secondly, this class contains a wrapper for each of the test_single methods which calls this test on each gene and each subset of the data, yielding number of subsets many p-values and a test_multi object. This would simplify such tasks for the user and we could generate better overview/enrichment information for such scenarios.

I think we can keep model fitting compartmentalised in the subsets for now, in principle one could fit one global model for the scenarios based on wald tests (no intercept!), this is a bit similar to how test_pairwise is fit.

Enable leaving dispersion at MME

It would be good to give the user the option to not model variance beyond the MME. We could trigger this when design_scale="~0". One use case would be two-sample tests: We can directly compute the model in closed form if we leave the dispersion at the MME, which we means that this is pretty much the same speed as t-test and that the z-test migth even be faster than that. Here it might also make sense to check out the closed form hessian so that we never have to call the TensorFlow backend. It would definitely be good to have this as a benchmark together with T-test and full MLE Wald / z-test.

two sample validate pvalues via t-test

add function compare_to_ttest() to output object which were generated with two_sample which calls two sample again with a t-test and plots a scatter with pvalues of given method (ie wald) vs t-test pvalues. this would be good as an additional convergence guarantee that everybody can easily do within one line.

update examples

Examples code /diffxpy/diffxpy/examples doesn't work, likely related to input data bug #3

properties of _Estimation should be fisher_loc_inv and fisher_scale_inv

I.e. the inverse of the negative hessian and not the negative hessian. Need this as BFGS directly approximates the inv of the hessian. I already changed DifferentialExpressionTestWald._test() to assume that fisher_loc contains the inverse, still have to change this in batchglm and rename this property.

benchmarks

for each N_cells in c(1e4, 1e5, 1e6): run time of wilcoxon, t-test, wald(TF), wald(bfgs)
for each N_cells in c(1e4, 1e5, 1e6): run time of wilcoxon vs_rest, t-test vs_rest, wald(TF) vs_rest, fast_wald(TF)

lrt with dispersion model = ~1

This yields an error, likely because the dispersion model isnt set up correctly?

sim.generate_sample_description(num_batches=0, num_confounders=0)
sim.generate()


logging.getLogger("tensorflow").setLevel(logging.WARNING)
logging.getLogger("batchglm").setLevel(logging.INFO)
logging.getLogger("diffxpy").setLevel(logging.INFO)

random_sample_description = pd.DataFrame({
    "condition": np.random.randint(2, size=sim.num_observations)
})

training_strategy = [{
	"learning_rate": 1e-4,
	"convergence_criteria": "t_test",
	"stop_at_loss_change": 1e-8,
	"loss_window_size": 20,
	'use_batching': False, 
	'optim_algo': 'GD'
}]
test = de.test_lrt(
    data = sim.X, 
    full_formula_loc="~ 1 + condition", 
    reduced_formula_loc="~ 1", 
    full_formula_scale="~ 1", 
    reduced_formula_scale="~ 1", 
    sample_description=random_sample_description,
    training_strategy=training_strategy
)

This is the full model estimation output:

Estimating model...
Using closed-form MLE initialization for mean
Using closed-form MME initialization for dispersion
Traceback (most recent call last):
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1605, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 2. Shapes are [1,1000] and [2,1000]. for 'output/b' (op: 'Select') with input shapes: [2,1000], [1,1000], [1,1000].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 803, in test_lrt
    training_strategy=training_strategy,
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 676, in _fit
    estim = test_model.Estimator(input_data=input_data, init_model=init_model, **constructor_args)
  File "/Users/david.fischer/gitDevelopment/batchglm/batchglm/train/tf/nb_glm/estimator.py", line 811, in __init__
    extended_summary=extended_summary
  File "/Users/david.fischer/gitDevelopment/batchglm/batchglm/train/tf/nb_glm/estimator.py", line 504, in __init__
    name="b"
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2608, in where
    return gen_math_ops.select(condition=condition, x=x, y=y, name=name)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 6931, in select
    "Select", condition=condition, t=x, e=y, name=name)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3259, in create_op
    op_def=op_def)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1769, in __init__
    control_input_ops)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1608, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 2. Shapes are [1,1000] and [2,1000]. for 'output/b' (op: 'Select') with input shapes: [2,1000], [1,1000], [1,1000].```

Make /diffxpy/base.py functional

Work on TODOs that I left in the code such:

  1. Think about reducing redundant code, in particular design matrix building and model fitting.
  2. Check out new DifferentialExpressionTest meta class.
  3. Improve api for many to many tests (not DifferentialExpressionTest).
  4. Build sgdlgm interface that allows separate location and scale parameter models. This is the python naming, for nb loc=mean und scale=dispersion model.
  5. Check out function annotation: I think this is comprehensive, this is the level of detail that we need for user exposed functions.

wilcoxon exception all numbers are the same

Scipy throws an error in this case, this should be caught by the wrapper in the daughter of DifferentialExpressionTestSingle.

`/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in pairwise(data, grouping, test, gene_names, sample_description, noise_model, pval_correction, batch_size, training_strategy, keep_full_test_objs, **kwargs)
1632 batch_size=batch_size,
1633 training_strategy=training_strategy,
-> 1634 **kwargs
1635 )
1636 pvals[i, j] = de_test_temp.pval

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in two_sample(data, grouping, test, gene_names, sample_description, noise_model, batch_size, training_strategy, **kwargs)
1461 data=X,
1462 gene_names=gene_names,
-> 1463 grouping=grouping,
1464 )
1465 else:

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description)
1311 data=data,
1312 grouping=grouping,
-> 1313 gene_ids=gene_names,
1314 )
1315

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in init(self, data, grouping, gene_ids)
607 x0, x1 = _split_X(data, grouping)
608
--> 609 self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
610 self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
611 q = self.qval

/opt/conda/lib/python3.6/site-packages/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
69 y=x1[:, i].flatten(),
70 alternative='two-sided'
---> 71 ).pvalue for i in range(x0.shape[1])
72 ])
73 return pvals

/opt/conda/lib/python3.6/site-packages/diffxpy/stats/stats.py in (.0)
69 y=x1[:, i].flatten(),
70 alternative='two-sided'
---> 71 ).pvalue for i in range(x0.shape[1])
72 ])
73 return pvals

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
4895 T = tiecorrect(ranked)
4896 if T == 0:
-> 4897 raise ValueError('All numbers are identical in mannwhitneyu')
4898 sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
4899

ValueError: All numbers are identical in mannwhitneyu`

Add gene ID conversion

Allow user to map between ID classes, such as ENSG and HGNC. Base this on input tables that user has to supply as this changes fairly recently and those changes should not be confounded with diffxpy versioning.

source example jupyter notebooks out in separate repo?

They cause a lot of volume in the commits which is annoying. Secondly, they automatically write if the notebook server is running when a branch is switched, I fear this may cause some issues in the future. Opinions? They also have that for scanpy, I assume for similar reasons? @flying-sheep

t-test fails with indexing bug if gene_names are not given

`~/gitDevelopment/diffxpy/diffxpy/testing/base.py in t_test(data, grouping, gene_names, sample_description)
1532 grouping=grouping,
1533 gene_ids=gene_names,
-> 1534 )
1535
1536 return de_test

~/gitDevelopment/diffxpy/diffxpy/testing/base.py in init(self, data, grouping, gene_ids)
594 idx_tt = np.where(np.logical_and(self._ave_geq_zero == True, self._var_geq_zero == True))[0]
595 pval = np.zeros([self._gene_ids.shape[0]]) + np.nan
--> 596 pval[idx_tt] = stats.t_test_raw(x0=x0[:, idx_tt], x1=x1[:, idx_tt])
597 self._pval = pval
598 self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data

IndexError: tuple index out of range`

adata wrapper: issue with xarray conversion


ValueError Traceback (most recent call last)
in ()
----> 1 de.test.pairwise(adata, test='z-score',grouping='louvain')

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in pairwise(data, grouping, test, gene_names, sample_description, noise_model, pval_correction, batch_size, training_strategy, keep_full_test_objs, **kwargs)
1565 # Do not store all models but only p-value and q-value matrix:
1566 # genes x groups x groups
-> 1567 X = _parse_data(data, gene_names)
1568 gene_names = _parse_gene_names(data, gene_names)
1569 sample_description = _parse_sample_description(data, sample_description)

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in _parse_data(data, gene_names)
893
894 def _parse_data(data, gene_names):
--> 895 X = data_utils.xarray_from_data(data, dims=("observations", "features"))
896 if gene_names is not None:
897 X.coords["features"] = gene_names

/opt/conda/lib/python3.6/site-packages/batchglm/data.py in xarray_from_data(data, dims)
65 if scipy.sparse.issparse(data.X):
66 X = _sparse_to_xarray(data.X, dims=dims)
---> 67 X.coords[dims[0]] = data.obs_names
68 X.coords[dims[1]] = data.var_names
69 else:

/opt/conda/lib/python3.6/site-packages/xarray/core/coordinates.py in setitem(self, key, value)
23
24 def setitem(self, key, value):
---> 25 self.update({key: value})
26
27 @Property

/opt/conda/lib/python3.6/site-packages/xarray/core/coordinates.py in update(self, other)
91 coords = merge_coords([self.variables, other_vars],
92 priority_arg=1, indexes=self.indexes)
---> 93 self._update_coords(coords)
94
95 def _merge_raw(self, other):

/opt/conda/lib/python3.6/site-packages/xarray/core/coordinates.py in _update_coords(self, coords)
234 dims = calculate_dimensions(coords_plus_data)
235 if not set(dims) <= set(self.dims):
--> 236 raise ValueError('cannot add coordinates with new dimensions to '
237 'a DataArray')
238 self._data._coords = coords

ValueError: cannot add coordinates with new dimensions to a DataArray

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.