The diffxpy from theislab

Add option to change between batch and full gradient

Needed for perfect convergence.

pvalue distribution under null model for fast_wald in test_vs_rest is not uniform

Check whether test statistics is correct.

coefficient standard deviation in summary has wrong shape

File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/stats.py", line 186, in wald_test
raise ValueError('stats.wald_test(): theta_mle and theta_sd have to contain the same number of entries')
ValueError: stats.wald_test(): theta_mle and theta_sd have to contain the same number of entries

theta_sd has length of parameters, instead this should be the parameter which is tested x genes, so should be length genes

de.two_sample input data bug

import os
import diffxpy.base as de
from batchglm.api.models.nb_glm import Simulator, Estimator, InputData
import abc
import logging
from typing import Union, Dict, Tuple, List, Set
import time
import pandas as pd
import xarray as xr
import numpy as np

sim = Simulator(num_observations=1000, num_features=1000)
sim.generate_sample_description(num_confounders=0, num_batches=2)
sim.generate()
start = time.time()
res = de.two_sample(data=sim.input_data, grouping="batch", noise_model="nb")
end = time.time()

yields:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 1035, in two_sample
    X = _parse_data(data, gene_names)
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 568, in _parse_data
    X = xr.DataArray(data, dims=("observations", "features"))
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/xarray/core/dataarray.py", line 228, in __init__
    variable = Variable(dims, data, attrs, encoding, fastpath=True)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/xarray/core/variable.py", line 263, in __init__
    self._dims = self._parse_dimensions(dims)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/xarray/core/variable.py", line 424, in _parse_dimensions
    % (dims, self.ndim))
ValueError: dimensions ('observations', 'features') must have the same length as the number of data dimensions, ndim=0

I get a similar bug when feeding in an adata object and with test_lrt.

selection of noise model in base.py

I m not sure whether the choice of noise model via an import statement within the function is so clean. Try something along the lines of creating an instance within a switch and then call the TF session code on this instance, this should be independent of the type of model? The hyperparameters might be model specific, you could also define them as variables within the switch.

Divide model fitting of pairwise and vsrest into separate run

Right now this is divided into one fit per group. Add optional fit of a single model which fits all dispersion parameters in one go (means are MLE anyway). We will use this for runtime benchmarking to make sure we dont loose too much on overheads here. Expose option to fit separate or together to user.

clean up how FDR correction type is handled

Check that arguments are passed correctly and that this is retained (also in DETestSingle) if it is changed once.

diagnostic plots

add function diagnostic plots to DE-single output objects

volcano plot: plot lfc vs log pvalue
scatter inferred lfc vs observed lfc (ie log ratio of averages) for two sample test (just dont compute this ratio for other tests so that this plotting function returns None)

merge to master once no_lrt_release is achieved

This will be the alpha release if we cannot fix lrt until then.

Change return type of pairwise and vsrest p-values

Currently it's not easy to check which coefficients were tested in which field:

de.test.design_matrix() not compatible with adata.obs

function call:
dmat = de.test.design_matrix(sample_description=adata.obs, formula='~1+condition+sample+chip')

Error:

TypeError Traceback (most recent call last)
in ()
----> 1 dmat = de.test.design_matrix(sample_description=adata_counts.obs, formula='~1+condition+sample+chip')

~/github_packages/diffxpy/diffxpy/testing/base.py in design_matrix(data, sample_description, formula, dmat)
1349 if dmat is None:
1350 sample_description = _parse_sample_description(data, sample_description)
-> 1351 dmat = data_utils.design_matrix(sample_description=sample_description, formula=formula)
1352
1353 return dmat

~/github_packages/batchglm/batchglm/data.py in design_matrix(sample_description, formula, as_categorical, return_type)
125 sample_description[col] = sample_description[col].astype("category")
126
--> 127 dmat = patsy.highlevel.dmatrix(formula, sample_description)
128
129 if return_type == "dataframe":

~/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "

~/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,

~/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None

~/anaconda3/lib/python3.6/site-packages/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:

~/anaconda3/lib/python3.6/site-packages/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
446 cat_sniffers[factor] = CategoricalSniffer(NA_action,
447 factor.origin)
--> 448 done = cat_sniffers[factor].sniff(value)
449 if done:
450 examine_needed.remove(factor)

~/anaconda3/lib/python3.6/site-packages/patsy/categorical.py in sniff(self, data)
196 # fastpath to avoid doing an item-by-item iteration over boolean
197 # arrays, as requested by #44
--> 198 if hasattr(data, "dtype") and safe_issubdtype(data.dtype, np.bool_):
199 self._level_set = set([True, False])
200 return True

~/anaconda3/lib/python3.6/site-packages/patsy/util.py in safe_issubdtype(dt1, dt2)
679 if safe_is_pandas_categorical_dtype(dt1):
680 return False
--> 681 return np.issubdtype(dt1, dt2)
682
683 def test_safe_issubdtype():

~/anaconda3/lib/python3.6/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
724 """
725 if not issubclass_(arg1, generic):
--> 726 arg1 = dtype(arg1).type
727 if not issubclass_(arg2, generic):
728 arg2_orig = arg2

TypeError: data type not understood

add multiple testing correction

add a multitest mapreduce operation

Allow user to set up a test such as a Wald test on the effect of condition in a model ~1+condition and broadcast this across a range of subset in each of which this test is supposed to be carried our for each gene. We could for example define a class "mapreduce_single" (or similar) which takes the data set and defines the subsetting in the constructor. This class would be handed the data object and the observation meta data vector which conains assignments to louvain groups. Secondly, this class contains a wrapper for each of the test_single methods which calls this test on each gene and each subset of the data, yielding number of subsets many p-values and a test_multi object. This would simplify such tasks for the user and we could generate better overview/enrichment information for such scenarios.

I think we can keep model fitting compartmentalised in the subsets for now, in principle one could fit one global model for the scenarios based on wald tests (no intercept!), this is a bit similar to how test_pairwise is fit.

make multi-test results more accessible

Define a specific summary function. Also adapt getters to pull pairs directly #32

Clean up output object of scipy.optimize.minimize and adapt to style used in _Estimation

Did some of that already, have to stop working on this now though. log_probs() doesnt work the same way yet and the try: catch: code in DifferentialExpressionTestWald is ugly.

Enable leaving dispersion at MME

It would be good to give the user the option to not model variance beyond the MME. We could trigger this when design_scale="~0". One use case would be two-sample tests: We can directly compute the model in closed form if we leave the dispersion at the MME, which we means that this is pretty much the same speed as t-test and that the z-test migth even be faster than that. Here it might also make sense to check out the closed form hessian so that we never have to call the TensorFlow backend. It would definitely be good to have this as a benchmark together with T-test and full MLE Wald / z-test.

Make per-group test fit only one model

See #35, this is currently done by fitting separate models per group.

two sample validate pvalues via t-test

add function compare_to_ttest() to output object which were generated with two_sample which calls two sample again with a t-test and plots a scatter with pvalues of given method (ie wald) vs t-test pvalues. this would be good as an additional convergence guarantee that everybody can easily do within one line.

Add continuous DE test

Use spline basis transform of contnuous covariate via http://patsy.readthedocs.io/en/latest/spline-regression.html#natural-and-cyclic-cubic-regression-splines and feed transformed feature space to batchglm so that batchglm does not know anything specific happenend.

Fix the chi^2 test convergence

temporary workaround: e418a01

update examples

Examples code /diffxpy/diffxpy/examples doesn't work, likely related to input data bug #3

properties of _Estimation should be fisher_loc_inv and fisher_scale_inv

I.e. the inverse of the negative hessian and not the negative hessian. Need this as BFGS directly approximates the inv of the hessian. I already changed DifferentialExpressionTestWald._test() to assume that fisher_loc contains the inverse, still have to change this in batchglm and rename this property.

Make stats.py comatible with xarray

If a xarray is passed, a xarray should be returned as well.

Related: #5 (comment)

reserve package name on pypi?

wald should expect inverted fisher information instead of fisher information

diffxpy/diffxpy/base.py

Line 441 in 49755fd

def _test(self):

there is a sqrt(1/fim) here, bfgs only gives inverted fim via fisher_inv, fim_diags is wrongly named and should be fim__inv_diags. _test should then assume that it is given inverted fisher information and only extract theta_sd = sqrt(fim_inv.diagonal())

add gene set enrichment functionality

add tutorial for training_strategies

benchmarks

for each N_cells in c(1e4, 1e5, 1e6): run time of wilcoxon, t-test, wald(TF), wald(bfgs)
for each N_cells in c(1e4, 1e5, 1e6): run time of wilcoxon vs_rest, t-test vs_rest, wald(TF) vs_rest, fast_wald(TF)

Consider merging intercept and slope

See theislab/batchglm#9

lrt with dispersion model = ~1

This yields an error, likely because the dispersion model isnt set up correctly?

sim.generate_sample_description(num_batches=0, num_confounders=0)
sim.generate()


logging.getLogger("tensorflow").setLevel(logging.WARNING)
logging.getLogger("batchglm").setLevel(logging.INFO)
logging.getLogger("diffxpy").setLevel(logging.INFO)

random_sample_description = pd.DataFrame({
    "condition": np.random.randint(2, size=sim.num_observations)
})

training_strategy = [{
	"learning_rate": 1e-4,
	"convergence_criteria": "t_test",
	"stop_at_loss_change": 1e-8,
	"loss_window_size": 20,
	'use_batching': False, 
	'optim_algo': 'GD'
}]
test = de.test_lrt(
    data = sim.X, 
    full_formula_loc="~ 1 + condition", 
    reduced_formula_loc="~ 1", 
    full_formula_scale="~ 1", 
    reduced_formula_scale="~ 1", 
    sample_description=random_sample_description,
    training_strategy=training_strategy
)

This is the full model estimation output:

Estimating model...
Using closed-form MLE initialization for mean
Using closed-form MME initialization for dispersion
Traceback (most recent call last):
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1605, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 2. Shapes are [1,1000] and [2,1000]. for 'output/b' (op: 'Select') with input shapes: [2,1000], [1,1000], [1,1000].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 803, in test_lrt
    training_strategy=training_strategy,
  File "/Users/david.fischer/gitDevelopment/diffxpy/diffxpy/base.py", line 676, in _fit
    estim = test_model.Estimator(input_data=input_data, init_model=init_model, **constructor_args)
  File "/Users/david.fischer/gitDevelopment/batchglm/batchglm/train/tf/nb_glm/estimator.py", line 811, in __init__
    extended_summary=extended_summary
  File "/Users/david.fischer/gitDevelopment/batchglm/batchglm/train/tf/nb_glm/estimator.py", line 504, in __init__
    name="b"
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2608, in where
    return gen_math_ops.select(condition=condition, x=x, y=y, name=name)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 6931, in select
    "Select", condition=condition, t=x, e=y, name=name)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3259, in create_op
    op_def=op_def)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1769, in __init__
    control_input_ops)
  File "/Users/david.fischer/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1608, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 2. Shapes are [1,1000] and [2,1000]. for 'output/b' (op: 'Select') with input shapes: [2,1000], [1,1000], [1,1000].```

add regress out function

Make /diffxpy/base.py functional

Work on TODOs that I left in the code such:

Think about reducing redundant code, in particular design matrix building and model fitting.
Check out new DifferentialExpressionTest meta class.
Improve api for many to many tests (not DifferentialExpressionTest).
Build sgdlgm interface that allows separate location and scale parameter models. This is the python naming, for nb loc=mean und scale=dispersion model.
Check out function annotation: I think this is comprehensive, this is the level of detail that we need for user exposed functions.

Add option to change the batch size

See theislab/batchglm#4

clean up unit tests for DE wrappers

Rewrite so that they test whether pvalue distribution under null model is uniform and report only pvalue.

de.test.design_matrix(): allow no absortion of first factor level into offset

Relevant for constraint scenarios.

wilcoxon exception all numbers are the same

Scipy throws an error in this case, this should be caught by the wrapper in the daughter of DifferentialExpressionTestSingle.

`/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in pairwise(data, grouping, test, gene_names, sample_description, noise_model, pval_correction, batch_size, training_strategy, keep_full_test_objs, **kwargs)
1632 batch_size=batch_size,
1633 training_strategy=training_strategy,
-> 1634 **kwargs
1635 )
1636 pvals[i, j] = de_test_temp.pval

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in two_sample(data, grouping, test, gene_names, sample_description, noise_model, batch_size, training_strategy, **kwargs)
1461 data=X,
1462 gene_names=gene_names,
-> 1463 grouping=grouping,
1464 )
1465 else:

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description)
1311 data=data,
1312 grouping=grouping,
-> 1313 gene_ids=gene_names,
1314 )
1315

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in init(self, data, grouping, gene_ids)
607 x0, x1 = _split_X(data, grouping)
608
--> 609 self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
610 self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
611 q = self.qval

/opt/conda/lib/python3.6/site-packages/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
69 y=x1[:, i].flatten(),
70 alternative='two-sided'
---> 71 ).pvalue for i in range(x0.shape[1])
72 ])
73 return pvals

/opt/conda/lib/python3.6/site-packages/diffxpy/stats/stats.py in (.0)
69 y=x1[:, i].flatten(),
70 alternative='two-sided'
---> 71 ).pvalue for i in range(x0.shape[1])
72 ])
73 return pvals

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
4895 T = tiecorrect(ranked)
4896 if T == 0:
-> 4897 raise ValueError('All numbers are identical in mannwhitneyu')
4898 sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
4899

ValueError: All numbers are identical in mannwhitneyu`

Rename test_t_test method?

somewhat redundant...
Maybe test_t() is better?

base.py:test_t_test() stable to zero variance genes?

Check if this causes an error and catch if it does.

Add size_factors into wald and lrt wrappers

Communicate these to batchglm.

add warning that lrt does not work yet

when lrt is called

Add gene ID conversion

Allow user to map between ID classes, such as ENSG and HGNC. Base this on input tables that user has to supply as this changes fairly recently and those changes should not be confounded with diffxpy versioning.

source example jupyter notebooks out in separate repo?

They cause a lot of volume in the commits which is annoying. Secondly, they automatically write if the notebook server is running when a branch is switched, I fear this may cause some issues in the future. Opinions? They also have that for scanpy, I assume for similar reasons? @flying-sheep

t-test fails with indexing bug if gene_names are not given

`~/gitDevelopment/diffxpy/diffxpy/testing/base.py in t_test(data, grouping, gene_names, sample_description)
1532 grouping=grouping,
1533 gene_ids=gene_names,
-> 1534 )
1535
1536 return de_test

~/gitDevelopment/diffxpy/diffxpy/testing/base.py in init(self, data, grouping, gene_ids)
594 idx_tt = np.where(np.logical_and(self._ave_geq_zero == True, self._var_geq_zero == True))[0]
595 pval = np.zeros([self._gene_ids.shape[0]]) + np.nan
--> 596 pval[idx_tt] = stats.t_test_raw(x0=x0[:, idx_tt], x1=x1[:, idx_tt])
597 self._pval = pval
598 self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data

IndexError: tuple index out of range`

check anndata and sparse matrix support in diffxpy/base/_parse_data.py

adata wrapper: issue with xarray conversion

ValueError Traceback (most recent call last)
in ()
----> 1 de.test.pairwise(adata, test='z-score',grouping='louvain')

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in pairwise(data, grouping, test, gene_names, sample_description, noise_model, pval_correction, batch_size, training_strategy, keep_full_test_objs, **kwargs)
1565 # Do not store all models but only p-value and q-value matrix:
1566 # genes x groups x groups
-> 1567 X = _parse_data(data, gene_names)
1568 gene_names = _parse_gene_names(data, gene_names)
1569 sample_description = _parse_sample_description(data, sample_description)

/opt/conda/lib/python3.6/site-packages/diffxpy/testing/base.py in _parse_data(data, gene_names)
893
894 def _parse_data(data, gene_names):
--> 895 X = data_utils.xarray_from_data(data, dims=("observations", "features"))
896 if gene_names is not None:
897 X.coords["features"] = gene_names

/opt/conda/lib/python3.6/site-packages/batchglm/data.py in xarray_from_data(data, dims)
65 if scipy.sparse.issparse(data.X):
66 X = _sparse_to_xarray(data.X, dims=dims)
---> 67 X.coords[dims[0]] = data.obs_names
68 X.coords[dims[1]] = data.var_names
69 else:

/opt/conda/lib/python3.6/site-packages/xarray/core/coordinates.py in setitem(self, key, value)
23
24 def setitem(self, key, value):
---> 25 self.update({key: value})
26
27 @Property

/opt/conda/lib/python3.6/site-packages/xarray/core/coordinates.py in update(self, other)
91 coords = merge_coords([self.variables, other_vars],
92 priority_arg=1, indexes=self.indexes)
---> 93 self._update_coords(coords)
94
95 def _merge_raw(self, other):

/opt/conda/lib/python3.6/site-packages/xarray/core/coordinates.py in _update_coords(self, coords)
234 dims = calculate_dimensions(coords_plus_data)
235 if not set(dims) <= set(self.dims):
--> 236 raise ValueError('cannot add coordinates with new dimensions to '
237 'a DataArray')
238 self._data._coords = coords

ValueError: cannot add coordinates with new dimensions to a DataArray

initialisatio of full model if full model consists of discrete groups

In cases where full model is also a case of 'discrete groups', change initalisation to:

init reduced with MLE+MME
train reduced (only r)
init full with with MLE+MME
train full (only r)

check whether unit_test/test_lrt.py should be deprecated and only unit_test/test_single used?

I dont mind, your choice.

theislab / diffxpy Goto Github PK

diffxpy's Introduction

Fast and scalable differential expression analysis on single-cell RNA-seq data

diffxpy's People

Contributors

Stargazers

Watchers

Forkers

diffxpy's Issues

Error:

Recommend Projects

Recommend Topics

Recommend Org