theislab / autogenes Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 8.0 56.73 MB

License: MIT License

Python 5.96% Jupyter Notebook 94.04%

autogenes's People

Contributors

Stargazers

Watchers

Forkers

mil2041 xiaorui531 lily123920 nvrivera jiahuaqu kristianunger ryeeeeeeeeee

autogenes's Issues

AutoGeneS+ Query and Documentation

Hi Hana,

Do you have a documentation for AutoGeneS+?

I optimized my single cell data of 13 cell types and found highly correlated cell types. I ran the optimization on those cell types thus adding 10 more genes to the signature matrix- I am not sure how to deconvolute the bulk data now. I was trying something along these lines: [ag.AutoGeneS(data=signature_matrix_np), ag.deconvolve(numeric_bulk.T, model='nusvr')] but the ag is picking up vales from the new optimization which is on 2 cell types only..

save and restore results

Hi there,

in my hands saving the actual state using ag.save after doing a gen=5000 run results in huge files while the saving process never finishes and the resulting file can't get read afterwards. So I wonder if there was a slimmer solution for this. In the end I only need the list of genes selected, right? But How I would I feed that to ag.deconvolve? Providing the gene names via key = does not work. Could you give directions here? I would like to try out runs with several options on a cluster and then just load the results.

Many thanks

Kristian

Can AutoGeneS utilise MicroArray data also?

Thank you very much for this tool!

I was wondering if the tool can only be used to deconvolve bulk RNA-seq data or if you can also use it to deconvolve MicroArray data? Have you done any testing on MicroArray data?

Best regards, Clara

Error when loading with AutoGenes.load

Hi,

I've been having difficulties reloading ag pickle files after generating them. I've been trying to follow the "SaveLoadTest.ipynb" notebook but have the same error message each time. I'm not quite sure what's going wrong.
Thanks very much!

import numpy as np
import pandas as pd
import sys
import importlib
import pickle
from autogenes import AutoGenes

import os.path

%load_ext autoreload
%autoreload 1
%aimport autogenes

importlib.reload(autogenes)

ag = AutoGenes.load('ag.pickle')


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-412bf52ad40b> in <module>
      1 os.chdir('/home/ngr18/hcaskin/theis_decon')
----> 2 healthy_ag = AutoGenes.load('eczema_ag.pickle')

~/.local/lib/python3.6/site-packages/autogenes/core.py in load(file)
    300 
    301   def load(file):
--> 302     return pickle.load(open(file, 'rb'))
    303 
    304   #

~/anaconda3/envs/hcaskin/lib/python3.6/site-packages/dill/_dill.py in load(file, ignore, **kwds)
    268 def load(file, ignore=None, **kwds):
    269     """unpickle an object from a file"""
--> 270     return Unpickler(file, ignore=ignore, **kwds).load()
    271 
    272 def loads(str, ignore=None, **kwds):

~/anaconda3/envs/hcaskin/lib/python3.6/site-packages/dill/_dill.py in load(self)
    470 
    471     def load(self): #NOTE: if settings change, need to update attributes
--> 472         obj = StockUnpickler.load(self)
    473         if type(obj).__module__ == getattr(_main_module, '__name__', '__main__'):
    474             if not self._ignore:

~/anaconda3/envs/hcaskin/lib/python3.6/site-packages/dill/_dill.py in find_class(self, module, name)
    460             return type(None) #XXX: special case: NoneType missing
    461         if module == 'dill.dill': module = 'dill._dill'
--> 462         return StockUnpickler.find_class(self, module, name)
    463 
    464     def __init__(self, *args, **kwds):

AttributeError: Can't get attribute 'IndividualGA' on <module 'deap.creator' from '/home/ngr18/anaconda3/envs/hcaskin/lib/python3.6/site-packages/deap/creator.py'>

Versions
deap 1.3.1; python 3.6.9; cachetools 4.0.0; dill 0.3.1.1

Error occurs in ag.init step

Hi,
I was trying my first time AutoGeneS on a public dataset after running without error following the tutorial.
But an error occurs in the ag.init step.
Here is my code:

adata_norm = sc.pp.normalize_per_cell(coh1_adata, copy=True)
adata_log = sc.pp.log1p(adata_norm,copy=True)
sc.pp.highly_variable_genes(adata_log, flavor = "seurat_v3", n_top_genes=4000)
adata_proc = adata_norm[:, adata_log.var[adata_log.var['highly_variable']==True].index]
# I assume the matrix is just the average expression matrix of celltypes
res = pd.DataFrame(columns=adata_proc.var_names, index=adata_proc.obs['new_celltype'].cat.categories)
for x in adata_proc.obs.new_celltype.cat.categories:
    res.loc[x]=adata_proc[adata_proc.obs['new_celltype'].isin([x]),:].X.mean(0)

centroids_sc_hv = res.T
centroids_sc_hv.shape
ag.init(centroids_sc_hv.T)

And the error is :

TypeError Traceback (most recent call last)
/tmp/ipykernel_1304110/259709457.py in
----> 1 ag.init(res)
2 ag.optimize(ngen=5000,seed=0,nfeatures=400,mode='fixed',offspring_size=100,verbose=True)

~/miniconda3/lib/python3.9/site-packages/autogenes/interface.py in init(self, data, celltype_key, genes_key, use_highly_variable, **kwargs)
86 self.data = data.values
87 self.data_genes = data.columns.values
---> 88 self.main = AutoGeneS(self.data)
89 self.pre_selection = np.full((data.shape[1],),True)
90

~/miniconda3/lib/python3.9/site-packages/autogenes/core.py in init(self, data)
40 raise ValueError("Number of columns (genes) must be >= number of rows (cell types)")
41
---> 42 if not np.isfinite(self.data).all():
43 raise ValueError("Some entries in data are not scalars")
44

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Do you know how to fix it and what caused it?

Thanks!

Recommendation for max.iter parameter

Hi,

Thanks for the nice tool :)

I am having some issues with autogenes, where sometimes the process will not finish ( even after 5 days) on the same dataset on which it previously finished in less than 1h. I believe that this might be due to the number of iterations and that the algorithm does not terminate due to some reason. If i set the max.iter parameter to something other than -1 (the default), this issue did not appear anymore.
My question now would be, what value would be a reasonable choice here? I think it can actually be quite high but just to have this limit that autogenes can terminate in every circumstance.

Best,
Alex

Extracting Reference profile

Hi developers,
Thanks for a great tool!
I was wondering if you could extract the Reference Profile prior to the actual deconvolution.
I am currently trying to make the deconvolution work, however, it fails to estimate certain cell types (probably due to an insufficient estimation of marker genes). Can I extract the matrix and change it?
Best,
Peter

Do I have to start with highly variable genes?

Based on the example, it seems that AutoGeneS selects genes among highly variable genes.
But can I select genes from the whole transcriptome?

single cell reference data

Data structure - how to obtain the signature matrix?

Hi,

So starting on a good note, very nice tool you've provided here! However, after having run through it I'm struggling a bit with the data structure. Let's say I've run the modules within autogenes and have obtained my cell proportions. How can i export the signature matrix? Is it saved within the autogenes object or would I've remake it in another way?

Best
Mike

Error in ag.init()

Hi again!

When I run ag.init() using the anndata object (as shown here: https://autogenes.readthedocs.io/en/latest/getting_started.html), I get this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
   1653                 blocks = [
-> 1654                     make_block(values=blocks[0], placement=slice(0, len(axes[0])))
   1655                 ]

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype)
   3046 
-> 3047     return klass(values, ndim=ndim, placement=placement)
   3048 

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
   2594 
-> 2595         super().__init__(values, ndim=ndim, placement=placement)
   2596 

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    124             raise ValueError(
--> 125                 f"Wrong number of items passed {len(self.values)}, "
    126                 f"placement implies {len(self.mgr_locs)}"

ValueError: Wrong number of items passed 1, placement implies 11945

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-10-a6f87f4b968d> in <module>
      5 
      6 # Initialise autogenes object
----> 7 ag.init(adata_snrna_raw, genes_key='selected', celltype_key='annotation_1')
      8 
      9 # do not run the gene selection

~/.local/lib/python3.7/site-packages/autogenes/interface.py in init(self, data, celltype_key, genes_key, use_highly_variable, **kwargs)
     68         raise ValueError(f"AnnData has no obs column '{celltype_key}'")
     69 
---> 70       self._adata = self.__compute_means(data,celltype_key)
     71       self.data_genes = data.var_names.values
     72 

~/.local/lib/python3.7/site-packages/autogenes/interface.py in __compute_means(self, adata, celltype_key)
    403     if celltype_key not in adata.obs:
    404       raise ValueError("Key not found")
--> 405     sc_means = pd.DataFrame(data=adata.X, columns=adata.var_names)
    406     sc_means['cell_types'] = pd.Series(data=adata.obs[celltype_key].values,index=sc_means.index)
    407     sc_means = sc_means.groupby('cell_types').mean()

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    486                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    487                 else:
--> 488                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    489             else:
    490                 mgr = init_dict({}, index, columns, dtype=dtype)

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    208         block_values = [values]
    209 
--> 210     return create_block_manager_from_blocks(block_values, [columns, index])
    211 
    212 

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
   1662         blocks = [getattr(b, "values", b) for b in blocks]
   1663         tot_items = sum(b.shape[0] for b in blocks)
-> 1664         construction_error(tot_items, blocks[0].shape[1:], axes, e)
   1665 
   1666 

/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1692     if block_shape[0] == 0:
   1693         raise ValueError("Empty data passed with indices specified.")
-> 1694     raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
   1695 
   1696 

ValueError: Shape of passed values is (19797, 1), indices imply (19797, 11945)

I use the latest AutoGeneS version from pip.

Input data

Dear Theis lab,
thanks for this amazing tool to deconvolute bulk data. However I was wondering (maybe I missed it in the documentation) what kind of expression data you would suggest to use as input for the bulk data. We were trying basemean expression from deseq with mixed results. Also how many genes would you suggest to use when working with highly variable genes only ?
Thank you so much

normalization of gene expression data

Hi,

Do the scRNA and bulk data inputted into autogeneS both need to be normalized using the same approach? I can't tell based on the documentation.

Thanks so much in advance.

How is the cell type information being inputted into the regression model?

Hello,

Thank you for creating such a useful tool. After reading the preprint and examining the code for the deconvolve function I'm still unsure of how AutoGeneS is leveraging the user inputted cell types to produce the regression coefficients/estimated proportions.

In the case of the nnls or nusvr model it looks to me like the inputs are X (bulk sample by gene expression matrix subsetted to n genes selected as solution to genetic algorithm) to predict y (vector of those n gene names). The function returns the coefficient estimates from the fitted model in a bulk sample by cell type matrix.

How is the cell type information being inputted into the regression model?

where is the 'GSE75748_sc_cell_type_ec.csv'?

Dear authors,
Thanks for your excellent work .
I have a question that how can i download the file 'GSE75748_sc_cell_type_ec.csv''?
https://github.com/theislab/AutoGeneS/blob/master/deconv_example/bulkDeconvolution_using_singleCellReferenceProfiles.ipynb
In [2]:
#read single-cell data
file = './data/GSE75748_sc_cell_type_ec.csv'
adata = sc.read(file, cache=True).transpose()
adata

Looking forward your reply,
Siyu

Definition of matrix in deconvolution based on formula

Hi,
Thanks for your useful tool!
I am wondering where I can find Matrix S which is a k × k diagonal matrix mentioned in your paper methods.
Besides, I am not sure what "average number of mRNAs in cell type l (also called cell size)" means.
Is the value of matrix S merge an average number of mRNAs in all of the genes in 1 cell type?

Best,
Tasha

The hierarchical AutoGeneS

Dear Author,

Thanks for this new API.

As you mentioned in your paper, the paragraph of Hierarchical optimization for highly correlated cell types.
"we ran AutoGeneS separated CD4+ and CD8+ T cells ......" as AutoGeneS*
I would like to run it on my data, It seems highly correlated in my reference i.e. subtype of memory B v.s. naive B cell.
With low correlation Pareto optimal solutions, I found very few markers.
I have about 100,000 cells and over 30 cell types as Reference initial, I had regroup some cell types for easy to deconvolution, but it doesn't work very well.

Now I want to use AutoGenS*, would you share your codes ?

Very nice feature selection method using GA.
Thanks in advance
Chuang

Healthy Vs Pateint signature

Hi,

Thanks for the valuable tool.

I have a single cell RNA-seq dataset that includes:

3 samples healthy == 3 batches
3 samples patients == 3 batches

Can I build up a signature matrix by comparing the 3 healthy Vs 3 patients?

my first issue:
The main issue here is that I have two levels of batch effect:
1- Within the 3 samples with each condition
2- Within the two condition

my second issue:
I have bulk RNA-seq data that have 4 samples, 2 healthy and 2 patients, and I want to confirm if, for instance, the 2 healthy samples are more similar to the healthy samples that are coming from the 3 SC RNA-seq samples or not.

So this is another issue of correcting the batch while comparing the SC-RNA-seq dataset to the Bulk RNA-seq

N.B. the bulk data don't have any batch.

Many thanks in advance,
Mohamed

Is the deconvolution result highly dependent the cell types annotation of the scRNAseq reference data?

Hi, thanks for developing this tool.

I was wondering about my title's question.
Here is what I did to come up with this question.

I am currently using a scRNAseq data of Breast cancer cohort from a published paper to deconvolute the TCGA-BRCA bulk RNAseq.
After generating the deconvolution proportion matrices from using a general cell annotation and a detailed cell annotation (i.e. more immune cell subgroups), I compared their cancer cell proportions (annotation of cancer cells are the same) with TCGA purity.
It turns out that the proportion matrices are a little different, cancer cell proportion from the detailed annotation is relatively lower than the other. And the Pearson correlation with purity is totally different.

And based on the methods of generating the signature matrix, I guess it is important to make a relative correct annotation of your scRNAseq data before using it to deconvolute? Am I right?

Thanks for any response.

Couldn't find documentation of `ag.run`

Very excited to test your package! Could you please point me to the description of ag.run arguments?

Cell type assignment in "coef"

Hello,

I have a very basic question!

After running the "init" and "optimize" steps, I obtain the coefficients with the deconvolve function:

ag.init(adata, celltype_key='Celltype')
ag.optimize(ngen=5000, nfeatures=400, seed=0, mode="fixed")
coef=ag.deconvolve(bulk_data, model="nnls")

coef doesn't contain headers or row names, and I was wondering how one can be sure which cell types and which samples correspond to which columns / rows?
I assume that, for the samples, the order will not change from the bulk's row names, but what about the different cell types? How can I assign names there?

Thank you!!

How to interpret the score

Hi,

I was wondering about the properties of AutoGeneS' cell-type score. I couldn't find any information in the manuscript if the score

is relative to the total immune cell fraction (like CIBERSORT), which only allows comparisons between cell-types
is an absolute cell-type fraction, which allows comparisons between both samples and cell-types (like EPIC)
is in arbitrary units, which allows comparisons between samples, but not cell-types.

For more background-information, see also the introduction of our benchmark paper of "1st-gen" deconvolution methods.

Best regards,
Gregor

Reference matrix (scRNAseq)

Many thanks for developing such a great tool.

I have one question regarding the reference matrix using scRNAseq data: is it required to use healthy tissue - derived data?
To my understanding, the purpose of using reference matrix is to have a database of cell types that are thought to be present in the tissue of interest, but wondering about the abovementioned matter. For example, my RNAseq data to do the cellular deconvolution is derived from chronic hepatitis B, in which I expected to have hepatocyte populations with different fibrotic states and there is no way to estimate these populations using a normal human liver - derived scRNAseq dataset as a reference matrix. Is is ok to use a reference matrix from a scRNAseq dataset of livers with different fibrosis degrees?

Many thanks,
Dien

Could you make it possible to pass different Nu and C values?

AutoGeneS/autogenes/interface.py

Line 268 in 5627144

model = NuSVR(nu=0.5,C=0.5,kernel='linear')

Normalization of cell type proportions

When using autogenes, the output of the deconvolve function can contain negative numbers. Now I found the method def normalize_proportions(data,copy): in your example at https://github.com/theislab/AutoGeneS/blob/master/deconv_example/bulkDeconvolution_using_singleCellReferenceProfiles.ipynb, but since there is no other mention how to convert the output to cell type proportions which add up to one in the documentation or the paper, I was wondering whether that is the recommended way to do it.
Another possibilities would be normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } and then scale it to one, which would not create as many zero proportion cell types as the method normalize_proportions.

What would be your recommended way of converting the output of deconvolve() to cell type proportions that add up to one for each sample?