theislab / autogenes Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hi Hana,
Do you have a documentation for AutoGeneS+?
I optimized my single cell data of 13 cell types and found highly correlated cell types. I ran the optimization on those cell types thus adding 10 more genes to the signature matrix- I am not sure how to deconvolute the bulk data now. I was trying something along these lines: [ag.AutoGeneS(data=signature_matrix_np), ag.deconvolve(numeric_bulk.T, model='nusvr')] but the ag is picking up vales from the new optimization which is on 2 cell types only..
Hi there,
in my hands saving the actual state using ag.save after doing a gen=5000 run results in huge files while the saving process never finishes and the resulting file can't get read afterwards. So I wonder if there was a slimmer solution for this. In the end I only need the list of genes selected, right? But How I would I feed that to ag.deconvolve? Providing the gene names via key = does not work. Could you give directions here? I would like to try out runs with several options on a cluster and then just load the results.
Many thanks
Kristian
Hi
Thank you very much for this tool!
I was wondering if the tool can only be used to deconvolve bulk RNA-seq data or if you can also use it to deconvolve MicroArray data? Have you done any testing on MicroArray data?
Best regards, Clara
Hi,
I've been having difficulties reloading ag pickle files after generating them. I've been trying to follow the "SaveLoadTest.ipynb" notebook but have the same error message each time. I'm not quite sure what's going wrong.
Thanks very much!
import numpy as np
import pandas as pd
import sys
import importlib
import pickle
from autogenes import AutoGenes
import os.path
%load_ext autoreload
%autoreload 1
%aimport autogenes
importlib.reload(autogenes)
ag = AutoGenes.load('ag.pickle')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-412bf52ad40b> in <module>
1 os.chdir('/home/ngr18/hcaskin/theis_decon')
----> 2 healthy_ag = AutoGenes.load('eczema_ag.pickle')
~/.local/lib/python3.6/site-packages/autogenes/core.py in load(file)
300
301 def load(file):
--> 302 return pickle.load(open(file, 'rb'))
303
304 #
~/anaconda3/envs/hcaskin/lib/python3.6/site-packages/dill/_dill.py in load(file, ignore, **kwds)
268 def load(file, ignore=None, **kwds):
269 """unpickle an object from a file"""
--> 270 return Unpickler(file, ignore=ignore, **kwds).load()
271
272 def loads(str, ignore=None, **kwds):
~/anaconda3/envs/hcaskin/lib/python3.6/site-packages/dill/_dill.py in load(self)
470
471 def load(self): #NOTE: if settings change, need to update attributes
--> 472 obj = StockUnpickler.load(self)
473 if type(obj).__module__ == getattr(_main_module, '__name__', '__main__'):
474 if not self._ignore:
~/anaconda3/envs/hcaskin/lib/python3.6/site-packages/dill/_dill.py in find_class(self, module, name)
460 return type(None) #XXX: special case: NoneType missing
461 if module == 'dill.dill': module = 'dill._dill'
--> 462 return StockUnpickler.find_class(self, module, name)
463
464 def __init__(self, *args, **kwds):
AttributeError: Can't get attribute 'IndividualGA' on <module 'deap.creator' from '/home/ngr18/anaconda3/envs/hcaskin/lib/python3.6/site-packages/deap/creator.py'>
Versions
deap 1.3.1; python 3.6.9; cachetools 4.0.0; dill 0.3.1.1
Hi,
I was trying my first time AutoGeneS on a public dataset after running without error following the tutorial.
But an error occurs in the ag.init step.
Here is my code:
adata_norm = sc.pp.normalize_per_cell(coh1_adata, copy=True)
adata_log = sc.pp.log1p(adata_norm,copy=True)
sc.pp.highly_variable_genes(adata_log, flavor = "seurat_v3", n_top_genes=4000)
adata_proc = adata_norm[:, adata_log.var[adata_log.var['highly_variable']==True].index]
# I assume the matrix is just the average expression matrix of celltypes
res = pd.DataFrame(columns=adata_proc.var_names, index=adata_proc.obs['new_celltype'].cat.categories)
for x in adata_proc.obs.new_celltype.cat.categories:
res.loc[x]=adata_proc[adata_proc.obs['new_celltype'].isin([x]),:].X.mean(0)
centroids_sc_hv = res.T
centroids_sc_hv.shape
ag.init(centroids_sc_hv.T)
TypeError Traceback (most recent call last)
/tmp/ipykernel_1304110/259709457.py in
----> 1 ag.init(res)
2 ag.optimize(ngen=5000,seed=0,nfeatures=400,mode='fixed',offspring_size=100,verbose=True)
~/miniconda3/lib/python3.9/site-packages/autogenes/interface.py in init(self, data, celltype_key, genes_key, use_highly_variable, **kwargs)
86 self.data = data.values
87 self.data_genes = data.columns.values
---> 88 self.main = AutoGeneS(self.data)
89 self.pre_selection = np.full((data.shape[1],),True)
90
~/miniconda3/lib/python3.9/site-packages/autogenes/core.py in init(self, data)
40 raise ValueError("Number of columns (genes) must be >= number of rows (cell types)")
41
---> 42 if not np.isfinite(self.data).all():
43 raise ValueError("Some entries in data are not scalars")
44
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Do you know how to fix it and what caused it?
Thanks!
Hi,
Thanks for the nice tool :)
I am having some issues with autogenes, where sometimes the process will not finish ( even after 5 days) on the same dataset on which it previously finished in less than 1h. I believe that this might be due to the number of iterations and that the algorithm does not terminate due to some reason. If i set the max.iter
parameter to something other than -1 (the default), this issue did not appear anymore.
My question now would be, what value would be a reasonable choice here? I think it can actually be quite high but just to have this limit that autogenes can terminate in every circumstance.
Best,
Alex
Hi developers,
Thanks for a great tool!
I was wondering if you could extract the Reference Profile prior to the actual deconvolution.
I am currently trying to make the deconvolution work, however, it fails to estimate certain cell types (probably due to an insufficient estimation of marker genes). Can I extract the matrix and change it?
Best,
Peter
Based on the example, it seems that AutoGeneS selects genes among highly variable genes.
But can I select genes from the whole transcriptome?
Hi,
So starting on a good note, very nice tool you've provided here! However, after having run through it I'm struggling a bit with the data structure. Let's say I've run the modules within autogenes and have obtained my cell proportions. How can i export the signature matrix? Is it saved within the autogenes object or would I've remake it in another way?
Best
Mike
Hi again!
When I run ag.init() using the anndata object (as shown here: https://autogenes.readthedocs.io/en/latest/getting_started.html), I get this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1653 blocks = [
-> 1654 make_block(values=blocks[0], placement=slice(0, len(axes[0])))
1655 ]
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype)
3046
-> 3047 return klass(values, ndim=ndim, placement=placement)
3048
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
2594
-> 2595 super().__init__(values, ndim=ndim, placement=placement)
2596
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
124 raise ValueError(
--> 125 f"Wrong number of items passed {len(self.values)}, "
126 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 1, placement implies 11945
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-10-a6f87f4b968d> in <module>
5
6 # Initialise autogenes object
----> 7 ag.init(adata_snrna_raw, genes_key='selected', celltype_key='annotation_1')
8
9 # do not run the gene selection
~/.local/lib/python3.7/site-packages/autogenes/interface.py in init(self, data, celltype_key, genes_key, use_highly_variable, **kwargs)
68 raise ValueError(f"AnnData has no obs column '{celltype_key}'")
69
---> 70 self._adata = self.__compute_means(data,celltype_key)
71 self.data_genes = data.var_names.values
72
~/.local/lib/python3.7/site-packages/autogenes/interface.py in __compute_means(self, adata, celltype_key)
403 if celltype_key not in adata.obs:
404 raise ValueError("Key not found")
--> 405 sc_means = pd.DataFrame(data=adata.X, columns=adata.var_names)
406 sc_means['cell_types'] = pd.Series(data=adata.obs[celltype_key].values,index=sc_means.index)
407 sc_means = sc_means.groupby('cell_types').mean()
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
486 mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
487 else:
--> 488 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
489 else:
490 mgr = init_dict({}, index, columns, dtype=dtype)
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
208 block_values = [values]
209
--> 210 return create_block_manager_from_blocks(block_values, [columns, index])
211
212
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1662 blocks = [getattr(b, "values", b) for b in blocks]
1663 tot_items = sum(b.shape[0] for b in blocks)
-> 1664 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1665
1666
/nfs/team283/vk7/software/miniconda3farm5/envs/cellpymc/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
1692 if block_shape[0] == 0:
1693 raise ValueError("Empty data passed with indices specified.")
-> 1694 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
1695
1696
ValueError: Shape of passed values is (19797, 1), indices imply (19797, 11945)
I use the latest AutoGeneS version from pip.
Dear Theis lab,
thanks for this amazing tool to deconvolute bulk data. However I was wondering (maybe I missed it in the documentation) what kind of expression data you would suggest to use as input for the bulk data. We were trying basemean expression from deseq with mixed results. Also how many genes would you suggest to use when working with highly variable genes only ?
Thank you so much
Hi,
Do the scRNA and bulk data inputted into autogeneS both need to be normalized using the same approach? I can't tell based on the documentation.
Thanks so much in advance.
Hello,
Thank you for creating such a useful tool. After reading the preprint and examining the code for the deconvolve function I'm still unsure of how AutoGeneS is leveraging the user inputted cell types to produce the regression coefficients/estimated proportions.
In the case of the nnls or nusvr model it looks to me like the inputs are X (bulk sample by gene expression matrix subsetted to n genes selected as solution to genetic algorithm) to predict y (vector of those n gene names). The function returns the coefficient estimates from the fitted model in a bulk sample by cell type matrix.
How is the cell type information being inputted into the regression model?
Dear authors,
Thanks for your excellent work .
I have a question that how can i download the file 'GSE75748_sc_cell_type_ec.csv''?
https://github.com/theislab/AutoGeneS/blob/master/deconv_example/bulkDeconvolution_using_singleCellReferenceProfiles.ipynb
In [2]:
#read single-cell data
file = './data/GSE75748_sc_cell_type_ec.csv'
adata = sc.read(file, cache=True).transpose()
adata
Looking forward your reply,
Siyu
Hi,
Thanks for your useful tool!
I am wondering where I can find Matrix S which is a k ร k diagonal matrix mentioned in your paper methods.
Besides, I am not sure what "average number of mRNAs in cell type l (also called cell size)" means.
Is the value of matrix S merge an average number of mRNAs in all of the genes in 1 cell type?
Best,
Tasha
Dear Author,
Thanks for this new API.
As you mentioned in your paper, the paragraph of Hierarchical optimization for highly correlated cell types.
"we ran AutoGeneS separated CD4+ and CD8+ T cells ......" as AutoGeneS*
I would like to run it on my data, It seems highly correlated in my reference i.e. subtype of memory B v.s. naive B cell.
With low correlation Pareto optimal solutions, I found very few markers.
I have about 100,000 cells and over 30 cell types as Reference initial, I had regroup some cell types for easy to deconvolution, but it doesn't work very well.
Now I want to use AutoGenS*, would you share your codes ?
Very nice feature selection method using GA.
Thanks in advance
Chuang
Hi,
Thanks for the valuable tool.
I have a single cell RNA-seq dataset that includes:
3 samples healthy == 3 batches
3 samples patients == 3 batches
Can I build up a signature matrix by comparing the 3 healthy Vs 3 patients?
my first issue:
The main issue here is that I have two levels of batch effect:
1- Within the 3 samples with each condition
2- Within the two condition
my second issue:
I have bulk RNA-seq data that have 4 samples, 2 healthy and 2 patients, and I want to confirm if, for instance, the 2 healthy samples are more similar to the healthy samples that are coming from the 3 SC RNA-seq samples or not.
So this is another issue of correcting the batch while comparing the SC-RNA-seq dataset to the Bulk RNA-seq
N.B. the bulk data don't have any batch.
Many thanks in advance,
Mohamed
Hi, thanks for developing this tool.
I was wondering about my title's question.
Here is what I did to come up with this question.
I am currently using a scRNAseq data of Breast cancer cohort from a published paper to deconvolute the TCGA-BRCA bulk RNAseq.
After generating the deconvolution proportion matrices from using a general cell annotation and a detailed cell annotation (i.e. more immune cell subgroups), I compared their cancer cell proportions (annotation of cancer cells are the same) with TCGA purity.
It turns out that the proportion matrices are a little different, cancer cell proportion from the detailed annotation is relatively lower than the other. And the Pearson correlation with purity is totally different.
And based on the methods of generating the signature matrix, I guess it is important to make a relative correct annotation of your scRNAseq data before using it to deconvolute? Am I right?
Thanks for any response.
Hello,
I have a very basic question!
After running the "init" and "optimize" steps, I obtain the coefficients with the deconvolve function:
ag.init(adata, celltype_key='Celltype')
ag.optimize(ngen=5000, nfeatures=400, seed=0, mode="fixed")
coef=ag.deconvolve(bulk_data, model="nnls")
coef
doesn't contain headers or row names, and I was wondering how one can be sure which cell types and which samples correspond to which columns / rows?
I assume that, for the samples, the order will not change from the bulk's row names, but what about the different cell types? How can I assign names there?
Thank you!!
Hi,
I was wondering about the properties of AutoGeneS' cell-type score. I couldn't find any information in the manuscript if the score
For more background-information, see also the introduction of our benchmark paper of "1st-gen" deconvolution methods.
Best regards,
Gregor
Many thanks for developing such a great tool.
I have one question regarding the reference matrix using scRNAseq data: is it required to use healthy tissue - derived data?
To my understanding, the purpose of using reference matrix is to have a database of cell types that are thought to be present in the tissue of interest, but wondering about the abovementioned matter. For example, my RNAseq data to do the cellular deconvolution is derived from chronic hepatitis B, in which I expected to have hepatocyte populations with different fibrotic states and there is no way to estimate these populations using a normal human liver - derived scRNAseq dataset as a reference matrix. Is is ok to use a reference matrix from a scRNAseq dataset of livers with different fibrosis degrees?
Many thanks,
Dien
AutoGeneS/autogenes/interface.py
Line 268 in 5627144
When using autogenes, the output of the deconvolve function can contain negative numbers. Now I found the method def normalize_proportions(data,copy):
in your example at https://github.com/theislab/AutoGeneS/blob/master/deconv_example/bulkDeconvolution_using_singleCellReferenceProfiles.ipynb, but since there is no other mention how to convert the output to cell type proportions which add up to one in the documentation or the paper, I was wondering whether that is the recommended way to do it.
Another possibilities would be normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
and then scale it to one, which would not create as many zero proportion cell types as the method normalize_proportions.
What would be your recommended way of converting the output of deconvolve() to cell type proportions that add up to one for each sample?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.