scverse / anndata Goto Github PK

View Code? Open in Web Editor NEW

552.0 14.0 150.0 4.89 MB

Annotated data.

Home Page: http://anndata.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scanpy data-science transcriptomics bioinformatics machine-learning scverse anndata

anndata's Introduction

anndata - Annotated data

anndata is a Python package for handling annotated data matrices in memory and on disk, positioned between pandas and xarray. anndata offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface.

Discuss development on GitHub.
Read the documentation.
Ask questions on the scverse Discourse.
Install via pip install anndata or conda install anndata -c conda-forge.
See Scanpy's documentation for usage related to single cell data. anndata was initially built for Scanpy.

anndata is part of the scverse project (website, governance) and is fiscally sponsored by NumFOCUS. Please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.

Citation

If you use anndata in your work, please cite the anndata pre-print as follows:

anndata: Annotated data

Isaac Virshup, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, F. Alexander Wolf

bioRxiv 2021 Dec 19. doi: 10.1101/2021.12.16.473007.

You can cite the scverse publication as follows:

The scverse project provides a computational ecosystem for single-cell omics data analysis

Isaac Virshup, Danila Bredikhin, Lukas Heumos, Giovanni Palla, Gregor Sturm, Adam Gayoso, Ilia Kats, Mikaela Koutrouli, Scverse Community, Bonnie Berger, Dana Pe’er, Aviv Regev, Sarah A. Teichmann, Francesca Finotello, F. Alexander Wolf, Nir Yosef, Oliver Stegle & Fabian J. Theis

Nat Biotechnol. 2023 Apr 10. doi: 10.1038/s41587-023-01733-8.

anndata's People

Contributors

Stargazers

Watchers

Forkers

dawe yueqiw ivirshup hrk2109 jeremy9959 tomwhite lasersonlab volkerbergen bli25 leandroagudelo189 hoeze jun-lizst davek44 grst stefanpeidli 146790g afrendeiro nh3 pavlin-policar gamazeps gokceneraslan mffrank fbnrst jianye0383 csweaver lilab-bcb djacobowitz juhaa ciaranwelsh tkisss pwl cinaljess yunpengl9071 ssicreative83 bacemdatascience chris-rands taoliu jeffhsu3 mirkazemi mweiden dhtc sakoht fhausmann retee rilango yuehhua anuprulez theaustinator mruffalo lyc-1995 keller-mark shunsunsun daneseanna realyuyangyang shulp2211 runsascoded kridsadakorn yaqiangcao gliurepertoire acastanza genevievebuckley baohongz volkerh vals seqyuan alleninstitute jbloom22 morris-frank ilan-gold philloidin stjordanis atarashansky michalk8 dburkhardt adamgayoso hdfgroup jykr jianguozhou3 crsky1023 ashooll gtca kchennen ihnorton kaizhang flypythons elpadocan meeseeksmachine chaichontat sujaypatil96 geneticresources chartl drin hrovatin rahulbshrestha selmanozleyen ganesh25 jsicherman viktorb1 merang2050 snsansom

anndata's Issues

slicing of backed AnnData object throws error

Test case:

import scanpy.api as sc
import numpy as np
adata = sc.AnnData(X=np.random.binomial(100, .01, (100, 100)))
adata.obs_names = adata.obs_names.astype(str)

# this works fine
adata[0:2,:][:,0:2]

adata.write("tmp.h5ad")
adata_backed = sc.read("tmp.h5ad", backed="r")

# this throws error
adata_backed[0:2,:][:,0:2]

Traceback of final line:

Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 1297, in __getitem__
    return self._getitem_view(index)
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 1301, in _getitem_view
    return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 664, in __init__
    self._init_as_view(X, oidx, vidx)
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 689, in _init_as_view
    uns_new = deepcopy(self._adata_ref._uns)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.6/copy.py", line 220, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 274, in _reconstruct
    y = func(*args)
  File "stringsource", line 5, in h5py.h5f.__pyx_unpickle_FileID
  File "h5py/_objects.pyx", line 178, in h5py._objects.ObjectID.__cinit__
TypeError: __cinit__() takes exactly 1 positional argument (0 given)

inconsistent slicing with one element.

When slicing AnnData with a List of one (adata[:, [0]]) element, I expect it to return a 2d array.
When slicing AnnData with one element (adata[:, 0]), I expect it to return a 1d array.
AnnData always returns a 1d array.

See the following example

a = np.ones((3, 3))
adata = AnnData(a)

Expected behaviour (like in numpy)

> a[:, [0]]
array([1., 1., 1.])
> a[:, 0]
array([[1.],
       [1.],
       [1.]])
> adata.X[:, [0]]
array([[1.],
       [1.],
       [1.]], dtype=float32)

Actual behaviour:

> adata[:, 0].X
ArrayView([1., 1., 1.], dtype=float32)
> adata[:, [0]].X

This is somewhat related to #60. I still opened a separate issue as in my case the behavious is actually inconsistent with numpy.

[proposal] Changing the backend to xarray

Xarray has a lot of advantages, e.g.:

named dimensions
Dask integration for multi-file datasets and chunked calculations for data not fitting into memory
Interoperability with numpy / pandas
NetCDF4 support, this would save the necessity to design custom HDF formats

The only big problem currently is the missing sparse data support, but this will be changed (hopefully in the near) future:
pydata/xarray#1375

read_loom failed

Hi,
I use scanpy 1.3.1. I have tried the 'read_loom' function, but it produced the following error:

adata = sc.read(filename=path_to_velocyto_files + 'all_controls.loom')

--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-52-61a36c9ba297> in <module>()
----> 1 adata = sc.read(filename=path_to_velocyto_files + 'all_controls.loom')

~/anaconda3/lib/python3.6/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
     73         return _read(filename, backed=backed, sheet=sheet, ext=ext,
     74                      delimiter=delimiter, first_column_names=first_column_names,
---> 75                      backup_url=backup_url, cache=cache, **kwargs)
     76     # generate filename and read to dict
     77     filekey = filename

~/anaconda3/lib/python3.6/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
    311             adata = _read_softgz(filename)
    312         elif ext == 'loom':
--> 313             adata = read_loom(filename=filename, **kwargs)
    314         else:
    315             raise ValueError('Unkown extension {}.'.format(ext))

~/anaconda3/lib/python3.6/site-packages/anndata/readwrite/read.py in read_loom(filename, sparse, cleanup, X_name, obs_names, var_names)
    144     filename = fspath(filename)  # allow passing pathlib.Path objects
    145     from loompy import connect
--> 146     with connect(filename, 'r') as lc:
    147 
    148         if X_name not in lc.layers.keys(): X_name = ''

AttributeError: __enter__

I have found that when I try to read the file the interactive console, I have the following result:

> filename =os.fspath(path_to_velocyto_files + 'all_controls.loom')
> lc =connect(filename, 'r')
> lc.layer.keys()
dict_keys(['', 'ambiguous', 'spliced', 'unspliced'])

So for my loom file, it's not 'layers', but 'layer'. Can you consider to include this case in the anndata read function?

create conda recipe to upload to bioconda

Hi, scvi supports automatically loading an .h5ad file to analyze scRNA-seq. Since we included anndata as one of our dependencies and we uploaded scvi to bioconda channel, I'm wondering if it's okay if I upload a conda recipe for anndata to bioconda?

Links in AnnData docs no longer working

Hi @flying-sheep,

the last modifications to the docs seem to have destroyed proper linking here, both in the Attributes and the Methods.

What do you think?

networkx support

Hi guys,

Is there any way to store networkx object in anndata?

I tried to store it in adata.uns['networkx'] and saved it as a .h5ad-formatted file adata.write(results_file).

But when I read back in the object, the networkx object can't be restored. It's transformed into an array. I wonder what should be the right way to deal with networkx or other graph object in anndata.

Any suggestions would be much appreciated. Thanks!

Missing .raw attribute should raise proper exception

Hey,

adata = sc.datasets.paul15()
adata._get_obs_array('Sfpi1', use_raw=True)

raises ValueError: Did not find Sfpi1 in obs.keys or var_names. However the correct exception should be ".raw doesn't exist" or so.

How I ended up with this was actually sc.pl.scatter(adata, 'Sfpi1', 'Gata1'), which raises the same exception. Raising an exception explicitly about the lack of .raw would be much more informative for users.

I guess @Koncopd added the layer support, so he might be interested.

Trouble reading backed objects

Hey. I've run into a couple issues with reading in backed objects with a raw representation.

The first is just the case of reading in an object with a raw attribute:

import scanpy.api as sc
import numpy as np

adata = sc.AnnData(X=np.random.binomial(100, .01, (100, 100)))
adata.raw = adata.copy()
sc.pp.log1p(adata) # Just so they are different
adata.write("./tmp.h5ad")
sc.read("tmp.h5ad", backed="r")

traceback

AttributeError                            Traceback (most recent call last)
<ipython-input-2-7e0cdbc773a6> in <module>()
      3 sc.pp.log1p(adata) # Just so they are different
      4 adata.write("./tmp.h5ad")
----> 5 sc.read("tmp.h5ad", backed="r")

/usr/local/lib/python3.6/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
     73         return _read(filename, backed=backed, sheet=sheet, ext=ext,
     74                      delimiter=delimiter, first_column_names=first_column_names,
---> 75                      backup_url=backup_url, cache=cache, **kwargs)
     76     # generate filename and read to dict
     77     filekey = filename

/usr/local/lib/python3.6/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
    274     if ext in {'h5', 'h5ad'}:
    275         if sheet is None:
--> 276             return read_h5ad(filename, backed=backed)
    277         else:
    278             logg.msg('reading sheet', sheet, 'from file', filename, v=4)

/usr/local/lib/python3.6/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed)
    407     if backed:
    408         # open in backed-mode
--> 409         return AnnData(filename=filename, filemode=backed)
    410     else:
    411         # load everything into memory

/usr/local/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    681                 layers=layers,
    682                 dtype=dtype, shape=shape,
--> 683                 filename=filename, filemode=filemode)
    684 
    685     def _init_as_view(self, adata_ref: 'AnnData', oidx: Index, vidx: Index):

/usr/local/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, layers, dtype, shape, filename, filemode)
    895                     self,
    896                     X=raw['X'],
--> 897                     var=_gen_dataframe(raw['var'], raw['X'].shape[1], ['var_names', 'col_names']),
    898                     varm=raw['varm'] if 'varm' in raw else None)
    899 

AttributeError: 'NoneType' object has no attribute 'shape'

Additionally the reader doesn't clean up after itself if it errors. In the same session:

adata.write("./tmp.h5ad")

traceback

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-e04b48216e1b> in <module>()
----> 1 adata.write("./tmp.h5ad")

/usr/local/lib/python3.6/site-packages/anndata/base.py in write(self, filename, compression, compression_opts, force_dense)
   1887 
   1888         _write_h5ad(filename, self, compression=compression, compression_opts=compression_opts,
-> 1889                                                              force_dense=force_dense)
   1890         if self.isbacked:
   1891             self.file.close()

/usr/local/lib/python3.6/site-packages/anndata/readwrite/write.py in _write_h5ad(filename, adata, force_dense, **kwargs)
    218         d['X'] = adata.X[:]
    219     # need to use 'a' if backed, otherwise we loose the backed objects
--> 220     with h5py.File(filename, 'a' if adata.isbacked else 'w', force_dense=force_dense) as f:
    221         for key, value in d.items():
    222             _write_key_value_to_h5(f, key, value, **kwargs)

/usr/local/lib/python3.6/site-packages/anndata/h5py/h5sparse.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, force_dense, **kwds)
    139             userblock_size=userblock_size,
    140             swmr=swmr,
--> 141             **kwds,
    142         )
    143         super().__init__(self.h5f, force_dense)

/usr/local/lib/python3.6/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, **kwds)
    310             with phil:
    311                 fapl = make_fapl(driver, libver, **kwds)
--> 312                 fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
    313 
    314                 if swmr_support:

/usr/local/lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    146         fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
    147     elif mode == 'w':
--> 148         fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
    149     elif mode == 'a':
    150         # Open in append mode (read/write).

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.create()

OSError: Unable to create file (unable to truncate a file which is already open)

I get these errors using v0.6.10 and the current master branch.

Reading tsv-files with only one column

Hi, I'd like to read a "barcode.tsv" file:

sc.read_csv("barcodes.tsv", delimiter="\t")

However, this file has only one column and cannot be loaded:

Traceback (most recent call last):
  File "<ipython-input-24-7055edd4368a>", line 13, in load_mtx_to_adata
    ad.obs = sc.read_csv("barcodes.tsv", delimiter="\t")
  File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 36, in read_csv
    return read_text(filename, delimiter, first_column_names, dtype)
  File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 210, in read_text
    return _read_text(f, delimiter, first_column_names, dtype)
  File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 243, in _read_text
    .format(delimiter))
ValueError: Did not find delimiter "	" in first line.

Layers attribute similar to loom

@Koncopd Let's discuss the issue here.

In essence, we want to have loom's layers functionality also for AnnData in order to deal with replicated data matrices as produced by the velocyto command line tool.

The most basic thing we need is the iteration over the layers in the loom file and their corresponding initialization in the AnnData file, which would be an extension of read_loom (https://github.com/theislab/anndata/blob/86ede1effa86a5d88db18d71c21c4057887066b5/anndata/readwrite/read.py#L126-L155)

      adata.layer[key] = loomconnection.layer['key'][:, :].T

Main questions are: how to elegantly combine the X and the layers group? Shall we call it .layers_X for more verbosity and stressing of the fact that we force the dimensions to be the same? How to deal with the transposition: ideally, when in backed mode, we don't want to load everything in the loom file into memory but rather convert the file into an .h5ad file.

Let's start with memory mode and some simple functionality, though...

Remove obsm

Hey guys,

What's the easiest way to remove a field in adata.obsm?
for example, I have

AnnData object with n_obs × n_vars = 48011 × 25583 
    obs: 'n_genes', 'n_counts', 'percent_mito', 'Sample', 'Donor', 'Tissue', 'batch', 'DCA_split', 'size_factors'
    var: 'gene_ids', 'n_counts'
    uns: 'DCA_losses'
    obsm: 'X_dca', 'X_dca_mean', 'X_dca_hidden', 'X_dca_dropout', 'X_dca_dispersion'

and I just want to remove the items in obsm to reduce memory consumption.

Files with singleton categoricals cannot be opened

What I mean by this is that you might have a categorical, which internally in the h5ad is represented by e.g. uns/condition_categories. If there is only one condition (perhaps because it's a subset of data), this will fail the Pandas check for unique categories, because the shape for uns/condition_categories is (1, ).

This can be avoided by when writing the file to h5py, detecting singleton categories and appending a dummy category, to ensure there are at least two unique values in uns/conditions_categories.

Cannot read text files

read_csv delegates to read_txt, which doesn’t exist: NameError: name 'read_txt' is not defined

https://github.com/theislab/anndata/blob/2965233eb1c4f50247af3437b6bd17520eb5233c/anndata/readwrite/read.py#L28

I’m on it

Documentation request: Row and column selection and iteration examples

I've found that I can select multiple rows and then iterate over their columns and values like this:

# Trying to find how to get one gene
selected = adata[:, adata.var_names.isin(['Tcea1', 'Xkr4'])]  # works

# this version only works if I'm using a sparse matrix, if not the tocoo() call fails.
#cx = adata.X.tocoo()    
#for cell, gene, value in zip(adata.obs_names[cx.row], adata.var_names[cx.col], cx.data):
#    print(cell, gene, value)

# This is for a complete matrix
for g, gene in enumerate(selected.var_names):
    for c, cell in enumerate(selected.obs_names):
        print("{0}\t{1}\t{2}".format(cell, gene, selected.X[c, g]))

But now I want to select just ONE row and iterate over its columns and values.

These do not work.

selected = adata[:, adata.var_names.isin(['Tcea1',])] # this doesn't
selected = adata[:, adata.var_names['Tcea1']]  # this doesn't
``
What actually fails each time though isn't getting selected, it's trying to print it:

Traceback (most recent call last):
File "./h5ad_test_read.py", line 35, in
print("{0}\t{1}\t{2}".format(cell, gene, selected.X[c, g]))
IndexError: too many indices for array


I'd love to see an example of doing this correctly for a single gene.

publicly available datasets?

Are there any scRNA-seq dataset in the AnnData format that are publicly available?

Thanks!

Documentation of `.layers`

Hi @Koncopd,

would you write a proper documentation of the .layers attribute you built? Currently, it's only a very non-informative stub:
https://anndata.readthedocs.io/en/latest/anndata.AnnData.layers.html
It should contain a reference to loompy, scvelo and very basic examples.

Here, even the heading in the table is missing:
https://anndata.readthedocs.io/en/latest/anndata.AnnData.html

Option to select the variable with the highest median when making the index unique

Hi all,
when loading the data, in case of duplicates I usually choose the item with the highest median (e.g. the gene with the highest median signal). With a pandas DataFrame it can be done as easily as this:

df = df.T
df['Median'] = df.median(axis=1)
df = df.sort_values(by=['Median'], ascending=False, na_position='last')
df = df.drop(columns=['Median'])
df = df.groupby(level=1).first()
df = df.T

considering genes in columns and samples in index. I find it more useful than:

anndata/anndata/utils.py

Line 10 in 1c05290

def make_index_unique(index, join='-'):

since you don't change the name of genes. I don't know however how to implement it using AnnData. Are you interested in integrating it in AnnData, maybe also with different options (e.g. average, etc.)?

Thanks,
Francesco

adata.write errors

Hi guys,

I'm really enjoying the efficiency and scalability of anndata. I have been using it to manage my large datasets. Excellent work!

But unfortunately i'm having bad luck in writing adata object to .h5ad-formatted hdf5 file.

adata.write(results_file)

It takes very long time and ends up with one of the errors below (I haven't managed to store adata object in file)

IndexError: too many indices for array
2)ValueError: name already used as a name or title

In adata object, I'm using different annotations including uns, obs, vars, obsm and it works smoothly without any errors. But each time i try to write it to file, the aforementioned errors are thrown out.

I will really appreciate your opinions. Thanks in advance!

dimensions not preserved after split and merge

Hi,

I observed the following issue:

I have split up my AnnData object in two parts
removed some cells from the second AnnData object,
Did louvain clustering on the second part, and
merged this part back together with the first part.

However, after applying the concatenate function, I ended up with more features even though I definitely had the same number of genes in both parts.

cc @mbuttner

anndata - deleting specific row

Hi there,

Sorry for the simple question, but I just started using anndata within Scanpy and was wondering: is there a way to remove a specific row? Something like

if 
adata.var_names == "foo"
remove the row

(It is to remove some mitochondrial genes)

Thank you!

Clean up benchmarks

Clean up https://anndata.readthedocs.io/en/latest/benchmarks.html: move https://github.com/Koncopd/anndata-scanpy-benchmarks to anndata_usage and provide links to nbviewer similar to how it’s done on https://scanpy.readthedocs.io/en/latest/examples.html.

No module named pathlib on install

I believe pathlib needs to be added as a dependency.

> pip install anndata
Collecting anndata
  Using cached https://files.pythonhosted.org/packages/a2/31/abf1918b45012977f1f78de6cdd01ee6c3650acae538ff8f7b0b17c1f47f/anndata-0.6.5.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "c:\users\scott\appdata\local\temp\pip-build-lzqa4a\anndata\setup.py", line 2, in <module>
        from pathlib import Path
    ImportError: No module named pathlib

This is with Python 2.7 on Windows.

Can’t concat single-observation anndatas

Fixed by retaining 2D storage via #55

'dtype' within reading functions doesn't work

Hi,

I'm trying to read in a matrix file in the format of 'float64'. I did the following:

adata = ad.read_text('./test.txt',delimiter='\t',dtype='float64')

I have specified the dtype as 'float64' but still adata.X is showing the default 'float32'. Is it a bug or did I miss anything?

I'm attaching a short script here to reproduce this issue. I will really appreciate your help. Many thanks!
Archive.zip

Views take around as much memory as object

I was playing around with some visualization on a large dataset, when I noticed some surprisingly high memory usage. I think I've narrowed it down to unexpected memory growth from taking views:

In [1]: import scanpy.api as sc

In [2]: %load_ext memory_profiler

In [3]: %memit adata = sc.read("bm.h5ad")
peak memory: 5317.54 MiB, increment: 5187.05 MiB

In [4]: %memit
peak memory: 2624.52 MiB, increment: 0.00 MiB

In [5]: %memit view = adata[:, (adata.var["n_cells_by_counts"] > 10000)]
peak memory: 5299.68 MiB, increment: 2675.16 MiB

In [6]: %memit
peak memory: 5080.07 MiB, increment: 0.00 MiB

My assumption here being: taking a view shouldn't cause noticeable growth in memory usage. I'm pretty sure it's not just how memory_profiler is counting objects, since top and ActivityMonitor pick this up as well.

Error reading loom file with read_loom()

Hi,

I converted a Seurat object into a .loom file and tried to read it into Scanpy using the read_loom() function. I got the following error:

data = sc.read_loom(filename)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-c4e72203067d> in <module>()
----> 1 data = sc.read_loom(filename)

/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/readwrite/read.py in read_loom(filename)
    149         X.T,
    150         obs=lc.col_attrs,
--> 151         var=lc.row_attrs)
    152     lc.close()
    153     return adata

/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, raw, dtype, single_col, filename, filemode, asview, oidx, vidx)
    753                 obsm=obsm, varm=varm, raw=raw,
    754                 dtype=dtype, single_col=single_col,
--> 755                 filename=filename, filemode=filemode)
    756 
    757     def _init_as_view(self, adata_ref, oidx, vidx):

/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, dtype, single_col, filename, filemode)
    889         # annotations
    890         self._obs = _gen_dataframe(obs, self._n_obs,
--> 891                                    ['obs_names', 'row_names', 'smp_names'])
    892         self._var = _gen_dataframe(var, self._n_vars, ['var_names', 'col_names'])
    893 

/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/base.py in _gen_dataframe(anno, length, index_names)
    228                 break
    229         else:
--> 230             _anno = pd.DataFrame(anno)
    231     return _anno
    232 

/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    402                                          dtype=values.dtype, copy=False)
    403             else:
--> 404                 raise ValueError('DataFrame constructor not properly called!')
    405 
    406         NDFrame.__init__(self, mgr, fastpath=True)

ValueError: DataFrame constructor not properly called!

I revolved this error by changing _anno = pd.DataFrame(anno)
to _anno = pd.DataFrame(dict(anno)) in base.py.

This is because the loompy package extracts column and row annotations as generators rather than dictionaries, and pd.DataFrame does not take generators as input.

Hope it's fixed in the future. Thanks!

Yueqi

Failure reading single observation h5ads

AnnData doesn't seem to like h5ad files with a single observation:

In [54]: anndata.__version__
Out[54]: '0.6.6'

In [55]: df.shape
Out[55]: (23465, 1)

In [56]: adata = anndata.AnnData(df.values.T, {"cell_names": df.columns.values}, {"gene_names": df.index.values})

In [57]: adata.write("test.h5ad")

In [58]: bdata = anndata.read_h5ad("test.h5ad")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-222d2ae8c407> in <module>()
----> 1 bdata = anndata.read_h5ad("test.h5ad")

~/.local/lib/python3.5/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed)
    342         # load everything into memory
    343         d = _read_h5ad(filename=filename)
--> 344         return AnnData(d)
    345
    346

~/.local/lib/python3.5/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    639                 obsm=obsm, varm=varm, raw=raw,
    640                 dtype=dtype, shape=shape,
--> 641                 filename=filename, filemode=filemode)
    642
    643     def _init_as_view(self, adata_ref, oidx, vidx):

~/.local/lib/python3.5/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode)
    758                 raise ValueError(
    759                     'If `X` is a dict no further arguments must be provided.')
--> 760             X, obs, var, uns, obsm, varm, raw = self._from_dict(X)
    761
    762         # init from AnnData

~/.local/lib/python3.5/site-packages/anndata/base.py in _from_dict(ddata)
   1920                     if key in d_true_keys[true_key].dtype.names:
   1921                         d_true_keys[true_key] = pd.DataFrame.from_records(
-> 1922                             d_true_keys[true_key], index=key)
   1923                         break
   1924                 d_true_keys[true_key].index = d_true_keys[true_key].index.astype('U')

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1267         else:
   1268             arrays, arr_columns = _to_arrays(data, columns,
-> 1269                                              coerce_float=coerce_float)
   1270
   1271             arr_columns = _ensure_index(arr_columns)

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype)
   7493     else:
   7494         # last ditch effort
-> 7495         data = lmap(tuple, data)
   7496         return _list_to_arrays(data, columns, coerce_float=coerce_float,
   7497                                dtype=dtype)

~/.local/lib/python3.5/site-packages/pandas/compat/__init__.py in lmap(*args, **kwargs)
    129
    130     def lmap(*args, **kwargs):
--> 131         return list(map(*args, **kwargs))
    132
    133     def lfilter(*args, **kwargs):

TypeError: 'numpy.int64' object is not iterable

But faking another cell works fine:

In [59]: df2 = pandas.concat([df, df], axis=1)

In [60]: df2.shape
Out[60]: (23465, 2)

In [61]: adata = anndata.AnnData(df2.values.T, {"cell_names": df2.columns.values}, {"gene_names": df2.index.values})

In [62]: adata.write("test.h5ad")

In [63]: bdata = anndata.read_h5ad("test.h5ad")

In [64]: bdata.n_obs, bdata.n_vars
Out[64]: (2, 23465)

I'm guessing this has something to do with _fix_shapes

Test on 3.5 again

We started using syntax only available on Python 3.6, but our setup.py says we support 3.5.

anndata is no longer tested on 3.5 since a dependency (loompy) needs 3.6, but we should simply skip loompy tests on 3.5 instead.

Sparse matrices with HDF5

This issue is meant to serve as a discussion page for establishing conventions for storing sparse data in HDF5 files.

The suggestion made within anndata is described here.

Don’t set a log level on import

In 4233e07 we started calling logging.basicConfig and in 40bdcbb we changed the global log level to INFO. This means that everyone’s logging configuration is overridden (not nice) and they start seeing INFO-level noise from all modules!

Python modules should never call any code with global side effects on import. logging.basicConfig goes into __main__.

@falexwolf said that the goal was to have similar output to scanpy’s. The problem is that scanpy uses its own logging infrastructure instead of python’s. My proposal:

get rid of this side effect
get rid of scanpy’s logging infrastructure and use python’s (scverse/scanpy#256)

We can still use our own logging format and set scanpy and anndata to INFO level like this:

logger = logging.getLogger(__name__)
logger.propagate = False  # Don’t pass log messages on to the root logger and its handler
logger.setLevel('INFO')

handler = logging.StreamHandler(sys.stderr)  # Why did we use stdout?
handler.setFormatter(logging.Formatter('%(message)s'))
handler.setLevel('INFO')
logger.addHandler(handler)

Unable to create AnnData from input pandas DataFrame

When trying to create an AnnData from a pandas DataFrame I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-235904346c66> in <module>()
----> 1 adata_prova = sc.AnnData(adata_to_df(adata=adata_p))

~/.pyenv/versions/3.6.4/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    681                 layers=layers,
    682                 dtype=dtype, shape=shape,
--> 683                 filename=filename, filemode=filemode)
    684 
    685     def _init_as_view(self, adata_ref: 'AnnData', oidx: Index, vidx: Index):

~/.pyenv/versions/3.6.4/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, layers, dtype, shape, filename, filemode)
    824                 class_names = ', '.join(c.__name__ for c in StorageType.classes())
    825                 raise ValueError('`X` needs to be of one of {}, not {}.'
--> 826                                  .format(class_names, type(X)))
    827             if shape is not None:
    828                 raise ValueError('`shape` needs to be `None` is `X` is not `None`.')

ValueError: `X` needs to be of one of ndarray, MaskedArray, spmatrix, ZarrArray, not <class 'pandas.core.frame.DataFrame'>.

The line in the current version is:

https://github.com/theislab/anndata/blob/4c40622b7b5889f2893162d7c067256a5df5132e/anndata/base.py#L821

I am using the latest (from git) version of scanpy (1.3.1+68.ga045533) and the latest published version of AnnData (0.6.10).

I am creating an AnnData using:

adata = sc.AnnData(df)

Calling AnnData(numpy.ndarray) fails

Hi,

thanks for putting this out there, python needed an annotated data format :)

I am trying to use the diffusion maps from scanpy, and it requires me to format my input array in AnnData format. Now, according to the documentation, I should be able to call AnnData with a numpy.ndarray and without any annotation. However, when I do that, even with ad = AnnData(np.ones((2, 2))), like in the documentation, I get a TypeError:

TypeError                                 Traceback (most recent call last)
<ipython-input-8-1792540ebafc> in <module>()
----> 1 ad = AnnData(X1)

~/miniconda2/envs/py35/lib/python3.5/site-packages/anndata/anndata.py in __init__(self, data, smp, var, uns, smpm, varm, dtype, single_col)
    334 
    335         # multi-dimensional array annotations
--> 336         if smpm is None: smpm = np.empty(self._n_smps, dtype=[])
    337         if varm is None: varm = np.empty(self._n_vars, dtype=[])
    338         self._smpm = BoundRecArr(smpm, self, 'smpm')

TypeError: Empty data-type

the error persists in jupyter notebooks and in the python console. Any ideas what may be causing that?

cheers,
Niko

Conceptual indexing problem

The faster indexing solution

converts boolean indices to integer indices (we shouldn’t do that boolean indices are super fast!)
creates a pandas index using the var_names as index to speed up lookups

If the var_names happen to be strings, that works (slower than necessary), but if they’re integers, this breaks. Example:

In[1]: ad = AnnData(np.array([[0,1,2],[3,4,5]]), var=dict(var_names=[10,11,12]))
In[2]: ad[:, ad.X.sum(0) > 3]
Traceback (most recent call last):
  File "<ipython-input-23-d08541977b75>", line 1, in <module>
    ad[:, ad.X.sum(0) > 3]
  File "anndata/base.py", line 1187, in __getitem__
    return self._getitem_view(index)
  File "anndata/base.py", line 1190, in _getitem_view
    oidx, vidx = self._normalize_indices(index)
  File "anndata/base.py", line 1167, in _normalize_indices
    var = _normalize_index(var, self.var_names)
  File "anndata/base.py", line 244, in _normalize_index
    positions = positions[index]
  File "pandas/core/series.py", line 809, in __getitem__
    return self._get_with(key)
  [...]
  File "pandas/core/indexing.py", line 1206, in _validate_read_indexer
    key=key, axis=self.obj._get_axis_name(axis)))
KeyError: 'None of [[1 2]] are in the [index]'

Error slicing when there is only one observation

This may be intentional, but there seems to be an issue when the raw data only has one row. We have come across this in our unit tests when creating a small example data set with only one observation.

In [1]: import anndata

In [2]: anndata.__version__
Out[2]: '0.6.10'

In [3]: import pandas as pd

In [4]: d = [
   ...:     (1, 'A', 'a', 'Z', 'z'),
   ...:     (2, 'A', 'b', 'Z', 'z'),
   ...:     (3, 'B', 'c', 'Z', 'z'),
   ...:     (4, 'B', 'd', 'Z', 'z'),
   ...: ]

In [5]: df = pd.DataFrame(d, columns='c0 c1 c2 c3 c4'.split())
   ...: df
Out[5]:
   c0 c1 c2 c3 c4
0   1  A  a  Z  z
1   2  A  b  Z  z
2   3  B  c  Z  z
3   4  B  d  Z  z

In [6]: df = df.set_index('c1 c2 c3 c4'.split())['c0'].unstack(level=[2, 3]).T
   ...: df
Out[6]:
c1     A     B
c2     a  b  c  d
c3 c4
Z  z   1  2  3  4

In [7]: def convert_idx(x): return x.to_frame().reset_index(drop=True)

In [8]: obs = convert_idx(df.index)

In [9]: var = convert_idx(df.columns)

In [10]: a = anndata.AnnData(X=df.values, obs=obs, var=var)

In [11]: a
Out[11]:
AnnData object with n_obs × n_vars = 1 × 4
    obs: 'c3', 'c4'
    var: 'c1', 'c2'

In [12]: a.obs
Out[12]:
  c3 c4
0  Z  z

In [13]: a.var
Out[13]:
  c1 c2
0  A  a
1  A  b
2  B  c
3  B  d

In [14]: assert a.shape == a.X.shape, '{} != {}'.format(a.shape, a.X.shape)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-7f38676dce05> in <module>()
----> 1 assert a.shape == a.X.shape, '{} != {}'.format(a.shape, a.X.shape)

AssertionError: (1, 4) != (4,)

In [15]: a[0, 0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-0d768f50cb80> in <module>()
----> 1 a[0, 0]

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in __getitem__(self, index)
   1292     def __getitem__(self, index):
   1293         """Returns a sliced view of the object."""
-> 1294         return self._getitem_view(index)
   1295
   1296     def _getitem_view(self, index):

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _getitem_view(self, index)
   1296     def _getitem_view(self, index):
   1297         oidx, vidx = self._normalize_indices(index)
-> 1298         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1299
   1300     # this is used in the setter for uns, if a view

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    674             if not isinstance(X, AnnData):
    675                 raise ValueError('`X` has to be an AnnData object.')
--> 676             self._init_as_view(X, oidx, vidx)
    677         else:
    678             self._init_as_actual(

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _init_as_view(self, adata_ref, oidx, vidx)
    735         # set data
    736         if self.isbacked: self._X = None
--> 737         else: self._init_X_as_view()
    738
    739         self._layers = AnnDataLayers(self, adata_ref=adata_ref, oidx=oidx, vidx=vidx)

~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _init_X_as_view(self)
    750             self._X = None
    751             return
--> 752         X = self._adata_ref.X[self._oidx, self._vidx]
    753         if len(X.shape) == 2:
    754             n_obs, n_vars = X.shape

IndexError: too many indices for array

Consider changing variable names to be more generic

For example, "column_annotations" instead of "smp"

rename_categories gives ValueError

Here is the example:

import scanpy.api as sc
sc.settings.verbosity = 0
adata = sc.datasets.blobs()
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
sc.tl.rank_genes_groups(adata, 'louvain')

adata.rename_categories('louvain', {'1': 'a'})

throws the error

ValueError                                Traceback (most recent call last)
<ipython-input-29-5a4b44b293eb> in <module>()
      6 sc.tl.rank_genes_groups(adata, 'louvain')
      7 
----> 8 adata.rename_categories('louvain', {'1': 'a'})

~/miniconda3/envs/spols180816d/lib/python3.6/site-packages/anndata/base.py in rename_categories(self, key, categories)
   1359                             if isinstance(v2, np.ndarray) and v2.dtype.names is not None:
   1360                                 if list(v2.dtype.names) == old_categories:
-> 1361                                     self.uns[k1][k2].dtype.names = categories
   1362                                 else:
   1363                                     logg.warn(

ValueError: must replace all names at once with a sequence of length 5

The problem does not appear if I skip sc.tl.rank_genes_groups(adata, 'louvain'). I am on the master branch of scanpy. Recently, sc.tl.rank_genes_groups reports p-values and so on. Could that be related to the error?

[bug] ValueError: could not broadcast input array from shape (1000,1) into shape (1000)

I'd like to store some design matrix inside adata.obsm.
However, there are cases where this design matrix has only one column, i.e. it has shape (adata.n_obs, 1).

In this case I get the following error message:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-86462c4012f3>", line 1, in <module>
    adata.obsm["asdf"] = np.reshape(np.arange(1000), (1000,1))
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 126, in __setitem__
    new[name] = arr
ValueError: could not broadcast input array from shape (1000,1) into shape (1000)

Example to reproduce:

adata.obsm["asdf"] = np.reshape(np.arange(adata.n_obs), (adata.n_obs,1))```

Use splats in API

This here:

https://github.com/theislab/anndata/blob/38c500dc8e9becd4c78a82897dea5febcdb9e0a4/anndata/anndata.py#L669

would e.g. be nicer as

def concatenate(self, *adatas, batch_key='batch', batch_categories=None): ...

This is because we can still call it with a list that way, but also more easily with multiple adatas. And we don’t have to say “list-or-AnnData”, but only “AnnData(s)”:

adata.concatenate(adata2)
adata.concatenate(adata2, adata3)
adata.concatenate(*some_adatas)

Also generally, when we introduce an API with more than two parameters, one of which has a default, we should do

def foo(bar, *, baz=1, boz=2): ...

def foo(bar, *baz, boz=1, biz=2): ...

(every keyword argument after a splat star is keyword only. this prevents errors)

Counterintuitive indexing behaviour

Anndata __getitem__ seems to exhibit a cross-product (similar to Python slices or np.ix_()) behaviour instead of numpy fancy indexing behaviour (so when it's indexed with 3 rows and 3 columns, it returns a 3x3 anndata, not 3 scalars as in fancy indexing).

This is very useful and intuitive, in my opinion, because when users specify cell and gene indices they mean cell and gene filtering:

adata = sc.datasets.paul15_raw()

print(adata[[0, 4, 10], ['Sfpi1']])
print(adata[[0, 4, 10], ['Sfpi1', 'Gata1', 'Fli1']])

However, this is the case only if the col/row index dimensions are compatible in terms of broadcasting:

print(adata[[0, 4, 10], ['Sfpi1', 'Gata1']])

This is a bit difficult to understand :) Either all 3 cases should raise an exception or they should all perform cross-product-like slicing. What do you think?

Sample annotations with same value but different dtypes are unloadable

I'm not sure if this belongs here or in the scanpy repo, and this is a hybrid bug and feature request. This is also somewhat related to #31, as both issues stem from the same function.

If two sample annotations have the same "value" but different dtype, a dataset saved as h5ad becomes unreadable. This stems from the way that the categories for 'object' typed annotations are defined. Minimal example:

import scanpy.api as sc

# load any dataset:
dataset = sc.read('/path/to/dataset')
test1 = dataset.copy()[:5, :5]
test2 = dataset.copy()[:5, :5]

# add annotations
test1.obs['sampleid'] = 1
test2.obs['sampleid'] = '1'

test_combined = test1.concatenate([test2])

test_combined.save('test.h5ad')
test_combined = sc.read('test.h5ad')

The last line fails for me with:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-85-2cfbeaca8a55> in <module>()
----> 1 test_combined = sc.read_h5ad('test.h5ad')

/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed)
    343         # load everything into memory
    344         d = _read_h5ad(filename=filename)
--> 345         return AnnData(d)
    346 
    347 

/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    634                 obsm=obsm, varm=varm, raw=raw,
    635                 dtype=dtype, shape=shape,
--> 636                 filename=filename, filemode=filemode)
    637 
    638     def _init_as_view(self, adata_ref, oidx, vidx):

/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode)
    753                 raise ValueError(
    754                     'If `X` is a dict no further arguments must be provided.')
--> 755             X, obs, var, uns, obsm, varm, raw = self._from_dict(X)
    756 
    757         # init from AnnData

/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/base.py in _from_dict(ddata)
   1872                     d_true_keys['obs'][k_stripped] = pd.Categorical.from_codes(
   1873                         codes=d_true_keys['obs'][k_stripped].values,
-> 1874                         categories=v)
   1875                 if k_stripped in d_true_keys['var']:
   1876                     d_true_keys['var'][k_stripped] = pd.Categorical.from_codes(

/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/pandas/core/categorical.py in from_codes(cls, codes, categories, ordered)
    616                 "codes need to be convertible to an arrays of integers")
    617 
--> 618         categories = CategoricalDtype._validate_categories(categories)
    619 
    620         if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):

/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in _validate_categories(categories, fastpath)
    325 
    326             if not categories.is_unique:
--> 327                 raise ValueError('Categorical categories must be unique')
    328 
    329         if isinstance(categories, ABCCategoricalIndex):

ValueError: Categorical categories must be unique

Looking at the attributes of test_combined, I see this:

test_combined.obs.sampleid.astype('category')
AAACCTGAGAACAACT-1-0    1
AAACCTGAGCTAGTTC-1-0    1
AAACCTGAGGGAAACA-1-0    1
AAACCTGCAATCACAC-1-0    1
AAACCTGCAATCGAAA-1-0    1
AAACCTGAGAACAACT-1-1    1
AAACCTGAGCTAGTTC-1-1    1
AAACCTGAGGGAAACA-1-1    1
AAACCTGCAATCACAC-1-1    1
AAACCTGCAATCGAAA-1-1    1
Name: sampleid, dtype: category
Categories (2, object): [1, 1]

and inspecting h5file['uns']['sampleid_categories'] yields [b'1', b'1']. Because some of the values are strings, the dtype of the column in the dataframe gets set as 'object' which causes is_string_dtype(data.obs.sampleid) to be True.

I think the logic in base. df_to_records_fixed_width should probably be changed to sanitize user-defined inputs like this or display a warning if mixed dtypes are detected.

I'm using anndata==0.6.4 and scanpy==1.2.1.

BoundRecArray objects don't pickle well

BoundRecArray objects don't keep their attributes when pickled:

>>> from anndata import AnnData
/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> import pickle
>>> import numpy as np
>>> 
>>> adata = AnnData()
>>> adata.obsm._parent == adata
True
>>> adata2 = pickle.loads(pickle.dumps(adata))
>>> adata2.obsm._parent == adata2
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/numpy/core/records.py", line 450, in __getattribute__
    res = fielddict[attr][:2]
KeyError: '_parent'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/numpy/core/records.py", line 452, in __getattribute__
    raise AttributeError("recarray has no attribute %s" % attr)
AttributeError: recarray has no attribute _parent
>>> adata2.obsm.__dict__
{}
>>> adata.obsm.__dict__
{'_parent': AnnData object with n_obs × n_vars = 0 × 0 , '_attr': 'obsm'}

Based on this stackoverflow question, I think the issue comes from subclassing a numpy object, which have custom code for pickling.

numpy/scipy version dependencies

Hi, would it be possible to change the numpy/scipy dependencies to numy >= 1.14 and scipy >= 1.0?
anndata otherwise causes problems with other packages requiring newer versions of numpy/scipy.

Concatenate and layers

The concatenate function does not take care of layers of an anndata object yet.

Can't index into anndata object with integer obs_names

Trying to index into an AnnData object which has integer obs_names throws an assertion error. I expected either not allowing the construction an object with integer obs_names or allowing indexing into them.

Here's a quick example. First I instantiate an AnnData object, give it integers for observation names, then get an error when I try to index into it:

>>> import scanpy.api as sc
>>> import pandas as pd
>>> import numpy as np
>>> adata = sc.datasets.krumsiek11()
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
... storing 'cell_type' as categorical
>>> adata.obs_names = np.array(range(adata.n_obs))
>>> adata[:, ['Gata2', 'Gata1']]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1211, in __getitem__
    return self._getitem_view(index)
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1214, in _getitem_view
    oidx, vidx = self._normalize_indices(index)
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1190, in _normalize_indices
    obs = _normalize_index(obs, self.obs_names)
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 231, in _normalize_index
    'Don’t call _normalize_index with non-categorical/string names'
AssertionError: Don’t call _normalize_index with non-categorical/string names

This error in indexing can be recovered by using a pandas.RangeIndex for the observation names:

>>> adata.obs_names = pd.RangeIndex(stop=adata.n_obs)
>>> adata[:, ['Gata2', 'Gata1']]
View of AnnData object with n_obs × n_vars = 640 × 2 
    obs: 'cell_type'
    uns: 'iroot', 'highlights'

However, range indexes are frequently implicitly replaced with integer indexes:

>>> adata_norm = sc.pp.normalize_per_cell(adata, copy=True)
>>> adata_norm.obs_names
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            630, 631, 632, 633, 634, 635, 636, 637, 638, 639],
           dtype='int64', length=640)
>>> adata_norm[:, ['Gata2', 'Gata1']]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1211, in __getitem__
    return self._getitem_view(index)
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1214, in _getitem_view
    oidx, vidx = self._normalize_indices(index)
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1190, in _normalize_indices
    obs = _normalize_index(obs, self.obs_names)
  File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 231, in _normalize_index
    'Don’t call _normalize_index with non-categorical/string names'
AssertionError: Don’t call _normalize_index with non-categorical/string names

Thanks!

AttributeError: 'AnnData' object has no attribute 'file' on repeated subsetting.

When I subset 3 times on this dataset, AnnData throws an AttributeError: 'AnnData' object has no attribute 'file'. Interestingly, the first two subsets work as expected.

Reproducible example:

Dataset

adata.zip

Code

import scanpy.api as sc
import pandas as pd
import numpy as np

adata = sc.read_h5ad("adata.h5ad")

adata = adata[adata.obs['n_genes'] > 200, :]
adata = adata[adata.obs['n_genes'] > 200, :]
adata = adata[adata.obs['n_genes'] > 200, :]

(it doesn't matter if I subset with different values or on different columns)

Stacktrace

Click to expand

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-7835ead5dcea> in <module>
----> 1 adata = adata[adata.obs['n_genes'] > 200, :]

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in __getitem__(self, index)
   1292     def __getitem__(self, index):
   1293         """Returns a sliced view of the object."""
-> 1294         return self._getitem_view(index)
   1295 
   1296     def _getitem_view(self, index):

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in _getitem_view(self, index)
   1296     def _getitem_view(self, index):
   1297         oidx, vidx = self._normalize_indices(index)
-> 1298         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1299 
   1300     # this is used in the setter for uns, if a view

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    674             if not isinstance(X, AnnData):
    675                 raise ValueError('`X` has to be an AnnData object.')
--> 676             self._init_as_view(X, oidx, vidx)
    677         else:
    678             self._init_as_actual(

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in _init_as_view(self, adata_ref, oidx, vidx)
    705         self._varm = ArrayView(adata_ref.varm[vidx_normalized], view_args=(self, 'varm'))
    706         # hackish solution here, no copy should be necessary
--> 707         uns_new = deepcopy(self._adata_ref._uns)
    708         # need to do the slicing before setting the updated self._n_obs, self._n_vars
    709         self._n_obs = self._adata_ref.n_obs  # use the original n_obs here

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    178                     y = x
    179                 else:
--> 180                     y = _reconstruct(x, memo, *rv)
    181 
    182     # If is its own copy, don't memoize.

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    278     if state is not None:
    279         if deep:
--> 280             state = deepcopy(state, memo)
    281         if hasattr(y, '__setstate__'):
    282             y.__setstate__(state)

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_tuple(x, memo, deepcopy)
    218 
    219 def _deepcopy_tuple(x, memo, deepcopy=deepcopy):
--> 220     y = [deepcopy(a, memo) for a in x]
    221     # We're not going to put the tuple in the memo, but it's still important we
    222     # check for it, in case the tuple contains recursive mutable structures.

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in <listcomp>(.0)
    218 
    219 def _deepcopy_tuple(x, memo, deepcopy=deepcopy):
--> 220     y = [deepcopy(a, memo) for a in x]
    221     # We're not going to put the tuple in the memo, but it's still important we
    222     # check for it, in case the tuple contains recursive mutable structures.

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    178                     y = x
    179                 else:
--> 180                     y = _reconstruct(x, memo, *rv)
    181 
    182     # If is its own copy, don't memoize.

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    278     if state is not None:
    279         if deep:
--> 280             state = deepcopy(state, memo)
    281         if hasattr(y, '__setstate__'):
    282             y.__setstate__(state)

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    178                     y = x
    179                 else:
--> 180                     y = _reconstruct(x, memo, *rv)
    181 
    182     # If is its own copy, don't memoize.

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    278     if state is not None:
    279         if deep:
--> 280             state = deepcopy(state, memo)
    281         if hasattr(y, '__setstate__'):
    282             y.__setstate__(state)

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    178                     y = x
    179                 else:
--> 180                     y = _reconstruct(x, memo, *rv)
    181 
    182     # If is its own copy, don't memoize.

~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    305                 key = deepcopy(key, memo)
    306                 value = deepcopy(value, memo)
--> 307                 y[key] = value
    308         else:
    309             for key, value in dictiter:

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in __setitem__(self, idx, value)
    455         else:
    456             adata_view, attr_name = self._view_args
--> 457             _init_actual_AnnData(adata_view)
    458             getattr(adata_view, attr_name)[idx] = value
    459 

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in _init_actual_AnnData(adata_view)
    373 
    374 def _init_actual_AnnData(adata_view):
--> 375     if adata_view.isbacked:
    376         raise ValueError(
    377             'You cannot modify elements of an AnnData view, '

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in isbacked(self)
   1188     def isbacked(self):
   1189         """``True`` if object is backed on disk, ``False`` otherwise."""
-> 1190         return self.filename is not None
   1191 
   1192     @property

~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in filename(self)
   1204           want to copy the previous file, use ``copy(filename='new_filename')``.
   1205         """
-> 1206         return self.file.filename
   1207 
   1208     @filename.setter

AttributeError: 'AnnData' object has no attribute 'file'

Package versions

Running Scanpy 1.3.2 on 2018-10-29 14:22.
anndata==0.6.10 numpy==1.14.3 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.20.0 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 matplotlib==3.0.0 seaborn==0.9.0

uns field is replicated for all observation which makes it annotation of observation.

when uns field is used to save some dictionary which has too many keys/values, the file size will increase massively. Expectedly, the reading and writing operation will be slow.

gzip and bzip2 compression support in read()

Hey,

Currently csv/tsv files with gzip or bzip2 compression are not supported, if I'm not mistaken. There is an issue in DCA (theislab/dca#7) about this, so I wanted to file a tracking issue here.

I was thinking about simply adding gzip.open() and bzip2.open() calls based on the file extension, just like other functions in the implementation, but there is also an option of using pandas for that because if we add compression support to anndata read_text() will start to converge pandas.read_csv().

So would it make sense to use pandas.read_csv for all text file reading functionality? It's already a dependency of anndata, so I don't see why not.

Data subsetting method

Is there a plan to add data subset method for extracting some particular cells for the anndata object ?

[bug] Support integers as column names

Hi, I recently stumbled about the following problem:

In[98]: adata
Out[98]: 
AnnData object with n_obs × n_vars = 20728 × 32738 
    obs: 0, 'batch', 'condition', 'source'
    var: 0, 1
In[99]: adata.write(file)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-99-c21b8d69b5d6>", line 1, in <module>
    adata.write(file)
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 1779, in write
    _write_h5ad(filename, self, compression=compression, compression_opts=compression_opts)
  File "/usr/lib/python3.6/site-packages/anndata/readwrite/write.py", line 94, in _write_h5ad
    d = adata._to_dict_fixed_width_arrays()
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 1926, in _to_dict_fixed_width_arrays
    obs_rec, uns_obs = df_to_records_fixed_width(self._obs)
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 176, in df_to_records_fixed_width
    uns[k + '_categories'] = c.cat.categories.values
TypeError: unsupported operand type(s) for +: 'int' and 'str'

As it seems, df_to_records_fixed_width has problems when some column names are actually integers.

The following solves this problem:

adata.var.columns = adata.var.columns.astype(str)
adata.obs.columns = adata.obs.columns.astype(str)