I'm trying to run CELLEX on ~120k cells from a single-nucleus experiment. In the prepr

Unfortunately, your code erased everything in cols: <div class="highlight highligh

parse_input does not detect invalid metadata value types,about perslab/cellex

rrydbirk commented on August 20, 2024 1

Unfortunately, your code erased everything in cols:

>>> cols = pd.read_csv("./cells.csv").values
>>> cols.size
118345
>>> cols = [c[0] for c in cols if type(c) is tuple]
>>> cols.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'size'
>>> cols
[]

Instead, this did the trick:

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
data.columns = pd.read_csv("./cells.csv").values
data.columns = [v[0] for v in data.columns]
data.index = pd.read_csv("./genes.csv").values
data.index = [v[0] for v in data.index]

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 1 min 8 sec
Preprocessing - normalizing data ... data normalized in 2 min 24 sec
Preprocessing - running ANOVA ... excluded 4172 / 28621 genes in 2 min 44 sec

Also, I now see that I'm running on an old version of HDF5Array (1.10.1) since I'm still on R 3.5.1. In newer versions (at least 1.16.0), writeHDF5Array includes the option "with.dimnames" which should circumvent my problems.
So, for future reference, going from R to Python with large sparsematrices, this should suffice:
writeHDF5Array(matrix, "./data.h5", "data", verbose = T, with.dimnames=T)

from cellex.

tstannius commented on August 20, 2024

Hi Rasmus!

I will need a little more to go on. Would you kindly share the code (a minimal example) needed to reproduce this output?
Please also include the output of calling

data.head()
metadata.head()

And the versions of the modules you are using.

from cellex.

rrydbirk commented on August 20, 2024

Hi Tobias

Thanks for getting back to me.
This is my code. Please note, since I'm coming from R and my count matrix is too big to write to .csv, I used HDF5 instead. It's a bit dirty since I couldn't get it to transfer row and column names directly (and I'm not that familiar with Python in general):

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']))
data.columns = pd.read_csv("./cells.csv").values
data.index = pd.read_csv("./genes.csv").values
data = data.astype(dtype="int64")

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

The requested output. I omitted sensitive data:

data.head()
               (Cell1_AAACCCAAGATCCGAG-1,)  ...  (Celln_TTTGTTGTCTAACGGT-1,)
(AL627309.1,)                                  0  ...                                  0
(AC114498.1,)                                  0  ...                                  0
(AL669831.2,)                                  0  ...                                  0
(AL669831.5,)                                  1  ...                                  0
(FAM87B,)                                      0  ...                                  0

[5 rows x 118345 columns]

>>> metadata.head()
                               cell_type
cell_id
Cell1_ACTACGAGTATGCGGA-1    Type1
Cell2_AAACCCAAGTGCGCTC-1    Type1
Cell3_ACTGATGGTAACACGG-1    Type1
Cell4_CACTGAATCACAATGC-1    Type1
Cell5_CAGGGCTCACCAACAT-1    Type1

And the versions:
➜ Cellex python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux

➜ Cellex pip list

Click to show

Package Version ---------------------------------- --------- adjustText 0.7.3 alabaster 0.7.12 anaconda-client 1.7.2 anaconda-navigator 1.9.7 anaconda-project 0.8.3 aniso8601 8.0.0 annoy 1.16.3 asn1crypto 1.0.1 astroid 2.3.1 astropy 3.2.2 atomicwrites 1.3.0 attrs 19.2.0 Babel 2.7.0 backcall 0.1.0 backports.functools-lru-cache 1.5 backports.os 0.1.1 backports.shutil-get-terminal-size 1.0.0 backports.tempfile 1.0 backports.weakref 1.0.post1 beautifulsoup4 4.8.0 bitarray 1.0.1 bkcharts 0.2 bleach 3.1.0 bokeh 1.3.4 boto 2.49.0 boto3 1.7.84 botocore 1.10.84 Bottleneck 1.2.1 cellex 1.1.1 CellPhoneDB 2.1.2 certifi 2019.9.11 cffi 1.12.3 chardet 3.0.4 click 6.7 cloudpickle 1.2.2 clyent 1.2.2 colorama 0.4.1 conda 4.8.2 conda-build 3.18.9 conda-package-handling 1.6.0 conda-verify 3.4.2 contextlib2 0.6.0 cryptography 2.7 cycler 0.10.0 Cython 0.29.13 cytoolz 0.10.0 dask 2.5.2 decorator 4.4.0 defusedxml 0.6.0 descartes 1.1.0 distributed 2.5.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fastcache 1.1.0 fbpca 1.0 filelock 3.0.12 Flask 1.0.4 Flask-RESTful 0.3.8 Flask-Testing 0.7.1 fsspec 0.5.2 future 0.17.1 geosketch 0.3 gevent 1.4.0 glob2 0.7 gmpy2 2.0.8 greenlet 0.4.15 h5py 2.9.0 HeapDict 1.0.1 html5lib 1.0.1 idna 2.7 imageio 2.6.0 imagesize 1.1.0 importlib-metadata 0.23 ipykernel 5.1.2 ipython 7.8.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 itsdangerous 1.1.0 jdcal 1.4.1 jedi 0.15.1 jeepney 0.4.1 Jinja2 2.10.3 jmespath 0.9.5 joblib 0.13.2 json5 0.8.5 jsonschema 3.0.2 jupyter 1.0.0 jupyter-client 5.3.3 jupyter-console 6.0.0 jupyter-core 4.5.0 jupyterlab 1.1.4 jupyterlab-server 1.0.6 keyring 18.0.0 kiwisolver 1.1.0 lazy-object-proxy 1.4.2 libarchive-c 2.8 lief 0.9.0 llvmlite 0.29.0 locket 0.2.0 loompy 3.0.6 lxml 4.4.1 MarkupSafe 1.1.1 matplotlib 3.1.1 mccabe 0.6.1 mistune 0.8.4 mizani 0.6.0 mkl-fft 1.0.14 mkl-random 1.1.0 mkl-service 2.3.0 mock 3.0.5 more-itertools 7.2.0 mpmath 1.1.0 msgpack 0.6.1 multipledispatch 0.6.0 navigator-updater 0.2.1 nbconvert 5.6.0 nbformat 4.4.0 networkx 2.3 nltk 3.4.5 nose 1.3.7 notebook 6.0.1 numba 0.45.1 numexpr 2.7.0 numpy 1.17.2 numpy-groupies 0+unknown numpydoc 0.9.1 olefile 0.46 openpyxl 3.0.0 packaging 19.2 palettable 3.3.0 pandas 1.0.3 pandocfilters 1.4.2 parso 0.5.1 partd 1.0.0 path.py 12.0.1 pathlib2 2.3.5 patsy 0.5.1 pep8 1.7.1 pexpect 4.7.0 pickleshare 0.7.5 pika 0.12.0 Pillow 6.2.0 pip 19.2.3 pkginfo 1.5.0.1 plotnine 0.6.0 pluggy 0.13.0 ply 3.11 prometheus-client 0.7.1 prompt-toolkit 2.0.10 psutil 5.6.3 ptyprocess 0.6.0 py 1.8.0 pyarrow 0.17.1 pycodestyle 2.5.0 pycosat 0.6.3 pycparser 2.19 pycrypto 2.6.1 pycurl 7.43.0.3 pyflakes 2.1.1 Pygments 2.4.2 pylint 2.4.2 pyodbc 4.0.27 pyOpenSSL 19.0.0 pyparsing 2.4.2 pyreadr 0.2.9 pyrsistent 0.15.4 pysam 0.15.4 PySocks 1.7.1 pytest 5.2.1 pytest-arraydiff 0.3 pytest-astropy 0.5.0 pytest-doctestplus 0.4.0 pytest-openfiles 0.4.0 pytest-remotedata 0.3.2 python-dateutil 2.8.0 pytz 2019.3 PyWavelets 1.0.3 PyYAML 5.1.2 pyzmq 18.1.0 QtAwesome 0.6.0 qtconsole 4.5.5 QtPy 1.9.0 requests 2.19.1 rope 0.14.0 rpy2 3.0.5 ruamel-yaml 0.15.46 s3transfer 0.1.13 scikit-image 0.15.0 scikit-learn 0.21.3 scipy 1.3.1 scrublet 0.2.1 seaborn 0.9.0 SecretStorage 3.1.1 Send2Trash 1.5.0 setuptools 41.4.0 setuptools-scm 4.1.1 simplegeneric 0.8.1 singledispatch 3.4.0.3 six 1.12.0 snowballstemmer 2.0.0 sortedcollections 1.1.2 sortedcontainers 2.1.0 soupsieve 1.9.3 Sphinx 2.2.0 sphinxcontrib-applehelp 1.0.1 sphinxcontrib-devhelp 1.0.1 sphinxcontrib-htmlhelp 1.0.2 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.2 sphinxcontrib-serializinghtml 1.1.3 sphinxcontrib-websupport 1.1.2 spyder 3.3.6 spyder-kernels 0.5.2 SQLAlchemy 1.3.9 statsmodels 0.10.1 sympy 1.4 tables 3.5.2 tblib 1.4.0 terminado 0.8.2 testpath 0.4.2 toolz 0.10.0 tornado 6.0.3 tqdm 4.32.2 traitlets 4.3.3 tzlocal 2.1 umap-learn 0.3.10 unicodecsv 0.14.1 urllib3 1.23 velocyto 0.17.17 wcwidth 0.1.7 webencodings 0.5.1 Werkzeug 0.16.0 wheel 0.33.6 widgetsnbextension 3.5.1 wrapt 1.11.2 wurlitzer 1.0.3 xlrd 1.2.0 XlsxWriter 1.2.1 xlwt 1.3.0 zict 1.0.0 zipp 0.6.0

If you need the actual data, I can anonymize it and share it with you. Let me know if you need anything else.

from cellex.

tstannius commented on August 20, 2024

Thanks!

It makes good sense to store your data as hdf5 and we will also transition to this format in the next release :-)

It seems that the type (tuple) of the row and column names are causing the error. They should be strings instead.

Do try something along the lines of the following and let me know how it goes.

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
# I think float32 should be sufficient
cols = pd.read_csv("./cells.csv").values
cols = [c[0] for c in cols if type(c) is tuple]
# may give an error, if not all colnames were tuples and you will have to do some more work,
# but rather that than a silent error!
data.columns = cols


# same approach for idx
idx = pd.read_csv("./genes.csv").values
# do stuff ...
data.index = idx

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

from cellex.

tstannius commented on August 20, 2024

Glad to hear that!

I will keep the issue open until the input check function has been updated to handle this kind of issue.

from cellex.

parse_input does not detect invalid metadata value types about cellex HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent