Coder Social home page Coder Social logo

Comments (5)

rrydbirk avatar rrydbirk commented on August 20, 2024 1

Unfortunately, your code erased everything in cols:

>>> cols = pd.read_csv("./cells.csv").values
>>> cols.size
118345
>>> cols = [c[0] for c in cols if type(c) is tuple]
>>> cols.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'size'
>>> cols
[]

Instead, this did the trick:

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
data.columns = pd.read_csv("./cells.csv").values
data.columns = [v[0] for v in data.columns]
data.index = pd.read_csv("./genes.csv").values
data.index = [v[0] for v in data.index]

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 1 min 8 sec
Preprocessing - normalizing data ... data normalized in 2 min 24 sec
Preprocessing - running ANOVA ... excluded 4172 / 28621 genes in 2 min 44 sec

Also, I now see that I'm running on an old version of HDF5Array (1.10.1) since I'm still on R 3.5.1. In newer versions (at least 1.16.0), writeHDF5Array includes the option "with.dimnames" which should circumvent my problems.
So, for future reference, going from R to Python with large sparsematrices, this should suffice:
writeHDF5Array(matrix, "./data.h5", "data", verbose = T, with.dimnames=T)

from cellex.

tstannius avatar tstannius commented on August 20, 2024

Hi Rasmus!

I will need a little more to go on. Would you kindly share the code (a minimal example) needed to reproduce this output?
Please also include the output of calling

data.head()
metadata.head()

And the versions of the modules you are using.

from cellex.

rrydbirk avatar rrydbirk commented on August 20, 2024

Hi Tobias

Thanks for getting back to me.
This is my code. Please note, since I'm coming from R and my count matrix is too big to write to .csv, I used HDF5 instead. It's a bit dirty since I couldn't get it to transfer row and column names directly (and I'm not that familiar with Python in general):

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']))
data.columns = pd.read_csv("./cells.csv").values
data.index = pd.read_csv("./genes.csv").values
data = data.astype(dtype="int64")

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

The requested output. I omitted sensitive data:

data.head()
               (Cell1_AAACCCAAGATCCGAG-1,)  ...  (Celln_TTTGTTGTCTAACGGT-1,)
(AL627309.1,)                                  0  ...                                  0
(AC114498.1,)                                  0  ...                                  0
(AL669831.2,)                                  0  ...                                  0
(AL669831.5,)                                  1  ...                                  0
(FAM87B,)                                      0  ...                                  0

[5 rows x 118345 columns]
>>> metadata.head()
                               cell_type
cell_id
Cell1_ACTACGAGTATGCGGA-1    Type1
Cell2_AAACCCAAGTGCGCTC-1    Type1
Cell3_ACTGATGGTAACACGG-1    Type1
Cell4_CACTGAATCACAATGC-1    Type1
Cell5_CAGGGCTCACCAACAT-1    Type1

And the versions:
➜ Cellex python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux

➜ Cellex pip list

Click to show

Package Version ---------------------------------- --------- adjustText 0.7.3 alabaster 0.7.12 anaconda-client 1.7.2 anaconda-navigator 1.9.7 anaconda-project 0.8.3 aniso8601 8.0.0 annoy 1.16.3 asn1crypto 1.0.1 astroid 2.3.1 astropy 3.2.2 atomicwrites 1.3.0 attrs 19.2.0 Babel 2.7.0 backcall 0.1.0 backports.functools-lru-cache 1.5 backports.os 0.1.1 backports.shutil-get-terminal-size 1.0.0 backports.tempfile 1.0 backports.weakref 1.0.post1 beautifulsoup4 4.8.0 bitarray 1.0.1 bkcharts 0.2 bleach 3.1.0 bokeh 1.3.4 boto 2.49.0 boto3 1.7.84 botocore 1.10.84 Bottleneck 1.2.1 cellex 1.1.1 CellPhoneDB 2.1.2 certifi 2019.9.11 cffi 1.12.3 chardet 3.0.4 click 6.7 cloudpickle 1.2.2 clyent 1.2.2 colorama 0.4.1 conda 4.8.2 conda-build 3.18.9 conda-package-handling 1.6.0 conda-verify 3.4.2 contextlib2 0.6.0 cryptography 2.7 cycler 0.10.0 Cython 0.29.13 cytoolz 0.10.0 dask 2.5.2 decorator 4.4.0 defusedxml 0.6.0 descartes 1.1.0 distributed 2.5.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fastcache 1.1.0 fbpca 1.0 filelock 3.0.12 Flask 1.0.4 Flask-RESTful 0.3.8 Flask-Testing 0.7.1 fsspec 0.5.2 future 0.17.1 geosketch 0.3 gevent 1.4.0 glob2 0.7 gmpy2 2.0.8 greenlet 0.4.15 h5py 2.9.0 HeapDict 1.0.1 html5lib 1.0.1 idna 2.7 imageio 2.6.0 imagesize 1.1.0 importlib-metadata 0.23 ipykernel 5.1.2 ipython 7.8.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 itsdangerous 1.1.0 jdcal 1.4.1 jedi 0.15.1 jeepney 0.4.1 Jinja2 2.10.3 jmespath 0.9.5 joblib 0.13.2 json5 0.8.5 jsonschema 3.0.2 jupyter 1.0.0 jupyter-client 5.3.3 jupyter-console 6.0.0 jupyter-core 4.5.0 jupyterlab 1.1.4 jupyterlab-server 1.0.6 keyring 18.0.0 kiwisolver 1.1.0 lazy-object-proxy 1.4.2 libarchive-c 2.8 lief 0.9.0 llvmlite 0.29.0 locket 0.2.0 loompy 3.0.6 lxml 4.4.1 MarkupSafe 1.1.1 matplotlib 3.1.1 mccabe 0.6.1 mistune 0.8.4 mizani 0.6.0 mkl-fft 1.0.14 mkl-random 1.1.0 mkl-service 2.3.0 mock 3.0.5 more-itertools 7.2.0 mpmath 1.1.0 msgpack 0.6.1 multipledispatch 0.6.0 navigator-updater 0.2.1 nbconvert 5.6.0 nbformat 4.4.0 networkx 2.3 nltk 3.4.5 nose 1.3.7 notebook 6.0.1 numba 0.45.1 numexpr 2.7.0 numpy 1.17.2 numpy-groupies 0+unknown numpydoc 0.9.1 olefile 0.46 openpyxl 3.0.0 packaging 19.2 palettable 3.3.0 pandas 1.0.3 pandocfilters 1.4.2 parso 0.5.1 partd 1.0.0 path.py 12.0.1 pathlib2 2.3.5 patsy 0.5.1 pep8 1.7.1 pexpect 4.7.0 pickleshare 0.7.5 pika 0.12.0 Pillow 6.2.0 pip 19.2.3 pkginfo 1.5.0.1 plotnine 0.6.0 pluggy 0.13.0 ply 3.11 prometheus-client 0.7.1 prompt-toolkit 2.0.10 psutil 5.6.3 ptyprocess 0.6.0 py 1.8.0 pyarrow 0.17.1 pycodestyle 2.5.0 pycosat 0.6.3 pycparser 2.19 pycrypto 2.6.1 pycurl 7.43.0.3 pyflakes 2.1.1 Pygments 2.4.2 pylint 2.4.2 pyodbc 4.0.27 pyOpenSSL 19.0.0 pyparsing 2.4.2 pyreadr 0.2.9 pyrsistent 0.15.4 pysam 0.15.4 PySocks 1.7.1 pytest 5.2.1 pytest-arraydiff 0.3 pytest-astropy 0.5.0 pytest-doctestplus 0.4.0 pytest-openfiles 0.4.0 pytest-remotedata 0.3.2 python-dateutil 2.8.0 pytz 2019.3 PyWavelets 1.0.3 PyYAML 5.1.2 pyzmq 18.1.0 QtAwesome 0.6.0 qtconsole 4.5.5 QtPy 1.9.0 requests 2.19.1 rope 0.14.0 rpy2 3.0.5 ruamel-yaml 0.15.46 s3transfer 0.1.13 scikit-image 0.15.0 scikit-learn 0.21.3 scipy 1.3.1 scrublet 0.2.1 seaborn 0.9.0 SecretStorage 3.1.1 Send2Trash 1.5.0 setuptools 41.4.0 setuptools-scm 4.1.1 simplegeneric 0.8.1 singledispatch 3.4.0.3 six 1.12.0 snowballstemmer 2.0.0 sortedcollections 1.1.2 sortedcontainers 2.1.0 soupsieve 1.9.3 Sphinx 2.2.0 sphinxcontrib-applehelp 1.0.1 sphinxcontrib-devhelp 1.0.1 sphinxcontrib-htmlhelp 1.0.2 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.2 sphinxcontrib-serializinghtml 1.1.3 sphinxcontrib-websupport 1.1.2 spyder 3.3.6 spyder-kernels 0.5.2 SQLAlchemy 1.3.9 statsmodels 0.10.1 sympy 1.4 tables 3.5.2 tblib 1.4.0 terminado 0.8.2 testpath 0.4.2 toolz 0.10.0 tornado 6.0.3 tqdm 4.32.2 traitlets 4.3.3 tzlocal 2.1 umap-learn 0.3.10 unicodecsv 0.14.1 urllib3 1.23 velocyto 0.17.17 wcwidth 0.1.7 webencodings 0.5.1 Werkzeug 0.16.0 wheel 0.33.6 widgetsnbextension 3.5.1 wrapt 1.11.2 wurlitzer 1.0.3 xlrd 1.2.0 XlsxWriter 1.2.1 xlwt 1.3.0 zict 1.0.0 zipp 0.6.0

If you need the actual data, I can anonymize it and share it with you. Let me know if you need anything else.

from cellex.

tstannius avatar tstannius commented on August 20, 2024

Thanks!

It makes good sense to store your data as hdf5 and we will also transition to this format in the next release :-)

It seems that the type (tuple) of the row and column names are causing the error. They should be strings instead.

Do try something along the lines of the following and let me know how it goes.

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
# I think float32 should be sufficient
cols = pd.read_csv("./cells.csv").values
cols = [c[0] for c in cols if type(c) is tuple]
# may give an error, if not all colnames were tuples and you will have to do some more work,
# but rather that than a silent error!
data.columns = cols


# same approach for idx
idx = pd.read_csv("./genes.csv").values
# do stuff ...
data.index = idx

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

from cellex.

tstannius avatar tstannius commented on August 20, 2024

Glad to hear that!

I will keep the issue open until the input check function has been updated to handle this kind of issue.

from cellex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.