Hello and happy new year cellxgene team. I am working on cellxgene i

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

[FEATURE REQUEST] .CXG (TileDb compliant) format support from single-cell-explorer to cellxgene Annotate about cellxgene HOT 5 OPEN

Neah-Ko commented on May 28, 2024

[FEATURE REQUEST] .CXG (TileDb compliant) format support from single-cell-explorer to cellxgene Annotate

from cellxgene.

Comments (5)

MaximilianLombardo commented on May 28, 2024 1

Hi @Neah-Ko !

Thanks for your question! Responding inline:

Can we expect this feature (support ingestion of .cxg) to be available on Annotate ?

We currently do not have plans to incorporate this feature in Annotate in the near future, that being said our roadmap has not been fully established for the latter half of 2024, so its not totally out of the realm of possibilities (if it becomes enabling or required for some other higher priority features, this would incentivize us to commit some resources towards development).

I will most likely have to work on a PoC on this feature. If you are interested, I could extend a pull request.

We really appreciate this offer and if this is a high priority/urgent feature for you, I think it makes sense for you to work on it. If you do end up implementing the feature, we'd definitely like to hear about your experience and technical discovery work on it. We can't promise that the PR would be merged into the main branch since our engineering resources are already committed to higher priority work, and we would likely not have time to review. That being said, hearing about your experience and learnings while implementing it would be invaluable for us and we definitely encourage you to report back with any information that might inform a future direction/implementation of the feature in Annotate.

We sincerely value your understanding and enthusiasm for contributing to the CELLxGENE community. Please do continue to share your ideas and feedback with us.

from cellxgene.

Neah-Ko commented on May 28, 2024

Hello @MaximilianLombardo

So I went on with frontporting the feature, with way more success than I initially expected to be honest :)

You may find the code at #2656
It comes with a Dockerfile in order to test it

docker build . -f Dockerfile_editable -t cellxgene:<tag>
docker run -it  -p 5005:5005 -v /path/to/datasets:/data cellxgene:<tag> launch /data/dataset.[h5ad | cxg/] --host 0.0.0.0 --verbose

Note: The Dockerfile is made to be rebuilt to test changes on sources. For me, installing cellxgene in editable mode didn't worked because then react doesn't find the templates anymore. I had a look on stackoverflow and it may or may not be arranged by tweaking pyproject.toml

Behavior

So far, it seems that both file formats can be read in, with no concerning error in the logs ⭐
However, I cannot guarantee yet that it is full-proof. I mostly used example-datasets/pbmc3k.h5ad and single-cell-explorer->example-datasets/pbmc3k.cxg/ for my tests.

In principle, I merged the different parts into an up to date version of CXG, so if something breaks it is probably going to be on the CXG side.

I've made some modest attempts at factoring, but it probably deserves another pass.

I am going to detail some of my choices and findings:

FBS schema

As mentioned above, I merged into Annotate. I decided to try without regenerating the schema.

To support cxg dataset, I had to propagate some num_bins argument value around server/common/fbs/matrix.py#encode_matrix_fbs() to eventually not use it.
In the single-cell-explorer repo, it contributes deciding of a "lossy" array_type when serializing, so we might be loosing something in the process.

To dig more into it, I would need access to at least the compilation chains sent to flatc

Conversion

cellxgene reads the dataset when converted with the primitives in server/converters/h5ad_data_file.py coming from v0.16.3 (pr features the hash of the commit)

However, I have to load it setting the flag
Else, I get this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 48, in __init__
    self.extract_metadata_about_dataset()
  File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 204, in extract_metadata_about_dataset
    raise ValueError("Unknown source file schema version is unsupported.")
ValueError: Unknown source file schema version is unsupported.

It doesn't reads when conversion is done using cxgtool

Configuration

The configuration scheme evolved from single-cell-explorer to here. At the moment I am not fully certain that all parts are going to be consumed and not taken from the default.

I'm almost duplicating the dataset entry with a default_dataset copied from single-cell-explorer. We may rethink about naming and factoring.

CxgDataset parameters

To enable annotation support and user genesets creation I had to tweak the function _validate_and_initialize to recover them from the schema, whereas it is being set in _alias_annotation_names on anndata_adaptor side, and I am not 100% sure that the two are yielding consistent results. Especially since there's some renaming happening on the second one.

In general I could have overlooked something from the many configuration files.

Let me know what you think of it, I am going to scan and bring minor improvements where I can next week and try to merge and run both data_adaptors unit tests.

from cellxgene.

Neah-Ko commented on May 28, 2024

Hello @MaximilianLombardo ,

Update

As I mentioned above, the pull request now also features test/unit/test_cxg_dataset.py, which runs fine but fails on creating a lossy array (i.e. test_tdb_bug_lossy).

I can see that on the single-cell-explorer side, the primitives located at server/common/fbs/fbs_coders.py are doing some encoding optimization on serialization (E.g. the DenseNumericIntCoder that stores float{16,32,64} in an int16 array).

At the moment, I don't think I can properly support them in the pull request as they are calling some NetEncodings that don't belong to the FBS sources for Annotate, and in particular this Int16EncodedXFBArray to perform the mentioned optimization. (Also, no dict type, only a JSON one).

If I could have access to some documentation on how the schema is re-generated, that would be really helpful.

Suggestion

While browsing for TileDb <-> Anndata conversion tools, I stumbled onto TileDb-SOMA from a collaboration between you CZI and TileDb.

In my opinion that would be a great backend for cellxgene, benefiting from TileDb optimizations and able to output an Anndata object.

That could also contribute towards reducing cellxgene RAM footprint, which is the top challenge at the moment.

We are pretty committed on using cellxgene for visualization here, we would benefit from improvements. So let me know if I may contribute.

Best,

from cellxgene.

[FEATURE REQUEST] .CXG (TileDb compliant) format support from single-cell-explorer to cellxgene Annotate about cellxgene HOT 5 OPEN

Comments (5)

Behavior

FBS schema

Conversion

Configuration

CxgDataset parameters

Update

Suggestion

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent