Coder Social home page Coder Social logo

Comments (5)

MaximilianLombardo avatar MaximilianLombardo commented on May 28, 2024 1

Hi @Neah-Ko !

Thanks for your question! Responding inline:

Can we expect this feature (support ingestion of .cxg) to be available on Annotate ?

We currently do not have plans to incorporate this feature in Annotate in the near future, that being said our roadmap has not been fully established for the latter half of 2024, so its not totally out of the realm of possibilities (if it becomes enabling or required for some other higher priority features, this would incentivize us to commit some resources towards development).

I will most likely have to work on a PoC on this feature. If you are interested, I could extend a pull request.

We really appreciate this offer and if this is a high priority/urgent feature for you, I think it makes sense for you to work on it. If you do end up implementing the feature, we'd definitely like to hear about your experience and technical discovery work on it. We can't promise that the PR would be merged into the main branch since our engineering resources are already committed to higher priority work, and we would likely not have time to review. That being said, hearing about your experience and learnings while implementing it would be invaluable for us and we definitely encourage you to report back with any information that might inform a future direction/implementation of the feature in Annotate.

We sincerely value your understanding and enthusiasm for contributing to the CELLxGENE community. Please do continue to share your ideas and feedback with us.

from cellxgene.

Neah-Ko avatar Neah-Ko commented on May 28, 2024

Hello @MaximilianLombardo

So I went on with frontporting the feature, with way more success than I initially expected to be honest :)

You may find the code at #2656
It comes with a Dockerfile in order to test it

docker build . -f Dockerfile_editable -t cellxgene:<tag>
docker run -it  -p 5005:5005 -v /path/to/datasets:/data cellxgene:<tag> launch /data/dataset.[h5ad | cxg/] --host 0.0.0.0 --verbose

Note: The Dockerfile is made to be rebuilt to test changes on sources. For me, installing cellxgene in editable mode didn't worked because then react doesn't find the templates anymore. I had a look on stackoverflow and it may or may not be arranged by tweaking pyproject.toml

Behavior

So far, it seems that both file formats can be read in, with no concerning error in the logs ⭐
However, I cannot guarantee yet that it is full-proof. I mostly used example-datasets/pbmc3k.h5ad and single-cell-explorer->example-datasets/pbmc3k.cxg/ for my tests.

In principle, I merged the different parts into an up to date version of CXG, so if something breaks it is probably going to be on the CXG side.

I've made some modest attempts at factoring, but it probably deserves another pass.

I am going to detail some of my choices and findings:

FBS schema

As mentioned above, I merged into Annotate. I decided to try without regenerating the schema.

To support cxg dataset, I had to propagate some num_bins argument value around server/common/fbs/matrix.py#encode_matrix_fbs() to eventually not use it.
In the single-cell-explorer repo, it contributes deciding of a "lossy" array_type when serializing, so we might be loosing something in the process.

To dig more into it, I would need access to at least the compilation chains sent to flatc

Conversion

cellxgene reads the dataset when converted with the primitives in server/converters/h5ad_data_file.py coming from v0.16.3 (pr features the hash of the commit)

However, I have to load it setting the flag
Else, I get this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 48, in __init__
    self.extract_metadata_about_dataset()
  File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 204, in extract_metadata_about_dataset
    raise ValueError("Unknown source file schema version is unsupported.")
ValueError: Unknown source file schema version is unsupported.

It doesn't reads when conversion is done using cxgtool

Configuration

The configuration scheme evolved from single-cell-explorer to here. At the moment I am not fully certain that all parts are going to be consumed and not taken from the default.

I'm almost duplicating the dataset entry with a default_dataset copied from single-cell-explorer. We may rethink about naming and factoring.

CxgDataset parameters

To enable annotation support and user genesets creation I had to tweak the function _validate_and_initialize to recover them from the schema, whereas it is being set in _alias_annotation_names on anndata_adaptor side, and I am not 100% sure that the two are yielding consistent results. Especially since there's some renaming happening on the second one.

In general I could have overlooked something from the many configuration files.

Let me know what you think of it, I am going to scan and bring minor improvements where I can next week and try to merge and run both data_adaptors unit tests.

from cellxgene.

Neah-Ko avatar Neah-Ko commented on May 28, 2024

Hello @MaximilianLombardo ,

Update

As I mentioned above, the pull request now also features test/unit/test_cxg_dataset.py, which runs fine but fails on creating a lossy array (i.e. test_tdb_bug_lossy).

I can see that on the single-cell-explorer side, the primitives located at server/common/fbs/fbs_coders.py are doing some encoding optimization on serialization (E.g. the DenseNumericIntCoder that stores float{16,32,64} in an int16 array).

At the moment, I don't think I can properly support them in the pull request as they are calling some NetEncodings that don't belong to the FBS sources for Annotate, and in particular this Int16EncodedXFBArray to perform the mentioned optimization. (Also, no dict type, only a JSON one).

If I could have access to some documentation on how the schema is re-generated, that would be really helpful.

Suggestion

While browsing for TileDb <-> Anndata conversion tools, I stumbled onto TileDb-SOMA from a collaboration between you CZI and TileDb.

In my opinion that would be a great backend for cellxgene, benefiting from TileDb optimizations and able to output an Anndata object.

That could also contribute towards reducing cellxgene RAM footprint, which is the top challenge at the moment.

We are pretty committed on using cellxgene for visualization here, we would benefit from improvements. So let me know if I may contribute.

Best,

from cellxgene.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.