Comments (5)
Hi @Neah-Ko !
Thanks for your question! Responding inline:
Can we expect this feature (support ingestion of .cxg) to be available on Annotate ?
We currently do not have plans to incorporate this feature in Annotate in the near future, that being said our roadmap has not been fully established for the latter half of 2024, so its not totally out of the realm of possibilities (if it becomes enabling or required for some other higher priority features, this would incentivize us to commit some resources towards development).
I will most likely have to work on a PoC on this feature. If you are interested, I could extend a pull request.
We really appreciate this offer and if this is a high priority/urgent feature for you, I think it makes sense for you to work on it. If you do end up implementing the feature, we'd definitely like to hear about your experience and technical discovery work on it. We can't promise that the PR would be merged into the main branch since our engineering resources are already committed to higher priority work, and we would likely not have time to review. That being said, hearing about your experience and learnings while implementing it would be invaluable for us and we definitely encourage you to report back with any information that might inform a future direction/implementation of the feature in Annotate.
We sincerely value your understanding and enthusiasm for contributing to the CELLxGENE community. Please do continue to share your ideas and feedback with us.
from cellxgene.
Hello @MaximilianLombardo
So I went on with frontporting the feature, with way more success than I initially expected to be honest :)
You may find the code at #2656
It comes with a Dockerfile in order to test it
docker build . -f Dockerfile_editable -t cellxgene:<tag>
docker run -it -p 5005:5005 -v /path/to/datasets:/data cellxgene:<tag> launch /data/dataset.[h5ad | cxg/] --host 0.0.0.0 --verbose
Note: The Dockerfile is made to be rebuilt to test changes on sources. For me, installing cellxgene in editable mode didn't worked because then react doesn't find the templates anymore. I had a look on stackoverflow and it may or may not be arranged by tweaking pyproject.toml
Behavior
So far, it seems that both file formats can be read in, with no concerning error in the logs ⭐
However, I cannot guarantee yet that it is full-proof. I mostly used example-datasets/pbmc3k.h5ad
and single-cell-explorer->example-datasets/pbmc3k.cxg/
for my tests.
In principle, I merged the different parts into an up to date version of CXG, so if something breaks it is probably going to be on the CXG side.
I've made some modest attempts at factoring, but it probably deserves another pass.
I am going to detail some of my choices and findings:
FBS schema
As mentioned above, I merged into Annotate. I decided to try without regenerating the schema.
To support cxg dataset, I had to propagate some num_bins
argument value around server/common/fbs/matrix.py#encode_matrix_fbs()
to eventually not use it.
In the single-cell-explorer repo, it contributes deciding of a "lossy" array_type when serializing, so we might be loosing something in the process.
To dig more into it, I would need access to at least the compilation chains sent to flatc
Conversion
cellxgene reads the dataset when converted with the primitives in server/converters/h5ad_data_file.py
coming from v0.16.3 (pr features the hash of the commit)
However, I have to load it setting the flag
Else, I get this error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 48, in __init__
self.extract_metadata_about_dataset()
File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 204, in extract_metadata_about_dataset
raise ValueError("Unknown source file schema version is unsupported.")
ValueError: Unknown source file schema version is unsupported.
It doesn't reads when conversion is done using cxgtool
Configuration
The configuration scheme evolved from single-cell-explorer to here. At the moment I am not fully certain that all parts are going to be consumed and not taken from the default.
I'm almost duplicating the dataset
entry with a default_dataset
copied from single-cell-explorer. We may rethink about naming and factoring.
CxgDataset parameters
To enable annotation support and user genesets creation I had to tweak the function _validate_and_initialize
to recover them from the schema, whereas it is being set in _alias_annotation_names
on anndata_adaptor
side, and I am not 100% sure that the two are yielding consistent results. Especially since there's some renaming happening on the second one.
In general I could have overlooked something from the many configuration files.
Let me know what you think of it, I am going to scan and bring minor improvements where I can next week and try to merge and run both data_adaptors unit tests.
from cellxgene.
Hello @MaximilianLombardo ,
Update
As I mentioned above, the pull request now also features test/unit/test_cxg_dataset.py
, which runs fine but fails on creating a lossy
array (i.e. test_tdb_bug_lossy
).
I can see that on the single-cell-explorer
side, the primitives located at server/common/fbs/fbs_coders.py
are doing some encoding optimization on serialization (E.g. the DenseNumericIntCoder
that stores float{16,32,64}
in an int16
array).
At the moment, I don't think I can properly support them in the pull request as they are calling some NetEncodings
that don't belong to the FBS sources for Annotate
, and in particular this Int16EncodedXFBArray
to perform the mentioned optimization. (Also, no dict type, only a JSON one).
If I could have access to some documentation on how the schema is re-generated, that would be really helpful.
Suggestion
While browsing for TileDb
<-> Anndata
conversion tools, I stumbled onto TileDb-SOMA from a collaboration between you CZI and TileDb.
In my opinion that would be a great backend for cellxgene, benefiting from TileDb optimizations and able to output an Anndata object.
That could also contribute towards reducing cellxgene RAM footprint, which is the top challenge at the moment.
We are pretty committed on using cellxgene for visualization here, we would benefit from improvements. So let me know if I may contribute.
Best,
from cellxgene.
Related Issues (20)
- [QUESTION] Ensembl gene name HOT 2
- [BUG] AttributeError: module 'flask.json' has no attribute 'JSONEncoder' HOT 1
- [BUG] HOT 3
- 404 Error when running differential test on cellxGene HOT 5
- [QUESTION] Upload multiple datasets (.h5ad) HOT 2
- [BUG] Can not start cellxgene HOT 5
- Feature request: distinguish zeros when plotting
- Annotate - update FE dependencies
- Annotate - update BE dependencies HOT 3
- Load data to cellxgene with SOMA 💡?
- [FEATURE REQUEST] Option to choose the dot plot order [low to high expression as in Scanpy] HOT 1
- [BUG] Cellxgene failing to launch h5ad files HOT 11
- [FEATURE REQUEST] Plot zeros in grey, nonzero using colormap HOT 1
- Optimizing Large Dataset Loading and Differential Expression Analysis in local hosted CellxGene VM HOT 3
- Issue with opening files [BUG] HOT 3
- install document update needed
- [QUESTION] Increase the size of dots/cells
- 使用示例anndata文件启动 Annotate 开始报错如何解决呢 HOT 3
- how to annotate the celltype HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cellxgene.