The ionimage_embedding from tdrose

Hyper-parameter tuning for full image data

For the hyperparameter tuning, the following things should be considered/optimized:

Each model type will be optimized independently (loss and architecture, data scenario)
Hyperparameters for all models
- Learning rate
- Batch size?
- dims
- clip gradients
- pertaining with AE model
- ColocML preprocessing
- final activation function
- epochs
- Weight decay
Model/loss specific
- Self contrast: initial, steps, ds_percentiles, knn
- coloc contrast: initial, steps
- Regression contrast: Maybe also a step function?
Optimization metrics (different for random and transitivity):
- Top accuracy will be the main metric (on latent training space)
Baseline models:
- Make a function to run UMAP, mean coloc and BioMedCLIP multiple interactions on a bunch test data

Dissimilar ions across datasets

Next big issue Enforce dissimilarity across datasets for features being dissimilar within datasets:

Compute dataset-specific similarity
Take the lowest e.g. 20% (can also be a dynamic parameter like lower bound)
Create complete Ion x Ion matrix
Fill with pairs of lowest ions (int counts)
Keep if ions they are lowest in a minimum 30% of datasets (requires additional counting of ion pairs (could be a with another matrix by just adding up all ions not only lowest 20%).
Use these ions to in loss for A as forced zeros

Results

Writing

Next steps

Create evaluation datasets (selection of different organs + more datasets per organ)
Fix RegContrast & colocContrast issue with rotating datasets
Run optuna for different model scenarios
Mean coloc UMAP representation (to compare latent embedding space of mean coloc embedding play with number of dimensions)
Evaluate BioMedCLIP performance

Goal

Shared node embedding (ions are nodes) from coloc networks across multiple graphs (datasets).

Ideas

(V)GAE model: discretizing edges (colocalized/not-colocalized)
GNN learning latent space on a fully connected graph
- Using coloc edge weights
- Predict coloc from latent space directly (e.g. MSE loss function
Shallow node embedding
- Sample node sequence proportionally to coloc (probably requires softmax probabilities)

Notes

GAE approaches will predict different node embeddings per network. Requires aggregation such as centroiding, but can be advantageous since it gives us a variance/variability for the embedding of each node

Links

Implement CVAE model

As another model for included in the evaluation.

Potentially using embeddings as conditions, to integrate additional datasets later that were not part of training. I made a simple reference implementation of a scPoli like model: https://github.com/tdrose/lightning-cvae

Implement W2V model

As an additional model for the package. Use previous work from Dominik for the dataloaders.

Try to incorporate intensities instead of just thresholds.

Flexible dataloader for evalutation scenarios

For the evaluation of the project, we discussed a few scenarios from simple to more complex. Dataloaders (training, validation, and testing) for each scenario should be implemented.

One cohort, randomly leave out molecules
One cohort, leave out molecules, such that they never occur in one dataset
One cohort, leave out complete datasets
Diverse datasets, leave out random molecules
Diverse datasets, leave out molecules, such that they never occur in one dataset
Diverse datasets, leave out complete datasets

Next steps

CRL lightning implementation

Move the current torch implementation to lightning.

CRL model improvements

The is room for improvement in the CRL model, by adapting the model architecture and trying out different loss functions

What should definitely be explored:

Double normalization in Loss function (from legacy model implementation
Explore loss functions
Use previous layers as embedding layers

Notes:

InfoNCE (https://arxiv.org/pdf/1807.03748v2.pdf)
- Works only if one positive pair per batch. Not really our case. I could change sampling to this but that will probably make this more inefficiently
Overview of contrastive loss functions (https://lilianweng.github.io/posts/2021-05-31-contrastive/)
When changing the final activation layer (I need to exponate the similarity afterwards: https://towardsdatascience.com/contrastive-loss-explaned-159f2d4a87ec)
- Play with this and properly understand model loss behavior.
- Cross entropy works with log(p) --> p is probability. The similarity matrix does not consist of probabilities.
- But I could just sample from the batch to build the loss

Baseline model and evaluation

We need a baseline model for the evaluation (e.g. just basic cosine-based colocalization).
Evaluation protocoll
- Coloc and the model output operate in very different spaces, we need a comparable metric for their results
  - Check output distributions, and ranges of outputs, ...
- What can we use as ground truth?
  - Probably the true colocalization between molecules
  - If molecules have never been observed together, we might use prior knowledge metabolic networks
  - Simulated data?

tdrose / ionimage_embedding Goto Github PK

ionimage_embedding's People

Contributors

Watchers

ionimage_embedding's Issues

Results

Writing

Goal

Ideas

Notes

Links

Recommend Projects

Recommend Topics

Recommend Org