Coder Social home page Coder Social logo

tdrose / ionimage_embedding Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 6.57 MB

DL based representation learning of MS imaging data

License: GNU General Public License v3.0

Python 98.91% Shell 1.09%
deep-learning gnn imaging mass-spectrometry metaspace pytorch representation-learning

ionimage_embedding's People

Contributors

tdrose avatar

Watchers

 avatar  avatar

ionimage_embedding's Issues

Hyper-parameter tuning for full image data

For the hyperparameter tuning, the following things should be considered/optimized:

  • Each model type will be optimized independently (loss and architecture, data scenario)
  • Hyperparameters for all models
    • Learning rate
    • Batch size?
    • dims
    • clip gradients
    • pertaining with AE model
    • ColocML preprocessing
    • final activation function
    • epochs
    • Weight decay
  • Model/loss specific
    • Self contrast: initial, steps, ds_percentiles, knn
    • coloc contrast: initial, steps
    • Regression contrast: Maybe also a step function?
  • Optimization metrics (different for random and transitivity):
    • Top accuracy will be the main metric (on latent training space)
  • Baseline models:
    • Make a function to run UMAP, mean coloc and BioMedCLIP multiple interactions on a bunch test data

Dissimilar ions across datasets

Next big issue Enforce dissimilarity across datasets for features being dissimilar within datasets:

  • Compute dataset-specific similarity
  • Take the lowest e.g. 20% (can also be a dynamic parameter like lower bound)
  • Create complete Ion x Ion matrix
  • Fill with pairs of lowest ions (int counts)
  • Keep if ions they are lowest in a minimum 30% of datasets (requires additional counting of ion pairs (could be a with another matrix by just adding up all ions not only lowest 20%).
  • Use these ions to in loss for A as forced zeros

Todo for publication

Results

  • Performance with increasing levels of transitivity (less overlap of datasets)
  • Performance for different organs
  • Run across all organs (evaluation performance per organ)
  • Investigate variance
  • Compare to LINEX?
  • Highlight one case
  • Same node similarity in loss function
  • Weight less occurring node pairs inversely in the loss function.
  • Investigate performance distribution on a node/edge/dataset level
    • easier/hard to predict metabolites/datasets (correlate to detection/co-detection levels)
    • Visualize the for individual metabolites (compare between UMAP/GNN/Mean) (particularly to see why MSE is so robust).
  • Relative MSE with variance (idea from Jovan, think about it a bit more)
    • Compute square residuals for every prediction (and the number of times they occur in each datasest)
    • Additionally note the molecule identities to compute the variance

Writing

  • Abstract
  • Intro
  • Results
  • Methods
  • Discussion
  • Acknowledgements/Code availability/...

Todo until conference

  • 08.01. - 12.01.
    • Selection of evaluation datasets (more datasets per organ and different organs + look at annotation overlap stats)
    • Mean/median coloc UMAP (to compare against embedding of DL models, play with number of dimensions)
    • BioMedCLIP wrapper to use in the same evaluation framework
    • Check UMAP implementation (data leakage)
    • Check CRL evaluation code
    • Transitivity evaluation:
      • MSE for colocs that never occurred in the training data
      • Top-accuracy including only the pairs that did not occur in the training data
  • 15.01. - 19.01.
    • Implement vision transformer model for CRL (It can deal with different image sizes (as long as they are dividable into patches, Further they have the number of classes as a parameter (returns linear layer output). Should be relatively easy to add similarly to ResNet architecture.) (Default, uses 1000 classes coming from a linear layer, just add another layer to resulting in desired dimension)
    • #9
    • #21
    • Check CVAE reconstruction & latent space for optimization
    • Make presentation for feedback session
  • 22.01. - 26.01.
    • Graph Dataloader (top-N most colocalized per node, coloc quantile cut-off)
    • #22
    • Play with model
  • 29.01. - 02.02.
    • Worked on GNN model performance/evaluation
  • 05.02. - 09.02.
    • Model optimizations (fine-tuning, loss functions, architecture changes)
    • Hyperparameter tuning
    • Investigate Latent space
    • In which cases is transitivity performing badly?
    • More detailed info in #22
  • 12.02. - 16.02.
    • Leave out ions training
    • Molecule embedding variance (quantify uncertainty)
    • Try euclidean distance
    • Prepare slides for Theo
    • Work on Theo feedback
  • 19.02. - 23.02.
    • Evaluation on different scenarios
    • Finalize goal for poster (What do we want to show? Which features do we focus on?)
    • Make plots for poster
      • MSE (with BiomedCLIP & CLR
      • (Maybe ACC)
      • UMAP + cluster images
      • PCA
    • Draft poster
  • 26.02. - 01.03.
    • Will not do much because of surgery
    • Send out Poster for feedback
  • 04.03. - 08.03.
    • Incorporate Poster feedback
    • Print poster
  • 12.03. - 15.03.

Next steps

  • Create evaluation datasets (selection of different organs + more datasets per organ)
  • Fix RegContrast & colocContrast issue with rotating datasets
  • Run optuna for different model scenarios
  • Mean coloc UMAP representation (to compare latent embedding space of mean coloc embedding play with number of dimensions)
  • Evaluate BioMedCLIP performance

Implement coloc graph GNN

Goal

Shared node embedding (ions are nodes) from coloc networks across multiple graphs (datasets).

Ideas

  • (V)GAE model: discretizing edges (colocalized/not-colocalized)
  • GNN learning latent space on a fully connected graph
    • Using coloc edge weights
    • Predict coloc from latent space directly (e.g. MSE loss function
  • Shallow node embedding
    • Sample node sequence proportionally to coloc (probably requires softmax probabilities)

Notes

  • GAE approaches will predict different node embeddings per network. Requires aggregation such as centroiding, but can be advantageous since it gives us a variance/variability for the embedding of each node

Links

Implement W2V model

As an additional model for the package. Use previous work from Dominik for the dataloaders.

Try to incorporate intensities instead of just thresholds.

Flexible dataloader for evalutation scenarios

For the evaluation of the project, we discussed a few scenarios from simple to more complex. Dataloaders (training, validation, and testing) for each scenario should be implemented.

  • One cohort, randomly leave out molecules
  • One cohort, leave out molecules, such that they never occur in one dataset
  • One cohort, leave out complete datasets
  • Diverse datasets, leave out random molecules
  • Diverse datasets, leave out molecules, such that they never occur in one dataset
  • Diverse datasets, leave out complete datasets

Next steps

  • Run without CAE
  • Add FC layers + fc output
  • Test different fc activation layers.
  • set up optuna run
  • ColocML preprocessing
  • Tests for coloc framework and evaluations

CRL model improvements

The is room for improvement in the CRL model, by adapting the model architecture and trying out different loss functions

What should definitely be explored:

  • Double normalization in Loss function (from legacy model implementation
  • Explore loss functions
  • Use previous layers as embedding layers

Notes:

Baseline model and evaluation

  1. We need a baseline model for the evaluation (e.g. just basic cosine-based colocalization).
  2. Evaluation protocoll
    • Coloc and the model output operate in very different spaces, we need a comparable metric for their results
      • Check output distributions, and ranges of outputs, ...
    • What can we use as ground truth?
      • Probably the true colocalization between molecules
      • If molecules have never been observed together, we might use prior knowledge metabolic networks
      • Simulated data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.