The dacapo from funkelab

Random seeds

make sure every run is as repeatable as possible

Problems:
Precache: Other sources of randomness that aren't easily controlled

Partial reruns

Modifying parts of your pipeline is very common, and rerunning a full grid search can be very impractical. To support partial reruns we should think of the following:

Identifying (automatically based on config changes) which parts of the dacapo run need to be recomputed (validation results only or the full network training loop)
Provide ergonomic options for users to define which jobs should be rerun (i.e. a bug was fixed in a specific model, no need to retrain all models, just the one that was fixed)
Throw warnings when code has changed that potentially invalidated specific results ( code hashing )

Early stopping when no progress being made?

During training, it might be nice to end training before the requested number of iterations has been completed, if training does not seem to be making progress.

Raw max value < 0

Maximum raw value after augmentation is between [-1,0]
I think the aim was to do [-1,1] but there is a mistake

Ergonomic improvements to plots

Currently, the plots produced showing loss vs iteration number, etc do not have axis labels on them, making them hard to interpret unless you already know what they are. Also, the way they are displayed makes them hard to look at without a lot of horizontal and vertical scrolling. For example:

Pipeline Generalization

train pipeline:

replace create_pipeline_2d and create_pipeline_3d with dimension agnostic create_pipeline if possible
support non-gunpowder pipelines if possible
make the train pipeline a class (generator) with modular components (get_inputs, get_outputs, add_augments, ...) to better support customization without having to rewrite everything. (use abc?)
Multiple sources

predict pipeline:
very similar to train pipeline

Multi label distance estimation - All Masks are one even for unannotated classes

I am training a model with 5 output labels. and by looking to the snapshots the mask is 1 for all of them

This is confusing the model, because it is a sparse annotation and each crop have only one label annotated. other organelles are present but not annotated.

@rhoadesScholar @davidackerman @funkey @pattonw

Feature request: Way to set bsub queue name without editing source code

It would be good to have a way to set the bsub queue name without having to edit the DaCapo source code. Currently I have to edit the source to change queue='slowpoke' to queue='gpu_any' in the file dacapo/run.py near line 264, in the implementation of the run_remote() function.

Issue: error while installing with Python 3.11

When trying to install dacapo with pip install git+https://github.com/funkelab/dacapo under a fresh conda environment with python 3.11 I receive the following error:

INFO: pip is looking at multiple versions of dacapo-ml to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.1.14 Requires-Python >=3.7,<3.8; 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement lsds>=0.1.3 (from dacapo-ml) (from versions: none)
ERROR: No matching distribution found for lsds>=0.1.3

Installation works as expected when using the same setup but with python 3.10. OS used is Ubuntu 22.04.

Custom Code Injection

It is unlikely that the default config files will be able to cover all desired use cases for dacapo. It will be necessary to allow super users to customize as much as they would like of the dacapo pipeline. However we do want the following:

minimize repetition. Avoid having to handle the master process separately from worker processes.
maximize customization: It should be easy to replace any part of the dacapo process. Reading data, training pipeline, evaluation, post_processing, writing to db, etc. These should all be generic enough that users can replace any part.
provide a single obvious way of overriding specific behavior.

Current state:
custom local dacapo.PostProcessors are supported via config arguments. First you give the path to your post_processor_module. This module is added to the path. Then you should be able to import your custom module.
Pros:

Allows you to include any custom dacapo.PostProcessor without having to write a custom run script.
Cons:
Modifying the sys.path variable and depending on local files seems like it could get unwieldy with naming conflicts when overwriting many parts of the framework.
Local files may not be available for workers. Might need extra work such as configuring mount directories etc.
Sharing project requires copying the python environment, local code, and configurations.

Proposal #1:
Functional interface, something such as:

def run_all(
    data_configs,
    model_configs,
    post_processor_configs,
    ...,
    task_configs,
    train_iteration = None, # Optional generator that yields training batches
    post_processor = None, # Optional dacapo.PostProcessor
    loss = None, # Optional torch.nn.Module
    ...,
    data = None, # Optional dacapo.Data
):
    ...

Pros:

Easy to replace any part that is exposed in run_all.
Could run the exact same script as master and as worker, it would then be job of dacapo to figure out when run_all was called by the master or by the workers (probably via environment variable).
User has full control. Only depends on local files if user wants it to. Customization could be very minimal, i.e. a single run.py script containing custom code.
Cons:
Custom code requires a custom run script. Everyone who wants to do something similar (or every similar dacapo project) would need at least a copy of the same boilerplate run.py script.
Confusing function names. Every worker would call run_all, but then only run a single specific configuration.

Proposal #2:
Plugin system: documentation for python package plugin systems here.
Pros:

Sharing your environment and configurations is enough.
Easier to enforce some structure on custom code, making backwards compatibility easier going forward.
No custom script required. Could make a command-line tool that can handle more than just the default setup.
Cons.
Higher barrier to user customization. (Could probably be alleviated through providing templates)

Prediction

Prediction is often expected to run over a huge dataset.
Since we know the input/output size of the model, the processing steps, and the volume size, full volume prediction should be supported with daisy.

Usage/API

choose an interface (command line, functional)

Command line:
Pros:

No copy-paste boiler plate script needed
Easier for maintainers to enforce a structure

Cons:

higher barrier for users to customize

A lot of False positive because of low background weight

I have multi class segmentation

The weight for target class is ~10 foreground ~0.5 background

Weight is very low for other classes which can help detecting True negative

Data Generalization

Seperate Data from Tasks
Improve Generality: Masks
Improve Generality: Synthetic data generators

funkelab / dacapo Goto Github PK

dacapo's People

Stargazers

Watchers

Forkers

dacapo's Issues

Recommend Projects

Recommend Topics

Recommend Org