dacapo's People
dacapo's Issues
Random seeds
- make sure every run is as repeatable as possible
Problems:
Precache: Other sources of randomness that aren't easily controlled
Partial reruns
Modifying parts of your pipeline is very common, and rerunning a full grid search can be very impractical. To support partial reruns we should think of the following:
- Identifying (automatically based on config changes) which parts of the dacapo run need to be recomputed (validation results only or the full network training loop)
- Provide ergonomic options for users to define which jobs should be rerun (i.e. a bug was fixed in a specific model, no need to retrain all models, just the one that was fixed)
- Throw warnings when code has changed that potentially invalidated specific results ( code hashing )
Early stopping when no progress being made?
During training, it might be nice to end training before the requested number of iterations has been completed, if training does not seem to be making progress.
Raw max value < 0
Ergonomic improvements to plots
Pipeline Generalization
train pipeline:
- replace
create_pipeline_2d
andcreate_pipeline_3d
with dimension agnosticcreate_pipeline
if possible - support non-gunpowder pipelines if possible
- make the train pipeline a class (generator) with modular components (get_inputs, get_outputs, add_augments, ...) to better support customization without having to rewrite everything. (use abc?)
- Multiple sources
predict pipeline:
very similar to train pipeline
Multi label distance estimation - All Masks are one even for unannotated classes
Feature request: Way to set bsub queue name without editing source code
It would be good to have a way to set the bsub queue name without having to edit the DaCapo source code. Currently I have to edit the source to change queue='slowpoke'
to queue='gpu_any'
in the file dacapo/run.py
near line 264, in the implementation of the run_remote()
function.
Issue: error while installing with Python 3.11
When trying to install dacapo with pip install git+https://github.com/funkelab/dacapo
under a fresh conda environment with python 3.11 I receive the following error:
INFO: pip is looking at multiple versions of dacapo-ml to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.1.14 Requires-Python >=3.7,<3.8; 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement lsds>=0.1.3 (from dacapo-ml) (from versions: none)
ERROR: No matching distribution found for lsds>=0.1.3
Installation works as expected when using the same setup but with python 3.10. OS used is Ubuntu 22.04.
Custom Code Injection
It is unlikely that the default config files will be able to cover all desired use cases for dacapo. It will be necessary to allow super users to customize as much as they would like of the dacapo pipeline. However we do want the following:
- minimize repetition. Avoid having to handle the master process separately from worker processes.
- maximize customization: It should be easy to replace any part of the dacapo process. Reading data, training pipeline, evaluation, post_processing, writing to db, etc. These should all be generic enough that users can replace any part.
- provide a single obvious way of overriding specific behavior.
Current state:
custom local dacapo.PostProcessors
are supported via config arguments. First you give the path to your post_processor_module
. This module is added to the path. Then you should be able to import your custom module.
Pros:
- Allows you to include any custom
dacapo.PostProcessor
without having to write a custom run script.
Cons: - Modifying the
sys.path
variable and depending on local files seems like it could get unwieldy with naming conflicts when overwriting many parts of the framework. - Local files may not be available for workers. Might need extra work such as configuring mount directories etc.
- Sharing project requires copying the python environment, local code, and configurations.
Proposal #1:
Functional interface, something such as:
def run_all(
data_configs,
model_configs,
post_processor_configs,
...,
task_configs,
train_iteration = None, # Optional generator that yields training batches
post_processor = None, # Optional dacapo.PostProcessor
loss = None, # Optional torch.nn.Module
...,
data = None, # Optional dacapo.Data
):
...
Pros:
- Easy to replace any part that is exposed in
run_all
. - Could run the exact same script as master and as worker, it would then be job of dacapo to figure out when
run_all
was called by the master or by the workers (probably via environment variable). - User has full control. Only depends on local files if user wants it to. Customization could be very minimal, i.e. a single
run.py
script containing custom code.
Cons: - Custom code requires a custom run script. Everyone who wants to do something similar (or every similar dacapo project) would need at least a copy of the same boilerplate
run.py
script. - Confusing function names. Every worker would call
run_all
, but then only run a single specific configuration.
Proposal #2:
Plugin system: documentation for python package plugin systems here.
Pros:
- Sharing your environment and configurations is enough.
- Easier to enforce some structure on custom code, making backwards compatibility easier going forward.
- No custom script required. Could make a command-line tool that can handle more than just the default setup.
Cons. - Higher barrier to user customization. (Could probably be alleviated through providing templates)
Prediction
Prediction is often expected to run over a huge dataset.
Since we know the input/output size of the model, the processing steps, and the volume size, full volume prediction should be supported with daisy.
Usage/API
- choose an interface (command line, functional)
Command line:
Pros:
- No copy-paste boiler plate script needed
- Easier for maintainers to enforce a structure
Cons:
- higher barrier for users to customize
A lot of False positive because of low background weight
I have multi class segmentation
The weight for target class is ~10 foreground ~0.5 background
- Weight is very low for other classes which can help detecting True negative
Data Generalization
- Seperate Data from Tasks
- Improve Generality: Masks
- Improve Generality: Synthetic data generators
Expand built in use cases
Use plugins or build directly into dacapo:
- Long range affinities
- star dist
- noise2void
- classification
- local shape descriptors
- ... many more
Large Validation volumes
Similar to #8, it could be that the validation volume is also too large for storing in memory. In this case we should also use parallel blockwise processing (daisy) to get results faster.
Continue training from best checkpoint of previous run
Would be nice if there was an easy way to continue training of a model from the best checkpoint from a previous run. Or, maybe, just run more iterations after determining that a previous run was making progress but needed more iteratons.
Road map
Here are a couple areas for improvement that have been identified:
- Usage/API #4
- Custom code #1
- Random seeds #6
- Data generalization #2
- Pipeline generalization #3
- Expand built in use cases #5
- Prediction #8
- Support large Validation volumes #9
- Support partial reruns #10
Things that would be nice to have:
- CI/CD github actions
- installation guide
- basic usage tutorial
- customization/plugin tutorial
- tests
Sementic segmentation, touching components get the same distance target when resampling crop
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.