Coder Social home page Coder Social logo

dacapo's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dacapo's Issues

Random seeds

  • make sure every run is as repeatable as possible

Problems:
Precache: Other sources of randomness that aren't easily controlled

Partial reruns

Modifying parts of your pipeline is very common, and rerunning a full grid search can be very impractical. To support partial reruns we should think of the following:

  • Identifying (automatically based on config changes) which parts of the dacapo run need to be recomputed (validation results only or the full network training loop)
  • Provide ergonomic options for users to define which jobs should be rerun (i.e. a bug was fixed in a specific model, no need to retrain all models, just the one that was fixed)
  • Throw warnings when code has changed that potentially invalidated specific results ( code hashing )

Raw max value < 0

Maximum raw value after augmentation is between [-1,0]
I think the aim was to do [-1,1] but there is a mistake

Screenshot 2023-08-27 at 8 40 05 PM

Ergonomic improvements to plots

Currently, the plots produced showing loss vs iteration number, etc do not have axis labels on them, making them hard to interpret unless you already know what they are. Also, the way they are displayed makes them hard to look at without a lot of horizontal and vertical scrolling. For example:
image (3)

Pipeline Generalization

train pipeline:

  • replace create_pipeline_2d and create_pipeline_3d with dimension agnostic create_pipeline if possible
  • support non-gunpowder pipelines if possible
  • make the train pipeline a class (generator) with modular components (get_inputs, get_outputs, add_augments, ...) to better support customization without having to rewrite everything. (use abc?)
  • Multiple sources

predict pipeline:
very similar to train pipeline

Issue: error while installing with Python 3.11

When trying to install dacapo with pip install git+https://github.com/funkelab/dacapo under a fresh conda environment with python 3.11 I receive the following error:

INFO: pip is looking at multiple versions of dacapo-ml to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.1.14 Requires-Python >=3.7,<3.8; 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement lsds>=0.1.3 (from dacapo-ml) (from versions: none)
ERROR: No matching distribution found for lsds>=0.1.3

Installation works as expected when using the same setup but with python 3.10. OS used is Ubuntu 22.04.

Custom Code Injection

It is unlikely that the default config files will be able to cover all desired use cases for dacapo. It will be necessary to allow super users to customize as much as they would like of the dacapo pipeline. However we do want the following:

  1. minimize repetition. Avoid having to handle the master process separately from worker processes.
  2. maximize customization: It should be easy to replace any part of the dacapo process. Reading data, training pipeline, evaluation, post_processing, writing to db, etc. These should all be generic enough that users can replace any part.
  3. provide a single obvious way of overriding specific behavior.

Current state:
custom local dacapo.PostProcessors are supported via config arguments. First you give the path to your post_processor_module. This module is added to the path. Then you should be able to import your custom module.
Pros:

  1. Allows you to include any custom dacapo.PostProcessor without having to write a custom run script.
    Cons:
  2. Modifying the sys.path variable and depending on local files seems like it could get unwieldy with naming conflicts when overwriting many parts of the framework.
  3. Local files may not be available for workers. Might need extra work such as configuring mount directories etc.
  4. Sharing project requires copying the python environment, local code, and configurations.

Proposal #1:
Functional interface, something such as:

def run_all(
    data_configs,
    model_configs,
    post_processor_configs,
    ...,
    task_configs,
    train_iteration = None, # Optional generator that yields training batches
    post_processor = None, # Optional dacapo.PostProcessor
    loss = None, # Optional torch.nn.Module
    ...,
    data = None, # Optional dacapo.Data
):
    ...

Pros:

  1. Easy to replace any part that is exposed in run_all.
  2. Could run the exact same script as master and as worker, it would then be job of dacapo to figure out when run_all was called by the master or by the workers (probably via environment variable).
  3. User has full control. Only depends on local files if user wants it to. Customization could be very minimal, i.e. a single run.py script containing custom code.
    Cons:
  4. Custom code requires a custom run script. Everyone who wants to do something similar (or every similar dacapo project) would need at least a copy of the same boilerplate run.py script.
  5. Confusing function names. Every worker would call run_all, but then only run a single specific configuration.

Proposal #2:
Plugin system: documentation for python package plugin systems here.
Pros:

  1. Sharing your environment and configurations is enough.
  2. Easier to enforce some structure on custom code, making backwards compatibility easier going forward.
  3. No custom script required. Could make a command-line tool that can handle more than just the default setup.
    Cons.
  4. Higher barrier to user customization. (Could probably be alleviated through providing templates)

Prediction

Prediction is often expected to run over a huge dataset.
Since we know the input/output size of the model, the processing steps, and the volume size, full volume prediction should be supported with daisy.

Usage/API

  • choose an interface (command line, functional)

Command line:
Pros:

  • No copy-paste boiler plate script needed
  • Easier for maintainers to enforce a structure

Cons:

  • higher barrier for users to customize

Data Generalization

  • Seperate Data from Tasks
  • Improve Generality: Masks
  • Improve Generality: Synthetic data generators

Expand built in use cases

Use plugins or build directly into dacapo:

  • Long range affinities
  • star dist
  • noise2void
  • classification
  • local shape descriptors
  • ... many more

Large Validation volumes

Similar to #8, it could be that the validation volume is also too large for storing in memory. In this case we should also use parallel blockwise processing (daisy) to get results faster.

Continue training from best checkpoint of previous run

Would be nice if there was an easy way to continue training of a model from the best checkpoint from a previous run. Or, maybe, just run more iterations after determining that a previous run was making progress but needed more iteratons.

Road map

Here are a couple areas for improvement that have been identified:

  • Usage/API #4
  • Custom code #1
  • Random seeds #6
  • Data generalization #2
  • Pipeline generalization #3
  • Expand built in use cases #5
  • Prediction #8
  • Support large Validation volumes #9
  • Support partial reruns #10

Things that would be nice to have:

  • CI/CD github actions
  • installation guide
  • basic usage tutorial
  • customization/plugin tutorial
  • tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.