Coder Social home page Coder Social logo

Comments (5)

joshuacwnewton avatar joshuacwnewton commented on September 23, 2024 1

If we want something rigorous, we could have some sort of hashing of the model contents. (This has been suggested in the past in #1572 / #2489). But, there are some downsides to this?

  • Making the comparison between the "on-disk" model and the "SCT MODELS dict" URL is tricky. (From what files would the on-disk hash be computed? How would we store/fetch the "MODELS dict" hash? Would this require a network call every time we run sct_deepseg?)
    • NB: It seems like GitHub itself provides a hash, but I'm not sure it's a sustainable idea to rely on GitHub for hashes (esp. given alternative mirrors such as OS)
  • Using a hash as a version is less readable + user-verifiable in terms of JSON sidecars. There's no way to intuitively tell what version of the model you have by looking at a hash alone.

So, I think a simpler way to do this would be to use the model URL as its version indicator. (This is because, even though we do use GitHub Release tags, there can be multiple assets per tag, so I don't think we can rely on the tag name to uniquely identify the release.) This also has the benefit of being easily accessible by SCT without a network call (via the MODELS dict) and practically useful as well (as the URL points directly to the source).

A quick sketch:

  • When downloading the model using sct_deepseg, store the source URL inside a file called e.g. source.txt inside the model folder.
  • When running the model, compare the on-disk source.txt with the URL stored inside the MODELS dict.
  • If the model is out of date (e.g. SCT is updated but sct_deepseg -install-task has not been run yet), then emit a warning.
  • When running sct_deepseg -list-tasks, we can perform the same fetch + comparison to show the version stored + the latest version.

I would definitely value @mguaypaq's input here, though. 😊

from spinalcordtoolbox.

jcohenadad avatar jcohenadad commented on September 23, 2024 1

Thank you for drafting this excellent plan @joshuacwnewton ! My only comment would be:

If the model is out of date (e.g. SCT is updated but sct_deepseg -install-task has not been run yet), then emit a warning.

people tend to ignore warnings, so I would suggest to instead do a brute-force reinstallation of the model if it does not match the model that should be installed with the current version of SCT (if ppl want to run an older model they can install an older version of SCT)

from spinalcordtoolbox.

mguaypaq avatar mguaypaq commented on September 23, 2024 1

Storing some token inside a source.txt and checking it against the MODELS dict is a nice simple design. As for the warning vs forced installation, I would suggest:

  • It would be good if sct_deepseg -list-tasks showed both the expected version and the detected version.
  • But for every other action (including running the model), we should consider a mismatched version of the model to be the same as a not installed model. (Maybe this check could be incorporated into models.is_valid()?)

If we want to use the URL as the version token, there are some pros and cons:

  • (pro) It's human-readable, which is very nice.
  • (con) There can be more than one mirror listed in MODELS, so we'd need to be careful with the validation logic: any URL listed in the MODELS entry should be considered valid in source.txt.
  • (con) This difference would also show up in logs, sidecars, etc., and might lead people to wonder if they have the same model. And it's only buried in (each version of) MODELS that the correspondence between the various URLs for the same version of a model is written down.
  • (pro) This is maybe good for provenance/reproducibility, since it's arguably closer to the truth. In case one of the URLs points to the wrong file, it will be easier to track down which real version was used.
  • (neutral) In addition to multiple model URLs, we also have a mechanism for having a separate URL for each model seed. It shouldn't be hard to put a {seed_name: url} dict in source.txt, just something we need to be aware of. As a multi-line thing, this might also affect the formatting of sct_deepseg -list-tasks.

from spinalcordtoolbox.

jcohenadad avatar jcohenadad commented on September 23, 2024 1

do we really need to consider mirrors? if it simplify things, maybe we can ignore them (URL's from github assets seems to be fetchable from everywhere in the world, so far...)

from spinalcordtoolbox.

joshuacwnewton avatar joshuacwnewton commented on September 23, 2024
  • (con) There can be more than one mirror listed in MODELS, so we'd need to be careful with the validation logic: any URL listed in the MODELS entry should be considered valid in source.txt.
  • (con) This difference would also show up in logs, sidecars, etc., and might lead people to wonder if they have the same model. And it's only buried in (each version of) MODELS that the correspondence between the various URLs for the same version of a model is written down.

Since you already mention needing to display multiple model seeds, we could always store/display all of the mirrors, too?

Though, things get a little tricky when considering whether to display only the URL that was used to actually download the model, or to display all mirrors' at once. 🤔

Another quick sketch of what sources.txt (sources.json?) could look like:

// option 1: list just the URL that was actually used (useful for provenance if mirror urls somehow differ)
{
    "model_urls": {
        "mice_uqueensland_sc": "https://github.com/ivadomed/mice_uqueensland_gm/releases/download/r20200622/r20200622_mice_uqueensland_gm.zip",
    }
}

// option 2: list all sources at once (useful if we assume mirror urls are identical)
{
    "model_urls": {
        "mice_uqueensland_sc": [
            "https://github.com/ivadomed/mice_uqueensland_gm/releases/download/r20200622/r20200622_mice_uqueensland_gm.zip",
            "https://osf.io/mfxwg/download?version=6",
        ],
    }
}

// multi-model tasks (only 'seg_tumor-edema-cavity_t1-t2' and 'seg_tumor_t2')
EDIT: Metadata is per-model, so this isn't actually a case we need to consider.

// multi-seed tasks (only 'model_seg_ms_lesion_mp2rage')
{
    "model_urls": {
        "model_seg_ms_lesion_mp2rage": {
            "seed1": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed01.zip"],
            "seed2": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed02.zip"],
            "seed3": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed03.zip"],
            "seed4": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed04.zip"],
            "seed5": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed05.zip"],
        },
    },
}

Sidecar .json sample (this is a bit trickier, since the allowable fields for GeneratedBy are limited):

{
  "GeneratedBy": [
    {
      "Name": "sct_deepseg -task seg_sc_epi",
      "Version": "6.3.0",
      "CodeURL": "https://github.com/sct-pipeline/fmri-segmentation/releases/download/v0.2/model-fmri-segmentation-v0.2.zip"
    },
  ],
}

Or, alternatively:

{
  "GeneratedBy": [
    {
      "Name": "sct",
      "Version": "6.3.0",
      "CodeURL": "https://github.com/spinalcordtoolbox/spinalcordtoolbox"
    },
    {
      "Name": "sct_deepseg -task seg_sc_epi",
      "Version": "https://github.com/sct-pipeline/fmri-segmentation/releases/download/v0.2/model-fmri-segmentation-v0.2.zip",
      "CodeURL": "https://github.com/spinalcordtoolbox/spinalcordtoolbox/blob/72ff27030efda92431ee831b45f31e2858b8552d/spinalcordtoolbox/scripts/sct_deepseg.py"
    },
  ],
}

Assuming we can only store 1 URL per GeneratedBy entry, perhaps we could just store the second model for the multi-model tumor tasks (e.g. t2_tumor), and just store the first seed for multi-seed tasks? I imagine that's enough provenance, given that we're also mentioning the SCT version... 🤔

from spinalcordtoolbox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.