Comments (5)
If we want something rigorous, we could have some sort of hashing of the model contents. (This has been suggested in the past in #1572 / #2489). But, there are some downsides to this?
- Making the comparison between the "on-disk" model and the "SCT
MODELS
dict" URL is tricky. (From what files would the on-disk hash be computed? How would we store/fetch the "MODELS
dict" hash? Would this require a network call every time we runsct_deepseg
?)- NB: It seems like GitHub itself provides a hash, but I'm not sure it's a sustainable idea to rely on GitHub for hashes (esp. given alternative mirrors such as OS)
- Using a hash as a version is less readable + user-verifiable in terms of JSON sidecars. There's no way to intuitively tell what version of the model you have by looking at a hash alone.
So, I think a simpler way to do this would be to use the model URL as its version indicator. (This is because, even though we do use GitHub Release tags, there can be multiple assets per tag, so I don't think we can rely on the tag name to uniquely identify the release.) This also has the benefit of being easily accessible by SCT without a network call (via the MODELS dict) and practically useful as well (as the URL points directly to the source).
A quick sketch:
- When downloading the model using
sct_deepseg
, store the source URL inside a file called e.g.source.txt
inside the model folder. - When running the model, compare the on-disk
source.txt
with the URL stored inside theMODELS
dict. - If the model is out of date (e.g. SCT is updated but
sct_deepseg -install-task
has not been run yet), then emit a warning. - When running
sct_deepseg -list-tasks
, we can perform the same fetch + comparison to show the version stored + the latest version.
I would definitely value @mguaypaq's input here, though. 😊
from spinalcordtoolbox.
Thank you for drafting this excellent plan @joshuacwnewton ! My only comment would be:
If the model is out of date (e.g. SCT is updated but sct_deepseg -install-task has not been run yet), then emit a warning.
people tend to ignore warnings, so I would suggest to instead do a brute-force reinstallation of the model if it does not match the model that should be installed with the current version of SCT (if ppl want to run an older model they can install an older version of SCT)
from spinalcordtoolbox.
Storing some token inside a source.txt
and checking it against the MODELS
dict is a nice simple design. As for the warning vs forced installation, I would suggest:
- It would be good if
sct_deepseg -list-tasks
showed both the expected version and the detected version. - But for every other action (including running the model), we should consider a mismatched version of the model to be the same as a not installed model. (Maybe this check could be incorporated into
models.is_valid()
?)
If we want to use the URL as the version token, there are some pros and cons:
- (pro) It's human-readable, which is very nice.
- (con) There can be more than one mirror listed in
MODELS
, so we'd need to be careful with the validation logic: any URL listed in theMODELS
entry should be considered valid insource.txt
. - (con) This difference would also show up in logs, sidecars, etc., and might lead people to wonder if they have the same model. And it's only buried in (each version of)
MODELS
that the correspondence between the various URLs for the same version of a model is written down. - (pro) This is maybe good for provenance/reproducibility, since it's arguably closer to the truth. In case one of the URLs points to the wrong file, it will be easier to track down which real version was used.
- (neutral) In addition to multiple model URLs, we also have a mechanism for having a separate URL for each model seed. It shouldn't be hard to put a
{seed_name: url}
dict insource.txt
, just something we need to be aware of. As a multi-line thing, this might also affect the formatting ofsct_deepseg -list-tasks
.
from spinalcordtoolbox.
do we really need to consider mirrors? if it simplify things, maybe we can ignore them (URL's from github assets seems to be fetchable from everywhere in the world, so far...)
from spinalcordtoolbox.
- (con) There can be more than one mirror listed in
MODELS
, so we'd need to be careful with the validation logic: any URL listed in theMODELS
entry should be considered valid insource.txt
.- (con) This difference would also show up in logs, sidecars, etc., and might lead people to wonder if they have the same model. And it's only buried in (each version of)
MODELS
that the correspondence between the various URLs for the same version of a model is written down.
Since you already mention needing to display multiple model seeds, we could always store/display all of the mirrors, too?
Though, things get a little tricky when considering whether to display only the URL that was used to actually download the model, or to display all mirrors' at once. 🤔
Another quick sketch of what sources.txt
(sources.json
?) could look like:
// option 1: list just the URL that was actually used (useful for provenance if mirror urls somehow differ)
{
"model_urls": {
"mice_uqueensland_sc": "https://github.com/ivadomed/mice_uqueensland_gm/releases/download/r20200622/r20200622_mice_uqueensland_gm.zip",
}
}
// option 2: list all sources at once (useful if we assume mirror urls are identical)
{
"model_urls": {
"mice_uqueensland_sc": [
"https://github.com/ivadomed/mice_uqueensland_gm/releases/download/r20200622/r20200622_mice_uqueensland_gm.zip",
"https://osf.io/mfxwg/download?version=6",
],
}
}
// multi-model tasks (only 'seg_tumor-edema-cavity_t1-t2' and 'seg_tumor_t2')
EDIT: Metadata is per-model, so this isn't actually a case we need to consider.
// multi-seed tasks (only 'model_seg_ms_lesion_mp2rage')
{
"model_urls": {
"model_seg_ms_lesion_mp2rage": {
"seed1": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed01.zip"],
"seed2": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed02.zip"],
"seed3": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed03.zip"],
"seed4": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed04.zip"],
"seed5": ["https://github.com/ivadomed/model_seg_ms_mp2rage/releases/download/r20230210/model_seg_lesion_mp2rage_r20230210_dil32_seed05.zip"],
},
},
}
Sidecar .json sample (this is a bit trickier, since the allowable fields for GeneratedBy
are limited):
{
"GeneratedBy": [
{
"Name": "sct_deepseg -task seg_sc_epi",
"Version": "6.3.0",
"CodeURL": "https://github.com/sct-pipeline/fmri-segmentation/releases/download/v0.2/model-fmri-segmentation-v0.2.zip"
},
],
}
Or, alternatively:
{
"GeneratedBy": [
{
"Name": "sct",
"Version": "6.3.0",
"CodeURL": "https://github.com/spinalcordtoolbox/spinalcordtoolbox"
},
{
"Name": "sct_deepseg -task seg_sc_epi",
"Version": "https://github.com/sct-pipeline/fmri-segmentation/releases/download/v0.2/model-fmri-segmentation-v0.2.zip",
"CodeURL": "https://github.com/spinalcordtoolbox/spinalcordtoolbox/blob/72ff27030efda92431ee831b45f31e2858b8552d/spinalcordtoolbox/scripts/sct_deepseg.py"
},
],
}
Assuming we can only store 1 URL per GeneratedBy
entry, perhaps we could just store the second model for the multi-model tumor tasks (e.g. t2_tumor
), and just store the first seed for multi-seed tasks? I imagine that's enough provenance, given that we're also mentioning the SCT version... 🤔
from spinalcordtoolbox.
Related Issues (20)
- FMRI moco argparse help description refers to non-existent `-p` (should be `-param` instead) HOT 1
- Output ANTs syntax in moco for easier debugging HOT 2
- Suspicious application of a mask in moco HOT 5
- Verbose CLI arg (`sct_apply_transfo -v`) is not passed to the subprocess call to `antsApplyTransforms` HOT 1
- Update SCIseg model to the latest version
- Improve performance of `sct_analyze_lesion -f` by avoiding redundant calls to `np.where` HOT 1
- `matplotlib==3.9.1` no longer installs wheels, instead tries to build on Windows (but fails in CI) HOT 2
- `sct_analyze_lesion` - pandas warnings are mixing with the function output HOT 3
- `sct_analyze_lesion` - sagittal slices are alwayes reported in RPI HOT 1
- Clarify steps for installing SCT onto ComputeCanada Graham (`requirements.txt` conflicts) HOT 10
- `sct_analyze_lesion` crashes if there is no lesion on the midsagittal slice
- `DEPRECATION: Legacy editable install` during `pip install -e .` step of `install_sct` HOT 2
- `sct_analyze_lesion` - add support for spinal levels HOT 1
- `sct_analyze_lesion` - output lesion length and width for the midsagittal slice HOT 1
- `sct_analyze_lesion` - tissue bridges QC - possible improvements HOT 2
- Tensorflow-based ONNX models using a lot of memory on large image HOT 3
- Standardize on matplotlib style for SCT (implicit vs. explicit, `pyplot`-managed figures)
- EPI spinal cord seg model is missing `nnUNetTrainer` folder, causing it to crash on v6.4
- `sct_deepseg` - allow to specify path/URL to custom model
- Output physical coordinates in the ".csv" file HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spinalcordtoolbox.