M&Ms Challenge 2020

The CMR images have been segmented by experienced clinicians from the respective institutions, including contours for the left (LV) and right ventricle (RV) blood pools, as well as for the left ventricular myocardium (MYO). Labels are: 1 (LV), 2 (MYO) and 3 (RV)

Motivation

In the recent years, many machine/deep learning models have been proposed to accurately segment cardiac structures in magnetic resonance imaging. However, when these models are tested on unseen datasets acquired from distinct MRI scanners or clinical centres, the segmentation accuracy can be greatly reduced.

The M&Ms challenge aims to contribute to the effort of building generalisable models that can be applied consistently across clinical centres. Furthermore, M&Ms will provide a reference dataset for the community to build and assess future generalisable models in CMR segmentation.

Environment Setup

To use the code, the user needs to set te environment variable to access the data. At your ~/.bashrc add:

export MMsCardiac_DATA_PATH='/path/to/data/M&MsData/'

Also, the user needs to to pre-install a few packages:

$ pip install wheel setuptools
$ pip install -r requirements.txt
$ pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install torchcontrib~=0.0.2

Data preparation

Train csv

You can generate train csv for dataloaders using python3 preprocess/generate_train_df.py.

usage: generate_train_df.py [-h] [--meta_graphs]

M&Ms 2020 Challenge - Training info generation

optional arguments:
  -h, --help     show this help message and exit
  --meta_graphs  Generate train meta information graphs

Data Refactor

Load each volume to extract only 1 slice is time consuming. To solve this, save each slice in numpy arrays: python3 preprocess/dataloader_refactor.py

Global Training Mean and STD

You can easily get global mean and std from labeled training samples using python3 preprocess/get_mean_std.py.

Data Description

The challenge cohort is composed of 350 patients with hypertrophic and dilated cardiomyopathies as well as healthy subjects. All subjects were scanned in clinical centres in three different countries (Spain, Germany and Canada) using four different magnetic resonance scanner vendors (Siemens, General Electric, Philips and Canon).

Hospital	Num. studies	Country
Clinica Sagrada Familia	50	Spain
Hospital de la Santa Creu i Sant Pau	50	Spain
Hospital Universitari Dexeus	50	Spain
Hospital Vall d'Hebron	100	Spain
McGill University Health Centre	50	Canada
Universitätsklinikum Hamburg-Eppendorf	50	Germany

Training set (150+25 studies)

The training set will contain 150 annotated images from two different MRI vendors (75 each) and 25 unannotated images from a third vendor. The CMR images have been segmented by experienced clinicians from the respective institutions, including contours for the left (LV) and right ventricle (RV) blood pools, as well as for the left ventricular myocardium (MYO). Labels are: 1 (LV), 2 (MYO) and 3 (RV).

Testing set (200 studies)

The 200 test cases correspond to 50 new studies from each of the vendors provided in the training set and 50 additional studies from a fourth unseen vendor, that will be tested for model generalizability. 20% of these datasets will be used for validation and the rest will be reserved for testing and ranking participants.

Standard Operating Procedure (SOP) for data annotation

In order to build a useful dataset for the community we have decided to build on top of ACDC MICCAI 2017 challenge SOP and correct our contours accordingly.

In particular, clinical contours have been corrected by two in-house annotators that had to agree on the final result. These annotators followed these rules:

LV and RV cavities must be completely covered, with papillary muscles included.
No interpolation of the LV myocardium must be performed at the base.
RV must have a larger surface in end-diastole compared to end-systole and avoid the pulmonary artery.

The main difficulty and source of disagreement is the exact RV form in basal slices.

Results

Using ACDC checkpoint:

Average -> 0.7397 -> 0.9933 (background), 0.6931 (LV), 0.5624 (MYO), 0.71(RV)

Calculated using resnet34_unet_imagenet_encoder, Adam and constant learning rate. Fold metrics are calculated using mean of averaged iou and dice values. Only mnms data.

Method	Normalization	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Mean
bce_dice_border_ce -> 0.4,0.4,0.1,0.3,0.6 - lr 0.01	Reescale	0.7958	0.8272	0.8064	0.8107	0.8220	0.8124
bce_dice_border_ce -> 0.4,0.4,0.1,0.3,0.6 - lr 0.001	Reescale	0.8163	0.8384	0.8382	0.8336	0.8498	0.8352
bce_dice_border_ce -> 0.4,0.4,0.1,0.3,0.6 - lr 0.0001	Reescale	0.8066	0.8359	0.8235	0.8281	0.8310	0.8250
bce_dice_border_ce -> 0.4,0.4,0.1,0.3,0.6 - lr 0.01	Standardize	0.7711	0.7745	0.7993	0.8248	0.7791	0.7897
bce_dice_border_ce -> 0.4,0.4,0.1,0.3,0.6 - lr 0.001	Standardize	0.8058	0.8324	0.8322	0.8138	0.8433	0.8254
bce_dice_border_ce -> 0.4,0.4,0.1,0.3,0.6 - lr 0.0001	Standardize	0.7970	0.8382	0.8212	0.8313	0.8344	0.8244
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5 - lr 0.01	Reescale	0.7977	0.8150	0.8053	0.8188	0.8212	0.8116
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5 - lr 0.001	Reescale	0.8184	0.8400	0.8339	0.8408	0.8469	0.8360
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5 - lr 0.0001	Reescale	0.8096	0.8377	0.8230	0.8286	0.8316	0.8261
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5 - lr 0.01	Standardize	0.7842	0.8373	0.8254	0.8333	0.8318	0.8224
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5 - lr 0.001	Standardize	0.8235	0.8556	0.7736	0.8477	0.8598	0.8320
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5 - lr 0.0001	Standardize	0.8221	0.8494	0.8349	0.8453	0.8503	0.8404
bce_dice_border_ce -> 0.3,0.4,0.2,0.05,0.65 - lr 0.01	Reescale	0.7783	0.8101	0.8041	0.8021	0.8331	0.8055
bce_dice_border_ce -> 0.3,0.4,0.2,0.05,0.65 - lr 0.001	Reescale	0.8162	0.8378	0.8330	0.8322	0.8456	0.8329
bce_dice_border_ce -> 0.3,0.4,0.2,0.05,0.65 - lr 0.0001	Reescale	0.7971	0.8328	0.8065	0.8251	0.8291	0.8181
bce_dice_border_ce -> 0.3,0.4,0.2,0.05,0.65 - lr 0.01	Standardize	0.7893	0.7775	0.7257	0.8152	0.8162	0.7847
bce_dice_border_ce -> 0.3,0.4,0.2,0.05,0.65 - lr 0.001	Standardize	0.8091	0.8367	0.8204	0.8215	0.8436	0.8262
bce_dice_border_ce -> 0.3,0.4,0.2,0.05,0.65 - lr 0.0001	Standardize	0.7320	0.8234	0.7945	0.8245	0.8173	0.7983
bce_dice_ce -> 0.5,0.3,0.2,0.65 - lr 0.001	Standardize	0.7962	0.8384	0.8157	0.8053	0.8181	0.8147
bce_dice_ce -> 0.5,0.3,0.2,0.65 - lr 0.0001	Standardize	0.7915	0.8398	0.8148	0.8291	0.8244	0.8199

Principal conclusions: bce_dice_border_ce with 0.5,0.2,0.2,0.2,0.5 - lr 0.001/0.0001 - standardize.

Now, using lr 0.001, standardize and bce_dice_border_ce with 0.5,0.2,0.2,0.2,0.5, explore data augmentation. Without data augmentation score 0.8360.

Data Augmentation	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Mean
Vertical flip	0.8004	0.8273	0.8176	0.8074	0.8386	0.8182
Horizontal flip	0.8032	0.8225	0.8226	0.8244	0.8318	0.8209
Random Crops	0.8137	0.8376	0.8208	0.8283	0.7876	0.8181
Shift	0.8117	0.8240	0.8222	0.8330	0.8307	0.8243
Downscale	0.7949	0.8192	0.8166	0.8219	0.8384	0.8181
Elastic Transform	0.7991	0.8425	0.8274	0.8213	0.8408	0.8262
Rotations	0.8158	0.8426	0.8255	0.8290	0.8524	0.8330
Grid Distortion	0.8028	0.8361	0.7864	0.8275	0.8231	0.8151
Optical Distortion	0.7705	0.8418	0.8255	0.7996	0.8354	0.8145

Competition Models

Bala 1

Using standardization, data augmentation combination old and bce_dice_border_ce with 0.5,0.2,0.2,0.2,0.5. Resnet34 Unet with lr 0.001 and adam optimizer.

Method	Fold 0	Fold 1	Fold 2	Fold 3	Mean
weakly -> labeled	0.8286	0.8596	0.8505	0.8540	0.8482
combined -> labeled	0.8271	0.8473	0.8424	0.8573	0.8435

Bala 2

Using standardization, data augmentation combination old and bce_dice_border_ce with 0.5,0.2,0.2,0.2,0.5

Method	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Mean
Resnet34 Unet lr 0.001	0.8092	0.8257	0.8115	0.8293	0.8276	0.8207

Not Pretrained Model

Folding by patient.

Method	Normalization	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Mean
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.65 - lr 0.01	Standardize	0.7873	0.8263	0.8004	0.8195	0.7616	0.7990
bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.65 - lr 0.001	Standardize	0.7741	0.7879	0.7743	0.7883	0.8071	0.7863

Update: 11/06/2020 Meeting

Changes and ideas:

Use 2 folds grouping by vendor (A vs. B), instead of n grouping by patient. Then error analysis by vendor
Since is not permited the use of pre-trained models, try smaller architectures
Create convolutional network that learns to distinguish if an image comes from vendor A or vendor B. ¿Works?
- If works then we can create a DCGAN trying to apply a initial transformation to fool the discriminator and do something like normalize the input images! Note: Do not add vendor C in CNN classification step since we will use it for validate our GAN later.
Self-Supervised Learning for unseen vendor C

Folding by Vendor Resuts

(Wrong folding, no train subpartition/patients to compare)

Normalization by reescale. Criterion bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5.

Method	DA	A -> B	B -> A	Mean
resnet18_pspnet_unet - lr 0.001	None	0.7573	0.7121	0.7346
resnet18_pspnet_unet - lr 0.0001	None	0.6838	0.5532	0.6185
resnet18_pspnet_unet - lr 0.001	Combination	0.7612	0.6793	0.7202
resnet18_pspnet_unet - lr 0.0001	Combination	0.6982	0.5580	0.6281
resnet18_unet_scratch - lr 0.001	None	0.7498	0.6835	0.7166
resnet18_unet_scratch - lr 0.0001	None	0.6779	0.4997	0.5888
resnet18_unet_scratch - lr 0.001	Combination	0.7421	0.6627	0.7023
resnet18_unet_scratch - lr 0.0001	Combination	0.7588	0.6281	0.6934
resnet34_unet_scratch - lr 0.001	None	0.7649	0.6313	0.6980
resnet34_unet_scratch - lr 0.0001	None	0.7189	0.6273	0.6731
resnet34_unet_scratch - lr 0.001	Combination	0.7673	0.6530	0.7101
resnet34_unet_scratch - lr 0.0001	Combination	0.7707	0.6128	0.6917
nano_unet - lr 0.001	None	0.5035	0.4284	0.4659
nano_unet - lr 0.0001	None	0.4432	0.2821	0.3626
nano_unet - lr 0.001	Combination	0.4871	0.4771	0.4821
nano_unet - lr 0.0001	Combination	0.4310	0.2187	0.3248

General conclusions:

Models can extract more information and thus make better predictions when training with Vendor 'A' and then testing on 'B'. GAN should approximate images to Vendor A?
lr 0.001 works better than lower ones.
Not clear difference using data augmentation and without apply it...
Intermediate models size, resnet18_pspnet_unet, performs better than bigger ones and smaller ones.

11 random patients to compare

Criterion bce_dice_border_ce -> 0.5,0.2,0.2,0.2,0.5. Using resnet18_pspnet_unet.

Normalization	Data Augmentation	Learning Rate	A -> B	B -> A	Mean
Reescale	Combination (Old)	0.001	0.7328	0.6915	0.7121
Standardize	Combination (Old)	0.001	0.7601	0.6704	0.7152
Reescale	Combination (Old)	0.005	0.6593	0.4914	0.5753
Standardize	Combination (Old)	0.005	0.7499	0.6342	0.6920
Reescale	Combination	0.001	0.7502	0.7014	0.7258
Standardize	Combination	0.001	0.7561	0.6723	0.7142
Reescale	Combination	0.005	0.7370	0.5143	0.6257
Standardize	Combination	0.005	0.7123	0.6826	0.6975
Reescale	None	0.001	0.7462	0.7283	0.7372
Standardize	None	0.001	0.7668	0.6312	0.6990
Reescale	None	0.005	0.7098	0.6280	0.6689
Standardize	None	0.005	0.7606	0.6604	0.7105

General conclusions:

When using Vendor A as training set, generalizes better to Vendor B cases.

Classification: Vendor 'A' - 'B' Discriminator

Using resnet18_pspnet_classification model. Adam with bce. 60 epochs and *0.1 steps as 25 and 50. Img size 224x224. fold_system="patient" & label_type="vendor_label". Normalization standardize. Learning rate 0.001.

Data Augmentation	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Mean
None	0.9954	0.9726	1.0000	0.9878	0.9970	0.9906
Combination	0.9954	0.9771	0.9985	1.0000	0.9939	0.9930

Classification: Vendor 'A' - 'B' - 'C' Discriminator

Adam with bce. 80 epochs and *0.1 steps as 25 and 60. Img size 224x224. fold_system="patient" & label_type="vendor_label". Normalization standardize. Learning rate 0.001. Data Augmentation combination (old).

Model	Fold 0	Fold 1	Fold 2	Fold 3	Fold 4	Mean
resnet34_pspnet	0.9954	0.9726	1.0000	0.9878	0.9970	0.9906
resnet34_pspnet	0.9954	0.9771	0.9985	1.0000	0.9939	0.9930
resnet34_unet	0.9910	0.9871	1.0000	0.9740	0.9805	0.9865

Discriminator Entropy backwards 'A' - 'B' - 'C'

Using gradient gamma 0.99, max iterations 250, standardize normalization. Segmentator Training with 'A'. Baseline: 0.7799 IOU on B.

Out threshold	Target	More	B
0.01	A	----	0.7827
0.01	A	L1 2.0	0.7825
0.01	A	L1 5.0	0.7827
0.01	A	L1 10.0	0.7829
0.01	Equal	----	0.7713
0.01	Equal	L1 2.0	0.7723
0.01	Equal	L1 5.0	0.7725
0.01	Equal	L1 10.0	0.7744
0.001	A	----	0.7827
0.001	A	L1 2.0	0.7826
0.001	A	L1 5.0	0.7827
0.001	A	L1 10.0	0.7828
0.001	Equal	----	0.7713
0.001	Equal	L1 2.0	0.7723
0.001	Equal	L1 5.0	0.7725
0.001	Equal	L1 10.0	0.7744
0.0001	A	----	0.7827
0.0001	A	L1 2.0	0.7826
0.0001	A	L1 5.0	0.7828
0.0001	A	L1 10.0	0.7828
0.0001	Equal	----	0.7713
0.0001	Equal	L1 2.0	0.7723
0.0001	Equal	L1 5.0	0.7725
0.0001	Equal	L1 10.0	0.7744

Problem with low out thresholds... Waste all iterations and stops.

Discriminator Entropy backwards 'A' - 'B' - 'C' / With blur, unblur and gamma

Out threshold	Blur	Unblur	Gamma	Target	Iters	B
0.5	0.01	0.01	0.01	A	100	0.7770
0.5	0.0001	0.0001	0.0001	A	100	0.7786
0.5	0.000001	0.000001	0.000001	A	100	0.7779

7 July

Hausdorff loss tests

Mean average values for 5 folds. Data combination old. Lr 0.001 with resnet_unet_scratch.

Hausdorff Weight	IOU A	IOU B	DICE A	DICE B	HAUSSDORF A	HAUSSDORF B	ASSD A	ASSD B
0.0	0.7333	0.7835	0.8087	0.8561	4.4773	3.4890	1.2458	0.9624
0.05	0.7417	0.7867	0.8158	0.8589	4.0958	3.4073	1.1618	0.9646
0.1	0.7399	0.7827	0.8153	0.8550	4.1999	3.4355	1.1925	0.9735
0.2	0.7421	0.7806	0.8193	0.8522	4.2831	3.4414	1.1953	0.9831
0.3	0.7370	0.7790	0.8134	0.8534	4.3634	3.4972	1.2264	0.9886

Other

Development environment -> CUDA 10.1 and cudnn 7603. Python 3.8.2 - GCC 9.3.0
Challenge homepage here.
ACDC nomenclature: 0, 1, 2 and 3 represent voxels located in the background, in the right ventricular cavity, in the myocardium, and in the left ventricular cavity, respectively.

marioprojects / mnmscardiac Goto Github PK

mnmscardiac's Introduction