bimcv-csusp / bimcv-covid-19 Goto Github PK

Valencia Region Image Bank (BIMCV) that combines data from the PadChest dataset with future datasets based on COVID-19 pathology to provide the open scientific community with data of clinical-scientific value that helps early detection of COVID-19

License: MIT License

Jupyter Notebook 34.62% Python 0.36% HTML 65.02%

ai bimcv coronavirus-dataset covid deep-learning detection open-data padchest-dataset pneumonia rx scientific-community

bimcv-covid-19's People

Contributors

Stargazers

Watchers

bimcv-covid-19's Issues

Dataset and evaluation protocol

I have several questions:

1- how many images are available in total?

2- if the protocol is 10 fold cross validation why there are only 9 files in the balanced-tsv folder? I guess that files inside this folder are the one used with the suggested command:

python3 pneumo_cnn_classifier_training.py «FILE_TSV_BALANCED»

3- If there is enough images i recommends to use a hold-out protocol instead of a cross-validation since hold-out is more easy to manage and faster to train that the cross-validation.

Data Augmentation simplified

I simplified DA and de-obfuscated:

def get_sample(self, idx):
    '''Returns the sample and the label with the id passed as a parameter'''
    # Get the row from the dataframe corresponding to the index "idx"                                                                       
    df_row = self.df.iloc[idx]
    image = Image.open(os.path.join(self.path_to_img,df_row["ImageID"]))
    da =  np.asarray(image).shape
    #image.thumbnail((self.x,self.x), Image.ANTIALIAS)                                                                                      
    image = image.resize((self.x,self.x))
    image = np.asarray(image)
    label = dict_classes[df_row["group"]]
    image_resampled = np.reshape(image,image.shape + (self.target_channels,))
    img2=np.array(image_resampled)

    img2.setflags(write=1)

    # Data aumentation **always** if True                                                                                                   
    if self.data_augmentation:
        do_rotation = True
        do_shift = True
        do_zoom = True
        do_intense= True

        theta1 = float(np.around(np.random.uniform(-10.0,10.0, size=1), 3))
        offset = list(np.random.randint(-20,20, size=2))
        zoom  = float(np.around(np.random.uniform(0.9, 1.05, size=1), 2))
        factor = float(np.around(np.random.uniform(0.8, 1.2, size=1), 2))

        if do_rotation:
            rotateit(img2, theta1)
        if do_shift:
           translateit_fast(img2, offset)


        if do_zoom:
            for channel in range(self.target_channels):
                img2[:,...,channel] = scaleit(img2[:,...,channel], zoom)
        if do_intense:
            img2[:,...,0]=intensifyit(img2[:,...,0], factor)

    #### DA ends                                                                                                                            

    img2 = self.norm(img2)
    # Return the resized image and the label                                                                                                
    return img2, label

About split dataset

Hello,
How can I split the dataset (BIMCV-COVID19+) into two sets, one with the frontal axis and the other with the sagittal one?
The information is spread and it's not clear for me how to do it.

Thanks

Downscaled version of data set

The dataset is too big I think that is an expensive ticket to start with this project.

Taking a look to the code you propose an image working size of:

y, x, in_channel = 724, 200, 1

that perhaps is not enough but I agree that it is ok for starting.

Therefore, could you please provide a just reduced version of the images??

A fast calc:: 30 K images x 724 x 200 x 1 = 4GB approx in 8 bits precision or 32 for float32.

In any case significantly lower than the 160G of zip file.

Colaboración

Hola. Soy especialista en Deep Learning especializado en procesamiento de imágenes.

Probaré algo en cuanto tenga algo de tiempo. Hay algún tipo de premisa o algo que quisiérais comentar al respecto o simplemente creo una carpeta y me pongo a ello? Mi idea es probar un par de algoritmos y mostrar resultados visuales y métricos. Os iré preguntando cosas por aquí.

Si tenéis otro canal de preferencia hacédmelo saber. Saludos!

Application for the access to PadChest

Dear Ms./Mrs. Salinas:
My lab is devoted to the research of chest x-rays and looking forward to get access to your PadChest dataset， but we haven't found the instructions about how to formally request the access.

Slow DataGenerator

I observe that most of the time GPU is idle because CPU is reading images and doing DA (one-thread!). This is not good.

I think that you could provide different numpy with images just rescaled to different sizes (256x256), (512x512) for training, dev and test and we can load all into memory (i think that with a 32Gb machine it fits) and then use a standard Keras data augmentation that use all the CPU threads. I think we can scale up x4 - x8 training speed at least.

Flat list of label + location

according to the paper I was expecting that the combined labels of finding and location is a nested list

along with their localizations when available: [[’pulmonary fibrosis’,’loc basal bilateral’], [’chronic changes’], [’kyphosis’], [’pseudonodule’, ’groundglass pattern’, ’loc basal’]].

but it seems to me, that in the data bimcv_covid19_posi_head_iter1/derivatives/labels/labels_covid19_posi.tsv the list is flat and that there is no obvious way to slice it?

Unnamed: 0 PatientID ReportID Report Labels Localizations LabelsLocalizationsBySentence labelCUIS LocalizationsCUIS
0 sub-S03968 ses-E08123 no disponemos de estudios previos con los que comparar . infiltrados en practicamente todo el hemitorax derecho e izquierdo de predominio en campo medio y basal y mayores en el lado derecho . no se aprecia derrame pleural . no dispongo de estudios previos con los que comparar . a valorar en el contexto clinico posible covid 19 "['infiltrates' 'COVID 19']" "['loc left' 'loc pleural' 'loc basal' 'loc middle lung field' 'loc right' 'loc hemithorax']" "['exclude' 'infiltrates' 'loc left' 'loc right' 'loc basal' 'loc hemithorax' 'loc middle lung field' 'normal' 'loc pleural' 'exclude' 'COVID 19']" "[C0277877 C5203670]" "[C0443246 C0032225 C1282378 C0929434 C0444532 C0934569]"

Dataset "usability" for AI

I performed the following experiment

Downloaded datasets [1], [2] and [3]
Extracted PA views for control and pneumonia patients (for [2] all "pneumonia" images were used regardless of the type:bacteria/virus, for [3], only "normal" patients, or "lung opacity" patients were used)
Trained a convolutional network using oversampling to balance both labels and datasets (control and pneumonia images were sampled with 50% probability, and each dataset was sampled with 1/3 probability). This is to prevent the network from prioritizing a dataset or a label.
Selected the epoch with the best "balanced" validation accuracy (the "balanced" accuracy was computed by oversampling the validation datasets following the same strategy used for the training sets)

Achieving the following results

Specificity:

Dataset [1]: 0.8746355685131195
Dataset [2]: 0.8632478632478633
Dataset [3]: 0.9661399548532731

Sensibility:

Dataset [1]: 0.7647058823529411
Dataset [2]: 0.9794871794871794
Dataset [3]: 0.9581589958158996

The issue

The network seems to perform very well on dataset [3], where each image was manually reviewed by radiologists [4]. However it performs significantly worse on dataset [1], where most labels were extracted using NLP and the images were not reviewed (even leading to the inclusion of completely white, or completely black images [5]).

Do you think the quality of the images and annotations may be a limiting factor for the performance of the network?

References

[1] http://ceib.bioinfo.cipf.es/covid19/resized_padchest_neumo.tar.gz
[2] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
[3] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
[4] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements
[5] https://github.com/BIMCV-CSUSP/BIMCV-COVID-19/tree/master/padchest-covid#iti---proposal-for-datasets

Where is the segmentation?

Thanks for your great work!
I can't find the segmentation Ground Truth and their corresponding nii.gz.
Thank you!

patients' anonimity breach

Hi, we are working with CT subset of the dataset (and corresponding text reports) and found several cases (~50) of breached anonimity in the published collection (direct personal data, birth dates, names, etc). What is the best way to report them?

My model

Perhaps other practitioners can get ideas from my model. This is for the 2 class problem C vs N however I think that it would get good results with the other problems.

from future import print_function

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from keras.layers.normalization import BatchNormalization as BN
from keras.layers import GaussianNoise as GN
from keras.optimizers import SGD

from keras.callbacks import LearningRateScheduler as LRS
from keras.preprocessing.image import ImageDataGenerator
import os
from sklearn.utils import class_weight
import sys

Load

y_train = np.load('tr2_Y.npy')
y_test = np.load('val2_Y.npy')

print(y_train.shape)
print(y_test.shape)

print(sum(y_train[:,0]))
print(sum(y_train[:,1]))

x_train = np.load('tr2_X.npy')
x_test = np.load('val2_X.npy')

print(x_train.shape)

print(x_test.shape)

result = np.argmax(y_train, axis=1)
print(result.shape)

class_weights = class_weight.compute_class_weight('balanced',np.unique(result),result)

print(class_weights)

x_train /= 255
x_test /= 255

num_classes = 2

DEFINE A DATA AUGMENTATION GENERATOR

datagen = ImageDataGenerator(
width_shift_range=0.2,
height_shift_range=0.2,
rotation_range=20,
zoom_range=[0.9,1.1],
horizontal_flip=False)

DEF a Bottleneck BLOCK CONV + BN + MAXPOOL

def CBGN(model,filters,size):

model.add(Conv2D(filters, (1, 1), padding='same'))
model.add(BN())
model.add(Activation('relu'))

model.add(Conv2D(filters, (size, size), padding='same'))
model.add(BN())
model.add(Activation('relu'))

model.add(Conv2D(4*filters, (1, 1), padding='same'))
model.add(BN())
model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

return model

DEF NN TOPOLOGY

model = Sequential()

model.add(Conv2D(32, (7,7), padding='same',strides=(2,2),input_shape=x_train.shape[1:]))

model=CBGN(model,32,3)
model=CBGN(model,64,3)
model=CBGN(model,128,3)
model=CBGN(model,256,3)
model=CBGN(model,256,3)

model.add(GlobalMaxPooling2D())
#model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
#model.add(BN())

model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.summary()

OPTIM AND COMPILE

opt = SGD()

model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])

DEFINE A LEARNING RATE SCHEDULER

def scheduler(epoch):
if epoch < 200:
return .001
elif epoch < 250:
return 0.0001
else:
return 0.00001

set_lr = LRS(scheduler)

TRAINING with DA and LRA

batch_size=16 ## 2 x Titan 12Gb
epochs=100

First a pre-training without data augmentation or class_weight

history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,verbose=1,
callbacks=[set_lr],
validation_data=(x_test, y_test))

Now a training with data augmentation and class_weight

epochs=300
history=model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
steps_per_epoch=len(x_train) / batch_size,
epochs=epochs,
validation_data=(x_test, y_test),
callbacks=[set_lr],
class_weight=class_weights,
verbose=1)

What are the COVID-19 cases in "BIMCV-COVID19-" dataset?

Hello.
I found that some cases in "BIMCV-COVID19- " dataset are annotated as COVID-19 in "labels_SARS-cov-2_nega.tsv". What are these cases? I thought that the "BIMCV-COVID19- " have only non-COVID-19 cases. Thanks for your reply in advance.

Permission to use Dataset for study purpose

Hi Team,
I am Shubham, a Student and I am working on covid cases prediction, for that reason I want to access this dataset, how can I get the Images and permission to use them.

Thanking you.

Permission to use 3 images from Dataset

Hi there,

I have participated in the SIIM-FISABIO-RSNA COVID-19 Detection Kaggle competition that used BIMCV-COVID-19 data.

I did well and planning on writing an educational blog article about my approach. I would like to show three images as examples of the data and the model's performance on them. How can I officially obtain this permission?

Thank you,
Yousef

ValueError: assignment destination is read-only

When activating data augmentation (it is a must), i got this error:

ValueError: assignment destination is read-only

when trying to modify image_resampled.

To solve this we have to convert it to a np array and activate the write flag:

image_resampled = np.reshape(image,image.shape + (self.target_channels,))
img2=np.array(image_resampled)
img2.setflags(write=1)

and then go on with the operations over img2 instead of image_resampled.

In any case data generator is too much obfuscated