ieee8023 / covid-chestxray-dataset Goto Github PK

View Code? Open in Web Editor NEW

3.0K 3.0K 1.3K 632.96 MB

We are building an open database of COVID-19 cases with chest X-ray or CT images.

Python 16.32% Jupyter Notebook 82.77% JavaScript 0.91%

computed-tomography computer-vision covid-19 dataset deep-learning xray

covid-chestxray-dataset's People

Contributors

Stargazers

Watchers

Forkers

jmdtol drpengsong xiaoliang008 myhome1998 pzw520125 pustar kimichang chenliang613 bganglia lucijagregov stc-cqupt leixiaofeng-astar mirjunaid26 monjoybme jonnycrunch mamunahmed33 ifv deepanshu17 yrouphail ceefour rcillavicomtech samjcheng sohelkabir dgn001 itsmonterey juangon chinglamchoi 0xcc32 safwennaimi ipsquare andrewyzy rlepsch ambarish-moharil mrojasabregu 5l1v3r1 hamed225 jimgitonga datasci-rigo tanmoy13 zibagandomkar richardsonjf sangkny rafaelgallo jaykimbravekjh sharadgupta27 khawaritzmi francis621 sts-sadr goryszewskig amaljithcf chaoshengt kalzbra birajaghoshal danaelisanicolas umw0lverine tiravata esa-prakasa basmaezzat carlosdg harikiran2995 iwannadapdap tejamoy jordanmicahbennett mashfiq137 ghostyguo nguyenducnhaty lagvier ryansar alvaromashiro amirunpri2018 muschellij2 conradsollitt xiaolul brstar96 jlb226 gehongpeng jquinter benayab akhavan12 yakuzeng thintn222 mdoremami devscience bsirmacek jaymk mayurmorin pppnnn francescomarchesini lindawangg usman75 flamingofugang gravitytrope adahsieh mwestt skhobahi suyash091 mmshaifur tripoworld patuspitus kukuhsw

covid-chestxray-dataset's Issues

COVID-19 classification DCNN training code with "explainability" functionality

In this example, we use ONLY the XRs samples in the dataset labeled as COVID-19. We went the XRs way instead of the CTs since there are more of them. But I agree CTs are better for detection as mentioned here #5 .

The Neural Network source code is based in a post by Adrian Rosebrock in PyImageSearch.

Here, the dataset was divided into two labels: sicks and healthy. The healthy training samples were extracted from this Kaggle contest.

Then for training, we divide into two folders /dataset/sicks and /dataset/healthy, located in the root folder. Each class having the same number of images (around 90).

It's a preliminary approach that may improve substantially once the dataset grows enough.

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os
import lime
from lime import lime_image
from skimage.segmentation import mark_boundaries

plt.rcParams["figure.figsize"] = (20,10)

## global params
INIT_LR = 1e-4  # learning rate
EPOCHS = 21  # training epochs
BS = 8  # batch size


## load and prepare data
imagePaths = list(paths.list_images("dataset"))
data = []
labels = []
# loop over the image paths
for imagePath in imagePaths:
    # extract the class label from the filename
    label = imagePath.split(os.path.sep)[-2]
    # load the image, swap color channels, and resize it to be a fixed
    # 224x224 pixels while ignoring aspect ratio
    image = cv2.imread(imagePath)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = cv2.resize(image, (224, 224))
    # update the data and labels lists, respectively
    data.append(image)
    labels.append(label)
# convert the data and labels to NumPy arrays while scaling the pixel
# intensities to the range [0, 1]
data = np.array(data) / 255.0
labels = np.array(labels)

TEST_SET_SIZE = 0.2

lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels); print(labels)
# partition the data into training and testing splits using 80% of
# the data for training and the remaining 20% for testing
(trainX, testX, trainY, testY) = train_test_split(data, labels,
    test_size=TEST_SET_SIZE, stratify=labels, random_state=42)
# initialize the training data augmentation object
trainAug = ImageDataGenerator(
    rotation_range=15,
    fill_mode="nearest")

## build network
baseModel = VGG16(weights="imagenet", include_top=False,
    input_tensor=Input(shape=(224, 224, 3)))
# construct the head of the model that will be placed on top of the
# the base model
headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(4, 4))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(64, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(2, activation="softmax")(headModel)
# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)
# loop over all layers in the base model and freeze them so they will
# *not* be updated during the first training process
for layer in baseModel.layers:
    layer.trainable = False

print("[INFO] compiling model...")
opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy", optimizer=opt,
    metrics=["accuracy"])

## train
print("[INFO] training head...")
H = model.fit_generator(
    trainAug.flow(trainX, trainY, batch_size=BS),
    steps_per_epoch=len(trainX) // BS,
    validation_data=(testX, testY),
    validation_steps=len(testX) // BS,
    epochs=EPOCHS)

print("[INFO] saving COVID-19 detector model...")
model.save("covid19.model", save_format="h5")

## eval
print("[INFO] evaluating network...")
predIdxs = model.predict(testX, batch_size=BS)
predIdxs = np.argmax(predIdxs, axis=1) # argmax for the predicted probability
print(classification_report(testY.argmax(axis=1), predIdxs,
    target_names=lb.classes_))

cm = confusion_matrix(testY.argmax(axis=1), predIdxs)
total = sum(sum(cm))
acc = (cm[0, 0] + cm[1, 1]) / total
sensitivity = cm[0, 0] / (cm[0, 0] + cm[0, 1])
specificity = cm[1, 1] / (cm[1, 0] + cm[1, 1])
# show the confusion matrix, accuracy, sensitivity, and specificity
print(cm)
print("acc: {:.4f}".format(acc))
print("sensitivity: {:.4f}".format(sensitivity))
print("specificity: {:.4f}".format(specificity))


## explain
N = EPOCHS
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, N), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, N), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, N), H.history["val_accuracy"], label="val_acc")
plt.title("Precision of COVID-19 detection.")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="lower left")
plt.savefig("training_plot.png")

for ind in range(10): 
    explainer = lime_image.LimeImageExplainer()
    explanation = explainer.explain_instance(testX[-ind], model.predict,
                                             hide_color=0, num_samples=42)
    print("> label:", testY[ind].argmax(), "- predicted:", predIdxs[ind])
    
    temp, mask = explanation.get_image_and_mask(
    explanation.top_labels[0], positive_only=False, num_features=1, hide_rest=True)
    plt.imshow(mark_boundaries(temp / 2 + 0.5, mask)+testX[ind])
    plt.show()

In the end, you will have some visualizations on how the network is "detecting" (if the evaluation metrics make sense) COVID-19 suspicious region in the XRs.

Comment 1: In my experience, this Lime explanation method can be handy when classifying images and trying to understand what the network is actually "looking at" to make the decision.

Comment 2: I was wondering why the classification accuracy was so high here (and in the original PyImageSearch post). I think it is because the Kaggle dataset is so well standardized that the NN is learning to predict where the X-Ray comes from Kaggle or this dataset instead of classifying healthy/sick. Nevertheless, I feel that the source code is still relevant, and with more XRs data and better preprocessing, we will be able to fix this issue and improve the algorithm.

Thank you for what you're doing

Hi there, @ieee8023. I just wanted to take a second and thank you for what you're doing.

I run a popular computer vision/deep learning blog (PyImageSearch.com) and published a tutorial for CV/DL practitioners on how to use this dataset (your work and dataset curation is properly cited):

https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-tensorflow-and-deep-learning/

I'll be honest -- it's one of the least scientific things I've published (but that's not the point of the piece).

It's mainly just for people (including myself) to "feel" like they are helping and be able to take solace in it (and educate themselves via it). It's an opportunity for people to be inspired by it. And while it won't save any lives immediately, perhaps downstream, it will help people apply themselves and learn a new skill while they are quarantined or displaced from the work, school, or research lab.

I don't know if it's worthy of being included in your repo, but I wanted to pass it along just in case.

Thank you again, you're doing amazing work.

Open Source Helps!

Open Source Helps!
Thanks for your work to help the people in need! Your site has been added! I currently maintain the OpenSourceWuhan page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

https://weileizeng.github.io/OpenSourceWuhan/world

Cheers!

Annotations

Hello,

First of all thanks a lot for the effort you are putting to gather all these xray and ct measurements!

I am wondering if bounding box/masks for the detection of problematic regions can be provided or is this only available as an image classification dataset? If second, are there going to be also negative samples (xray images from covid-negative patients)?

Bests

Provide detailed explanation of the columns in metadata.csv

Can you provide detailed explanation of the columns in metadata.csv file?
Example: what does the offset represent?? similarly for the rest of the columns.

Missing label (Labels 0=No or 1=Yes) for each sample of meta data

Hi really appreciate your work.
When i looking at the metadata( https://github.com/ieee8023/covid-chestxray-dataset/blob/master/metadata.csv), I can't find label for each sample. How can i find the label? Do i miss something?

Recommended datasets for transfer learning

Hi @ieee8023

thank you for maintaining this dataset!

I implemented a pytorch lightning wrapper for a DenseNet model for covid-chestxray-dataset.

It is kick of a Pytorch Lightning's community project which aims at to be covid19 detector (for educational purposes).

Can you recommend us datasets and strategies on how to use additional data

I have scanned https://arxiv.org/pdf/2002.02497.pdf (I will return to it). It seems that to solve the labeling differences and other dataset preparation differences quite a lot of domain expertise is needed. Any tips appreciated.

Kind regards

Ondra

PS: I was inspired by #15
PPS: My fork was merged to the PyTorchLightning community project
PPPS: I believe that @Borda already contacted you that we may use slack for longer discussions if needed. Link to the slack can be found at PL

updates at alibaba about AI

https://asia.nikkei.com/Spotlight/Coronavirus/Alibaba-says-AI-can-identify-coronavirus-infections-with-96-accuracy

where are the computed tomography (CT) images?

I can't find the CT images.

https://radiopaedia.org/cases/covid-19-pneumonia-evolution-over-a-week-1?lang=us

Automate finding radiographs in academic papers

I have a project to automatically search for, download, and extract radiographs from papers on a given disease.

Right now everything works except filtering the radiographs from the other figures, which I am still working on. Even without this feature, though, the tool could help you to manually screen paper figures quickly.

Do you think this tool would be useful for this project?

Minimal jupyter notebook to train models

Hi,
I'm a deep learning researcher from Spain.
I have created a minimal jupyter notebook to train with the images from your repo.

https://drive.google.com/file/d/19T_qebLa1keUNpkp7FDNEBmsNRQUVtfJ/view?usp=sharing

I want to experiment with visual attention models on next days.
Rodrigo

https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30111-0/fulltext

5 South Korea papers still not included in metadata

Published online March 5, 2020. https://doi.org/10.3348/kjr.2020.0146
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-505.pdf
Published online March 20, 2020. https://doi.org/10.3348/kjr.2020.0195
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e45.pdf
Published online March 20, 2020. https://doi.org/10.3348/kjr.2020.0180
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e43.pdf
Published online March 13, 2020. https://doi.org/10.3348/kjr.2020.0181
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e42.pdf
Published online March 13, 2020. https://doi.org/10.3348/kjr.2020.0157
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e39.pdf
Published online February 11, 2020. https://doi.org/10.3348/kjr.2020.0078
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-365.pdf

Found using advanced search of the Korean Journal of Radiology kjronline[.]org searching the term "covid-19" filtering for years 2018 - 2020

Classification example

Hi,

I have prepared an example of Xray classification , based on this repo
https://colab.research.google.com/drive/1KlKvHDgvi-cfrpJUIOCvczmL4Ctc1wBL

Please check it out. Are you interested in such initiatives?

Missing License

Thank you so much for curating this dataset! Hopefully we can all work together to scale up detection of coronaviruses via radiography.

Could you please add a LICENSE file for the images and annotations in this repository? We'd like to be able to use and remix them but need clarification of the terms under which they're allowed to be used and shared.

CT segmentation data

Hi, you might be interested in this 100 CT slice dataset of 60 patients that we have segmented:

http://medicalsegmentation.com/covid19/

Feel free to add to your list if you find it relevant.

https://www.sciencedirect.com/science/article/pii/S1684118220300608

https://onlinelibrary.wiley.com/doi/full/10.1111/all.14238

Minor patientID issue - PatientID 62 is used twice

Thought we would post this minor issue that we found where in the metadata, patientID 62 is mentioned twice, but the gender and age (and source of data) are different.

This is amazing work! Keep it up!

Sharing my data

Hi I am doing some research on this topic applying CNN with deep learning to create an automated comupter vision based scanner to detect covid posivites and negatives scans.

Here you can find my dataset, I am currently building a CT scans dataset to try and train a model for ct scan other than rx scans. https://github.com/AleGiovanardi/covidhelper/tree/master/dataset/covidct

I also have a source of new rx and cts directly from italian hospital so i will update it periodically. You are welcome to take any of the data in my repo which are missing from here.

You can find also a code which train a model, save it and let you use it to test detection of scans, which is based on Adrian Rosebrock tutorial on pyimagesearch. I am constantyl working to enhance the performance and the accuracy of it.

Also thanks for your great job, this inspired me a lot!

https://www.kjronline.org/DOIx.php?id=10.3348/kjr.2020.0112

https://academic.oup.com/cid/advance-article/doi/10.1093/cid/ciaa199/5766408

https://www.nature.com/articles/s41591-020-0819-2?sf231541958=1

No source image?

Any dicom file?

Italy COVID-19 chest xray's

https://www.sirm.org/category/senza-categoria/covid-19/

There are a few images being posted here, may be worth investigating

ids in metadata.csv do not correspond to filenames ?

How does one link the metadata file to the image files? Also there seem to be less metadata than there are images?

Do the red arrows on some images create a danger for data leakage?

It just occurred to me that arrows only occur on images with a positive diagnosis, so this could cause data leakage.

That might not be as much of problem if you are using these images for differential diagnosis, and already know the patient has something, but it could be an issue if this dataset is being combined with healthy images to decide whether the patient is healthy or sick.

This data only disease group???

hmm... i can't see control group data

wrong image file name

This file below does not exist in metadata.csv

1-s2.0-S0929664620300449-gr3_lrg-e.jpg

Why used jpeg instead of DICOM?

Normally digital x-ray image has DICOM format.
And it has more bit depth, then jpeg.

https://radiopaedia.org/cases/covid-19-rapidly-progressive-acute-respiratory-distress-syndrome-ards?lang=us

https://radiopaedia.org/play/25975/entry/462501/case/75189/presentation?lang=us

Images on slides 2 and 3

What is the offset ?

Can anybody explain what is the purpose of offset values I am having issues to understand what it is

https://link.springer.com/article/10.1007%2Fs12630-020-01625-4

Does X-Ray images really usefull for diagnosing COVID-19?

"March 4, 2020 -- X-ray may not be the best imaging tool for detecting novel coronavirus disease (COVID-19). Almost three-quarters of a small cohort of South Korean patients with COVID-19 pneumonia had normal chest x-rays, missing pulmonary nodules that chest CT identified, according to a February 26 study in the Korean Journal of Radiology."

https://www.auntminnie.com/index.aspx?sec=sup&sub=xra&pag=dis&ItemID=128347

https://radiopaedia.org/cases/covid-19-pneumonia-rapidly-progressive?lang=us

https://www.kjronline.org/DOIx.php?id=10.3348/kjr.2020.0132

https://assets.radiopaedia.org/cases/covid-19-infection-exclusive-gastrointestinal-symptoms?lang=us

https://www.nejm.org/doi/full/10.1056/NEJMc2001573

https://radiopaedia.org/cases/covid-19-pneumonia-28?lang=us

https://radiopaedia.org/cases/covid-19-pneumonia-24?lang=us

https://journals.lww.com/investigativeradiology/Abstract/publishahead/Chest_CT_Findings_in_Patients_with_Corona_Virus.98835.aspx

The pictures are inside the PDF, but can be extracted. I can work on this

Double-check metadata for accuracy/completeness

I will be starting to revisit the papers and double-check the metadata for accuracy and completeness. I will keep track of my progress using this issue.

More Clinical note about other patients with SARS or ARDS or pneumonia

Anyone knows where we could find more clinical note about SARS or ARDS or pneumonia patients? I want to do some research on covid-19 classification based clinical note.
In current metadata, there are 92 covid-19 instances compared to 14 none covid-19 instances.
Really appreciate it if anyone could provide more useful info

Offset field

What is the meaning of the offset field?

4 images missing metadata

Thank you for work,
Is there labeling of images exists saying which image is of survived or not survived patient?

patients without Y or N checks?

hello, in the survival column of the metadata.csv file do you have patients without Y or N checks, how do we take these samples for training and validation?

thank you

https://app.figure1.com/images/5e7c1b8d98c29ab001275405/

Separate imaging modality and view

It would be easier to select just X-rays or CT scans if there was a column for the modality and a different column for the view. I have made these changes on my fork and can make a PR if you think this is useful. Right now I am just checking that all of my changes are correct.