ieee8023 / covid-chestxray-dataset Goto Github PK
View Code? Open in Web Editor NEWWe are building an open database of COVID-19 cases with chest X-ray or CT images.
We are building an open database of COVID-19 cases with chest X-ray or CT images.
In this example, we use ONLY the XRs samples in the dataset labeled as COVID-19. We went the XRs way instead of the CTs since there are more of them. But I agree CTs are better for detection as mentioned here #5 .
The Neural Network source code is based in a post by Adrian Rosebrock in PyImageSearch.
Here, the dataset was divided into two labels: sicks
and healthy
. The healthy training samples were extracted from this Kaggle contest.
Then for training, we divide into two folders /dataset/sicks
and /dataset/healthy
, located in the root folder. Each class having the same number of images (around 90).
It's a preliminary approach that may improve substantially once the dataset grows enough.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os
import lime
from lime import lime_image
from skimage.segmentation import mark_boundaries
plt.rcParams["figure.figsize"] = (20,10)
## global params
INIT_LR = 1e-4 # learning rate
EPOCHS = 21 # training epochs
BS = 8 # batch size
## load and prepare data
imagePaths = list(paths.list_images("dataset"))
data = []
labels = []
# loop over the image paths
for imagePath in imagePaths:
# extract the class label from the filename
label = imagePath.split(os.path.sep)[-2]
# load the image, swap color channels, and resize it to be a fixed
# 224x224 pixels while ignoring aspect ratio
image = cv2.imread(imagePath)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (224, 224))
# update the data and labels lists, respectively
data.append(image)
labels.append(label)
# convert the data and labels to NumPy arrays while scaling the pixel
# intensities to the range [0, 1]
data = np.array(data) / 255.0
labels = np.array(labels)
TEST_SET_SIZE = 0.2
lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels); print(labels)
# partition the data into training and testing splits using 80% of
# the data for training and the remaining 20% for testing
(trainX, testX, trainY, testY) = train_test_split(data, labels,
test_size=TEST_SET_SIZE, stratify=labels, random_state=42)
# initialize the training data augmentation object
trainAug = ImageDataGenerator(
rotation_range=15,
fill_mode="nearest")
## build network
baseModel = VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
# construct the head of the model that will be placed on top of the
# the base model
headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(4, 4))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(64, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(2, activation="softmax")(headModel)
# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)
# loop over all layers in the base model and freeze them so they will
# *not* be updated during the first training process
for layer in baseModel.layers:
layer.trainable = False
print("[INFO] compiling model...")
opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy", optimizer=opt,
metrics=["accuracy"])
## train
print("[INFO] training head...")
H = model.fit_generator(
trainAug.flow(trainX, trainY, batch_size=BS),
steps_per_epoch=len(trainX) // BS,
validation_data=(testX, testY),
validation_steps=len(testX) // BS,
epochs=EPOCHS)
print("[INFO] saving COVID-19 detector model...")
model.save("covid19.model", save_format="h5")
## eval
print("[INFO] evaluating network...")
predIdxs = model.predict(testX, batch_size=BS)
predIdxs = np.argmax(predIdxs, axis=1) # argmax for the predicted probability
print(classification_report(testY.argmax(axis=1), predIdxs,
target_names=lb.classes_))
cm = confusion_matrix(testY.argmax(axis=1), predIdxs)
total = sum(sum(cm))
acc = (cm[0, 0] + cm[1, 1]) / total
sensitivity = cm[0, 0] / (cm[0, 0] + cm[0, 1])
specificity = cm[1, 1] / (cm[1, 0] + cm[1, 1])
# show the confusion matrix, accuracy, sensitivity, and specificity
print(cm)
print("acc: {:.4f}".format(acc))
print("sensitivity: {:.4f}".format(sensitivity))
print("specificity: {:.4f}".format(specificity))
## explain
N = EPOCHS
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, N), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, N), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, N), H.history["val_accuracy"], label="val_acc")
plt.title("Precision of COVID-19 detection.")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="lower left")
plt.savefig("training_plot.png")
for ind in range(10):
explainer = lime_image.LimeImageExplainer()
explanation = explainer.explain_instance(testX[-ind], model.predict,
hide_color=0, num_samples=42)
print("> label:", testY[ind].argmax(), "- predicted:", predIdxs[ind])
temp, mask = explanation.get_image_and_mask(
explanation.top_labels[0], positive_only=False, num_features=1, hide_rest=True)
plt.imshow(mark_boundaries(temp / 2 + 0.5, mask)+testX[ind])
plt.show()
In the end, you will have some visualizations on how the network is "detecting" (if the evaluation metrics make sense) COVID-19 suspicious region in the XRs.
Comment 1: In my experience, this Lime explanation method can be handy when classifying images and trying to understand what the network is actually "looking at" to make the decision.
Comment 2: I was wondering why the classification accuracy was so high here (and in the original PyImageSearch post). I think it is because the Kaggle dataset is so well standardized that the NN is learning to predict where the X-Ray comes from Kaggle or this dataset instead of classifying healthy/sick
. Nevertheless, I feel that the source code is still relevant, and with more XRs data and better preprocessing, we will be able to fix this issue and improve the algorithm.
Hi there, @ieee8023. I just wanted to take a second and thank you for what you're doing.
I run a popular computer vision/deep learning blog (PyImageSearch.com) and published a tutorial for CV/DL practitioners on how to use this dataset (your work and dataset curation is properly cited):
I'll be honest -- it's one of the least scientific things I've published (but that's not the point of the piece).
It's mainly just for people (including myself) to "feel" like they are helping and be able to take solace in it (and educate themselves via it). It's an opportunity for people to be inspired by it. And while it won't save any lives immediately, perhaps downstream, it will help people apply themselves and learn a new skill while they are quarantined or displaced from the work, school, or research lab.
I don't know if it's worthy of being included in your repo, but I wanted to pass it along just in case.
Thank you again, you're doing amazing work.
Open Source Helps!
Thanks for your work to help the people in need! Your site has been added! I currently maintain the OpenSourceWuhan page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.
https://weileizeng.github.io/OpenSourceWuhan/world
Cheers!
Hello,
First of all thanks a lot for the effort you are putting to gather all these xray and ct measurements!
I am wondering if bounding box/masks for the detection of problematic regions can be provided or is this only available as an image classification dataset? If second, are there going to be also negative samples (xray images from covid-negative patients)?
Bests
Can you provide detailed explanation of the columns in metadata.csv
file?
Example: what does the offset
represent?? similarly for the rest of the columns.
Hi really appreciate your work.
When i looking at the metadata( https://github.com/ieee8023/covid-chestxray-dataset/blob/master/metadata.csv), I can't find label for each sample. How can i find the label? Do i miss something?
Hi @ieee8023
thank you for maintaining this dataset!
I implemented a pytorch lightning wrapper for a DenseNet model for covid-chestxray-dataset.
It is kick of a Pytorch Lightning's community project which aims at to be covid19 detector (for educational purposes).
Can you recommend us datasets and strategies on how to use additional data
I have scanned https://arxiv.org/pdf/2002.02497.pdf (I will return to it). It seems that to solve the labeling differences and other dataset preparation differences quite a lot of domain expertise is needed. Any tips appreciated.
Kind regards
Ondra
PS: I was inspired by #15
PPS: My fork was merged to the PyTorchLightning community project
PPPS: I believe that @Borda already contacted you that we may use slack for longer discussions if needed. Link to the slack can be found at PL
I can't find the CT images.
I have a project to automatically search for, download, and extract radiographs from papers on a given disease.
Right now everything works except filtering the radiographs from the other figures, which I am still working on. Even without this feature, though, the tool could help you to manually screen paper figures quickly.
Do you think this tool would be useful for this project?
Hi,
I'm a deep learning researcher from Spain.
I have created a minimal jupyter notebook to train with the images from your repo.
https://drive.google.com/file/d/19T_qebLa1keUNpkp7FDNEBmsNRQUVtfJ/view?usp=sharing
I want to experiment with visual attention models on next days.
Rodrigo
Published online March 5, 2020. https://doi.org/10.3348/kjr.2020.0146
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-505.pdf
Published online March 20, 2020. https://doi.org/10.3348/kjr.2020.0195
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e45.pdf
Published online March 20, 2020. https://doi.org/10.3348/kjr.2020.0180
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e43.pdf
Published online March 13, 2020. https://doi.org/10.3348/kjr.2020.0181
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e42.pdf
Published online March 13, 2020. https://doi.org/10.3348/kjr.2020.0157
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-e39.pdf
Published online February 11, 2020. https://doi.org/10.3348/kjr.2020.0078
https://kjronline.org/Synapse/Data/PDFData/0068KJR/kjr-21-365.pdf
Found using advanced search of the Korean Journal of Radiology kjronline[.]org searching the term "covid-19" filtering for years 2018 - 2020
Hi,
I have prepared an example of Xray classification , based on this repo
https://colab.research.google.com/drive/1KlKvHDgvi-cfrpJUIOCvczmL4Ctc1wBL
Please check it out. Are you interested in such initiatives?
Thank you so much for curating this dataset! Hopefully we can all work together to scale up detection of coronaviruses via radiography.
Could you please add a LICENSE
file for the images and annotations in this repository? We'd like to be able to use and remix them but need clarification of the terms under which they're allowed to be used and shared.
Hi, you might be interested in this 100 CT slice dataset of 60 patients that we have segmented:
http://medicalsegmentation.com/covid19/
Feel free to add to your list if you find it relevant.
Hi I am doing some research on this topic applying CNN with deep learning to create an automated comupter vision based scanner to detect covid posivites and negatives scans.
Here you can find my dataset, I am currently building a CT scans dataset to try and train a model for ct scan other than rx scans. https://github.com/AleGiovanardi/covidhelper/tree/master/dataset/covidct
I also have a source of new rx and cts directly from italian hospital so i will update it periodically. You are welcome to take any of the data in my repo which are missing from here.
You can find also a code which train a model, save it and let you use it to test detection of scans, which is based on Adrian Rosebrock tutorial on pyimagesearch. I am constantyl working to enhance the performance and the accuracy of it.
Also thanks for your great job, this inspired me a lot!
Any dicom file?
https://www.sirm.org/category/senza-categoria/covid-19/
There are a few images being posted here, may be worth investigating
How does one link the metadata file to the image files? Also there seem to be less metadata than there are images?
It just occurred to me that arrows only occur on images with a positive diagnosis, so this could cause data leakage.
That might not be as much of problem if you are using these images for differential diagnosis, and already know the patient has something, but it could be an issue if this dataset is being combined with healthy images to decide whether the patient is healthy or sick.
hmm... i can't see control group data
This file below does not exist in metadata.csv
1-s2.0-S0929664620300449-gr3_lrg-e.jpg
Normally digital x-ray image has DICOM format.
And it has more bit depth, then jpeg.
Images on slides 2 and 3
Can anybody explain what is the purpose of offset values I am having issues to understand what it is
"March 4, 2020 -- X-ray may not be the best imaging tool for detecting novel coronavirus disease (COVID-19). Almost three-quarters of a small cohort of South Korean patients with COVID-19 pneumonia had normal chest x-rays, missing pulmonary nodules that chest CT identified, according to a February 26 study in the Korean Journal of Radiology."
https://www.auntminnie.com/index.aspx?sec=sup&sub=xra&pag=dis&ItemID=128347
The pictures are inside the PDF, but can be extracted. I can work on this
I will be starting to revisit the papers and double-check the metadata for accuracy and completeness. I will keep track of my progress using this issue.
Anyone knows where we could find more clinical note about SARS or ARDS or pneumonia patients? I want to do some research on covid-19 classification based clinical note.
In current metadata, there are 92 covid-19 instances compared to 14 none covid-19 instances.
Really appreciate it if anyone could provide more useful info
What is the meaning of the offset field?
Thank you for work,
Is there labeling of images exists saying which image is of survived or not survived patient?
hello, in the survival column of the metadata.csv file do you have patients without Y or N checks, how do we take these samples for training and validation?
thank you
It would be easier to select just X-rays or CT scans if there was a column for the modality and a different column for the view. I have made these changes on my fork and can make a PR if you think this is useful. Right now I am just checking that all of my changes are correct.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.