Coder Social home page Coder Social logo

ieee8023 / covid-chestxray-dataset Goto Github PK

View Code? Open in Web Editor NEW
3.0K 3.0K 1.3K 632.96 MB

We are building an open database of COVID-19 cases with chest X-ray or CT images.

Python 16.32% Jupyter Notebook 82.77% JavaScript 0.91%
computed-tomography computer-vision covid-19 dataset deep-learning xray

covid-chestxray-dataset's People

Contributors

andreabac3 avatar beatrizgarcias avatar bganglia avatar generalblockchain avatar ieee8023 avatar juanmed avatar kant avatar lan-dao avatar ncovgt2020 avatar nullcodex avatar vishalshar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-chestxray-dataset's Issues

COVID-19 classification DCNN training code with "explainability" functionality

In this example, we use ONLY the XRs samples in the dataset labeled as COVID-19. We went the XRs way instead of the CTs since there are more of them. But I agree CTs are better for detection as mentioned here #5 .

The Neural Network source code is based in a post by Adrian Rosebrock in PyImageSearch.

Here, the dataset was divided into two labels: sicks and healthy. The healthy training samples were extracted from this Kaggle contest.

Then for training, we divide into two folders /dataset/sicks and /dataset/healthy, located in the root folder. Each class having the same number of images (around 90).

It's a preliminary approach that may improve substantially once the dataset grows enough.

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os
import lime
from lime import lime_image
from skimage.segmentation import mark_boundaries

plt.rcParams["figure.figsize"] = (20,10)

## global params
INIT_LR = 1e-4  # learning rate
EPOCHS = 21  # training epochs
BS = 8  # batch size


## load and prepare data
imagePaths = list(paths.list_images("dataset"))
data = []
labels = []
# loop over the image paths
for imagePath in imagePaths:
    # extract the class label from the filename
    label = imagePath.split(os.path.sep)[-2]
    # load the image, swap color channels, and resize it to be a fixed
    # 224x224 pixels while ignoring aspect ratio
    image = cv2.imread(imagePath)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = cv2.resize(image, (224, 224))
    # update the data and labels lists, respectively
    data.append(image)
    labels.append(label)
# convert the data and labels to NumPy arrays while scaling the pixel
# intensities to the range [0, 1]
data = np.array(data) / 255.0
labels = np.array(labels)

TEST_SET_SIZE = 0.2

lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels); print(labels)
# partition the data into training and testing splits using 80% of
# the data for training and the remaining 20% for testing
(trainX, testX, trainY, testY) = train_test_split(data, labels,
    test_size=TEST_SET_SIZE, stratify=labels, random_state=42)
# initialize the training data augmentation object
trainAug = ImageDataGenerator(
    rotation_range=15,
    fill_mode="nearest")

## build network
baseModel = VGG16(weights="imagenet", include_top=False,
    input_tensor=Input(shape=(224, 224, 3)))
# construct the head of the model that will be placed on top of the
# the base model
headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(4, 4))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(64, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(2, activation="softmax")(headModel)
# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)
# loop over all layers in the base model and freeze them so they will
# *not* be updated during the first training process
for layer in baseModel.layers:
    layer.trainable = False

print("[INFO] compiling model...")
opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy", optimizer=opt,
    metrics=["accuracy"])

## train
print("[INFO] training head...")
H = model.fit_generator(
    trainAug.flow(trainX, trainY, batch_size=BS),
    steps_per_epoch=len(trainX) // BS,
    validation_data=(testX, testY),
    validation_steps=len(testX) // BS,
    epochs=EPOCHS)

print("[INFO] saving COVID-19 detector model...")
model.save("covid19.model", save_format="h5")

## eval
print("[INFO] evaluating network...")
predIdxs = model.predict(testX, batch_size=BS)
predIdxs = np.argmax(predIdxs, axis=1) # argmax for the predicted probability
print(classification_report(testY.argmax(axis=1), predIdxs,
    target_names=lb.classes_))

cm = confusion_matrix(testY.argmax(axis=1), predIdxs)
total = sum(sum(cm))
acc = (cm[0, 0] + cm[1, 1]) / total
sensitivity = cm[0, 0] / (cm[0, 0] + cm[0, 1])
specificity = cm[1, 1] / (cm[1, 0] + cm[1, 1])
# show the confusion matrix, accuracy, sensitivity, and specificity
print(cm)
print("acc: {:.4f}".format(acc))
print("sensitivity: {:.4f}".format(sensitivity))
print("specificity: {:.4f}".format(specificity))


## explain
N = EPOCHS
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, N), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, N), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, N), H.history["val_accuracy"], label="val_acc")
plt.title("Precision of COVID-19 detection.")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="lower left")
plt.savefig("training_plot.png")

for ind in range(10): 
    explainer = lime_image.LimeImageExplainer()
    explanation = explainer.explain_instance(testX[-ind], model.predict,
                                             hide_color=0, num_samples=42)
    print("> label:", testY[ind].argmax(), "- predicted:", predIdxs[ind])
    
    temp, mask = explanation.get_image_and_mask(
    explanation.top_labels[0], positive_only=False, num_features=1, hide_rest=True)
    plt.imshow(mark_boundaries(temp / 2 + 0.5, mask)+testX[ind])
    plt.show()

In the end, you will have some visualizations on how the network is "detecting" (if the evaluation metrics make sense) COVID-19 suspicious region in the XRs.

sample_detection

Comment 1: In my experience, this Lime explanation method can be handy when classifying images and trying to understand what the network is actually "looking at" to make the decision.

Comment 2: I was wondering why the classification accuracy was so high here (and in the original PyImageSearch post). I think it is because the Kaggle dataset is so well standardized that the NN is learning to predict where the X-Ray comes from Kaggle or this dataset instead of classifying healthy/sick. Nevertheless, I feel that the source code is still relevant, and with more XRs data and better preprocessing, we will be able to fix this issue and improve the algorithm.

Thank you for what you're doing

Hi there, @ieee8023. I just wanted to take a second and thank you for what you're doing.

I run a popular computer vision/deep learning blog (PyImageSearch.com) and published a tutorial for CV/DL practitioners on how to use this dataset (your work and dataset curation is properly cited):

https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-tensorflow-and-deep-learning/

I'll be honest -- it's one of the least scientific things I've published (but that's not the point of the piece).

It's mainly just for people (including myself) to "feel" like they are helping and be able to take solace in it (and educate themselves via it). It's an opportunity for people to be inspired by it. And while it won't save any lives immediately, perhaps downstream, it will help people apply themselves and learn a new skill while they are quarantined or displaced from the work, school, or research lab.

I don't know if it's worthy of being included in your repo, but I wanted to pass it along just in case.

Thank you again, you're doing amazing work.

Open Source Helps!

Open Source Helps!
Thanks for your work to help the people in need! Your site has been added! I currently maintain the OpenSourceWuhan page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

https://weileizeng.github.io/OpenSourceWuhan/world

Cheers!

Annotations

Hello,

First of all thanks a lot for the effort you are putting to gather all these xray and ct measurements!

I am wondering if bounding box/masks for the detection of problematic regions can be provided or is this only available as an image classification dataset? If second, are there going to be also negative samples (xray images from covid-negative patients)?

Bests

Recommended datasets for transfer learning

Hi @ieee8023

thank you for maintaining this dataset!

I implemented a pytorch lightning wrapper for a DenseNet model for covid-chestxray-dataset.

It is kick of a Pytorch Lightning's community project which aims at to be covid19 detector (for educational purposes).

Can you recommend us datasets and strategies on how to use additional data

I have scanned https://arxiv.org/pdf/2002.02497.pdf (I will return to it). It seems that to solve the labeling differences and other dataset preparation differences quite a lot of domain expertise is needed. Any tips appreciated.

Kind regards

Ondra

PS: I was inspired by #15
PPS: My fork was merged to the PyTorchLightning community project
PPPS: I believe that @Borda already contacted you that we may use slack for longer discussions if needed. Link to the slack can be found at PL

Automate finding radiographs in academic papers

I have a project to automatically search for, download, and extract radiographs from papers on a given disease.

Right now everything works except filtering the radiographs from the other figures, which I am still working on. Even without this feature, though, the tool could help you to manually screen paper figures quickly.

Do you think this tool would be useful for this project?

5 South Korea papers still not included in metadata

Missing License

Thank you so much for curating this dataset! Hopefully we can all work together to scale up detection of coronaviruses via radiography.

Could you please add a LICENSE file for the images and annotations in this repository? We'd like to be able to use and remix them but need clarification of the terms under which they're allowed to be used and shared.

Minor patientID issue - PatientID 62 is used twice

Thought we would post this minor issue that we found where in the metadata, patientID 62 is mentioned twice, but the gender and age (and source of data) are different.

This is amazing work! Keep it up!
image

Sharing my data

Hi I am doing some research on this topic applying CNN with deep learning to create an automated comupter vision based scanner to detect covid posivites and negatives scans.

Here you can find my dataset, I am currently building a CT scans dataset to try and train a model for ct scan other than rx scans. https://github.com/AleGiovanardi/covidhelper/tree/master/dataset/covidct

I also have a source of new rx and cts directly from italian hospital so i will update it periodically. You are welcome to take any of the data in my repo which are missing from here.

You can find also a code which train a model, save it and let you use it to test detection of scans, which is based on Adrian Rosebrock tutorial on pyimagesearch. I am constantyl working to enhance the performance and the accuracy of it.

Also thanks for your great job, this inspired me a lot!

Do the red arrows on some images create a danger for data leakage?

It just occurred to me that arrows only occur on images with a positive diagnosis, so this could cause data leakage.

That might not be as much of problem if you are using these images for differential diagnosis, and already know the patient has something, but it could be an issue if this dataset is being combined with healthy images to decide whether the patient is healthy or sick.

wrong image file name

This file below does not exist in metadata.csv

1-s2.0-S0929664620300449-gr3_lrg-e.jpg

What is the offset ?

Can anybody explain what is the purpose of offset values I am having issues to understand what it is

More Clinical note about other patients with SARS or ARDS or pneumonia

Anyone knows where we could find more clinical note about SARS or ARDS or pneumonia patients? I want to do some research on covid-19 classification based clinical note.
In current metadata, there are 92 covid-19 instances compared to 14 none covid-19 instances.
Really appreciate it if anyone could provide more useful info

4 images missing metadata

Thank you for work,
Is there labeling of images exists saying which image is of survived or not survived patient?

patients without Y or N checks?

hello, in the survival column of the metadata.csv file do you have patients without Y or N checks, how do we take these samples for training and validation?

thank you

Separate imaging modality and view

It would be easier to select just X-rays or CT scans if there was a column for the modality and a different column for the view. I have made these changes on my fork and can make a PR if you think this is useful. Right now I am just checking that all of my changes are correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.