Coder Social home page Coder Social logo

lungcancerdetection's Introduction

LungCancerProject

Deep learning is a fast and evolving field that has a lot of implications on medical imaging field.

Currently medical images are interpreted by radiologists, physicians etc. But this interpretation gets very subjective. After years of looking at ultrasound images, my co-workers and I still get into arguments about whether we are actually seeing a tumor in a scan. Radiologists often have to look through large volumes of these images that can cause fatigue and lead to mistakes. So there is a need for automating this.

Machine learning algorithms such as support vector machines are often used to detect and classify tumors. But they are often limited by the assumptions we make when we define features. This results in reduced sensitivity. However, deep learning could be ideal solution because these algorithms are able to learn features from raw image data.

One challenge in implementing these algorithms is the scarcity of labeled medical image data. While this is a limitation for all applications of deep learning, it is more so for medical image data because of patient confidentiality concerns.

In this post you will learn how to build a convolutional neural network, train it, and have it detect lung nodules. I used the data from the Lung Image Database Consortium and Infectious Disease Research Institute [(LIDC/IDRI) data base] (https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI). As these images were huge (124 GB), I ended up using reformatted version available for LUNA16. This dataset consisted of 888 CT scans with annotations describing coordinates and ground truth labels. First step was to create a image database for training.

Creating an image database

The images were formatted as .mhd and .raw files. The header data is contained in .mhd files and multidimensional image data is stored in .raw files. I used SimpleITK library to read the .mhd files. Each CT scan has dimensions of 512 x 512 x n, where n is the number of axial scans. There are about 200 images in each CT scan.

There were a total of 551065 annotations. Of all the annotations provided, 1351 were labeled as nodules, rest were labeled negative. So there big class imbalance. The easy way to deal with it to under sample the majority class and augment the minority class through rotating images.

We could potentially train the CNN on all the pixels, but that would increase the computational cost and training time. So instead I just decided to crop the images around the coordinates provided in the annotations. The annotation were provided in Cartesian coordinates. So they had to be converted to voxel coordinates. Also the image intensity was defined in Hounsfield scale. So it had to be rescaled for image processing purposes.

The script below would generate 50 x 50 grayscale images for training, testing and validating a CNN.

<script src="https://gist.github.com/swethasubramanian/8483c5a21d0727e99976b0b9e2b60e68.js"></script>

While the script above under-sampled the negative class such that every 1 in 6 images had a nodule. The data set is still vastly imbalanced for training. I decided to augment my training set by rotating images. The script below does just that.

<script src="https://gist.github.com/swethasubramanian/72697b5cff4c5614c06460885dc7ae23.js"></script>

So for an original image, my script would create these two images:

original image90 degree rotation180 degree rotation

Augmentation resulted in a 80-20 class distribution, which was not entirely ideal. But I also did not want to augment the minority class too much because it might result in a minority class with little variation.

Building a CNN

Now we are ready to build a CNN. After dabbling a bit with tensorflow, I decided it was way too much work for something incredibly simple. I decided to use tflearn. Tflearn is a high-level API wrapper around tensorflow. It made coding lot more palatable. The approach I used was similar to this. I used a 3 convolutional layers in my architecture.

arch

My CNN model is defined in a class as shown in the script below.

<script src="https://gist.github.com/swethasubramanian/45be51b64d1595e78fb171c5dbb6cce6.js"></script>

I had a total of 6878 images in my training set.

Training the model

Because the data required to train a CNN is very large, it is often desirable to train the model in batches. Loading all the training data into memory is not always possible because you need enough memory to handle it and the features too. I was working out of a 2012 Macbook Pro. So I decided to load all the images into a hdfs dataset using h5py library. You can find the script I used to do that here.

Once I had the training data in a hdfs dataset, I trained the model using this script.

<script src="https://gist.github.com/swethasubramanian/dca76567afe1c175e016b2ce299cb7fb.js"></script>

The training took a couple of hours on my laptop. Like any engineer, I wanted to see what goes on under the hood. As the filters are of low resolution (5x5), it would be more useful to visualize features maps generated.

So if I pass through this image through the first convolutional layer (50 x 50 x 32), it generates a feature map that looks like this: conv_layer_0

The max pooling layer following the first layer downsampled the feature map by 2. So when the downsampled feature map is passed into the second convolutional layer of 64 5x5 filters, the resulting feature map is: conv_layer_1

The feature map generated by the third convolutional layer containing 64 3x3 filters is: conv_layer_2

Testing data

I tested my CNN model on 1623 images. I had an validation accuracy of 93 %. My model has a precision of 89.3 % and recall of 71.2 %. The model has a specificity of 98.2 %.

Here is the confusion matrix.

confusion_matrix

I looked deeper into the sort of predictions: False Negative Predictions: preds_fns False Positive Predictions: preds_fps True Negative Predictions: preds_tns True Positive Predictions: preds_tps

lungcancerdetection's People

Contributors

swethasubramanian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lungcancerdetection's Issues

General Working

Hi there. I just completed my course on Neural Networks and wanted to like do a project to get the general idea. So I think by running this project once I would get the practical idea of NN based projects. As I am just a beginner I am not understanding the way in which I can execute the files. So if you could you tell me the order in which I can run the files, it would be really helpful.

IndexError: list index out of range while training the model train.py

Hi,
I tried to run train.py model directly from h5 dataset. During the model.fit, I get an IndexError: list index out of range. How can we solve this error? The error details is as:

model.fit(X_train_images, Y_train_labels, n_epoch = 1, shuffle=True,
validation_set = (X_val_images, Y_val_labels), show_metric = True,
batch_size = 5, snapshot_epoch = True, run_id = 'nodule3-classifier')
Traceback (most recent call last):

File "", line 1, in
model.fit(X_train_images, Y_train_labels, n_epoch = 1, shuffle=True, validation_set = (X_val_images, Y_val_labels), show_metric = True, batch_size = 5, snapshot_epoch = True, run_id = 'nodule3-classifier')

File "C:\Users\Mahidol\Anaconda3\envs\Thesis\lib\site-packages\tflearn\models\dnn.py", line 184, in fit
self.targets)

File "C:\Users\Mahidol\Anaconda3\envs\Thesis\lib\site-packages\tflearn\utils.py", line 283, in feed_dict_builder
feed_dict[net_inputs[i]] = x

IndexError: list index out of range

about candidates.csv file?

are you using candidates.csv for classification purpose? what is the real difference between candidates.csv and candidates_V2.csv, I know that candidates_V2.csv contains more records.

Hdf5 files with X and Y.

Hello, first of all thank you for sharing your code, it's really well done!

It seems its not possible to generate the hdf5 files for prediction without the "data labels" file.
Would you please provide de hdf5 files?
I'd really appreciate that!

Thank you in advance!

Great Project

It is a detailed and beautiful project with its narrative and content

Error within LungCancerDetection.ipynb

model.predict(X_test_images)

IOError Traceback (most recent call last)
in ()
----> 1 model.predict(X_test_images)

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/models/dnn.pyc in predict(self, X)
229 """
230 feed_dict = feed_dict_builder(X, None, self.inputs, None)
--> 231 return self.predictor.predict(feed_dict)
232
233 def predict_label(self, X):

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/helpers/evaluator.pyc in predict(self, feed_dict)
61 if len(dprep_dict) > 0:
62 for k in dprep_dict:
---> 63 feed_dict[k] = dprep_dict[k].apply(feed_dict[k])
64
65 # Prediction for each tensor

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/data_preprocessing.pyc in apply(self, batch)
44 batch = m(batch, *self.args[i])
45 else:
---> 46 batch = m(batch)
47 return batch
48

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/data_preprocessing.pyc in _featurewise_zero_center(self, batch)
199 def _featurewise_zero_center(self, batch):
200 for i in range(len(batch)):
--> 201 batch[i] -= self.global_mean.value
202 return batch
203

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)()

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in setitem(self, args, val)
616 mspace = h5s.create_simple(mshape_pad, (h5s.UNLIMITED,)*len(mshape_pad))
617 for fspace in selection.broadcast(mshape):
--> 618 self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
619
620 def read_direct(self, dest, source_sel=None, dest_sel=None):

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)()

h5py/h5d.pyx in h5py.h5d.DatasetID.write (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/h5d.c:3678)()

h5py/_proxy.pyx in h5py._proxy.dset_rw (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_proxy.c:2022)()

h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dwrite (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_proxy.c:1732)()

IOError: Can't prepare for writing data (No write intent on file)

Any solution?

Data Agument

Thanks for your brilliant project. I have a puzzle about that there are 1690 images generated from data augment. However, the size of 'train.h5' is 5187. I think you didn't added the augmented data into training set.

about create_images.py

Hi Swetha,

Thank you very much for sharing this code. Could you please clarify the following below

inpfile = mode + 'data'
outDir = mode + '/image_'

In above 'data' is referring to??

About data labels

I appreciate it for your share.I wonder is there exact labels about your data? if yes,may I get it for a learning purpose?

#about predict_model.py

@swethasubramanian 您好!打扰您了! 想请教您一个问题,跑predict_module.py时,出现了下面的错误: NotFoundError (see above for traceback): Key Accuracy/Mean/moving_avg_1 not found in checkpoint [[Node: save_13/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_13/Const_0_0, save_13/RestoreV2/tensor_names, save_13/RestoreV2/shape_and_slices)]] 请问这个该怎么改

#about predict_model.py

 您好!打扰您了! 想请教您一个问题,跑predict_module.py时,出现了下面的错误:

NotFoundError (see above for traceback): Key Accuracy/Mean/moving_avg_1 not found in checkpoint
[[Node: save_13/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_13/Const_0_0, save_13/RestoreV2/tensor_names, save_13/RestoreV2/shape_and_slices)]]
请问这个该怎么改

IOError: File ../data/candidates.csv does not exist.

Hi,
I have checked out the codes, and executed the command "python create_images.py train" under the folder "LungCancerDetection/src/data". But, it prompts, "IOError: File ../data/candidates.csv does not exist". Indeed, I cannot find the "candidates.csv" file in the project.
So, could you pls help to give me some suggestions? Thanks!
candidates csv error

Instructions on how to run with existing data

Installation and usage

  1. Install the requirements( file attached ) using pip install -r requirements.txt
    requirements.txt

  2. From project directory
    $ cd src/models/
    $ python3 train.py

  • visualize_model.py or predict_model.py can be run once the training data is generated.

Extra steps for Python 3

  1. you may face issues with print being called without parens, please replace the usage of print with print()

  2. In visualize_model.py Line 69,
    replace nfilt/4 with nfilt//4 as we expect a int.

Request to Author

Looking at the Above steps this guide may feel un necesarry, but i spent around 2 hours trying to get this working. This would have saved a lot of time, if the requirements file was provided.

Other notes

If anyone is willing to change and make a pull request, please do so and make it runnable from any directory, that will be better than cding to models before execution.

Due to laziness, i'm only writing this short guide, sorry folks.

Fully connected weights

Hello!
First of all, thank you for uploading that wonderful project!

I'd like to know how can i see the final weights of the fully connected layer, and how many weights are there.

Thank you in advance!

list index out of range

when i run 'create_image.py test' in the cmd prompt, i get an error sayin list index out of range. The error refers to the following line: self.ds = sitk.ReadImage(path[0])

How to cite your work? How to execute your codes?

Hi, swethasubramanian
Thanks for sharing your codes. I have some questions to ask you.

How to cite your work? Which paper should be cited?
How to execute your codes? What is the order of executing your codes?
Thanks for your helps
Best regards
Gu Yu

LIDC-IDRIdatabase

Hello, would you like to send the ldc-idri data set to me? I would appreciate it very much. My Internet number is 18709853489.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.