swethasubramanian / lungcancerdetection Goto Github PK

Use CNN to detect nodules in LIDC dataset.

License: MIT License

Makefile 2.35% Jupyter Notebook 85.61% Python 12.04%

lungcancerdetection's Introduction

LungCancerProject

Deep learning is a fast and evolving field that has a lot of implications on medical imaging field.

Currently medical images are interpreted by radiologists, physicians etc. But this interpretation gets very subjective. After years of looking at ultrasound images, my co-workers and I still get into arguments about whether we are actually seeing a tumor in a scan. Radiologists often have to look through large volumes of these images that can cause fatigue and lead to mistakes. So there is a need for automating this.

Machine learning algorithms such as support vector machines are often used to detect and classify tumors. But they are often limited by the assumptions we make when we define features. This results in reduced sensitivity. However, deep learning could be ideal solution because these algorithms are able to learn features from raw image data.

One challenge in implementing these algorithms is the scarcity of labeled medical image data. While this is a limitation for all applications of deep learning, it is more so for medical image data because of patient confidentiality concerns.

In this post you will learn how to build a convolutional neural network, train it, and have it detect lung nodules. I used the data from the Lung Image Database Consortium and Infectious Disease Research Institute [(LIDC/IDRI) data base] (https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI). As these images were huge (124 GB), I ended up using reformatted version available for LUNA16. This dataset consisted of 888 CT scans with annotations describing coordinates and ground truth labels. First step was to create a image database for training.

Creating an image database

The images were formatted as .mhd and .raw files. The header data is contained in .mhd files and multidimensional image data is stored in .raw files. I used SimpleITK library to read the .mhd files. Each CT scan has dimensions of 512 x 512 x n, where n is the number of axial scans. There are about 200 images in each CT scan.

There were a total of 551065 annotations. Of all the annotations provided, 1351 were labeled as nodules, rest were labeled negative. So there big class imbalance. The easy way to deal with it to under sample the majority class and augment the minority class through rotating images.

We could potentially train the CNN on all the pixels, but that would increase the computational cost and training time. So instead I just decided to crop the images around the coordinates provided in the annotations. The annotation were provided in Cartesian coordinates. So they had to be converted to voxel coordinates. Also the image intensity was defined in Hounsfield scale. So it had to be rescaled for image processing purposes.

The script below would generate 50 x 50 grayscale images for training, testing and validating a CNN.

While the script above under-sampled the negative class such that every 1 in 6 images had a nodule. The data set is still vastly imbalanced for training. I decided to augment my training set by rotating images. The script below does just that.

So for an original image, my script would create these two images:


original image	90 degree rotation	180 degree rotation

Augmentation resulted in a 80-20 class distribution, which was not entirely ideal. But I also did not want to augment the minority class too much because it might result in a minority class with little variation.

Building a CNN

Now we are ready to build a CNN. After dabbling a bit with tensorflow, I decided it was way too much work for something incredibly simple. I decided to use tflearn. Tflearn is a high-level API wrapper around tensorflow. It made coding lot more palatable. The approach I used was similar to this. I used a 3 convolutional layers in my architecture.

My CNN model is defined in a class as shown in the script below.

I had a total of 6878 images in my training set.

Training the model

Because the data required to train a CNN is very large, it is often desirable to train the model in batches. Loading all the training data into memory is not always possible because you need enough memory to handle it and the features too. I was working out of a 2012 Macbook Pro. So I decided to load all the images into a hdfs dataset using h5py library. You can find the script I used to do that here.

Once I had the training data in a hdfs dataset, I trained the model using this script.

The training took a couple of hours on my laptop. Like any engineer, I wanted to see what goes on under the hood. As the filters are of low resolution (5x5), it would be more useful to visualize features maps generated.

So if I pass through this image through the first convolutional layer (50 x 50 x 32), it generates a feature map that looks like this:

The max pooling layer following the first layer downsampled the feature map by 2. So when the downsampled feature map is passed into the second convolutional layer of 64 5x5 filters, the resulting feature map is:

The feature map generated by the third convolutional layer containing 64 3x3 filters is:

Testing data

I tested my CNN model on 1623 images. I had an validation accuracy of 93 %. My model has a precision of 89.3 % and recall of 71.2 %. The model has a specificity of 98.2 %.

Here is the confusion matrix.

I looked deeper into the sort of predictions: False Negative Predictions: False Positive Predictions: True Negative Predictions: True Positive Predictions:

lungcancerdetection's People

Contributors

Stargazers

Watchers

Forkers

aomiit cancerimageai jenifferwuucla luyangcat chocl8camellia sunxingxingtf huichuanliu solversa aaronjie gssankar hs99 nanfengpo rundvrun priya-gittest hfutzzw lina2liuyazhou peterxiaoguo loveplay1983 dpbac mohamedsaeedtolba jonathand94 jaedukseo razielar angiew xzf125244170 sumit33k liuzc188 cooltonycheng shannonz jacklee20151 genimind wenwenyu color4 samanch mpsaradhi fortisaqua houchaoqun liuhuihuii cwsherry jingjing54007 zalkinv saraahmadi70 sanjana-satish fhxzh songtang1 dyerlee kant lifanghe francyya ganeshoneplus shazha liuwenhaha yangerkun procaffeinator kartikmehta09 wanghychn day77 thuyngch iamrabin guyucowboy kiinghom prakashreddy44 844486694 pavankumarmnvsv 30-z jackiehuang1 tuanthng basharbme sailfish009 tekle-ai hamigua2019 sarthakgarg19 dp-0809 yilin198 appurvakapil beeqb deepmd-io kendoej victorwealtha ananthan-akvs aktar07 qiongds harshkc hanyi-cai anjoysutrisna sindhupriyaakuthota chen0jie pravin-subramanian21 luatvy redbudthu omololevy john-james-ai adarsh195 ssraghuvanshi iiitmg hasihays nomad653 singhaiaditya60 siva-sanjay goknurarican

lungcancerdetection's Issues

General Working

Hi there. I just completed my course on Neural Networks and wanted to like do a project to get the general idea. So I think by running this project once I would get the practical idea of NN based projects. As I am just a beginner I am not understanding the way in which I can execute the files. So if you could you tell me the order in which I can run the files, it would be really helpful.

IndexError: list index out of range while training the model train.py

Hi,
I tried to run train.py model directly from h5 dataset. During the model.fit, I get an IndexError: list index out of range. How can we solve this error? The error details is as:

model.fit(X_train_images, Y_train_labels, n_epoch = 1, shuffle=True,
validation_set = (X_val_images, Y_val_labels), show_metric = True,
batch_size = 5, snapshot_epoch = True, run_id = 'nodule3-classifier')
Traceback (most recent call last):

File "", line 1, in
model.fit(X_train_images, Y_train_labels, n_epoch = 1, shuffle=True, validation_set = (X_val_images, Y_val_labels), show_metric = True, batch_size = 5, snapshot_epoch = True, run_id = 'nodule3-classifier')

File "C:\Users\Mahidol\Anaconda3\envs\Thesis\lib\site-packages\tflearn\models\dnn.py", line 184, in fit
self.targets)

File "C:\Users\Mahidol\Anaconda3\envs\Thesis\lib\site-packages\tflearn\utils.py", line 283, in feed_dict_builder
feed_dict[net_inputs[i]] = x

IndexError: list index out of range

about candidates.csv file?

are you using candidates.csv for classification purpose? what is the real difference between candidates.csv and candidates_V2.csv, I know that candidates_V2.csv contains more records.

Regarding some doubts about implemenatation

@swethasubramanian ,
After converting the data to .h5 format I am getting files(screenshot of files) as follow, are they correct or not ?
please reply.

Thanks in advance!!

Hdf5 files with X and Y.

Hello, first of all thank you for sharing your code, it's really well done!

It seems its not possible to generate the hdf5 files for prediction without the "data labels" file.
Would you please provide de hdf5 files?
I'd really appreciate that!

Thank you in advance!

Great Project

It is a detailed and beautiful project with its narrative and content

Error within LungCancerDetection.ipynb

model.predict(X_test_images)

IOError Traceback (most recent call last)
in ()
----> 1 model.predict(X_test_images)

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/models/dnn.pyc in predict(self, X)
229 """
230 feed_dict = feed_dict_builder(X, None, self.inputs, None)
--> 231 return self.predictor.predict(feed_dict)
232
233 def predict_label(self, X):

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/helpers/evaluator.pyc in predict(self, feed_dict)
61 if len(dprep_dict) > 0:
62 for k in dprep_dict:
---> 63 feed_dict[k] = dprep_dict[k].apply(feed_dict[k])
64
65 # Prediction for each tensor

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/data_preprocessing.pyc in apply(self, batch)
44 batch = m(batch, *self.args[i])
45 else:
---> 46 batch = m(batch)
47 return batch
48

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/tflearn/data_preprocessing.pyc in _featurewise_zero_center(self, batch)
199 def _featurewise_zero_center(self, batch):
200 for i in range(len(batch)):
--> 201 batch[i] -= self.global_mean.value
202 return batch
203

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)()

/Users/nemo/anaconda/envs/kaggleShit/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in setitem(self, args, val)
616 mspace = h5s.create_simple(mshape_pad, (h5s.UNLIMITED,)*len(mshape_pad))
617 for fspace in selection.broadcast(mshape):
--> 618 self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
619
620 def read_direct(self, dest, source_sel=None, dest_sel=None):

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)()

h5py/h5d.pyx in h5py.h5d.DatasetID.write (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/h5d.c:3678)()

h5py/_proxy.pyx in h5py._proxy.dset_rw (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_proxy.c:2022)()

h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dwrite (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_proxy.c:1732)()

IOError: Can't prepare for writing data (No write intent on file)

Any solution?

Data Agument

Thanks for your brilliant project. I have a puzzle about that there are 1690 images generated from data augment. However, the size of 'train.h5' is 5187. I think you didn't added the augmented data into training set.

about create_images.py

Hi Swetha,

Thank you very much for sharing this code. Could you please clarify the following below

inpfile = mode + 'data'
outDir = mode + '/image_'

In above 'data' is referring to??

About data labels

I appreciate it for your share.I wonder is there exact labels about your data? if yes,may I get it for a learning purpose?

#about predict_model.py

@swethasubramanian 您好！打扰您了! 想请教您一个问题，跑predict_module.py时，出现了下面的错误： NotFoundError (see above for traceback): Key Accuracy/Mean/moving_avg_1 not found in checkpoint [[Node: save_13/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_13/Const_0_0, save_13/RestoreV2/tensor_names, save_13/RestoreV2/shape_and_slices)]] 请问这个该怎么改

#about predict_model.py

 您好！打扰您了! 想请教您一个问题，跑predict_module.py时，出现了下面的错误：

NotFoundError (see above for traceback): Key Accuracy/Mean/moving_avg_1 not found in checkpoint
[[Node: save_13/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_13/Const_0_0, save_13/RestoreV2/tensor_names, save_13/RestoreV2/shape_and_slices)]]
请问这个该怎么改

IOError: File ../data/candidates.csv does not exist.

Hi,
I have checked out the codes, and executed the command "python create_images.py train" under the folder "LungCancerDetection/src/data". But, it prompts, "IOError: File ../data/candidates.csv does not exist". Indeed, I cannot find the "candidates.csv" file in the project.
So, could you pls help to give me some suggestions? Thanks!

Instructions on how to run with existing data

Installation and usage

Install the requirements( file attached ) using pip install -r requirements.txt
requirements.txt
From project directory
$ cd src/models/
$ python3 train.py

visualize_model.py or predict_model.py can be run once the training data is generated.

Extra steps for Python 3

you may face issues with print being called without parens, please replace the usage of print with print()
In visualize_model.py Line 69,
replace nfilt/4 with nfilt//4 as we expect a int.

Request to Author

Looking at the Above steps this guide may feel un necesarry, but i spent around 2 hours trying to get this working. This would have saved a lot of time, if the requirements file was provided.

Other notes

If anyone is willing to change and make a pull request, please do so and make it runnable from any directory, that will be better than cding to models before execution.

Due to laziness, i'm only writing this short guide, sorry folks.

Fully connected weights

Hello!
First of all, thank you for uploading that wonderful project!

I'd like to know how can i see the final weights of the fully connected layer, and how many weights are there.