Coder Social home page Coder Social logo

garethjns / kaggle-eeg Goto Github PK

View Code? Open in Web Editor NEW
95.0 7.0 31.0 882 KB

Seizure prediction from EEG data using machine learning. 3rd place solution for Kaggle/Uni Melbourne seizure prediction competition.

MATLAB 100.00%
seizure-prediction kaggle svm matlab tree-ensemble eeg melbourne-university machine-learning kaggle-competition

kaggle-eeg's Introduction

EEG Seizure Prediction

Gareth Paul Jones
3rd place Melbourne University AES/MathWorks/NIH Seizure Prediction
2016

See also:

Description

This code is designed to process the raw data from Melbourne University AES/MathWorks/NIH Seizure Prediction, train a seizureModel (train.m), then predict seizure occurrence from a new test set (predict.m).

Data

The raw data contains 16 channel inter-cranial EEG recordings from 3 patients. It's split in to interictal (background) periods and preictal (before-seizure) periods.

alt text

Features

Various feautres are extracted from the raw data, inlcuding:

  • Frequency power in EEG bands
  • Summary statistics in the temporal domain
  • Correlation between channels in the frequency and temporal domains

These features are extracted with various windows sizes (240, 160, and 80s in the 3rd place submission) and are combined in to a single data set before training the models. Processed features are saved to disk for faster subsequent loading.

alt text

Models

Two models are fit to the processed data:

  • An RUS Boosted tree ensemble
  • A Quadratic SVM

These models are handled by the seizureModel object and are fit to all the data, rather than individual models being trained for each subject. The predictions of each model are ensembled with a simple mean, which produces a considerably better score than either model alone.

alt text

Running

Training and prediction stages can be run independently from their respective scripts, or together from testRun.m. If running from testRun.m paths need to be set in predict.m and train.m first. Warning: testRun.m is designed to run entirely from scratch and deletes all .mat files from the working directory when it starts!

Both predict.m and train.m expect the same directory structure as provided for the competition, and training is specifically written to handle the temporal relationships in this dataset - it would need modification to work correctly with new data.

  • Extract the original Kaggle data to a folder, eg. R:\EEG Data\Original\

  • Extract the second test set released on Kaggle into a folder named New, R:\EEG Data\New\
    Folder structure

  • Set paths the paths params.paths.dataDir and params.paths.or in predict.m and train.m

    • params.paths.or should be the path to the "Original" folder created above.
    • params.paths.dataDir should be the "New" folder from above. Data from the original training and test sets will be copied here to create a new training set.
  • Run train.m

    • The first function copyTestLeakToTrain.m creates a new training/test set in params.paths.dataDir. This set will be used for training and the folder structure should look like this:
      Original folder structure
  • Run predict.m

    • params.paths.dataDir should be the "New" directory, eg R:\EEG Data\New\

Processed features and final submission file are saved in to working directory to save time on subsequent runs.

Scripts

train.m script:

  • Processes raw data
    • Creates new test set from original test and training sets
  • Extracts features and saves in featuresObject (featuresTrain)
  • Trains an SVM and RUS boosted tree ensemble, saves the compact version of these.

predict.m script:

  • Loads trained models (SVM and tree ensemble saved as seizureModel objects)
  • Loads new data
    • Extracts features and saves in a featuresObject (featuresTest)
  • Predicts new data
    • Reduces epoch predictions to segment predictions
    • Ensembles SVM and tree ensemble
  • Saves in to .csv submission file as per Kaggle specification

Classes

featuresObject

  • Handles extraction of features and combination of features generated using different window lenghts.
    seizureModel
  • Handles training of SVM or RBT.
    cvPart
  • Used instead of MATLAB's cvpartition object to handle cross-validation. Allows grouping of subject data from consecutive time periods in the training set, preventing data leak that otherwise leads to over optimistic scoring of the model's performance locally.

Requirements

  • Original Kaggle data or trained models
  • MATLAB 2016b:
  • Statistics and Machine Learning Toolbox

Notes

  • If seeds are now setting correctly, should score ~0.8059 (= 2nd place)
  • Uses new version of featuresObject that holds only one dataset, rather than both train and test sets
  • All parallel processing has been removed for hold out testing
  • All figures should be suppressed in prediction stage

To do

  • Save use structure and params.divS to each seizureModel
  • Add feature descriptions

kaggle-eeg's People

Contributors

garethjns avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kaggle-eeg's Issues

Training and Testing

I was able to run the code, but i need to edit parts in it to enhance the results,
I need to understand what files are used to produce training AUC , i mean are they trained against a hold set of the train files for example ? and same for testing is it compared to previously trained data? Can you point these parts in the code

Thanks

Temporary change to data sub paths

In featuresObject methods genFileListNoConcat and genFileListSomeSingles sub path has been changed to:

sDir = [paths.dataDir, str, '_', subs{s}, '_New'];

Rather than:

sDir = [paths.dataDir, str, '_', subs{s}, '\'];

It now doesn't match the locations created by copyTestLeakToTrain(params.paths) during initial set up and will need to be changed back after hold out testing.

Data-sub path setting is set in featuresObject

The main data path is set in params.paths.dataDir in predict.m and train.m but the subpath finding is hard coded in featuresObject (in both genFileListSomeSingles and genFileListNoConcat methods) to expect the paths from the competition data ie.

sDir = [paths.dataDir, str, '_', subs{s}, 'New'];
Or
sDir = [paths.dataDir, str, '
', subs{s}, ''];

This path needs to be manually changed if folder names are different or if the code is run on OS using '/'. It should be migrated to top level scripts.

train.m error

Hello there, I was trying to run train.m and I get this error message.

Insufficient number of outputs from right hand side of equal sign to satisfy assignment.

Error in featuresObject/checkFiles (line 378)
                    files{1,d} = fn.name;

Error in featuresObject/compileFeatures (line 74)
            obj.files =  obj.checkFiles();

I am entirely new to MatLab so I am struggling of identifying where this error is coming from. I have the original kaggle data located in the Original folder. I wasn't sure what to put in the New folder since you mentioned a second test set. So I simply copy test sets 1, 2, and 3 from the Original folder onto the New folder. Before I got the error the code was copying the training sets from original folder onto the new folder. And I noticed that three .mat files were created onto the directory, "Singles1.mat, Singles2.mat, and Singles3.mat".

I am not sure if that is where the error is coming from but maybe you can shed some light on what is going on here?

Run Time

Hello

The train files has been running more than 16 hours and feature extractions is still not finished, i want to run using parallel processing toolbox. I have went though matlab videos and saw presentation for 3rd algorithm on threads division but not sure how to implement this. Will you please guide me on this issue ? @garethjns

Thanks Alot

Feature information not saved in seizureModels

Some useful information about features is not saved in seizureModel object, and need to be reset manually in predict script.

use - Structure that lists features included in model.
params.divS - Field containing list of epoch window lengths used to generate features.

Either save these in the seizureModel object during training, or save entire features object in seizureModel (minus actual data) for reference.

Solution File

I was able to run the code and obtain the results, however i need to make a new solution file for the data i entered, can you guide me which part i need to work on in the code? @garethjns

Predict.m error

Greetings dear. At the moment of running predict.m I get the following error:
''

Creating basic features
Error using featuresObject.genFileListNoConcat (line 644)
To assign to or create a variable in a table, the number of rows must match the height of the table.
Error in featuresObject/getFileLists (line 265)
obj.genFileListNoConcat({'1','2','3'}, ...
Error in featuresObject/setFileLists (line 247)
getFileLists(obj);
Error in featuresObject (line 59)
obj = setFileLists(obj, {});
Error in predict (line 77)
featuresTest = featuresObject(params, use);

''
I would be very grateful if you could help me.
Best regards.

Running the code

Hello

I am trying to run the code but i currently have 2 issues :

1)I have only original Kaggla data, so not sure how to modify the code as i dont have the new test set
2) trainedModelsCompactTest.mat i dont have this file so it give me error while running the code

Thanks

Training and Testing

I was able to run the code, but i need to edit parts in it to enhance the results,
I need to understand what files are used to produce training AUC , i mean are they trained against a hold set of the train files for example ? and same for testing is it compared to previously trained data? Can you point these parts in the code

Thanks

Question on interpretation of output results

Hi there, I recently finished running the code after 5 days of training and testing over the data. I consider myself a beginner in machine learning, so I have a few questions on the output of the code.

Firstly, the predict.m file generated a .csv file that contains columns for "File" and "Class". Here is a screenshot of what that output looks like.

master61svmgrbtg

I guess my question is, why is it that there are values lower than 0 and greater than 1? I looked at the evaluation overview on the kaggle competition right here, and it seems that the values for the class variable are suppose to be probabilities(between 0 and 1).

The second question I wanted to ask was that in the README you said that "If seeds are now setting correctly, should score ~0.8059 (= 2nd place)".

When I ran train.m, I got an output like this.
train_output

So are those AUC scores the ones that you were referring to in the README? Or is the README notes referring to some other score that is generated by predict.m? When I ran your code I did not get a score from predict.m. So right now I can only assume that the score that I am suppose to look at is the AUC of SVM and RBT, but maybe you can clarify what those things mean for me?

Redundant import methods in featuresObject

Methods genFileListSomeSingles and genFileListNoConcat in featuresObject are used in predicting (?) and training respectively, but are slightly different versions of the same original function. genFileListSomeSingles can be used for both and genFileListNoConcat removed.

Model feature names

The trained model used for the Kaggle entry were trained with an older version of the features object that combined feature names from different epoch lengths incorrectly.

This bug is left in the new version of featuresObejct included here to maintain compatibility with the Kaggle model (MATLAB looks for specific table column names). New models are still trained with incorrect feature names.

Feature naming code needs correcting in featuresObject.compileFeatures().

Training two SVMs instead of SVM and RBT

train.m is currently producing models scoring ~0.65. Predicting from previously trained models still scores ~0.8.

This is because two SVMs are being trained, rather than an SVM and RBT.

Features Object_checkFiles

In Features object file, after calling featuresobject , compile features is called and the first part is method
"check Files" which creates file names with division for example Train_divs240.mat and then checks if it is there , and of course it doesn't exist because its not created, however the debugger enters the first if which says it exist then gives an error in the highlighted line as there's is not file to save.

image

Original Kaggle data

Hello,garethjns
Iโ€™m an entire beginner of Kaggle- eeg competition, and I want to run your code to learn step by step. But I sadly find the eeg data has been removed from Kaggle.So I wonder if you still store the data and could send some to me?
Thank you very much.

Private AUC is different

_I produced different results running the algorithm

For SVM 0.72396
For RBT 0.61181
For Both 0.6943
I have only your result for both from the brain journal and it is 0.7952 , SVM is creating better results in my case so i am wondering you can share your input on why this happens i only changed input methods but the same algorithm is carried out , training all data and using cv on training set. Then testing is run using the new test set. so I have a few questions to help i identify the reason for this

1_ Results for training is Grown weak learners: 100
SVM: general model AUC: 0.82756
RBT: general model AUC: 0.75574
I didn't find your training results so if you can share them with me it will me identify if we are doing something different.

2-SetSafeIdx, is the method you use to only get safe files and use them only for training right? I wonder if you edited the train_and_test_data_labels_safe.csv to make this work ? as training files are in format of Pat1Train_1_0.mat so the method wont work. I ran it and now only 39953 files are chosen out of the 5047 total training files, with calculation they should be 3829 files safe ,however the method marks safe a file that is in datalist but not in .csv files. Is This the way you meant for the method to work . However when i edited the AUC went to down to SVM: general model AUC: 0.79039
RBT: general model AUC: 0.73239 for the training.
Is there a way to know the files the training files you first entered to the algorithm and their number ? Are they the 5047 files in contest_train_data_labels.csv ?

3_CV is done on both RUS and SVM right?

4-If i want to regenerate results of single models, i edit copytestleaktotrain, featuresobject methods to run on 1 patient only so only data of patient 1 is trained and tested individually ? this means i will run the algorithm 3 times for training and testing each right ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.