Coder Social home page Coder Social logo

aditeyabaral / kepler-exoplanet-analysis Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 2.0 143.72 MB

Analysis of Kepler Objects of Interest using Machine Learning for Exoplanet Identification.

Home Page: https://youtu.be/lPUQ2x55jE4

Jupyter Notebook 99.96% Python 0.04%
kepler nasa space exoplanets exoplanet-analysis data-science data-analytics machine-learning

kepler-exoplanet-analysis's Introduction

kepler-exoplanet-analysis

Analysis of Kepler Objects of Interest using Machine Learning for Exoplanet Identification.

This repository contains the source code as well as the visualisations and models created as a part of the Final Project for the Data Analytics course (UE18CS312) at PES University.

The Final Report for the document can be found here.
The Video Presentation can be viewed here.

Team Members

Aditeya Baral
Ameya Rajendra Bhamare
Saarthak Agarwal

Directory Structure

kepler-exoplanet-analysis
├── data
    ├── [CLEANED]kepler-data.csv
    └── kepler-data.csv

├── docs
    ├── Project Guidelines and Requirements Documents

├── model
    ├── adaboost-error.model
    ├── adaboost.model
    ├── nn-model-error.h5
    ├── nn-model.h5
    ├── random-forest-error.model
    ├── random-forest.model
    ├── svm-error.model
    └── svm.model
   
├── notebook
    └── Notebooks containing data preprocessing, model training, visualisation and analysis

├── plots
    └── All possible plots based on preprocessing, visualisation and model training

├── presentation
    └── Presentation and Video

├── report
    ├── Final Report
    └── Plagiarism Check

└── scripts
    ├── getPlotsMatplotLib.py
    └── getPlotsPlotly.py

How to run the code?

Advisory: Each model training script takes a long time to run, sometimes almost an hour. This is because of cross-validation and GridSearch.

  1. Clone this repository
git clone https://github.com/aditeyabaral/kepler-exoplanet-analysis
  1. Navigate to the repoand install the required dependencies.
cd kepler-exoplanet-analysis/
pip3 install -r requirements.txt
  1. Open Jupyter Notebook in the notebook directory
cd notebook/
jupyter notebook
  1. Run any notebook by executing all the code cells.

Exoplanet Analysis

For several decades, planet identification has been a task performed by specialized astronomers and domain experts. With the advent of computational methods and access to satellite data from space missions, this trend has changed. For instance, NASA’s Exoplanet Exploration Program has provided us with vast amounts of data on celestial objects to assist in space exploration. One such mission of interest is the Kepler mission.

Over 4300 transiting exoplanets have been identified since the commencement of the mission in 2007. It’s focus lay on exploring planets and planetary systems. It has provided us with a catalog of discoveries that help in computing planet occurrence rates as a function of size, star type, insolation flux and orbital period.

The Kepler Mission

The Kepler Space Telescope launched in 2009, has been the most successful telescope to aid the discovery of exoplanets. It has identified several thousand objects of interest, with over 4300 of them confirmed exoplanets. The mission has been designed to survey a portion of the Milky Way galaxy and discovers hundreds of Earth-size and smaller planets in or near the habitable zone. It additionally determines the fraction of the billions of stars in our galaxy that might have their own solar system.

The satellite was officially retired in October 2018 because it ran out of fuel. Years later, the statistical data that Kepler produced continues to produce new exoplanet discoveries.

Dataset

Measurements from the Kepler satellite are available for public domain use. These records are maintained by CalTech in the Kepler Cumulative Object of Interest (KCOI) table. The KCOI table contains 50 features recorded from Kepler data.

Predictive Modelling

This study focuses on a binary classification of Objects of Interest as “FALSE POSITIVE” or “CONFIRMED” exoplanets. NASA uses the label of “FALSE POSITIVE” to indicate the satellite incorrectly tracked an object. We do not consider the observations labelled as “CANDIDATE” since these are yet to be labelled by NASA and hence, are unknown to us. For our analysis, we have used four models, each with its own unique characteristics to tackle the problem at hand from different angles.

The four models used are

  1. Support Vector Machine
  2. Random Forest
  3. AdaBoost
  4. Feed-Forward Neural Network.

Evaluation of Model Performance

To counter the imbalance of the dataset, we propose different evaluation metrics, which take in account the imbalance. These include the F1 Score, Cohen Kappa score, Balanced Accuracy Score and finally the Confusion Matrix.

Additionally, to test out our classifier on different sets, we use K-Fold cross-validation across our entire dataset to ensure that we are not underfitting our classifier by introducing high bias.

Since the dataset is imbalanced, we again use both -

  • A non-stratified split
  • A stratified split

This is to ensure that within each fold the number of positive and negative examples are equal. We measure our classifier’s performance across each split and finally take the mean of the performance achieved.

Model Results

We obtain two sets of results, when we consider and omit the attributes corresponding to error metrics. We can conclude that the models built with the Error attributes tend to do better than the models built after removing the error attributes. Although all models perform almost equally well, the AdaBoost classifier outperforms the rest.

With Error Attributes

Model Stratified F-1 Score Non-Stratified F-1 Score
SVM 98.28% 98.31%
Random Forest 97.68% 97.61%
AdaBoost 98.01% 98.17%
Neural Network 98.16% 98.27%

Without Error Attributes

Model Stratified F-1 Score Non-Stratified F-1 Score
SVM 97.72% 97.66%
Random Forest 98% 98.13%
AdaBoost 98.03% 98.11%
Neural Network 97.78% 97.46%

Observations and Conclusions

  1. We observe that there is significant overlap between the different classes of exoplanets, making it increasingly difficult for scientists to predict their habitability.

  2. We also observe that most of the exoplanet characteristics are independent of each other, with very few attributes having significant correlation.

  3. A difference in feature rank importance was observed across the different algorithms, showing the differences in the working of each model.

  4. Additionally, we see that machine learning algorithms prefer categorical variables for classification as it allows them to form decisions faster and reduce entropy quicker.

kepler-exoplanet-analysis's People

Contributors

aditeyabaral avatar ameyabhamare avatar saarthak-agarwal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.