Coder Social home page Coder Social logo

theanhle / machine-learning-workflow-with-python Goto Github PK

View Code? Open in Web Editor NEW

This project forked from not-a-builder/ml-workflow-iris

0.0 0.0 0.0 28.7 MB

This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation

Jupyter Notebook 100.00%

machine-learning-workflow-with-python's Introduction

1- Introduction

This is a comprehensive ML techniques with python , that I have spent for more than 6 months to complete it.

I think it is a great opportunity for who want to learn machine learning workflow with python completely. I have covered most of the methods that are implemented for iris until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

I am open to getting your feedback for improving this

2- Machine Learning Workflow

Field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

most of these books share the following steps (checklist):

  • Define the Problem(Look at the big picture)
  • Specify Inputs & Outputs
  • Data Collection
  • Exploratory data analysis
  • Data Preprocessing
  • Model Design, Training, and Offline Evaluation
  • Model Deployment, Online Evaluation, and Monitoring
  • Model Maintenance, Diagnosis, and Retraining

You can see my workflow in the below image :

you should feel free to adapt this checklist to your needs

2-1 Real world Application Vs Competitions


## 3- Problem Definition I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)

Problem Definition has four steps that have illustrated in the picture below:

3-1 Problem Feature

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. That's why the name DieTanic. This is a very unforgetable disaster that no one in the world can forget.

It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for begineers to start a journey in data science and participate in competitions in Kaggle.

ٌWe will use the classic titanic data set. This dataset contains information about 11 different variables:

  1. Survival
  2. Pclass
  3. Name
  4. Sex
  5. Age
  6. SibSp
  7. Parch
  8. Ticket
  9. Fare
  10. Cabin
  11. Embarked

Note : You must answer the following question: How does your company expact to use and benfit from your model.

3-2 Aim

It is your job to predict if a passenger survived the sinking of the Titanic or not. For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.

3-3 Variables

  1. Age :

    1. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
  2. Sibsp :

    1. The dataset defines family relations in this way...

      a. Sibling = brother, sister, stepbrother, stepsister

      b. Spouse = husband, wife (mistresses and fiancés were ignored)

  3. Parch:

    1. The dataset defines family relations in this way...

      a. Parent = mother, father

      b. Child = daughter, son, stepdaughter, stepson

      c. Some children travelled only with a nanny, therefore parch=0 for them.

  4. Pclass :

    • A proxy for socio-economic status (SES)
      • 1st = Upper
      • 2nd = Middle
      • 3rd = Lower
  5. Embarked :

    • nominal datatype
  6. Name:

    • nominal datatype . It could be used in feature engineering to derive the gender from title
  7. Sex:

    • nominal datatype
  8. Ticket:

    • that have no impact on the outcome variable. Thus, they will be excluded from analysis
  9. Cabin:

    • is a nominal datatype that can be used in feature engineering
  10. Fare:

    • Indicating the fare
  11. PassengerID:

    • have no impact on the outcome variable. Thus, it will be excluded from analysis
  12. Survival:

4- Inputs & Outputs


4-1 Inputs

What's our input for this problem: 1. train.csv 1. test.csv

4-2 Outputs

  1. Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.

The Outputs should have exactly 2 columns:

1. PassengerId (sorted in any order)
1. Survived (contains your binary predictions: 1 for survived, 0 for deceased)

5- Loading Packages

In this kernel we are using the following packages:

6- Exploratory Data Analysis(EDA)

In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

  • Which variables suggest interesting relationships?
  • Which observations are unusual?
  • Analysis of the features!

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

  • 5-1 Data Collection
  • 5-2 Visualization
  • 5-3 Data Preprocessing
  • 5-4 Data Cleaning

Note: You can change the order of the above steps.

6-1 Data Collection

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]
I start Collection Data by the training and testing datasets into Pandas DataFrames

6-2 Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]

In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:

Help

I hope you have enjoyed reading my python notebook.

If you have any problem to run notebook please open an issue here in github.

for most of the my notebook you need dataset as input.

To use the correct data, please download the dat set from the Kaggle site and put it in your notebook folder.

Mj Bhamnai

[email protected]

Have Fun!

you can follow me on:

Please Fork the Repository to continue...

machine-learning-workflow-with-python's People

Contributors

mbahmani avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.