Coder Social home page Coder Social logo

email-classifier's Introduction

EMAIL-CLASSIFICATION:

This is a text classification project which is a multi-class classification. Various classifiers are trained and tested using Python. It includes the classification of emails based on their content into three categories: Normal, Spam and Fraud.

Various classifiers including Support Vector Machine (SVM), K-Nearest Neighbour, Multinomial Naïve Bayes, Decision Tree, Logistic Regression, SVM with Stochastic Gradient Descent classifier (SGD-SVM) and Logistic Regression with Stochastic Gradient Descent classifier (SGD-LR) are trained on features extracted using TF-IDF vectorizer. Further, ensemble classifiers including Random Forest (RF), AdaBoost, Bagging (BGC), Extra Trees and, Vote on various classifier combinations are trained in a similar manner. Also, the effect of stemming on the model performance is observed. Additionally, classifiers are trained on the features extracted using Count Vectorizer.

Finally, all the models are evaluated based on standard evaluation metrics: Accuracy, Precision, Recall, F-score and Confusion Matrix. It is observed that Vote on SVM, BGC and RF outperform all the models, followed by SGD-SVM, trained on TF-IDF features without stemming.

PREREQUISITES:

  • Python 3.x.
  • Libraries:
    • Pandas
    • Sklearn
    • Nltk
    • Numpy
    • Matplotlib
    • String
    • re
    • Random

CODE BRIEF:

The entire coding is done in Python3.5 which was executed in Spyder which is a part of Anaconda3. There are two python files ‘Extract_email.py’ and ‘Email_Classification.py’ which involves the process of Data Extraction and, Text processing and classification respectively.

I. Extract_email.py: This file involves the process of Data Extraction. In this, 1000 fraud emails from the ‘fradulent_emails.txt’ file containing 4075 emails are extracted and, 1000 emails for each Spam and Normal category are extracted from ‘emails.csv’ file that contains 5730 emails which is a combination of both Spam and Normal emails. Finally, all the extracted emails are concatenated into one csv file. This csv file contains the final dataset that contains 3000 emails with 1000 emails for each category.

NOTE: For the proper execution of the code, update the paths for:

  • the final csv file to be created (‘final_dataset.csv’)
  • fradulent_emails.txt file (file containing fraud emails)
  • emails.csv file (file containing spam and normal emails)
II. Email_Classification.py: This file involves the complete process of email processing and classification:
  1. Data Preprocessing: Functions are created for the removal of punctuation and stopwords. Another function is created for stemming of the content. In order to extract the relevant features 2 vectors were used: TF-IDF vectors and Count Vectors. First, the entire process of classification is performed by the features created using TF-IDF and then the features created by Count- Vectorizer are processed and observed. Then the features are split into train and test set in the ratio 7:3 respectively.

  2. Text Classification: Various classifers are trained on the features extracted above and then, their performance is observed. Before the training of the models, Parameter Tuning is performed to identify the optimum parameters for each classifier.

  3. Evaluation Metrics: The classifiers are evaluated on the basis of:

  • Accuracy
  • Confusion Matrix
  • Precision, Recall and F-Score

NOTE: For the execution of the code, change the path for final_dataset.csv that was created in the previous step for the variable ‘input_dataset’.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.