Coder Social home page Coder Social logo

anjanatiha / toxic-comment-classification Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 5.53 MB

Classify comments on categories - "Toxic", "Severe Toxic", "Obscene", "Threat", "Insult", "Identity Hate", "Any of the Above", "None of the Above".

License: MIT License

Jupyter Notebook 100.00%
machine-learning machine-learning-algorithms classification multi-label-classification grid-search text-analysis preprocessing feature-extraction feature-engineering bagofwords bag-of-words tf-idf text-mining

toxic-comment-classification's Introduction

Toxic Comment Classification

Technology : Python, Machine Learning

Duration : Aug - Sep 18

Description

  1. Classified around 130, 000 text comments of size 34 MB on categories - "Toxic", "Severe Toxic", "Obscene", "Threat", "Insult", "Identity Hate", "Any of the Above", "None of the Above".
  2. Used 17 features from AAAI 2018 paper "Anatomy of Online Hate: Developing a Taxonomy and Machine Learning Models for Identifying and Classifying Hate in Online News Media" by "Salminen, Almerekhi".
  3. Built pipelines for machine learning model training for reading file, creating training testing dataset, preprocessing, extracting features, and training and evaluation in grid search approach for multiple models.
  4. Generated aggregated report and visualization on different machine learning model performance.

Procedure:

  1. Build pipelines for machine learning model training for reading file, creating training testing dataset, preprocessing (cleaning text, tokenization, single character count, url count, modal count, non alpha mid character), extracting features, and training and evaluation in grid search approach for mutiple models.
  2. Preprocessing unit replaced non standard input features with default value.
  3. Build feature class for following features:
    • Count of exclamations, periods, question marks, punctuation, special characters, repeated punctuation, and quotes in each comment.
    • Count of single-char. tokens in each comment.
    • Count of URLs in each comment.
    • Length of the comment (in chars. and in tokens).
    • Total number of capital letters in the tokens.
    • Total number of emoticons in each comment.
    • Total number of modal words in each comment.
    • The modal words that were used are: can, could, may, might, must, will, would, and should.
    • Total number of tokens with non-alphabetic characters in the middle.
  4. Build model training pipeline for both classification and regression
  5. Generate agregated report on performance for various models.
  6. Visualization for different model performance.

Tools Requirement: Anaconda, Python

Current Version : v1.0.0.0

Last Update : 09.21.2018

toxic-comment-classification's People

Contributors

anjanatiha avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.