Coder Social home page Coder Social logo

budidwisatoto / product-categorization-nlp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aniass/product-categorization-nlp

0.0 0.0 0.0 14.35 MB

Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).

Python 0.49% Jupyter Notebook 99.51%

product-categorization-nlp's Introduction

Product Categorization

Multi-Class Text Classification of products based on their description

General info

The goal of the project is product categorization based on their description with Machine Learning and Deep Learning (MLP, CNN, Distilbert) algorithms. Additionaly we have created Doc2vec and Word2vec models, Topic Modeling (with LDA analysis) and EDA analysis (data exploration, data aggregation and cleaning data).

Dataset

The dataset comes from http://makeup-api.herokuapp.com/ and has been obtained by an API. It can be seen at my previous project at Extracting Data using API.

The dataset contains the real descriptions about makeup products where each description has been labeled with a specific product.

Motivation

The aim of the project is multi-class text classification to make-up products based on their description. Based on given text as an input, we have predicted what would be the category. We have five types of categories corresponding to different makeup products. In our analysis we used a different methods for a text representation (such as BoW +TF-IDF, doc2vec, Distilbert embeddings), feature extraction (Word2vec, Doc2vec) and various Machine Learning/Deep Lerning algorithms to get more accurate predictions and choose the most accurate one for our issue.

Project contains:

  • Multi-class text classification with ML algorithms- Text_analysis.ipynb
  • Text classification with Distilbert model - Bert_products.ipynb
  • Text classification with MLP and Convolutional Neural Netwok (CNN) models - Text_nn.ipynb
  • Text classification with Doc2vec model -Doc2vec.ipynb
  • Word2vec model - Word2vec.ipynb
  • LDA - Topic modeling - LDA_Topic_modeling.ipynb
  • EDA analysis - Products_analysis.ipynb
  • Python script to train ML models - text_model.py
  • Python script to train ML models with smote method - text_model_smote.py
  • Python script to text clean data - clean_data.py
  • Python script to generate predictions from trained model - predictions.py
  • data, models - data and models used in the project.

Summary

To resolve problem of the product categorization based on their description we applied multi-class text classification. We started with data analysis and data pre-processing from our dataset. Then we have used a combinations of text representation such as BoW +TF-IDF and doc2vec. We have experimented with several Machine Learning algorithms: Logistic Regression, Linear SVM, Multinomial Naive Bayes, Random Forest, Gradient Boosting and Neural Networks: MLP and Convolutional Neural Network (CNN) using different combinations of text representations and embeddings. Additionaly we have applied a transfer learning with a pretrained Distilbert model from Huggingface Transformers library.

From our experiments we can see that the tested models give a overall high accuracy and similar results for our problem. The SVM (BOW +TF-IDF) model give the best accuracy of validation set equal to 96 %. Logistic regression performed very well both with BOW + TF-IDF and Doc2vec and achieved similar accuracy as MLP. CNN with word embeddings also has a very comparable result (93 %) to MLP. Transfer learning with Distilbert model also gave a similar results to previous models an we achieved an accuracy on the test set equal to 93 %. That shows the extensive models are not gave a better results to our problem than simple Machine Learning models such as SVM.

Model Embeddings Accuracy
SVC BOW +TF-IDF 0.96
MLP Word embedding 0.93
CNN Word embedding 0.93
Distilbert Distilbert tokenizer 0.93
Gradient Boosting BOW +TF-IDF 0.93
Random Forest BOW +TF-IDF 0.92
SVM Doc2vec (DBOW) 0.92
Logistic Regression BOW +TF-IDF 0.91
Logistic Regression Doc2vec (DM) 0.90
Naive Bayes BOW +TF-IDF 0.88

Technologies

The project is created with:

  • Python 3.6/3.8
  • libraries: NLTK, gensim, Keras, TensorFlow, Hugging Face transformers, scikit-learn, pandas, numpy, seaborn, pyLDAvis.

Running the project:

To run this project use Jupyter Notebook or Google Colab.

You can run the scripts in the terminal:

clean_data.py
text_model.py
text_model_smote.py

product-categorization-nlp's People

Contributors

aniass avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.