The product-categorization-nlp from budidwisatoto

Product Categorization

Multi-Class Text Classification of products based on their description

General info

The goal of the project is product categorization based on their description with Machine Learning and Deep Learning (MLP, CNN, Distilbert) algorithms. Additionaly we have created Doc2vec and Word2vec models, Topic Modeling (with LDA analysis) and EDA analysis (data exploration, data aggregation and cleaning data).

Dataset

The dataset comes from http://makeup-api.herokuapp.com/ and has been obtained by an API. It can be seen at my previous project at Extracting Data using API.

The dataset contains the real descriptions about makeup products where each description has been labeled with a specific product.

Motivation

The aim of the project is multi-class text classification to make-up products based on their description. Based on given text as an input, we have predicted what would be the category. We have five types of categories corresponding to different makeup products. In our analysis we used a different methods for a text representation (such as BoW +TF-IDF, doc2vec, Distilbert embeddings), feature extraction (Word2vec, Doc2vec) and various Machine Learning/Deep Lerning algorithms to get more accurate predictions and choose the most accurate one for our issue.

Project contains:

Multi-class text classification with ML algorithms- Text_analysis.ipynb
Text classification with Distilbert model - Bert_products.ipynb
Text classification with MLP and Convolutional Neural Netwok (CNN) models - Text_nn.ipynb
Text classification with Doc2vec model -Doc2vec.ipynb
Word2vec model - Word2vec.ipynb
LDA - Topic modeling - LDA_Topic_modeling.ipynb
EDA analysis - Products_analysis.ipynb
Python script to train ML models - text_model.py
Python script to train ML models with smote method - text_model_smote.py
Python script to text clean data - clean_data.py
Python script to generate predictions from trained model - predictions.py
data, models - data and models used in the project.

Summary

To resolve problem of the product categorization based on their description we applied multi-class text classification. We started with data analysis and data pre-processing from our dataset. Then we have used a combinations of text representation such as BoW +TF-IDF and doc2vec. We have experimented with several Machine Learning algorithms: Logistic Regression, Linear SVM, Multinomial Naive Bayes, Random Forest, Gradient Boosting and Neural Networks: MLP and Convolutional Neural Network (CNN) using different combinations of text representations and embeddings. Additionaly we have applied a transfer learning with a pretrained Distilbert model from Huggingface Transformers library.

From our experiments we can see that the tested models give a overall high accuracy and similar results for our problem. The SVM (BOW +TF-IDF) model give the best accuracy of validation set equal to 96 %. Logistic regression performed very well both with BOW + TF-IDF and Doc2vec and achieved similar accuracy as MLP. CNN with word embeddings also has a very comparable result (93 %) to MLP. Transfer learning with Distilbert model also gave a similar results to previous models an we achieved an accuracy on the test set equal to 93 %. That shows the extensive models are not gave a better results to our problem than simple Machine Learning models such as SVM.

Model	Embeddings	Accuracy
SVC	BOW +TF-IDF	0.96
MLP	Word embedding	0.93
CNN	Word embedding	0.93
Distilbert	Distilbert tokenizer	0.93
Gradient Boosting	BOW +TF-IDF	0.93
Random Forest	BOW +TF-IDF	0.92
SVM	Doc2vec (DBOW)	0.92
Logistic Regression	BOW +TF-IDF	0.91
Logistic Regression	Doc2vec (DM)	0.90
Naive Bayes	BOW +TF-IDF	0.88

Technologies

The project is created with:

Python 3.6/3.8
libraries: NLTK, gensim, Keras, TensorFlow, Hugging Face transformers, scikit-learn, pandas, numpy, seaborn, pyLDAvis.

Running the project:

To run this project use Jupyter Notebook or Google Colab.

You can run the scripts in the terminal:

clean_data.py
text_model.py
text_model_smote.py

budidwisatoto / product-categorization-nlp Goto Github PK

product-categorization-nlp's Introduction

Product Categorization

Multi-Class Text Classification of products based on their description

General info

Dataset

Motivation

Project contains:

Summary

Technologies

The project is created with:

Running the project:

product-categorization-nlp's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent