LLM Text Classification

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

Dataset links:

Given Dataset- https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data Augmented Dataset by MIT - https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

About the data

Given Datset-

-train.csv - Columns for ID , text columns , and generated column for displaying classified labels

Augmented Dataset

contains text columns for classifation and Label column for prediction results

Setup:

setup for each model used is given separately below:

DistilBERT(public score 0.803) :

In this setup I also downloaded the output files after the first Internet "on" run and uploaded them into the input directory so as to save the effort of having to have a first run with intenet on whenever I opened the notebpok again.

RoBERTa (public score 0.672 ):

RoBERTa performed poorer despite being a more adavnced model due to overfitting and lack of optimization to make the suitable tokenizing changes. I implemented them later but some changes were left to be accomodated and time didn't allow for them.

BERT (public score N/A) :

BERT took a very long time to train and I was unsucessful in getting a complete first run on the dataset , so I couldn't save the model in time for submission, and had to leave the tokenzier files in the output directory itself.

I did manage to save the tokenized values but they were later deemed unnecessary due to the firther optimisations I made in the tokenisation process of BERT

Steps Taken:

Gathering of larger datasets
Tokenizinng text for each model
optmizing model for reduced complexity
Random Weight Samling to address overfiiting
Gradiet clipping
Optimizing tokensation lengths for BERT and RoBERTa by implementing function to decide max_len value
Tracking progress by progress lines inside the code
Saving the model weights and tokenized vakues to reduce time taken for subsequent runs in tokensiing
Saved model first into kaggle working directory and then downloaded and uploaded into input directory to enable offline model running

swadesh06 / llm-ai-genrated-text-classification Goto Github PK

llm-ai-genrated-text-classification's Introduction

LLM Text Classification

Dataset links:

About the data

Given Datset-

Augmented Dataset

Setup:

DistilBERT(public score 0.803) :

RoBERTa (public score 0.672 ):

BERT (public score N/A) :

Steps Taken:

llm-ai-genrated-text-classification's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent