Coder Social home page Coder Social logo

swadesh06 / llm-ai-genrated-text-classification Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 25 KB

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

Jupyter Notebook 100.00%

llm-ai-genrated-text-classification's Introduction

LLM Text Classification

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

download

Dataset links:

Given Dataset- https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data Augmented Dataset by MIT - https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

About the data

Given Datset-

-train.csv - Columns for ID , text columns , and generated column for displaying classified labels

Augmented Dataset

  • contains text columns for classifation and Label column for prediction results

Setup:

  • setup for each model used is given separately below:

    DistilBERT(public score 0.803) :

Screenshot 2024-01-15 at 10 00 56 PM

In this setup I also downloaded the output files after the first Internet "on" run and uploaded them into the input directory so as to save the effort of having to have a first run with intenet on whenever I opened the notebpok again.

RoBERTa (public score 0.672 ):

Screenshot 2024-01-15 at 10 10 20 PM

RoBERTa performed poorer despite being a more adavnced model due to overfitting and lack of optimization to make the suitable tokenizing changes. I implemented them later but some changes were left to be accomodated and time didn't allow for them.

BERT (public score N/A) :

Screenshot 2024-01-15 at 10 04 34 PM

BERT took a very long time to train and I was unsucessful in getting a complete first run on the dataset , so I couldn't save the model in time for submission, and had to leave the tokenzier files in the output directory itself.

I did manage to save the tokenized values but they were later deemed unnecessary due to the firther optimisations I made in the tokenisation process of BERT

Steps Taken:

  • Gathering of larger datasets
  • Tokenizinng text for each model
  • optmizing model for reduced complexity
  • Random Weight Samling to address overfiiting
  • Gradiet clipping
  • Optimizing tokensation lengths for BERT and RoBERTa by implementing function to decide max_len value
  • Tracking progress by progress lines inside the code
  • Saving the model weights and tokenized vakues to reduce time taken for subsequent runs in tokensiing
  • Saved model first into kaggle working directory and then downloaded and uploaded into input directory to enable offline model running

llm-ai-genrated-text-classification's People

Contributors

swadesh06 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.