The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.
Given Dataset- https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data Augmented Dataset by MIT - https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text
-train.csv - Columns for ID , text columns , and generated column for displaying classified labels
- contains text columns for classifation and Label column for prediction results
-
setup for each model used is given separately below:
![Screenshot 2024-01-15 at 10 00 56 PM](https://private-user-images.githubusercontent.com/129365476/296847010-17bc2e7f-6a64-48fd-9220-3383c62e517b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkyMDA0MjksIm5iZiI6MTcxOTIwMDEyOSwicGF0aCI6Ii8xMjkzNjU0NzYvMjk2ODQ3MDEwLTE3YmMyZTdmLTZhNjQtNDhmZC05MjIwLTMzODNjNjJlNTE3Yi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYyNFQwMzM1MjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jMmNiZmQxYmY4MjkxMGZjNWFkYjRkM2VkNmQyZDNhMWZhZDYyYTc5ZmViYWM5NDZlNmM2ODczYTg1YzQ1YWI5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.3A_37nsUwOyXskjRycldyqR20oeLUU_UsLdTuj_5Xsw)
In this setup I also downloaded the output files after the first Internet "on" run and uploaded them into the input directory so as to save the effort of having to have a first run with intenet on whenever I opened the notebpok again.
![Screenshot 2024-01-15 at 10 10 20 PM](https://private-user-images.githubusercontent.com/129365476/296847568-be025669-4a2c-48a2-bf24-ed7c42a69f80.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkyMDA0MjksIm5iZiI6MTcxOTIwMDEyOSwicGF0aCI6Ii8xMjkzNjU0NzYvMjk2ODQ3NTY4LWJlMDI1NjY5LTRhMmMtNDhhMi1iZjI0LWVkN2M0MmE2OWY4MC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYyNFQwMzM1MjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jNGViZGY2MjNiNDQxZjc5YmNlMmQyZTVjMDk4OWEwNTdjY2VkMGRmODk3ZDA2Y2YwMGZiODk0MTE0NzUzMDk3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.f0ysr2P4pwE41sO_HAlZv4QfabRPeM9kadI7_TcYAWg)
RoBERTa performed poorer despite being a more adavnced model due to overfitting and lack of optimization to make the suitable tokenizing changes. I implemented them later but some changes were left to be accomodated and time didn't allow for them.
![Screenshot 2024-01-15 at 10 04 34 PM](https://private-user-images.githubusercontent.com/129365476/296847546-5576d19f-0469-4718-bbe4-329841930082.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkyMDA0MjksIm5iZiI6MTcxOTIwMDEyOSwicGF0aCI6Ii8xMjkzNjU0NzYvMjk2ODQ3NTQ2LTU1NzZkMTlmLTA0NjktNDcxOC1iYmU0LTMyOTg0MTkzMDA4Mi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYyNFQwMzM1MjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00OTg3ZjMzN2EwOTIyZDkwZDgxMDg3YTRkNTM3ZThhMzgwNjY2ZTRjZDY0YjMyZWEwNDlkNDllNjI2YjdiMWJkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.sn5vvJkhFxlZOMzPZz8ofG0VwLar2Z-k7SwKgqUleZQ)
BERT took a very long time to train and I was unsucessful in getting a complete first run on the dataset , so I couldn't save the model in time for submission, and had to leave the tokenzier files in the output directory itself.
I did manage to save the tokenized values but they were later deemed unnecessary due to the firther optimisations I made in the tokenisation process of BERT
- Gathering of larger datasets
- Tokenizinng text for each model
- optmizing model for reduced complexity
- Random Weight Samling to address overfiiting
- Gradiet clipping
- Optimizing tokensation lengths for BERT and RoBERTa by implementing function to decide max_len value
- Tracking progress by progress lines inside the code
- Saving the model weights and tokenized vakues to reduce time taken for subsequent runs in tokensiing
- Saved model first into kaggle working directory and then downloaded and uploaded into input directory to enable offline model running