Coder Social home page Coder Social logo

kritika97gaikwad / ai-generated-text-detection Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 41.14 MB

Developed a machine learning model using scikit-learn, implementing ensemble techniques, PCA, correlation analysis, and extensive feature engineering. The goal was to classify documents as either human-generated (0) or AI-generated (1) based on document embeddings, word count, and punctuation.

Jupyter Notebook 100.00%
automl data-science ensemble-machine-learning ensemble-model ice-plot machine-learning numpy pandas permutation-importance python scikit-learn

ai-generated-text-detection's Introduction

AI Generated Text Detection

Screenshot 2024-06-15 151143

AI-generated texts have become increasingly prevalent across diverse industries, offering innovative solutions in areas such as Content Generation, Personalized Marketing, Virtual Assistants, and Creative Writing. However, with these advancements come challenges that must be addressed to ensure responsible and ethical use.

Project Overview

Developed a machine learning model using scikit-learn, implementing ensemble techniques, PCA, correlation analysis, and extensive feature engineering. The goal was to classify documents as either human-generated (0) or AI-generated (1) based on document embeddings, word count, and punctuation.

Requirements

  • Python 3.8 or higher
  • Jupyter Notebook or Google Colab

Usage

Exploratory Data Analysis (EDA)

In the EDA phase, we analyze the dataset using the following visualizations and statistics:

Distribution of the target variable (ind): Understand the imbalance in the dataset. Distribution of word counts: Analyze the length of the documents. Frequency of punctuation marks: Examine the usage of punctuation in the documents. Correlation heatmap of document embeddings: Identify relationships between different embedding dimensions. PCA and t-SNE visualizations of document embeddings: Reduce dimensions to visualize the embeddings in 2D space.

Data Preparation

During data preparation:

  • Feature Engineering: Create additional features such as average word length and number of unique words.
  • Train-Test Split: Split the data into training and testing sets (90/10 split) with a fixed random seed for reproducibility.
  • Class Imbalance Handling: Use techniques like SMOTE to balance the classes in the training set.

Model Training and Evaluation

We train the following models:

  • Logistic Regression
  • Random Forest
  • AdaBoost
  • SVC
  • Gradient Boosting
  • AutoML/ TPOT

For evaluation, we:

  • Generate learning curves for accuracy and loss.
  • Create confusion matrices.
  • Produce classification reports.
  • Calculate F1 scores, precision, and recall.
  • Generate Permutation Importance
  • Create Partial Dependence Plots

Results

The results section in AI_Generated_Text_Detection_Project.ipynb provides a detailed analysis of model performance, highlighting the strengths and weaknesses of each model.

Contributing

Contributions are welcome! If you have any improvements or bug fixes, please open an issue or submit a pull request.

ai-generated-text-detection's People

Contributors

kritika97gaikwad avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.