This project demonstrates how to build a simple spam email detection system using scikit-learn, a popular machine learning library in Python.
The project uses a Bag-of-Words model and the Naive Bayes classifier to classify emails as spam or not spam. It includes the following components:
- Loading the dataset from a CSV file.
- Preprocessing the data and splitting it into training and testing sets.
- Vectorizing the emails using the Bag-of-Words representation.
- Training a Naive Bayes classifier on the training data.
- Evaluating the model's accuracy on the testing data.
- Making predictions on new emails.
- Python 3.x
- scikit-learn
- pandas
- Ensure you have Python installed on your system.
- Install the required libraries using pip:
- Download the
emails.csv
file or prepare your own dataset in a similar format. - Run the provided Python script
spam_detection.py
. - The script will train the model, evaluate its accuracy, and make predictions on new emails.
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
- The code provided here is a basic example. For better accuracy, you may consider using more advanced techniques, such as feature engineering, hyperparameter tuning, or using more sophisticated classifiers.
- Ensure that your dataset is well-balanced and representative to build a robust spam detection model.
- Experiment with different vectorization techniques and classifiers to find the best combination for your specific use case.