-
Data Loading and Preprocessing:
- The dataset, containing email messages labeled as 'ham' (not spam) or 'spam', is loaded from a TSV file.
- Labels are converted into binary values (
0
for 'ham' and1
for 'spam') to facilitate training.
-
Feature Extraction:
- CountVectorizer is employed to transform the text messages into a matrix of token counts. This matrix serves as the feature set (
X
) for the machine learning model. - The vectorizer is first fitted on the training data (
X_train
) and then used to transform both training and test datasets.
- CountVectorizer is employed to transform the text messages into a matrix of token counts. This matrix serves as the feature set (
-
Model Training:
- A Multinomial Naive Bayes model is trained on the transformed training data (
X_train_transformed
,y_train
). This model is particularly effective for text classification problems, where the features are represented by word counts.
- A Multinomial Naive Bayes model is trained on the transformed training data (
-
Prediction Function:
- The
predict_spam_or_ham
function takes a user input string, transforms it using the previously fitted vectorizer, and predicts whether the message is 'Spam' or 'Not Spam (ham)' using the trained Naive Bayes model.
- The
-
Streamlit Application:
- The app's UI allows users to input a message and classify it as 'Spam' or 'Not Spam (ham)'.
- The
st.text_area()
component captures user input, and thest.button()
triggers the classification. The result is displayed usingst.write()
.
-
Environment Setup:
- Ensure Python 3.7+ is installed.
- Install required libraries:
pip install streamlit pandas scikit-learn
-
Running the Application:
- Execute the Streamlit app with the following command:
streamlit run app.py
- Execute the Streamlit app with the following command: