Coder Social home page Coder Social logo

nlp-redit-analysis's Introduction

Title: NLP, Web APIs & Classification - Reddit Dating/Relationship Advice Post Classification

Author: Sudeep Choudhary

Introduction:

This project delves into classifying Reddit posts from the "dating_advice" and "relationship_advice" subreddits using Natural Language Processing (NLP) techniques, aiming to identify the most suitable advertisements for each page. It leverages web scraping and machine learning algorithms to achieve this goal.

Problem Statement

As a data scientist at Reddit, the objective is to categorize posts effectively to serve targeted advertising on relevant subreddits. This project focuses on the "dating_advice" and "relationship_advice" communities, aiming to:

  • Identify key terms and phrases that hold predictive power in distinguishing between the two categories.
  • Develop a classification model (Logistic Regression and Bayes models are explored) to achieve accurate post classification.

Data Collection

  • Data source: Reddit subreddits: https://www.reddit.com/r/relationship_advice/ and https://www.reddit.com/r/dating_advice/
  • Scraping method: The requests library was employed to scrape post content. Around 2000 unique posts were collected (approximately 1000 each from both subreddits) using the "Hot" and "New" filters to diversify the data.
  • Preprocessing: Duplicate rows were removed using the drop_duplicates function with the id column. A 1-second delay was implemented between requests to avoid overloading Reddit's servers.
  • Saved data: The scraped content is stored as CSV files in the dataset folder of this repository.

Data Cleaning and Exploratory Data Analysis (EDA)

  • Null entries: Rows with missing values in the selftext column were removed as they lacked valuable information.
  • Feature creation: The title and selftext columns were combined into a single all_text column for analysis.
  • Dummy variables: The subreddit column was converted into dummy variables, where dating_advice is represented by 0 and relationship_advice by 1.
  • Distribution analysis: The frequency distribution of word counts in titles and full text was examined for both subreddits to identify potential differences.

Preprocessing and Modeling

  • Baseline accuracy: The baseline accuracy, achieved by always predicting the majority class ("relationship_advice"), was found to be 63.3%.
  • Modeling approach: Logistic Regression and Bayes models were considered as potential classification algorithms.

Hyperparameter Tuning

  • Initial attempts:
    • Logistic regression with default CountVectorizer parameters yielded poor results.
    • Using CountVectorizer with Stemmatizer preprocessing yielded low accuracy and high variance.
    • Employing a Lemmatizer instead slightly improved the test score (74%).
  • Optimized model:
    • Simply increasing the dataset size and employing stratification in the train/test split significantly improved the accuracy (C-Vectorizer test score: 81, cross-validation score: 80).
    • Additional improvements were achieved by fine-tuning CountVectorizer parameters:
      • min_df: 3 (minimum document frequency)
      • ngram_range: (1, 2) (considering single words and bigrams)
    • Grid Search with Pipeline was used to identify optimal hyperparameters for Logistic Regression and CountVectorizer. The best parameters were:
      • cvec__max_df: 0.95 (maximum document frequency)
      • cvec__max_features: 4000 (maximum number of features)
      • cvec__min_df: 3 (minimum document frequency)
      • cvec__ngram_range: (1, 2) (considering single words and bigrams)
    • Results:
      • Cross-validation score: 76%
      • Test score: 78%

TF-IDF Vectorization

  • Experimentation with TF-IDF vectorization was conducted:
    • TfidfVectorizer with custom stop words tailored to exclude irrelevant or overused words further reduced overfitting.
    • Final parameters:
      • Stop words: Customized list excluding common words like "relationship", "girlfriend", etc.
      • ngram_range: (1, 2) (considering single words and bigrams)
      • max_df: 0.9 (maximum document frequency)
      • min_df: 2 (minimum document frequency)
      • max_features: 5000 (maximum number of features)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.