Coder Social home page Coder Social logo

finding-your-reddit-community's Introduction

Using Reddit's API for predicting Posts

Problem Statement: Is it possible to create a classifier to predict which subreddit a given post came from?

In this project I have collected data uding Reddit's API and got the data for two subreddits "Running", and "Lose it". The goal for this project is to using NLP and Machine Leaning models, build a binary predictor to determine which subreddit a given post is coming from.

Data Collection: Reddit API

Data Cleaning and EDA

I first built API Function to pull posts as many as I need for the data collection. I chosen different subreddits , but since I needed more data entries, I used the subreddits with much more post enteries. Each subreddit has about 10000 enteries and after cleaning up and dropping the duplicates, I ended up having 19611 rows for the merged dataset. I located the text from each posts and created a new column to identify each post as whether coming from "Running" or "Lose it" subreddits.

Preprocessing

1- Removed non-letters,

2- Converted to lower case and splitted the words,

3- Used stop words,

4- Used Stemming and Tokenizing.

Before starting the preprocessing NLP, I tried to plot the most common words in each data set and the plots represented a few nonsense vocabularies such as "19F", "made", "SW1", and I decided to implement stemming, tokanizing and using stopwords to get the related words for each subreddits.

Modeling

1- Instantiated and fitted CountVectorizer and TF-IDF Vectorizer

2- Instantiated and fitted Pipeline and GridSearch

3- Used Logistic Regression, Naive Bayes Classifier and Random Forest Model

After the process of traning, testing and splitting, I used TFID Vectorizer and Count vectorizer tools to get the words frequencies. Then I created and compared different models such as Naive Bayes Clasifier, Logistic Regression and Random Forest Model and I got the highest score on the Count Vectorizer+ Logistic Regression Model for 99% on training dataset and %92 on testing dataset.

Conclusion

If you are an internet savvy person and follow different social news platform, you must have heard of Reddit ! Reddit allows its users to discuss and vote on content (such as links, text posts, and images) that other users have submitted. There are about 138,000 active subreddits among a total of 1.2 million subreddits. But, how is it possible to find your own community with the topics you have interest in?!

The challenge for this project is to help you identify the community which is right for you!

I know I know... You are thinking now “I like food and drinks”, “I love art and paintings” or maybe “I love Running and Losing weight” but don’t have the time to search in all these subreddits that Reddit offers and just want to find your community soon! You are already overwhelmed with life and prefer not googling to find the topics and community you like.

Well, that is why I am here! I try to help you out but how? My model here goes over all the subreddits that you have some interest in and BAM! It gives you the most key words of each subreddits and you can easily decide which community is for you! Why am I think my model is good? Another BIG BAM! Since my model has an accuracy of %99 to identify the certain posts and vocabularies from any subreddits.

finding-your-reddit-community's People

Contributors

minoot7 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.