Coder Social home page Coder Social logo

reddit_social_popularity_graph's Introduction

Reddit Social Popularity Graph

- Reddit social mapping for community detection of marketable customer segments -

Reddit Social Popularity Graph (RSPG) is a Python module for customer segmentation and popularity analysis of sub-reddit communities.

The project was started in June 2015 by Matt Strautmann as the capstone project for the Galvanize Data Science Immersive Program (galvanize.com). See the Slidedeck.pdf file for a presentation of key points and results of the project.

##Project Overview The goal of the project was to model the social relationships of online communities to detect communities in these networks. These communities would from frameworks to extract the topics of interest of each of the members, their relative popularity in the community, and their sentiment toward the the communities common topics of interest.

    To accomplish this goal, I needed a community on reddit with a big population that frequently 
    commented and participated in multiple the conversations.  This was determined by trying to 
    extract signal from common topics from subreddits on the front page of Reddit, the so-called 
    "default" subreddits.  I believe this is because the default subreddits are seen by everyone who
    comes to the website not just parties interested in a specific topic. To overcome this problem, 
    I set my criteria to be a subreddit that was in the top 50 on Reddit with a niche topic. For 
    this project, that was the subreddit "Games."  The code works on any social media source with 
    posts and comment trees.

    With my social relationship graph and the strong opinions of Reddit users!, I was able to see
    the customer segmentation of reddit users' opinions of games, releases/announcements, and game
    consoles. This segmentation gave me a community-level view of the entire subreddit user-base 
    with a rich meta-data about the popularity and sentiments of each community and the members of
    the communities.

##In-depth Process

####Scraping Reddit A relational database must come before any graph building. Using a Python wrapper for the Reddit API, I wrote a scraper class I called "Rabid Reddit" to grab the submission posts then flatten the comment tree and scrape the metadata and text from each comment. The scraped posts and comments were stored in a PostgreSQL database using the Psycopg2 module after data cleaning to regularize the data formats and remove deleted comments.

    Now that I had popularity scores for each comment and the comment text bodies, I adjusted 
    the popularity scores by calculating the sentiment of each comment using term frequency
    vectorizing (TF-IDF) and polarizing them using a Naive Bayes model.  This pos/neg 
    polarization from the Naive Bayes model was then applied to the popularity score to 
    adjust for the edge case where the sentiment of the comment was opposite to the 
    sentiment of the original post i.e. a negative comment about a positive sentiment posting
    would have a negative popularity for the comment.

####Building a Social Graph A graph is the framework to store the information about the Reddit users: their actions and relationships. I used the iGraph module for this project. The nodes (or vertices) of my graph were the unique users scraped from r/Games. The edges represent the relationships between users. The graph is formed by adding edges between users who talked about the same topics. In my project they were games. Now that I have the framework of who talked to who, I wanted to be able to detect popular members of communities and see what topics interested them. I wrote, as part of my graph master class, an add edges function that adds metadata to each edge. I added the popularity of the comments in common to the two users(calculated from the Reddit karma score: the net upvote/downvote score of the comment by other users) as well as the all the topics the two users talked about. This allowed me to query the graph at a community level and extract the topics talked about in the community.

####Community Detection A graph is not very useful without communities. Using the maximum modularity algorithm, I divided the graph into communities. I used the betweenness centrality theorem of maximum modularity as my measurement. This separates communities by maximizing the number of edges within each community and minimizing edges connecting communities together.

    I then wrote a custom distance scoring function for the K-Nearest Neighbors function to 
    allow for finding similar communities based on Doc2Vec similarity of topics discussed in
    the communities.

####Customer Segmentation The complete graph now shows me which games and topics are of interested to each community. It can also tell the dynamics of each group such as most influential member. And which users had positive or negative perspectives about the topics of interest to the community.

##Important links:

PRAW documentation: https://praw.readthedocs.org/en/v3.0.0/
Gensim documentation: http://gensim.readthedocs.org/en/latest/

##Dependencies

RSPG is tested to work under Python 2.7.

The following are required dependencies with the version tested on for the project:

gensim (0.11.1.post1)
gnumpy (0.2)
graphviz (0.4.4)
pygraphviz (1.2)
matplotlib (1.4.3)
nltk (3.0.3)
numpy (1.9.2)
pandas (0.16.1)
praw (3.0.0)
python-igraph (0.7.1.post6)
scikit-learn (0.16.1)
scipy (0.15.1)
psycopg2 (2.6)

reddit_social_popularity_graph's People

Contributors

matt-strautmann avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.