Coder Social home page Coder Social logo

taggernews's Introduction

Tagger News

The purpose of continuing this project was an idea that I had over a year before. However lacking the ML / text analysis skills it was left for another day. After finding out about the tagger news project (from HN funnily enough) I felt I could pursue the idea.

The Idea

Wouldn't it be great to know more about the person you are chatting with on HN like their background / motiviation.

Also: Is this person a sockpuppet? A shill? Or an apologist?

One way to do that is to look at their comment history and articles they post / like commenting on.

And this was a good project to practice my python...

Setup

Setup the environment
git clone https://github.com/dchecks/taggernews.git
cd taggernews
mkvirtualenv taggernews
ln -s ~/.virtualenvs/taggernews/bin/activate ./activate
echo export SECRET_KEY="PICK A SECRET KEY" >> /location/of/your/.virtualenvs/taggernews/bin/activate
echo export DEBUG=True >> /location/of/your/.virtualenvs/taggernews/bin/activate

For some reason a 500 will be thrown on the admin page if you don't enable debug

Dependencies
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser

If requirements install fails due to psycopg2 (on centos7) try:

sudo yum install python-devel postgresql-devel

Credit: https://stackoverflow.com/questions/5420789/how-to-install-psycopg2-with-pip-on-python

MySQL

The server setup is now using mysql, a simple schema creation script can be found in the 'creates' file in the root directory. If you want you can still use sqlite, just change the connection string in the settings.py file.

Either way settings.py will need updated with your server address, password etc. under

DATABASES = {...
Start the server
python manage.py runserver
open http://localhost:8000

You should now see a tagger news clone start up with no data.

For the admin interface:

open http://localhost:8000/admin

###Data Gathering and Analysis

Create a model

Now to create an LDA model (Latent Dirichlet Allocation) In analyze_hn run:

python tagger/model_topics.py

This will provide you with a dictionary and gensim files.

Prediction

Plug these into the predictor by editing the dictionary and lda field.

python tagger/predict_topics.py

The predictor will need to load in the training data on the first run. This is about 15k requests and so will take a long time. The bulk of this time is retrieving all of the article text from the supervised_topics.csv.

This is slow due to a python bug with OSX. It prevents multiprocess parsing of the supervised_topics list which is needed for initial model building. If you're running it on a proper *nix try setting

THREAD_COUNT = (some number respective to your cores)

In refresh_top_articles.py (just don't abuse the hn api too much)

Once parsing of the training data is done it will create 2 prediction models, under ml_models/predictions. These will be used when tagging. The tagger is smart enough to pick the latest model in the folder.

Tagging

Using the training set you can now put the machine to work. If you run it at this stage the tagger will only re-tag the articles from the supervised_topics.csv

python manage.py tag_articles

Or, to run indefinitely, (sleeping while it's out of work)

python tagger/management/commands/tag_articles.py
Importing

To import data run the commands:

python manage.py refresh_top_articles

This will start the parser, getting the top articles and saving them to the db. You can then run the tagging step again to tag these articles. Once this is complete you should be able to see a reasonably accurate front page of hn with tags.

User Tagging

To tag all the users, (and sleep when done as above)

python tagger/tag_user.py

To tag a list of users, provide the user names as cmd line arguements.

This is a reasonably intense thing to do. Each comment that a user makes is currently tied back to the orignal article that it was on (via a recursive call chain to the api), the article is then parsed and put in the list to be tagged.

The first time a user is queried for their tags it won't work. Tagging is too slow due to the amount of web fetching that needs to happen.

The front page submitters will be pre-fetched when refresh top is run.

In action

To view the tags, hover over a username on the website. If the user has been parsed you will see their favourite topics. Otherwise check back a few minutes later after the tagging is complete.

taggernews's People

Contributors

danrobinson avatar dchecks avatar dodger487 avatar mahidan avatar ngould avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.