Coder Social home page Coder Social logo

reddit_supplement_text_classification's Introduction

Data Science I Final Project Fall 2017 Sage Hahn

In this project I investigated the different ways that large unlabeled datasets can be leveraged for common binary text classification tasks. In particular I wanted to improve upon 'naive' strategies like keyword searches, as well as with established supervised learning tasks, my efforts constituting for the most part 'semi-supervised learning' techniques, and therefore require varying levels of user input. I present an array of techniques, both successful, but also unsuccessful, and back up each attempt with extensive experimentation. I also provide insights as to why certain methods succeed over others.

report_SAHAHN.pdf - Contains the final report in pdf form, where the bulk of the analysis takes place.

slides_SAHAHN.pdf - Contains the presentation slides for the report.

createRedditTextBlocks.ipynb - Jupyter Notebook where I read in and create 'redditTextBlocks', which are pickeled arrays of 10,000,000 posts each.

exploringMethods.ipynb - This Jupyter Notebook is where I do the BULK of my research, and exploratory measures. In addition, a number of plots are generated in this file.

initialExplore.ipynb - Jupyer Notebook containing a brief initial exploration of the data, and larger question.

preprocessing.ipynb - In this Notebook I run the globally applied preprocessing steps to all of the data, and save the output as a pickeled array for future use.

testingDifferentPreProc.ipynb - In this Notebook I explore the different preprocessing steps I might take, and explain some of the logic around why I make the choices I do.

word2VecApproach.ipynb - In this Notebook I explore a technique using word embeddings, word2vec, and a conv nueral net implemented in Keras.

ideas.txt - A text file containing dated entries of some of my thoughts and ideas along the way of completing this project. This document is fairly informal.

html/ - This folder contains saved HTML versions of all of the above notebooks in case they are needed.

data/ - THESE FILES HAVE BEEN REMOVED DUE TO THEIR > 100 MB SIZE What data/ did contain was,

The 'Corpus' files: redditTextBlock1.pkl, redditTextBlock2.pkl, courseraBlogs.txt, 
courseraNews.txt, courseraTwitter.txt AND a version of each of these renamed to be
e.g. redditTextBlock2Proc.pkl or courseraTwitterProc.pkl (Add Proc.pkl to end)

aswell as, glove.6B/ which contained the glove word embeddings for 50, 100, 200 and 300 dimensions.

reddit_supplement_text_classification's People

Contributors

sahahn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.