Coder Social home page Coder Social logo

chgibb / reimagined-pancake Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 51.57 MB

Twitter scraper and natural language processing platform

Shell 9.72% TypeScript 41.25% JavaScript 9.19% Java 6.89% HTML 0.42% C++ 32.13% Batchfile 0.41%
twitter twitter-scraper twitter-mining text-mining big-data bigdata tweets cli typescript text-analysis text-analytics natural-language-processing cplusplus java nodejs

reimagined-pancake's Introduction

reimagined-pancake

Note: this project is under active development. Contributors (especially to documentation) are welcome.

Build StatusDependency Status
Author: Chris Gibb
Twitter scraper and natural language processing platform.

Command line interface (CLI) based platform for high volume Twitter analytics.

Software Requirements:

  • Linux based Operating System (OS)
  • NodeJS
  • Java (Optional, only needed for built in text analytics)
  • Familiarity with the CLI

Minimum Recommended Hardware Requirements

  • Intel Celeron 2x Core @ 1.4GhZ
  • 2GB RAM
  • 500GB - 1TB disk space (depends on length of use and intentions)

Building From Source

From the directory where the source was cloned into:

Install (local) dependencies

bash install.bash

Assumes that javac, g++ and node are installed globally.

Build everything

bash build.bash

Usage

Authorization

The platform requires Twitter developer credentials in order to make requests to the Twitter API. It uses Twitter's application only authorization. See https://dev.twitter.com/oauth/application-only for more details.

Once you have acquired credentials from Twitter, a file named keys.json must be created in the dep directory (or wherever you have placed the built application). It should look like this:

{
    "consumerKey" : "",
    "consumerSecret" : "",
    "requestToken" : "",
    "requestTokenSecret" : "",
    "accessToken" : "",
    "accessTokenSecret" : ""
}

Acquiring Tweets

In order to run a round of mining:

node tweetScheduler --dataDir=data --threads=1 --iterations=1

See tweetScheduler documentation for more information.

Learning About Tweets

By default, the platform comes with a wrapper over Stanford Natural Language Processing Group's Named Entity Recognizer (SNER) in the form of nerLearner.jar. See http://nlp.stanford.edu/software/CRF-NER.shtml for more information. It builds a database of recognized words from SNER and allows other utilities to apply them to tweets in order to filter out words of interest. Some defaults in running nerLearner.jar are provided in runNerLearner.sh. nerLearner.sh takes a single argument, that is a path to a file containing a list of tweet bins to operate on.

In order to learn about words of interest (people, places, organizations ,mentions and hashtags) in tweets collected between January 1 2017 12:00am and January 1 2017 11:59pm, first generate a bin listing for that range. Assuming tweets have been saved into a directory named data:

./genListing data 2017 Jan 01 > Jan012017Listing
sh runNerLearner.sh Jan012017Listing
rm Jan012017Listing

runNerLearner.sh by default will output the results to classifiers/learned. If its output directory does not exist, it will produce no output.

All Together Now

node tweetScheduler --dataDir=data --threads=1 --iterations=1
sh runNerLearner.sh modBinsdata
rm modBinsData

Will run a round of mining and learn about only new tweets that have been acquired by that round of mining.

Tagging Tweets

In order to visualize and analyze tweets, they need to be tagged with the words of interest generated from nerLearner.
You should always generate a fresh listing and run tagging before you generate visualization files in order to ensure that the freshest data is available (assuming you are acquiring tweets continuously).

./tagApplicator Jan2017Listing classifiers/learned

Assuming that words of interest are being written to classifiers/learned (the default).

Generating Files for Visualization

Extract words of interest and sentiment information

node extract --listing=Jan2017Listing > Jan2017Dump

Average out word occurences and average sentiment

node average --file=Jan2017Dump --date=Jan2017 > Jan2017.json

Jan2017.json can now be visualized and queried.

reimagined-pancake's People

Contributors

chgibb avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

reimagined-pancake's Issues

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organization’s settings page, under Installed GitHub Apps.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.