Coder Social home page Coder Social logo

sotera / watchman Goto Github PK

View Code? Open in Web Editor NEW
20.0 9.0 7.0 90.99 MB

Watchman: An open-source social-media event-detection system

License: GNU General Public License v2.0

JavaScript 81.15% HTML 2.42% Java 0.52% Batchfile 0.03% Shell 1.22% Python 12.20% CSS 0.27% Jupyter Notebook 2.19%
event-detection social-media media clustering community-detection tf-idf word2vec named-entity-recognition docker service

watchman's Introduction

Watchman

What is it?

A core set of utilities frequently used in large data processing / ML projects, exposed as REST endpoints. Want to extract text from HTML?... we've got it. Caption a set of images scraped from the web?... this is your place. Extract entities with MITIE or Stanford NER. Yes please.

Dependencies

  1. Node 6
  2. Strongloop 2
  3. Bower
  4. Docker 1.12
  5. Python 2.7 + 3.5

Dev boostrap

# get working copy of .env file from a friend
npm i -g strongloop bower
npm i
# only if models change...
lb-ng server/server.js client/js/lb-services.js

Install with Docker Compose

docker rm $(docker ps -a -q) # optional, remove all un'composed' containers
sudo service docker restart # optional, but should speed things up
cp .env.template .env # add browser API keys, etc.
git clone https://github.com/Sotera/watchman.git app; cd app # optional if in dev env
cp slc-conf.template.json slc-conf.json
sudo script/docker/install-compose.sh
script/deploy/compose up deploy [branch] # branch optional, default: master
# script/deploy/compose up deploy local # deploy local branch, not remote
script/deploy/compose scale image-fetcher=3

# hint: add /docker-compose.override.yml to override services.

Misc

# build mitie-server image
git clone lukewendling/mitie-server
docker build --no-cache --force-rm -t lukewendling/mitie-server .

docker run -d -p 8888:8888 --name mitie lukewendling/mitie-server
./server/workers/start-extractor.js # start workers
# run a worker standalone
WORKER_SCRIPT=./workers/job-queue npm run dev

Tests

Services

conda env create -f services/environment.yml
source activate watchman
python services/run_tests.py

PySpark Docker container (local or standalone cluster mode)

# watchman services must be running
./script/docker/start-pyspark.sh

watchman's People

Contributors

ctwardy avatar drjagartner avatar ferventinteractive avatar justinlueders avatar lukewendling avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

watchman's Issues

Move event-finder, job-scheduler env vars to runtime settings

Problem: for env vars that we regularly update for diff envs, consider moving the vars to runtime settings.

Recommended protocol: env var -> settings value -> some default val (Date.now)

TODO:

  • handle conn failures when reading from db. perhaps cache previous values in memory?
  • create a consistent naming scheme for settings keys (ex. job_monitors:start_time)

what else?

@justinlueders

Improved sentiment clusters

The current method of scoring sentiment by the sum of the whole paragraph merits more scrutiny. My hope is to create better sentiment clusters by breaking out sentences into multiple terms, with each node being a term and it's synonyms. We will also look to create likelihood scores based on the same technique employed by comedian.

Feature: prune db after X hours

am i crazy or could we simply use mongo 3's TTL index... for all collections, keyed on the 'created' field, which we currently mix into all models. (the Timestamp mixin)

set all ttl indexes for ~48 hours so we'd be assured of a complete 1 day of prior data, with enough buffer, so we'd know all related records exist for previous day, even when events are created nearly a full day after smposts ... and we don't have to do anything else in code. just let mongo auto remove data, which it does in a background process with this feature.

https://docs.mongodb.com/v3.2/core/index-ttl/

@justinlueders @drJAGartner

Pull event finder into the app

right now we have to set up and run the event finder script to do the last part....i want to move it into the code to be run as part of the system.

Link watchmen events to news events (as provided by QCR)

One of our collaborators has an array of scrapers creating geo events based on stories by major new outlets. This information is available in a mongo db on the QCR system, and as such, we would like to see if we are able to link these stories to our events by either linking headline keywords to event keywords, or by direct comparison of URLs.

We can discuss this more on Monday.

Better Loopy scrolling: scroll by _id

replace 'time slice' (get_next_block) implementation with a more general-purpose, scrolling feature based on _id field sorting. this should make get_next_page calls in all services safe from mongo skip() problems.

Job Monitor delayed start

We need a system that will poll the job monitor collection watching for jobs that have sufficient data to be processed.
steps would be on a timer:

  • Is a job monitor running?
    • Yes: return
    • No: Get the earliest, non complete, non timed out job in the collection
      • do we have enough SMP to process this job?
        • Yes: Have we gotten a SMP in the last n minutes?
          • Yes: return
          • No: Run the job!
        • No: Have we passed our timeout for this window?
          • Yes: time the job out and ignore it.
          • No: return

Datetime Utils

I find myself constantly rewriting scripts to convert from string time to datetime to ms timestamp, etc. I am adding a package in utils to do all this for us for common types.

Experiment: Do docker-compose deployments complete faster?

Problem

Deployments are detrimentally slow on Openstack, often taking over an hour to deploy. Can docker-compose help, since it is less liberal in wiping containers, by default. This also gives us the opportunity to reconsider SLC-based deployment configs or even upgrade to next gen. SLC tools.

Create "Topic" service

When we don't eliminate high volume persistent hashtags, it actually serves as a good tool for doing daily topic modeling, which is how our events are perceived used at this time. As we move to a more granular event model, it would be good if we can reproduce this service as a once a day topic modeling service.

Add JobSet entity for better JobMonitor mgmt

Purpose

Let's revisit job running and scheduling for a more robust solution to processing historical data from QCR. We previously identified the need for a parent entity of JobMonitors, and the new entity is potentially a good candidate to drive the improved job scheduler.

Process Flow

max_retries = job interval * multiplier / loop interval

job set exists?
  N
    create
    (a) posts count > MIN_POSTS?
      N
        retry count == max_retries
          N
            increment retry counter
            exit
          Y
            set state = 'skip'
            exit
      Y
        set state = 'running'
        create monitors
  Y
    job set running?
      N
        goto (a)    
      Y
        exit

Non ascii loopy queries

To perform hashtag clustering, we need to perform queries of the post clusters for the terms we are seeing. The current branch #54 will not create non-english hashtag clusters as is.

Research: how should queue worker gracefully (re)start?

Problem: after deployment, or service (re)start, services immediately begin to pull existing items from redis lists.

brainstorm: what to do on container start/restart, on deploy, and on slc service start/restart?

what happens to python services that are working on related jobs when keys are deleted by the queue worker?

cc @justinlueders

Hashtag Bayesian likelihood filter

Presently, our signal is drowned out by the persistent noise of twitter. To ensure that we emphasise events, we will use a Bayesian likelihood filter on hashtag terms. This should have two effects:

1.) Normalize the importance hashtags: We will model the number of hashtag terms with a poisson distribution, and use a cumulative probability to characterize the importance of this number of terms. This puts all hashtag cluster weights on a 0 - 1 scale.

2.) Terminate events: Events are merged across time windows. Since hashtags persists across most time windows, events rarely terminate, and we create one large highly connected graph. Killing large, persistent nodes will make events more localized in time.

Pointwise mutual information

As an alternative to the current method of non-hashtag sentiment clustering, we can try to perform pointwise mutual information scores on word bigrams. For non-stopwords, we can assess what the pointwise mutual information is. Similarly to how we create hashtag clustering, we can assess the likelihood of creating such a high PMI score, and from there choose to include it in our graph.

fix mkdirp bugs

python services create dirs in /downloads with perms 777
but /downloads/threats was created by node process as 755

move all mkdirp code into node boot script

Research redis conn failures in python client

There are currently no actual failures...this is just to note that we should check to see what happens in the event of a failure. This has been a test of the emergency redis failure system...This is only a test.

Improve PostsClusters UI

in particular, clicking individual items to reveal tweet info is not helpful when running full system tests. much more helpful to see more than 1 at a time.

for images, maybe a thumbnail of the original, and then click to open full sized.

for text, hashtags, maybe click to open original tweet.

Load test data with pyspark

Use Spark 2.0 docker containers (1 master/ 1 slave) to encapsulate a simple py script that loads data directly into mongo (loopy not a good candidate) for ingesting fairly large datasets (> 10M posts).

Item in list query

Currently, there is no ability to perform a 'where' query in which one tries to find an item in a list. This specifically comes into place when attempting to find socialMediaPosts that have a specific hashtag term. I'm not sure how painful this would be to implement, but it would be a nice feature to have.

Cleanup: events UI

Tasks

  • show spinner when loading events
  • make the spinner more noticeable
  • remove 'aggevent' references
  • hide map if no points to plot
  • project fields in smposts queries to greatly reduce bits over wire
  • misc code cleanup

Add monitoring plots

We need to be able to understand problems more quickly. I would like it if we add a set of simple plots to a 'Social Media Diagnostics' page, which can start with the following plots:

  • Number of tweets ingested in a fixed period (~30 m) vs time
  • Average time interval between the tweet creation time and receiving the tweet
  • Number of posts clusters made in a given time interval
  • Number of events made per day

We may expand this selection, but I think that this subset will give us the ability to pretty quickly diagnose problems like the ones we're seeing now.

Create image fetcher service to preprocess posts

Currently, the caffe service handles grokking the 'primary image url' from all posts, and it also downloads the image, if found. This subprocess is really slow! Let's move image handling to another service that can run before featurizing step, to improve scaling opportunities and speed things up for the featurizer service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.