Light

sotera / watchman Goto Github PK

View Code? Open in Web Editor NEW

20.0 9.0 7.0 90.99 MB

Watchman: An open-source social-media event-detection system

License: GNU General Public License v2.0

JavaScript 81.15% HTML 2.42% Java 0.52% Batchfile 0.03% Shell 1.22% Python 12.20% CSS 0.27% Jupyter Notebook 2.19%

event-detection social-media media clustering community-detection tf-idf word2vec named-entity-recognition docker service

watchman's Introduction

Watchman

What is it?

A core set of utilities frequently used in large data processing / ML projects, exposed as REST endpoints. Want to extract text from HTML?... we've got it. Caption a set of images scraped from the web?... this is your place. Extract entities with MITIE or Stanford NER. Yes please.

Dependencies

Node 6
Strongloop 2
Bower
Docker 1.12
Python 2.7 + 3.5

Dev boostrap

# get working copy of .env file from a friend
npm i -g strongloop bower
npm i
# only if models change...
lb-ng server/server.js client/js/lb-services.js

Install with Docker Compose

docker rm $(docker ps -a -q) # optional, remove all un'composed' containers
sudo service docker restart # optional, but should speed things up
cp .env.template .env # add browser API keys, etc.
git clone https://github.com/Sotera/watchman.git app; cd app # optional if in dev env
cp slc-conf.template.json slc-conf.json
sudo script/docker/install-compose.sh
script/deploy/compose up deploy [branch] # branch optional, default: master
# script/deploy/compose up deploy local # deploy local branch, not remote
script/deploy/compose scale image-fetcher=3

# hint: add /docker-compose.override.yml to override services.

Misc

# build mitie-server image
git clone lukewendling/mitie-server
docker build --no-cache --force-rm -t lukewendling/mitie-server .

docker run -d -p 8888:8888 --name mitie lukewendling/mitie-server
./server/workers/start-extractor.js # start workers

# run a worker standalone
WORKER_SCRIPT=./workers/job-queue npm run dev

Tests

Services

conda env create -f services/environment.yml
source activate watchman
python services/run_tests.py

PySpark Docker container (local or standalone cluster mode)

# watchman services must be running
./script/docker/start-pyspark.sh

watchman's People

Contributors

Stargazers

Watchers

Forkers

sandeepsingh anukat2015 codeaudit drjagartner arunsigood escap-data-hub lukewendling

watchman's Issues

Implement Redis message queue

Replace pubsub with simple message queue, to support multiple workers.

Mostly like this: https://danielkokott.wordpress.com/2015/02/14/redis-reliable-queue-pattern/

Reqts

no race conditions bn workers
failed jobs should log same as current system
monitoring redis keys?

Improve fault tolerance of redis integration

Currently, when redis stops (mostly in dev), the node processes die b/c there is no restart strategy in our redis config, both in kue, and in job monitors.

Move event-finder, job-scheduler env vars to runtime settings

Problem: for env vars that we regularly update for diff envs, consider moving the vars to runtime settings.

Recommended protocol: env var -> settings value -> some default val (Date.now)

TODO:

handle conn failures when reading from db. perhaps cache previous values in memory?
create a consistent naming scheme for settings keys (ex. job_monitors:start_time)

what else?

Improved sentiment clusters

The current method of scoring sentiment by the sum of the whole paragraph merits more scrutiny. My hope is to create better sentiment clusters by breaking out sentences into multiple terms, with each node being a term and it's synonyms. We will also look to create likelihood scores based on the same technique employed by comedian.

Feature: prune db after X hours

am i crazy or could we simply use mongo 3's TTL index... for all collections, keyed on the 'created' field, which we currently mix into all models. (the Timestamp mixin)

set all ttl indexes for ~48 hours so we'd be assured of a complete 1 day of prior data, with enough buffer, so we'd know all related records exist for previous day, even when events are created nearly a full day after smposts ... and we don't have to do anything else in code. just let mongo auto remove data, which it does in a background process with this feature.

https://docs.mongodb.com/v3.2/core/index-ttl/

@justinlueders @drJAGartner

Rebuild all containers and push new versions for new redis queues

Remove html tags when extracting with mitie

remove html tags from text. also rm duplicate extracted entities in parsedEvents.

Run job monitors in a worker process

Detach long-running jobs from web processes.

Pull event finder into the app

right now we have to set up and run the event finder script to do the last part....i want to move it into the code to be run as part of the system.

Link watchmen events to news events (as provided by QCR)

One of our collaborators has an array of scrapers creating geo events based on stories by major new outlets. This information is available in a mongo db on the QCR system, and as such, we would like to see if we are able to link these stories to our events by either linking headline keywords to event keywords, or by direct comparison of URLs.

We can discuss this more on Monday.

Better Loopy scrolling: scroll by _id

replace 'time slice' (get_next_block) implementation with a more general-purpose, scrolling feature based on _id field sorting. this should make get_next_page calls in all services safe from mongo skip() problems.

Remove tweet 'instagram' field on scraped tweets

Reminder: we have a dependency on 'instagram.url'. remove this field? @justinlueders

Job Monitor delayed start

We need a system that will poll the job monitor collection watching for jobs that have sufficient data to be processed.
steps would be on a timer:

Is a job monitor running?
- Yes: return
- No: Get the earliest, non complete, non timed out job in the collection
  - do we have enough SMP to process this job?
    - Yes: Have we gotten a SMP in the last n minutes?
      - Yes: return
      - No: Run the job!
    - No: Have we passed our timeout for this window?
      - Yes: time the job out and ignore it.
      - No: return

Datetime Utils

I find myself constantly rewriting scripts to convert from string time to datetime to ms timestamp, etc. I am adding a package in utils to do all this for us for common types.

Cleanup: create custom Loopy exception class

see code review here: #91

Experiment: Do docker-compose deployments complete faster?

Problem

Deployments are detrimentally slow on Openstack, often taking over an hour to deploy. Can docker-compose help, since it is less liberal in wiping containers, by default. This also gives us the opportunity to reconsider SLC-based deployment configs or even upgrade to next gen. SLC tools.

Append quoted tweet text to retweet

We need to consider quoted tweets by augmenting current tweet with hashtags, text from quoted.
https://support.twitter.com/articles/20169873

Convert image-fetcher service to async module

Current python implementation is very slow, in all its I/O blocking glory.

Let's convert to Node and use async streams!

Move mongo to dedicated server in production

Create "Topic" service

When we don't eliminate high volume persistent hashtags, it actually serves as a good tool for doing daily topic modeling, which is how our events are perceived used at this time. As we move to a more granular event model, it would be good if we can reproduce this service as a once a day topic modeling service.

Remove Aggregate Clustering

Threat detection service

for Chris' work on threat detection service

upgrade node services to v6.9

Test

run full pipeline -> create events
deploy
update package.json, dotfiles, readme

Add JobSet entity for better JobMonitor mgmt

Purpose

Let's revisit job running and scheduling for a more robust solution to processing historical data from QCR. We previously identified the need for a parent entity of JobMonitors, and the new entity is potentially a good candidate to drive the improved job scheduler.

Process Flow

max_retries = job interval * multiplier / loop interval

job set exists?
  N
    create
    (a) posts count > MIN_POSTS?
      N
        retry count == max_retries
          N
            increment retry counter
            exit
          Y
            set state = 'skip'
            exit
      Y
        set state = 'running'
        create monitors
  Y
    job set running?
      N
        goto (a)    
      Y
        exit

Non ascii loopy queries

To perform hashtag clustering, we need to perform queries of the post clusters for the terms we are seeing. The current branch #54 will not create non-english hashtag clusters as is.

let Loopy handle socialmediaposts deletions

remove requests module wherever possible
dry up 'destroy' calls by using loopy object

Research: how should queue worker gracefully (re)start?

Problem: after deployment, or service (re)start, services immediately begin to pull existing items from redis lists.

brainstorm: what to do on container start/restart, on deploy, and on slc service start/restart?

what happens to python services that are working on related jobs when keys are deleted by the queue worker?

cc @justinlueders

Add social media endpoint for QCR integration

To capture tweets, fb posts, etc. posted by QCR service.

Bug: Runtime error in caffe featurizer with non-gpu dockerfile

build error says can't build in gpu mode, which i guess is determined by linux (ubuntu) host?

Hashtag Bayesian likelihood filter

Presently, our signal is drowned out by the persistent noise of twitter. To ensure that we emphasise events, we will use a Bayesian likelihood filter on hashtag terms. This should have two effects:

1.) Normalize the importance hashtags: We will model the number of hashtag terms with a poisson distribution, and use a cumulative probability to characterize the importance of this number of terms. This puts all hashtag cluster weights on a 0 - 1 scale.

2.) Terminate events: Events are merged across time windows. Since hashtags persists across most time windows, events rarely terminate, and we create one large highly connected graph. Killing large, persistent nodes will make events more localized in time.

Cleanup: better error handling

Stop word filter word cloud

Need to get rid of unhelpful terms

Pointwise mutual information

As an alternative to the current method of non-hashtag sentiment clustering, we can try to perform pointwise mutual information scores on word bigrams. For non-stopwords, we can assess what the pointwise mutual information is. Similarly to how we create hashtag clustering, we can assess the likelihood of creating such a high PMI score, and from there choose to include it in our graph.

fix mkdirp bugs

python services create dirs in /downloads with perms 777
but /downloads/threats was created by node process as 755

move all mkdirp code into node boot script

Research redis conn failures in python client

There are currently no actual failures...this is just to note that we should check to see what happens in the event of a failure. This has been a test of the emergency redis failure system...This is only a test.

Improve PostsClusters UI

in particular, clicking individual items to reveal tweet info is not helpful when running full system tests. much more helpful to see more than 1 at a time.

for images, maybe a thumbnail of the original, and then click to open full sized.

for text, hashtags, maybe click to open original tweet.

Fix event-finder bugs

move event-finder to own container

Load test data with pyspark

Use Spark 2.0 docker containers (1 master/ 1 slave) to encapsulate a simple py script that loads data directly into mongo (loopy not a good candidate) for ingesting fairly large datasets (> 10M posts).

Item in list query

Currently, there is no ability to perform a 'where' query in which one tries to find an item in a list. This specifically comes into place when attempting to find socialMediaPosts that have a specific hashtag term. I'm not sure how painful this would be to implement, but it would be a nice feature to have.

Comedian job params do not match submitted job

Looks like query_url/result_url params were modified for comedian:16. revert those to match job sent by node process.

Add Loopback 'inq' query to Loopy

example
filter[where][name][inq]=foo&filter[where][name][inq]=bar

Cleanup: events UI

Tasks

show spinner when loading events
make the spinner more noticeable
remove 'aggevent' references
hide map if no points to plot
project fields in smposts queries to greatly reduce bits over wire
misc code cleanup

Agg-clustering: needs rebuild?

is the current version (5) on docker hub in working order?

Add monitoring plots

We need to be able to understand problems more quickly. I would like it if we add a set of simple plots to a 'Social Media Diagnostics' page, which can start with the following plots:

Number of tweets ingested in a fixed period (~30 m) vs time
Average time interval between the tweet creation time and receiving the tweet
Number of posts clusters made in a given time interval
Number of events made per day

We may expand this selection, but I think that this subset will give us the ability to pretty quickly diagnose problems like the ones we're seeing now.

Include twitter images in featurizer service

We need more images so let's include twitter images in addition to instagram.

Add indexes for bayesian changes in #54

We need to add indexes for new queries added.

replace os.environ[ with getenv

I added some os.environ code in when it would be much better to use getenv

Create image fetcher service to preprocess posts

Currently, the caffe service handles grokking the 'primary image url' from all posts, and it also downloads the image, if found. This subprocess is really slow! Let's move image handling to another service that can run before featurizing step, to improve scaling opportunities and speed things up for the featurizer service.

Aggregate-clustering: stop ongoing events after x hours

Problem: Mongo's 16MB doc limit is regularly exceeded for text agg clusters.

We need to copy ongoing event into new doc.

Feature sim clustering: Unused data_type job param in Loopy query?

See https://github.com/Sotera/watchman/blob/master/services/feature-similarity-clustering/main.py#L38

Sanity check: should we be using job['data_type'] in the Loopy query params?

I noticed during a system run that the clustering step was taking a really long time. It is scrolling thru more records that I'd expect (more than the featurizer step) b/c it doesn't filter by data_type.

What am i missing? @drJAGartner @justinlueders

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.