Coder Social home page Coder Social logo

thedataridealongs / projectdomino Goto Github PK

View Code? Open in Web Editor NEW
62.0 15.0 13.0 13.29 MB

Scaling COVID public behavior change and anti-misinformation

License: Apache License 2.0

Python 95.87% Dockerfile 0.28% Shell 3.28% Makefile 0.56%
graphs4good covid graph misinformation behavior-change

projectdomino's People

Contributors

007vasy avatar alx avatar andreadowning avatar bechbd avatar lmeyerov avatar webcoderz avatar ziyaowei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

projectdomino's Issues

Make limit optional on get_from_neo

Add the ability to make the limit optional on get_from_neo. Leave the default limit in there and maybe add another parameter such as unlimited=true.

Dynamic Neo4j large ~parquet export

Tracks current effort to get Neo4j to export ~100M node/edge parquet/arrow graphs in decent time for use by analytics stacks

This is for fast on-the-fly mode: dynamic cypher query -> parquet/arrow

Ingest drug synonyms

-- Load drugbank.ca
-- Identify drug names and synonyms
-- Pair against clinical trial drug names
-- Write to neo4j
-- Run continuously via prefect.io

Initial BTC sample

Get initial set of ~100 btc/eth address as a CSV:

address | type | tweet id

Package up & send to an intel org for scoring

Twitter Response Bot

Tracks effort for direct intervention on Twitter:

-- Daily reports
-- Alerts to in-community activity to subscribers
-- Posting misinfo reports to detected misinfo covs

Associated tasks:

-- Get org API key
-- Prototype above capabilities
-- Website for use with those

redo neo4j bindings

  • Switch from py2neo to official neo4j python bindings
  • Expose structured parameter & timeout controls
  • ... and underlying connection
  • Helper methods for getting results as a dataframe: table view + (nodes_df, edges_df) view

Add a method to pull nodes in batches

A method to pull batches of Tweets based on a set of filtering criteria (e.g. hydrated='FULL' and btc_extraction_status=null)

This can only pull a single label node as it is meant to pull large batches of Tweets

Trigram and topic trending

As part of general discovery, such as looking at a campaign within clinical disinformation, need ability to quickly look at trends within different ranges (day, week, keyword filter, ...).

-- Trigram/hashtag trending: Sortable list of hot 2-3 word combos w/ sparklines of activity over timeranges

-- Topic trending: Same thing

Daily neo4j large parquet export

Daily batch exports, potentially by Type (node, relationship), for scheduled bulk analytics and easier analyst efforts

This is part of feeding + automating the graphsage stuff for the first look at the clinical disinformation.

StreamIt for dashboarding

Prove out StreamIt for dashboarding:

-- plotting cypher queries
-- deploy via docker (or within jupyter)

Ex: Feed ingest monitoring

message queue

Add a message queue with buffering and restart capabilities, such as managed kafka on azure

Right now, if we submit a large batch of IDs to twitter (ex: covid 50m), or say if our neo4j goes down for 12hr maintenance, we risk data loss etc. & manual retry efforts. A queue like kafka would make simpler.

Neo4j topic field for BERT scores

We want to record BERT model scores as properties on all our main entity types: URL, Tweet, Account. (Any others wrt Twitter?). Initially this is just one general topic classifier; later we may have multiple models (=scores) per entity for other aspects. It is currently unclear when the classifiers will be run/re-run.

For V1, the thinking is:

-- A manual job that trains the model, saves to disk (or uses pretrained)
-- A manual job out-of-band scores all Tweets, but not accounts/urls
-- ... It writes back into neo4j. It does not updating the record_modified_at, because base data is not modified?

For this ticket:

-- We need field(s) in Tweet, Account, and URL entities for the topic vector
-- The vector is 1024 floats
-- We'd benefit from an example of writing that datatype into neo4j & reading back -- is this 1024 diff fields, is it one binary buffer, ...?

Unclear for this ticket is data representation and indexing. Writing is generally indexed by entity id, which is fine. Ideally we can do similarity-based search, like "show all Tweets where their vector is within 2 units of this search vector". If neo4j can do natively, great, otherwise maybe we ultimately use FAISS as a service or out of band to generate and feed back as relations. For now, we can probably get away with doing a full DB scan on-the-fly where we page through all (tweet_id, topic vector)s.

cc @bechbd

URL extraction

-- Extract URLs from tweets
-- Write fragments to neo4j
-- Resolve redirects & write fragments to neo4j
-- FIgure out as part of prefect.io flow

Media: Logo! Homepage!

Logo:

  • Domino effect
  • Maybe: flatten the curve?
  • A few file sizes & formats etc. for use in social media, homepages, etc.

Homepage:

  • Something simple to maintain like github page or wix
  • Basic areas similar to current github README

Intervention: Scam detection

For basic scams (phishing, blockchain):

  • Ingest threat feeds: notebook one-offs and prefect.io/neo4j automations
  • Correlate against COVID firehose
  • Map / investigate top campaigns
  • Start on response workflows

Ingest clinical trials

-- Load WHO + clinicaltrails.gov
-- write into Neo4j as model: registration date, trial start date, trial end date, drug, tags (COVID?), database, ...

Add indexes to improve searches on hashtags

Add indices to allow for searching on hashtags such as:

MATCH (t:Tweet)<-[e:TWEETED]-(a:Account)
WHERE SIZE(t.hashtags) > 0 AND t.hashtags =~ "stayathome|quarantinelife"
RETURN
a.name as name,
a.id as account_id,
t.hashtags as tags,
t.created_at.month as month,
t.created_at.day as day,
t.id as tweet_id
LIMIT 100

move from twarc to twint

Feed the beast!

Priority:
-- search
-- status ID hydration
-- user timeline
-- user profile info

pytest for fh transforms

For some sample tweets, test especially:

  • tweets -> pandas
  • pandas -> arrow
  • pandas -> neo4j
  • arrow -> parquet

Less clear: diff search jobs

Start proving out cugraph flow

Start proving out:

  • Get subgraph from neo4j (in our model)
  • Convert to cugraph, potentially multiple views for diff tasks (discourse, follower, ...)
  • Run through cugraph algorithms: community detection, pagerank, centrality, ...
  • Feed back into neo4j
  • Do as a prefectio job

See also Issue on bulk neo4j export (100M node/edge)

GKE for Prefect's Dask Executor and general Dask

Please reference notebook here https://sandbox.projectdomino.org/notebook/notebooks/notebooks/twitter/twint/Cody-twint-prefect-defcon-Copy1.ipynb#dask for example of a workload to be executed with Dask.

Goal

Run large jobs using GCP resources optimally in a platform-agnostic way.

Current use-cases

Dask Executor for Prefect configured with secure access.

Dask can be used by data scientists for the datasets larger than is feasible to analyze on a single machine. It presents a dask.DaskDataframe object that can act as a [drop-in replacement] for the familiar pandas.PandasDataframe.(https://docs.dask.org/en/latest/dataframe.html).

Acceptance

Helm chart that deploys https://docs.dask.org/en/latest/setup/kubernetes-helm.html

Chat can be deployed using GitOps with a derivative of https://github.com/WyriHaximus/github-action-helm3

@lmeyerov @webcoderz @bechbd

Metrics stack

docker-compose
Prometheus
Neo4j -> Prometheus
Prefect -> slack fails
Docs ??

Intervention: Drug misinformation tracking

  • Ingest clinical trial data (clinicaltrials.gov + WHO)
  • Detect & track positive discussions on Twitter
  • Detect & track untrialed cure discussions on Twitter
  • Rate misinfo
  • Autoresponse

Get enriched URLs straight from Twitter

Twitter's API supports returning enriched URLs, such as following redirects and getting some page metadata: https://developer.twitter.com/en/docs/tweets/enrichments/overview/expanded-and-enhanced-urls

-- Our use of Twarc should try to include these and push as part of URLs to neo4j (tweet -> URL + ResolvedUrl -> Metadata)

-- If possible, in Twint too

-- Our own URL enrichments should only run if twitter doesn't already give us (and to augment what's left, e.g., post-redirect)

Prefect tutorial

Example flows for:

-- One-off tasks (CPU/GPU)
-- Backfill
-- Scheduled+Streaming

Account geo label inference

Label each twitter account with label + confidence score for primary:

-- long/lat <-- just start here...
-- country
-- if US: state, zipcode, city/county (?)

Helpful info:
-- 1% of tweets have geo data
-- sample use case: "who was early/late to adopting Masks4All? who is currently resisting / not activating?"

Unclear: People move and have separate home/work... so how does recency play into it? Not really an issue during quarantine tho...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.