thedataridealongs / projectdomino Goto Github PK
View Code? Open in Web Editor NEWScaling COVID public behavior change and anti-misinformation
License: Apache License 2.0
Scaling COVID public behavior change and anti-misinformation
License: Apache License 2.0
Add the ability to make the limit optional on get_from_neo. Leave the default limit in there and maybe add another parameter such as unlimited=true.
-- Get (large) extract from parquet or neo4j
-- Train classifier
-- Push labels back to neo4j
-- Save classifier somewhere
Ex: Improve upon botometer
Tracks current effort to get Neo4j to export ~100M node/edge parquet/arrow graphs in decent time for use by analytics stacks
This is for fast on-the-fly mode: dynamic cypher query -> parquet/arrow
-- Load drugbank.ca
-- Identify drug names and synonyms
-- Pair against clinical trial drug names
-- Write to neo4j
-- Run continuously via prefect.io
Get initial set of ~100 btc/eth address as a CSV:
address | type | tweet id
Package up & send to an intel org for scoring
Tracks effort for direct intervention on Twitter:
-- Daily reports
-- Alerts to in-community activity to subscribers
-- Posting misinfo reports to detected misinfo covs
Associated tasks:
-- Get org API key
-- Prototype above capabilities
-- Website for use with those
Add a prefect workflow to handle running the Drug Synonyms Workflow on a scheduled basis
A method to pull batches of Tweets based on a set of filtering criteria (e.g. hydrated='FULL' and btc_extraction_status=null)
This can only pull a single label node as it is meant to pull large batches of Tweets
https://developer.twitter.com/en/developer-terms/agreement-and-policy
https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases
https://developer.twitter.com/en/developer-terms/display-requirements
https://help.twitter.com/en/rules-and-policies/twitter-automation (edited)
-- Classify if ~cure related
-- Classify if a registered trial
-- Risk score
-- Figure out automation flow wrt neo4j/prefect and data pipelines
As part of general discovery, such as looking at a campaign within clinical disinformation, need ability to quickly look at trends within different ranges (day, week, keyword filter, ...).
-- Trigram/hashtag trending: Sortable list of hot 2-3 word combos w/ sparklines of activity over timeranges
-- Topic trending: Same thing
Daily batch exports, potentially by Type (node, relationship), for scheduled bulk analytics and easier analyst efforts
This is part of feeding + automating the graphsage stuff for the first look at the clinical disinformation.
Prove out StreamIt for dashboarding:
-- plotting cypher queries
-- deploy via docker (or within jupyter)
Ex: Feed ingest monitoring
The save_enrichment_df_to_graph method only allows one column to be added because it is missing a comma in between the set statements.
Add a message queue with buffering and restart capabilities, such as managed kafka on azure
Right now, if we submit a large batch of IDs to twitter (ex: covid 50m), or say if our neo4j goes down for 12hr maintenance, we risk data loss etc. & manual retry efforts. A queue like kafka would make simpler.
Add a python helper to retrieve the hydrated status of the Account node
-- Run topic analyzers & community detectors
-- For known campaigns, explore
-- Push annotations back to neo4j (or at least to a CSV...)
-- Build intuition for misinformation hotspots
-- Identify top feeds
-- Load into neo4j
We want to record BERT model scores as properties on all our main entity types: URL, Tweet, Account. (Any others wrt Twitter?). Initially this is just one general topic classifier; later we may have multiple models (=scores) per entity for other aspects. It is currently unclear when the classifiers will be run/re-run.
For V1, the thinking is:
-- A manual job that trains the model, saves to disk (or uses pretrained)
-- A manual job out-of-band scores all Tweets, but not accounts/urls
-- ... It writes back into neo4j. It does not updating the record_modified_at, because base data is not modified?
For this ticket:
-- We need field(s) in Tweet, Account, and URL entities for the topic vector
-- The vector is 1024 floats
-- We'd benefit from an example of writing that datatype into neo4j & reading back -- is this 1024 diff fields, is it one binary buffer, ...?
Unclear for this ticket is data representation and indexing. Writing is generally indexed by entity id, which is fine. Ideally we can do similarity-based search, like "show all Tweets where their vector is within 2 units of this search vector". If neo4j can do natively, great, otherwise maybe we ultimately use FAISS as a service or out of band to generate and feed back as relations. For now, we can probably get away with doing a full DB scan on-the-fly where we page through all (tweet_id, topic vector)
s.
cc @bechbd
-- Extract URLs from tweets
-- Write fragments to neo4j
-- Resolve redirects & write fragments to neo4j
-- FIgure out as part of prefect.io flow
Logo:
Homepage:
For basic scams (phishing, blockchain):
-- Load WHO + clinicaltrails.gov
-- write into Neo4j as model: registration date, trial start date, trial end date, drug, tags (COVID?), database, ...
Add indices to allow for searching on hashtags such as:
MATCH (t:Tweet)<-[e:TWEETED]-(a:Account)
WHERE SIZE(t.hashtags) > 0 AND t.hashtags =~ "stayathome|quarantinelife"
RETURN
a.name as name,
a.id as account_id,
t.hashtags as tags,
t.created_at.month as month,
t.created_at.day as day,
t.id as tweet_id
LIMIT 100
FAISS requires an OS dependency libomp-dev
(on Ubuntu) or to be installed via conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA10
Feed the beast!
Priority:
-- search
-- status ID hydration
-- user timeline
-- user profile info
Add a method which will allow for adding/updating edges from a dataframe passed in
For some sample tweets, test especially:
Less clear: diff search jobs
Start proving out:
See also Issue on bulk neo4j export (100M node/edge)
-- extract
-- on pipeline
-- and into neo4j
Please reference notebook here https://sandbox.projectdomino.org/notebook/notebooks/notebooks/twitter/twint/Cody-twint-prefect-defcon-Copy1.ipynb#dask for example of a workload to be executed with Dask.
Run large jobs using GCP resources optimally in a platform-agnostic way.
Dask Executor for Prefect configured with secure access.
Dask can be used by data scientists for the datasets larger than is feasible to analyze on a single machine. It presents a dask.DaskDataframe
object that can act as a [drop-in replacement] for the familiar pandas.PandasDataframe
.(https://docs.dask.org/en/latest/dataframe.html).
Helm chart that deploys https://docs.dask.org/en/latest/setup/kubernetes-helm.html
Chat can be deployed using GitOps with a derivative of https://github.com/WyriHaximus/github-action-helm3
We have the initial nb's, but need to get into prefect + neo4j. This is part of feeding + automating the graphsage stuff for the first look at the clinical disinformation.
docker-compose
Prometheus
Neo4j -> Prometheus
Prefect -> slack fails
Docs ??
Initially just on COVID/CoronaVirus (contiuing historical)
As a Prefect.io job
Twitter's API supports returning enriched URLs, such as following redirects and getting some page metadata: https://developer.twitter.com/en/docs/tweets/enrichments/overview/expanded-and-enhanced-urls
-- Our use of Twarc should try to include these and push as part of URLs to neo4j (tweet -> URL + ResolvedUrl -> Metadata)
-- If possible, in Twint too
-- Our own URL enrichments should only run if twitter doesn't already give us (and to augment what's left, e.g., post-redirect)
-- Run topic analyzers and community detectors
-- For known campaigns, explore
-- Map out buckets to orient the intervention
-- Identify
-- Test against known
-- Prove out NB
Add a method to allow for easily adding enrichment attributes to Neo4j nodes
Example flows for:
-- One-off tasks (CPU/GPU)
-- Backfill
-- Scheduled+Streaming
Update twint helpers to use multiprocessing (ex: prefect executor pool)
Once Twint merges twintproject/twint#914 , switch back to mainline
Label each twitter account with label + confidence score for primary:
-- long/lat <-- just start here...
-- country
-- if US: state, zipcode, city/county (?)
Helpful info:
-- 1% of tweets have geo data
-- sample use case: "who was early/late to adopting Masks4All? who is currently resisting / not activating?"
Unclear: People move and have separate home/work... so how does recency play into it? Not really an issue during quarantine tho...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.