Make limit optional on get_from_neo

Add the ability to make the limit optional on get_from_neo. Leave the default limit in there and maybe add another parameter such as unlimited=true.

Start proving out graph neural net flow

-- Get (large) extract from parquet or neo4j
-- Train classifier
-- Push labels back to neo4j
-- Save classifier somewhere

Ex: Improve upon botometer

Dynamic Neo4j large ~parquet export

Tracks current effort to get Neo4j to export ~100M node/edge parquet/arrow graphs in decent time for use by analytics stacks

This is for fast on-the-fly mode: dynamic cypher query -> parquet/arrow

Ingest drug synonyms

-- Load drugbank.ca
-- Identify drug names and synonyms
-- Pair against clinical trial drug names
-- Write to neo4j
-- Run continuously via prefect.io

Initial BTC sample

Get initial set of ~100 btc/eth address as a CSV:

address | type | tweet id

Package up & send to an intel org for scoring

Twitter Response Bot

Tracks effort for direct intervention on Twitter:

-- Daily reports
-- Alerts to in-community activity to subscribers
-- Posting misinfo reports to detected misinfo covs

Associated tasks:

-- Get org API key
-- Prototype above capabilities
-- Website for use with those

redo neo4j bindings

Switch from py2neo to official neo4j python bindings
Expose structured parameter & timeout controls
... and underlying connection
Helper methods for getting results as a dataframe: table view + (nodes_df, edges_df) view

70m Hydration via prefect/twint

Create Prefect Job for Drug Synonyms Workflow

Add a prefect workflow to handle running the Drug Synonyms Workflow on a scheduled basis

Add a method to pull nodes in batches

A method to pull batches of Tweets based on a set of filtering criteria (e.g. hydrated='FULL' and btc_extraction_status=null)

This can only pull a single label node as it is meant to pull large batches of Tweets

Legal & risk policy review

For core analytics, interventions, & users, check ToS of Twitter and other data feeds
Recommend framework for going forward

https://developer.twitter.com/en/developer-terms/agreement-and-policy
https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases
https://developer.twitter.com/en/developer-terms/display-requirements
https://help.twitter.com/en/rules-and-policies/twitter-automation (edited)

Replace firehose print with python logger

Classify tweets as drug/medicine related and whether in a trial

-- Classify if ~cure related
-- Classify if a registered trial
-- Risk score
-- Figure out automation flow wrt neo4j/prefect and data pipelines

Trigram and topic trending

As part of general discovery, such as looking at a campaign within clinical disinformation, need ability to quickly look at trends within different ranges (day, week, keyword filter, ...).

-- Trigram/hashtag trending: Sortable list of hot 2-3 word combos w/ sparklines of activity over timeranges

-- Topic trending: Same thing

Daily neo4j large parquet export

Daily batch exports, potentially by Type (node, relationship), for scheduled bulk analytics and easier analyst efforts

This is part of feeding + automating the graphsage stuff for the first look at the clinical disinformation.

StreamIt for dashboarding

Prove out StreamIt for dashboarding:

-- plotting cypher queries
-- deploy via docker (or within jupyter)

Ex: Feed ingest monitoring

save_enrichment_df_to_graph only allows one column to be added

The save_enrichment_df_to_graph method only allows one column to be added because it is missing a comma in between the set statements.

message queue

Add a message queue with buffering and restart capabilities, such as managed kafka on azure

Right now, if we submit a large batch of IDs to twitter (ex: covid 50m), or say if our neo4j goes down for 12hr maintenance, we risk data loss etc. & manual retry efforts. A queue like kafka would make simpler.

Add helper method to perform cache check on Account

Add a python helper to retrieve the hydrated status of the Account node

Start (manually) categorizing covid drug twitter to orient the intervention

-- Run topic analyzers & community detectors
-- For known campaigns, explore
-- Push annotations back to neo4j (or at least to a CSV...)
-- Build intuition for misinformation hotspots

Ingest phishfeeds

-- Identify top feeds
-- Load into neo4j

Neo4j topic field for BERT scores

We want to record BERT model scores as properties on all our main entity types: URL, Tweet, Account. (Any others wrt Twitter?). Initially this is just one general topic classifier; later we may have multiple models (=scores) per entity for other aspects. It is currently unclear when the classifiers will be run/re-run.

For V1, the thinking is:

-- A manual job that trains the model, saves to disk (or uses pretrained)
-- A manual job out-of-band scores all Tweets, but not accounts/urls
-- ... It writes back into neo4j. It does not updating the record_modified_at, because base data is not modified?

For this ticket:

-- We need field(s) in Tweet, Account, and URL entities for the topic vector
-- The vector is 1024 floats
-- We'd benefit from an example of writing that datatype into neo4j & reading back -- is this 1024 diff fields, is it one binary buffer, ...?

Unclear for this ticket is data representation and indexing. Writing is generally indexed by entity id, which is fine. Ideally we can do similarity-based search, like "show all Tweets where their vector is within 2 units of this search vector". If neo4j can do natively, great, otherwise maybe we ultimately use FAISS as a service or out of band to generate and feed back as relations. For now, we can probably get away with doing a full DB scan on-the-fly where we page through all (tweet_id, topic vector)s.

cc @bechbd

URL extraction

-- Extract URLs from tweets
-- Write fragments to neo4j
-- Resolve redirects & write fragments to neo4j
-- FIgure out as part of prefect.io flow

Media: Logo! Homepage!

Logo:

Domino effect
Maybe: flatten the curve?
A few file sizes & formats etc. for use in social media, homepages, etc.

Homepage:

Something simple to maintain like github page or wix
Basic areas similar to current github README

Intervention: Scam detection

For basic scams (phishing, blockchain):

Ingest threat feeds: notebook one-offs and prefect.io/neo4j automations
Correlate against COVID firehose
Map / investigate top campaigns
Start on response workflows

Ingest clinical trials

-- Load WHO + clinicaltrails.gov
-- write into Neo4j as model: registration date, trial start date, trial end date, drug, tags (COVID?), database, ...

Add indexes to improve searches on hashtags

Add indices to allow for searching on hashtags such as:

MATCH (t:Tweet)<-[e:TWEETED]-(a:Account)
WHERE SIZE(t.hashtags) > 0 AND t.hashtags =~ "stayathome|quarantinelife"
RETURN
a.name as name,
a.id as account_id,
t.hashtags as tags,
t.created_at.month as month,
t.created_at.day as day,
t.id as tweet_id
LIMIT 100

Install FAISS in Prefect Docker container

FAISS requires an OS dependency libomp-dev (on Ubuntu) or to be installed via conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA10

See: facebookresearch/faiss#821

move from twarc to twint

Feed the beast!

Priority:
-- search
-- status ID hydration
-- user timeline
-- user profile info

Add method to add Edges

Add a method which will allow for adding/updating edges from a dataframe passed in

pytest for fh transforms

For some sample tweets, test especially:

tweets -> pandas
pandas -> arrow
pandas -> neo4j
arrow -> parquet

Less clear: diff search jobs

Start proving out cugraph flow

Start proving out:

Get subgraph from neo4j (in our model)
Convert to cugraph, potentially multiple views for diff tasks (discourse, follower, ...)
Run through cugraph algorithms: community detection, pagerank, centrality, ...
Feed back into neo4j
Do as a prefectio job

See also Issue on bulk neo4j export (100M node/edge)

crypto address extractor

-- extract
-- on pipeline
-- and into neo4j

Hydrate 70m

GKE for Prefect's Dask Executor and general Dask

Please reference notebook here https://sandbox.projectdomino.org/notebook/notebooks/notebooks/twitter/twint/Cody-twint-prefect-defcon-Copy1.ipynb#dask for example of a workload to be executed with Dask.

Goal

Run large jobs using GCP resources optimally in a platform-agnostic way.

Current use-cases

Dask Executor for Prefect configured with secure access.

Dask can be used by data scientists for the datasets larger than is feasible to analyze on a single machine. It presents a dask.DaskDataframe object that can act as a [drop-in replacement] for the familiar pandas.PandasDataframe.(https://docs.dask.org/en/latest/dataframe.html).

Acceptance

Helm chart that deploys https://docs.dask.org/en/latest/setup/kubernetes-helm.html

Chat can be deployed using GitOps with a derivative of https://github.com/WyriHaximus/github-action-helm3

@lmeyerov @webcoderz @bechbd

Snopes/FactCheck/... daily ingest

We have the initial nb's, but need to get into prefect + neo4j. This is part of feeding + automating the graphsage stuff for the first look at the clinical disinformation.

@webcoderz @bechbd @ZiyaoWei

Metrics stack

docker-compose
Prometheus
Neo4j -> Prometheus
Prefect -> slack fails
Docs ??

Always-on Search Service

Initially just on COVID/CoronaVirus (contiuing historical)

As a Prefect.io job

Intervention: Drug misinformation tracking

Ingest clinical trial data (clinicaltrials.gov + WHO)
Detect & track positive discussions on Twitter
Detect & track untrialed cure discussions on Twitter
Rate misinfo
Autoresponse

Get enriched URLs straight from Twitter

Twitter's API supports returning enriched URLs, such as following redirects and getting some page metadata: https://developer.twitter.com/en/docs/tweets/enrichments/overview/expanded-and-enhanced-urls

-- Our use of Twarc should try to include these and push as part of URLs to neo4j (tweet -> URL + ResolvedUrl -> Metadata)

-- If possible, in Twint too

-- Our own URL enrichments should only run if twitter doesn't already give us (and to augment what's left, e.g., post-redirect)

Start (manually) categorizing covid blockchain twitter to orient the intervention

-- Run topic analyzers and community detectors
-- For known campaigns, explore
-- Map out buckets to orient the intervention

Find (free) blockchain threat intel sources

-- Identify
-- Test against known
-- Prove out NB

Add method to add enrichment properties to Neo4j Nodes

Add a method to allow for easily adding enrichment attributes to Neo4j nodes

Prefect tutorial

Example flows for:

-- One-off tasks (CPU/GPU)
-- Backfill
-- Scheduled+Streaming

twint parallelization

Update twint helpers to use multiprocessing (ex: prefect executor pool)

[ENH] use mainline twint

Once Twint merges twintproject/twint#914 , switch back to mainline

Account geo label inference

Label each twitter account with label + confidence score for primary:

-- long/lat <-- just start here...
-- country
-- if US: state, zipcode, city/county (?)

Helpful info:
-- 1% of tweets have geo data
-- sample use case: "who was early/late to adopting Masks4All? who is currently resisting / not activating?"

Unclear: People move and have separate home/work... so how does recency play into it? Not really an issue during quarantine tho...

thedataridealongs / projectdomino Goto Github PK

projectdomino's People

Contributors

Stargazers

Watchers

Forkers

projectdomino's Issues

Goal

Current use-cases

Acceptance

Recommend Projects

Recommend Topics

Recommend Org