wotd's Introduction

Analysis of WOTD data from Tweets marked #WordOfTheDay from my Twitter account

Twitter search: #WordOfTheDay from:permutans

Preprocessing

For detailed preprocessing instructions see PREPROCESSING.md, which in summary consists of the following to produce wotd_tweet.json from tweet.js (a file provided by Twitter):

function js2json () { tail -c +$(echo 3 + $(grep -b -o "=" <<<"$(head -1 $@)" | cut -d: -f1) | bc) $@; }

jq '[.[] | select(.tweet .entities .hashtags[] .text | test("wordoftheday"; "i"))]' <<<$(js2json tweet.js) > wotd_tweet.json

This file was big (188 tweets took up 24,000 lines), so the next step reduced its size by removing surplus info.

The step detailed in DEJUNK.md and performed in dejunk.sh reduced wotd_tweet.json from 24,000 lines to 22,000 lines but it was difficult to inspect the JSON tree's full contents without enumerating all paths given how variable the tweet object structure can be.

A 'path' here is a rooted path to any key, which I call a "unique key path" or UKP

Following this initial 'dejunk' step, I wanted to get a full account of what was in this JSON, so I enumerated all the paths (106 of them, the initial 'dejunking' only removed 7), the rest of which required would need more complex jq commands (due to nested keys inside iterators).

I turned these 106 paths into a 'checklist' to be deleted from the JSON programmatically rather than by manually writing a del command (it'd get big, taking a long time, and becoming hard to maintain or debug if I wanted to change it later, or reuse it on another file e.g. the much larger JSON file of all tweets, rather than this WOTD subset).

For many of these, the paths are 'inexact' i.e. 'globbed', as they're not present in all Tweet objects. The inexact paths 'expand' to a much larger number of paths (they are 'one-to-many').

This basic auditing step allowed me to confirm that information such as geolocation were not in the JSON, so I could be confident that uploading the file to GitHub wasn't being oblivious to security concerns (and will also mean I can remain confident of this in future by inspect this 'UKP manifest').

The step detailed in AUDIT.md and [to be] performed in audit.sh reduced wotd_tweet_dejunked.json from 22,000 lines to [TBC] lines in the file wotd_tweet_reconciled.json.

I wasn't expecting to, but the auditing step led me to write a trie, and then to recursively walk this trie to remove any repetitive parts to display the paths only in terms of what changed line by line (similar to when you write " " " " to indicate a repeated part of a line, 'as above'). This doubles as a suitable method of writing the jq query to delete those keys. This is explained in more detail in the markdown document.

wotd's People

Contributors

Stargazers

Watchers

wotd's Issues

Parent of a null child node in trie is masked when parent line not present

While establishing input validation for

grep -v "\- \[x\]" ukp_manifest.md | grep "\[\].*\[\]" | cut -d\` -f2 | tail -6 | sed '/variants$/d' | python trie_walk.py -

⇣

.[] .tweet .extended_entities .media[] .video_info .aspect_ratio
                                                   .duration_millis
                                                               .bitrate
                                                               .content_type
                                                               .url

I noticed there's a bug here. The output should be:

.[] .tweet .extended_entities .media[] .video_info .aspect_ratio
                                                   .duration_millis
                                                   .variants[] .bitrate
                                                               .content_type
                                                               .url

The difference between this (the expected output) and the invalid masked trie output (with the first line ending with .video_info) is the difference between valid and invalid output.

As far as I can tell so far, valid input will have no 'null children', as any descendants must (by the definition of mutually exclusive descendants and ancestors) exclude their descendants from a list of filter paths.
In other words, detecting paths which have these 'null children' will be sufficient to validate the masked trie list as input to build_del_call.py.

I think this should go in a separate file as trie_walk.py has hitherto been able to assume parent nodes could be present alongside descendants, but when used as a filter then this assumption becomes a prohibition.

This is called iterator masking in section 4 of AUDIT.md.

Recommend Projects

lmmx / wotd Goto Github PK

wotd's Introduction

Preprocessing

wotd's People

Contributors

Stargazers

Watchers

wotd's Issues

Parent of a null child node in trie is masked when parent line not present

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent