Coder Social home page Coder Social logo

wotd's Introduction

Analysis of WOTD data from Tweets marked #WordOfTheDay from my Twitter account

Preprocessing

For detailed preprocessing instructions see PREPROCESSING.md, which in summary consists of the following to produce wotd_tweet.json from tweet.js (a file provided by Twitter):

function js2json () { tail -c +$(echo 3 + $(grep -b -o "=" <<<"$(head -1 $@)" | cut -d: -f1) | bc) $@; }

jq '[.[] | select(.tweet .entities .hashtags[] .text | test("wordoftheday"; "i"))]' <<<$(js2json tweet.js) > wotd_tweet.json

This file was big (188 tweets took up 24,000 lines), so the next step reduced its size by removing surplus info.

The step detailed in DEJUNK.md and performed in dejunk.sh reduced wotd_tweet.json from 24,000 lines to 22,000 lines but it was difficult to inspect the JSON tree's full contents without enumerating all paths given how variable the tweet object structure can be.

  • A 'path' here is a rooted path to any key, which I call a "unique key path" or UKP

Following this initial 'dejunk' step, I wanted to get a full account of what was in this JSON, so I enumerated all the paths (106 of them, the initial 'dejunking' only removed 7), the rest of which required would need more complex jq commands (due to nested keys inside iterators).

I turned these 106 paths into a 'checklist' to be deleted from the JSON programmatically rather than by manually writing a del command (it'd get big, taking a long time, and becoming hard to maintain or debug if I wanted to change it later, or reuse it on another file e.g. the much larger JSON file of all tweets, rather than this WOTD subset).

For many of these, the paths are 'inexact' i.e. 'globbed', as they're not present in all Tweet objects. The inexact paths 'expand' to a much larger number of paths (they are 'one-to-many').

This basic auditing step allowed me to confirm that information such as geolocation were not in the JSON, so I could be confident that uploading the file to GitHub wasn't being oblivious to security concerns (and will also mean I can remain confident of this in future by inspect this 'UKP manifest').

The step detailed in AUDIT.md and [to be] performed in audit.sh reduced wotd_tweet_dejunked.json from 22,000 lines to [TBC] lines in the file wotd_tweet_reconciled.json.

I wasn't expecting to, but the auditing step led me to write a trie, and then to recursively walk this trie to remove any repetitive parts to display the paths only in terms of what changed line by line (similar to when you write " " " " to indicate a repeated part of a line, 'as above'). This doubles as a suitable method of writing the jq query to delete those keys. This is explained in more detail in the markdown document.

wotd's People

Contributors

lmmx avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

wotd's Issues

Parent of a null child node in trie is masked when parent line not present

While establishing input validation for

grep -v "\- \[x\]" ukp_manifest.md | grep "\[\].*\[\]" | cut -d\` -f2 | tail -6 | sed '/variants$/d' | python trie_walk.py -

โ‡ฃ

.[] .tweet .extended_entities .media[] .video_info .aspect_ratio
                                                   .duration_millis
                                                               .bitrate
                                                               .content_type
                                                               .url

I noticed there's a bug here. The output should be:

.[] .tweet .extended_entities .media[] .video_info .aspect_ratio
                                                   .duration_millis
                                                   .variants[] .bitrate
                                                               .content_type
                                                               .url

The difference between this (the expected output) and the invalid masked trie output (with the first line ending with .video_info) is the difference between valid and invalid output.

  • As far as I can tell so far, valid input will have no 'null children', as any descendants must (by the definition of mutually exclusive descendants and ancestors) exclude their descendants from a list of filter paths.
  • In other words, detecting paths which have these 'null children' will be sufficient to validate the masked trie list as input to build_del_call.py.

I think this should go in a separate file as trie_walk.py has hitherto been able to assume parent nodes could be present alongside descendants, but when used as a filter then this assumption becomes a prohibition.

This is called iterator masking in section 4 of AUDIT.md.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.