Coder Social home page Coder Social logo

jsonlined's Introduction

jsonlined

I work a lot with jsonlines files (.jsonl), in which each line represents a json-formatted 'dictionary' mapping keys to values.

For instance, I may have a large file with tweets, each tweet a dictionary storing number of likes, tweet text, time of tweeting, etc.

Sometimes I want to extract a value from each json line (say, the tweet's text under the key text), do something with it (like count the number of words, split it into sentences, or categorize its sentiment), and store the result under a new key in the original json dictionary.

I wrote this small module to make this easier. Perhaps you will find it useful.

It uses Python's suprocess module, which is probably not ideal compared to relying on bash itself. I'm not an expert in bash/pipes/streams/subprocess/json/operating systems. No guarantees about it working correctly anywhere other than my own computer.

Install

pip install git+https://github.com/mwestera/jsonlined

This will make two commands available in your shell:

  • jsonlined: for cases where each line is processed by a separate instance of a program.
  • jsonpiped: for cases where lines are fed one by one into a single running instance of a program.

The latter is especially recommended for programs with substantial buildup/teardown.

Examples

Suppose we have a .jsonl file with social media posts like this:

{"type": "submission", "id": "12qw3", "text": "The quick brown fox. So anyway.", "score": 0.5}
{"type": "reply", "id": "34ad5", "text": "Vintage pamphlets are fun. Buy them!", "score": 0.86}
{"type": "submission", "id": "654as", "text": "Ignorance of the law. What about it?", "score": 1.0}

We can extract the values under text, pass them into another command, like Unix' own wc -w for counting words, and store the result in a new key n_words, keeping the original text:

$ cat tests/test.jsonl | jsonlined [wc -w] text nwords --keep

Hypothetical example, assuming one has sentencize.py for splitting a text into sentences:

Get a bunch of jsonlines, extract the values under 'text', split each text into sentences, output a new json line per sentence, each with 'id' field derived from the original 'id' field:

$ cat tests/test.jsonl | jsonlined [python sentencize.py] text sentence --id id 

Or maybe we want it only for the lines where the "type" key has the value "submission":

$ cat tests/test.jsonl | jsonlined [python sentencize.py] text,type=submission sentence --id id 

You can also filter on the output of the subprocess, for instance to get all texts with 10 words:

$ cat tests/test.jsonl | jsonlined [wc -w]=10 text

Another example, for computing text embeddings (assuming we have the script embed.py to operate on lines of stdin:

$ cat tests/test.jsonl | jsonpiped [python embed.py] text embedding

This time, jsonpiped is used (instead of jsonlined), because embed.py requires considerable setup (loading model) -- a prerequisite is that it operates line-swise (not waiting for EOF like wc).

If subprocess outputs json format, this will be interpreted as such; otherwise literal string.

In case the subprocess can output multiple new lines per original input line, either use jsonlined, or -- for jsonpiped -- set --onetomany and make sure the subprocess outputs double newlines between inputs.

Related

More or less the same can be achieved, with a bit of bash scripting, by using the much more sophisticated, faster, more general-purpose JSON Stream Editor JJ.

jsonlined's People

Contributors

mwestera avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.