Coder Social home page Coder Social logo

cmungall / json-flattener Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 3.0 101 KB

Python library for denormalizing nested dicts or json objects to tables and back

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.07% Shell 0.11% Python 53.56% Jupyter Notebook 46.27%
json yaml denormalization dataframes linkml pandas

json-flattener's Introduction

json-flattener

Python library for denormalizing/flattening lists of complex objects to tables/data frames, with roundtripping

Notebook Example

EXAMPLE.ipynb

Description

Given YAML/JSON/JSON-Lines such as:

- id: S001
  name: Lord of the Rings
  genres:
    - fantasy
  creator:
    name: JRR Tolkein
    from_country: England
  books:
    - id: S001.1
      name: Fellowship of the Ring
      price: 5.99
      summary: Hobbits
    - id: S001.2
      name: The Two Towers
      price: 5.99
      summary: More hobbits
    - id: S001.3
      name: Return of the King
      price: 6.99
      summary: Yet more hobbits
- id: S002
  name: The Culture Series
  genres:
    - scifi
  creator:
    name: Ian M Banks
    from_country: Scotland
  books:
    - id: S002.1
      name: Consider Phlebas
      price: 5.99
    - id: S002.2
      name: Player of Games
      price: 5.99

Denormalize using jfl command:

jfl flatten -C creator=flat -C books=multivalued -i examples/books1.yaml -o examples/books1-flattened.tsv
id name genres creator_name creator_from_country books_name books_summary books_price books_id creator_genres
S001 Lord of the Rings [fantasy] JRR Tolkein England [Fellowship of the Ring|The Two Towers|Return of the King] [Hobbits|More hobbits|Yet more hobbits] [5.99|5.99|6.99] [S001.1|S001.2|S001.3]
S002 The Culture Series [scifi] Ian M Banks Scotland [Consider Phlebas|Player of Games] [5.99|5.99] [S002.1|S002.2]

To convert back to JSON/YAML we must first cache the generated mappings when we do the flatten with -O:

jfl flatten -C creator=flat -C books=multivalued -i examples/books1.yaml -O examples/conf.yaml -o examples/books1-flattened.tsv

Then pass this as an argument

jfl unflatten -C creator=flat -C books=multivalued -i examples/books1.tsv -c examples/conf.yaml -o examples/books1.yaml

This library also allows complex fields to be directly serialized as json or yaml (the default is to append _json to the key). For example:

jfl flatten -C creator=json -C books=json -i examples/books1.yaml -o examples/books1-jsonified.tsv
id name genres creator_json books_json
S001 Lord of the Rings [fantasy] {"name": "JRR Tolkein", "from_country": "England"} [{"id": "S001.1", "name": "Fellowship of the Ring", "summary": "Hobbits", "price": 5.99}, {"id": "S001.2", "name": "The Two Towers", "summary": "More hobbits", "price": 5.99}, {"id": "S001.3", "name": "Return of the King", "summary": "Yet more hobbits", "price": 6.99}]
S002 The Culture Series [scifi] {"name": "Ian M Banks", "from_country": "Scotland"} [{"id": "S002.1", "name": "Consider Phlebas", "price": 5.99}, {"id": "S002.2", "name": "Player of Games", "price": 5.99}]
S003 Book of the New Sun [scifi, fantasy] {"name": "Gene Wolfe", "genres": ["scifi", "fantasy"], "from_country": "USA"} [{"id": "S003.1", "name": "Shadow of the Torturer"}, {"id": "S003.2", "name": "Claw of the Conciliator", "price": 6.99}]
S004 Example with single book {"name": "Ms Writer", "genres": ["romance"], "from_country": "USA"} [{"id": "S004.1", "name": "Blah"}]
S005 Example with no books {"name": "Mr Unproductive", "genres": ["romance", "scifi", "fantasy"], "from_country": "USA"}

See

<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRyM06peU9BkrZbXJazuMlajw5s4Vbj5f0t0TE4hj_X9Ex_EASLSUZuaWUxYIhWbOC6CtPRtxrTGWQD/embed?start=false&loop=false&delayms=60000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

The primary use case is to go from a rich normalized data model (as python objects, JSON, or YAML) to a flatter representation that is amenable to processing with:

  • Solr/Lucene
  • Pandas/R Dataframes
  • Excel/Google sheets
  • Unix cut/grep/cat/etc
  • Simple denormalized SQL database representations

The target denormalized format is a list of rows / a data matrix, where each cell is either an atom or a list of atoms.

Usage from Python

dict = {
            "id": "A1",
            "subject": {"id": "G1", "name": "gene1", "category": "gene"},
            "object": {"id": "T1", "name": "term1", "category": "term"},
            "publications": ["PMID1", "PMID2"],
            "closure": [
                {"id": "X1", "name": "x1"},
                {"id": "X2", "name": "x2"},
                {"id": "X3", "name": "x3"},
            ],
        }
kconfig = {
            "subject": KeyConfig(delete=True, serializers="yaml"),
            "object": KeyConfig(delete=True, flatten=True),
            "closure": KeyConfig(delete=True, is_list=True, flatten=True),
        }
config = GlobalConfig(key_configs=kconfig)
flattened_objs = flatten(objs, config)

Method

  • Each top level key becomes a column
  • if the key value is a dict/object, then flatten
    • by default a '_' is used to separate the parent key from the inner key
    • e.g. the composition of creator and from_country becomes creator_from_country
    • currently one level of flattening is supported
  • if the key value is a list of atomic entities, then leave as is
  • if the key value is a list of dicts/objects, then flatten each key of this inner dict into a list
    • e.g. if books is a list of book objects, and name is a key on book, then books_name is a list of names of each book
    • order is significant - the first element of books_name is matched to the first element of books_price, etc
  • Allow any key to be serialized as yaml/json/pickle if configured

Comparison

Pandas json_normalize

Java json-flattener

https://github.com/wnameless/json-flattener

Python

csvjson

https://csvjson.com/json2csv

json-flattener's People

Contributors

cmungall avatar dalito avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

json-flattener's Issues

fix gh-actions

false positive fails: #2

needs pytest in dependencies

maybe just move whole project over to poetry?

Poetry version number is in `pyproject.toml` is out of date

The version number in pyproject.toml says 0.1.0 but the latest release is tagged at 0.2.0. Additionally, this latest release is unavailable on PyPI. The PyPI release is currently at 0.1.9 and there are some confusing inconsistencies in using this library as a CLI tool between the two versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.