Coder Social home page Coder Social logo

keita-dc / dsc-exploring-and-transforming-json-schemas-dc-ds-060319 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-exploring-and-transforming-json-schemas-dc-ds-060319

0.0 1.0 0.0 64 KB

License: Other

Jupyter Notebook 100.00%

dsc-exploring-and-transforming-json-schemas-dc-ds-060319's Introduction

Exploring and Transforming JSON Schemas

Introduction

In this lesson, you'll formalize your knowledge for how to explore a JSON file whose structure and schema is unknown to you. This often happens in practice when you are handed a file or stumble upon one with little documentation.

Objectives

You will be able to:

  • Explore unknown JSON schemas
  • Access and manipulate data inside a JSON file
  • Convert JSON to alternative data formats

Loading the JSON file

As before, you'll begin by importing the json package, opening a file with python's built in function, and then loading that data in.

import json
f = open('output.json')
data = json.load(f)

Exploring JSON Schemas

Recall that JSON files have a nested structure. The most granular level of raw data will be individual numbers (float/int) and strings. These in turn will be stored in the equivalent of python lists and dictionaries. Because these can be combined, you'll start exploring by checking the type of our root object, and start mapping out the hierarchy of the json file.

type(data)
dict

As you can see, in this case, the first level of the hierarchy is a dictionary. Let's explore what keys are within this:

data.keys()
dict_keys(['albums'])

In this case, there is only a single key, 'albums', so you'll continue on down the pathway exploring and mapping out the hierarchy. Once again, start by checking the type of this nested data structure.

type(data['albums'])
dict

Another dictionary! So thus far, you have a dictionary within a dictionary. Once again, investigate what's within this dictionary (JSON calls the equivalent of Python dictionaries Objects.)

data['albums'].keys()
dict_keys(['href', 'items', 'limit', 'next', 'offset', 'previous', 'total'])

At this point, things are starting to look something like this:

At this point, if you were to continue checking individual data types, you have a lot to go through. To simplify this, you can use a for loop:

for key in data['albums'].keys():
    print(key, type(data['albums'][key]))
href <class 'str'>
items <class 'list'>
limit <class 'int'>
next <class 'str'>
offset <class 'int'>
previous <class 'NoneType'>
total <class 'int'>

Adding this to our diagram we now have something like this:

Normally, you may not draw out the full diagram as done here, but its a useful picture to have in mind, and in complex schemas, can be useful to map out. At this point, you also probably have a good idea of the general structure of the json file. However, there is still the list of items, which we could investigate further:

type(data['albums']['items'])
list
len(data['albums']['items'])
2
type(data['albums']['items'][0])
dict
data['albums']['items'][0].keys()
dict_keys(['album_type', 'artists', 'available_markets', 'external_urls', 'href', 'id', 'images', 'name', 'type', 'uri'])

Converting JSON to Alternative Data Formats

As you can see, the nested structure continues on: our list of items is only 2 long, but each item is a dictionary with a large number of key value pairs. To add context, this is actually the data that you're probably after from this file: its that data providing details about what albums were recently released. The entirety of the JSON file itself is an example response from the Spotify API (more on that soon). So while the larger JSON provides us with many details about the response itself, our primary interest may simply be the list of dictionaries within data -> albums -> items. Preview this and see if you can transform it into our usual Pandas DataFrame.

import pandas as pd

On first attempt, you might be tempted to pass the whole object to Pandas. Try and think about what you would like the resulting dataframe to look like based on the schema we are mapping out. What would the column names be? What would the rows represent?

df = pd.DataFrame(data['albums']['items'])
df.head()
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
album_type artists available_markets external_urls href id images name type uri
0 single [{'external_urls': {'spotify': 'https://open.s... [AD, AR, AT, AU, BE, BG, BO, BR, CA, CH, CL, C... {'spotify': 'https://open.spotify.com/album/5Z... https://api.spotify.com/v1/albums/5ZX4m5aVSmWQ... 5ZX4m5aVSmWQ5iHAPQpT71 [{'height': 640, 'url': 'https://i.scdn.co/ima... Runnin' album spotify:album:5ZX4m5aVSmWQ5iHAPQpT71
1 single [{'external_urls': {'spotify': 'https://open.s... [AD, AR, AT, AU, BE, BG, BO, BR, CH, CL, CO, C... {'spotify': 'https://open.spotify.com/album/0g... https://api.spotify.com/v1/albums/0geTzdk2Inlq... 0geTzdk2InlqIoB16fW9Nd [{'height': 640, 'url': 'https://i.scdn.co/ima... Sneakin’ album spotify:album:0geTzdk2InlqIoB16fW9Nd

Not bad, although you can see some of our cells still have nested data within them. The artists column in particular might be nice to break apart. You could do this from the original json, but at this point, let's work with our DataFrame. Preview an entry.

df.artists.iloc[0]
[{'external_urls': {'spotify': 'https://open.spotify.com/artist/2RdwBSPQiwcmiDo9kixcl8'},
  'href': 'https://api.spotify.com/v1/artists/2RdwBSPQiwcmiDo9kixcl8',
  'id': '2RdwBSPQiwcmiDo9kixcl8',
  'name': 'Pharrell Williams',
  'type': 'artist',
  'uri': 'spotify:artist:2RdwBSPQiwcmiDo9kixcl8'}]

As you can see, you have a list of dictionaries, in this case with only one entry as theirs only one artist. You can imagine wanting to transform this for an artist1, artist2,...columns. This will be a great exercise in the upcoming lab to practice your Pandas skills and lambda functions!

Summary

JSON files often have a deep nested structure that can require initial investigation into the schema hierarchy in order to become familiar with how data is stored. Once done, it is important to identify what data your are looking to extract and then develop a strategy to transform it into your standard workflow (which generally will be dependent on Pandas DataFrames or NumPy arrays).

dsc-exploring-and-transforming-json-schemas-dc-ds-060319's People

Contributors

mathymitchell avatar loredirick avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.