Coder Social home page Coder Social logo

dsc-2-14-05-known-json-schemas-al's Introduction

Working with Known JSON Schemas

Introduction

We've started taking a look at JSON files and we'll continue to explore how to navigate and traverse these files. One common use case of json files will be when we are connecting to various websites through their established APIs to retrieve data from them. With these, we are typically given a schema for how the data is structured and then will use this knowledge to retrieve pertinant information. In this lecture, we'll take a look at the response from the NY Times API.

Objectives

You will be able to:

  • Read JSON Documentation Schemas and translate into code
  • Extract data from known json schemas
  • Write data to predefined JSON schemas

Reading a JSON Schema

Here's the JSON schema provided for a section of the NY Times API:

or a more detailed view (truncated):

You can see this yourself here: https://developer.nytimes.com/article_search_v2.json#/Documentation/GET/articlesearch.json

You can see that the master structure is a dictionary and has a key named 'response'. This is also a dictionary and has two keys: 'data' and 'meta'. As you continue to examine the schema hierarchy, you'll notice the vast majority in this case are dictionaries.

Loading the Data File

As we saw before, let's start by importing this data from file. We open to file and load its contents.

import json
f = open('ny_times_response.json', 'r')
data = json.load(f)
print(type(data))
print(data.keys())
<class 'dict'>
dict_keys(['status', 'copyright', 'response'])

You should see that there are two additional keys 'status' and 'copyright' which were not shown in the schema documentation.

Loading Specific Data

Looking at the schema, we might be interested in retrieving a specific piece of data, such as the articles' headlines. We see that this is a key under 'docs', which is under 'response'. This gives us roughly: data['response']['docs']['headline']. While this is close to the code we'll use to extract headlines, something is a bit off. Notice that if you look closely at the schema outline, that the 'docs' subheading is actually a list. Each item within this list should be a dictionary with the keys shown above, but that is an important distinction. Breaking it into two steps we have:

docs = data['response']['docs']
print(type(docs), len(docs))
<class 'list'> 9
for doc in docs:
    print(doc['headline'])
{'main': "HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.", 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'MUSIC OF THE WEEK', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Anacomp Inc. reports earnings for Qtr to March 31', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Brooklyn Routs Yeshiva', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Albuquerque Program Gives Drinkers a Lift', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Front Page 7 -- No Title', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}

Or if we want to just print the main headlines themselves:

for doc in docs:
    print(doc['headline']['main'])
    print('\n')
HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.


GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3


Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955


MUSIC OF THE WEEK


Anacomp Inc. reports earnings for Qtr to March 31


Brooklyn Routs Yeshiva


Albuquerque Program Gives Drinkers a Lift


Front Page 7 -- No Title


UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.

Transforming JSON to Alternative Formats

We also have previously started to take a look at how to transform JSON to DataFrames. Investigating our schema, a good option for this could again be the 'docs' subheading. While this still has nested data itself, I recommend loading the entire section as a dataframe and then using functions to break apart nested data from there.

import pandas as pd
df = pd.DataFrame(data['response']['docs'])
df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
_id abstract blog byline document_type headline keywords multimedia news_desk print_page pub_date score snippet source type_of_material web_url word_count
0 4fc04eb745c1498b0d23da00 Spent $22,200 {} NaN article {'main': 'HIGGINS, SPENT $22,189.53.; Governor... [{'name': 'persons', 'value': 'HIGGINS, LT. GO... [] NaN 2 1904-11-17T00:00:00Z 1 Spent $22,200 The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 213
1 4fc21ebf45c1498b0d612b22 NaN {} NaN article {'main': 'GARDEN BOUTS CANCELED; Mauriello Say... [] [] NaN 15 1944-10-23T00:00:00Z 1 The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 149
2 4fc3b41d45c1498b0d7fd41e NaN {} {'original': 'By JOHN G. FORREST', 'person': [... article {'main': 'Stock Drop Is Biggest in 2 Months--M... [] [] NaN F1 1955-05-15T00:00:00Z 1 Stock prices last week, on the lightest volume... The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 823

Breaking out nested data

Now that we have the data loaded, let's clean it up by breaking out some of the nested data. For example, you should notice that the headline entries are actualy dictionaries. We could transform these into singular data columns with something like this:

keys = df.headline.iloc[0].keys() #Get dictionary keys
#Keep track of columns we make for subsequent preview
new_cols = []
#Create a new feature for each of these keys
for key in keys:
    new_col = 'headline_{}'.format(key) #Create new column name
    df[new_col] = df.headline.map(lambda x: x[key]) #Create a new column
    new_cols.append(new_col)
df[new_cols].head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
headline_main headline_kicker headline_content_kicker headline_print_headline headline_name headline_seo headline_sub
0 HIGGINS, SPENT $22,189.53.; Governor-Elect's E... None None None None None None
1 GARDEN BOUTS CANCELED; Mauriello Says He Could... 1 None None None None None
2 Stock Drop Is Biggest in 2 Months--Margin Rise... 1 None None None None None
3 MUSIC OF THE WEEK None None None None None None
4 Anacomp Inc. reports earnings for Qtr to March 31 None None None None None None

Wahoo! This is a good general strategy for transforming nested JSON: create a DataFrame and then break out nested features into their own column features.

Outputing to JSON

Finally, let's look at how we can write data back to JSON. Like loading, we first open a file (this time with write permission) and use the json package to transfer data to that file container.

with open('output.json', 'w') as f:
    json.dump(data, f)

Summary

There you have it! In this, we took another look at JSON, taking a look at an example schema diagram and retrieving information. We also looked at a general procedure for transforming nested data to Pandas DataFrames (create a DataFrame, and then break apart nested data using lambda functions to create additional columns). We also took a brief look at saving data to json files.

dsc-2-14-05-known-json-schemas-al's People

Contributors

loredirick avatar mathymitchell avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.