Working with Known JSON Schemas
Introduction
We've started taking a look at JSON files and we'll continue to explore how to navigate and traverse these files. One common use case of json files will be when we are connecting to various websites through their established APIs to retrieve data from them. With these, we are typically given a schema for how the data is structured and then will use this knowledge to retrieve pertinant information. In this lecture, we'll take a look at the response from the NY Times API.
Objectives
You will be able to:
- Read JSON Documentation Schemas and translate into code
- Extract data from known json schemas
- Write data to predefined JSON schemas
Reading a JSON Schema
Here's the JSON schema provided for a section of the NY Times API:
or a more detailed view (truncated):
You can see this yourself here: https://developer.nytimes.com/article_search_v2.json#/Documentation/GET/articlesearch.json
You can see that the master structure is a dictionary and has a key named 'response'. This is also a dictionary and has two keys: 'data' and 'meta'. As you continue to examine the schema hierarchy, you'll notice the vast majority in this case are dictionaries.
Loading the Data File
As we saw before, let's start by importing this data from file. We open to file and load its contents.
import json
f = open('ny_times_response.json', 'r')
data = json.load(f)
print(type(data))
print(data.keys())
<class 'dict'>
dict_keys(['status', 'copyright', 'response'])
You should see that there are two additional keys 'status' and 'copyright' which were not shown in the schema documentation.
Loading Specific Data
Looking at the schema, we might be interested in retrieving a specific piece of data, such as the articles' headlines. We see that this is a key under 'docs', which is under 'response'. This gives us roughly: data['response']['docs']['headline']. While this is close to the code we'll use to extract headlines, something is a bit off. Notice that if you look closely at the schema outline, that the 'docs' subheading is actually a list. Each item within this list should be a dictionary with the keys shown above, but that is an important distinction. Breaking it into two steps we have:
docs = data['response']['docs']
print(type(docs), len(docs))
<class 'list'> 9
for doc in docs:
print(doc['headline'])
{'main': "HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.", 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'MUSIC OF THE WEEK', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Anacomp Inc. reports earnings for Qtr to March 31', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Brooklyn Routs Yeshiva', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Albuquerque Program Gives Drinkers a Lift', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Front Page 7 -- No Title', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
Or if we want to just print the main headlines themselves:
for doc in docs:
print(doc['headline']['main'])
print('\n')
HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.
GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3
Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955
MUSIC OF THE WEEK
Anacomp Inc. reports earnings for Qtr to March 31
Brooklyn Routs Yeshiva
Albuquerque Program Gives Drinkers a Lift
Front Page 7 -- No Title
UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.
Transforming JSON to Alternative Formats
We also have previously started to take a look at how to transform JSON to DataFrames. Investigating our schema, a good option for this could again be the 'docs' subheading. While this still has nested data itself, I recommend loading the entire section as a dataframe and then using functions to break apart nested data from there.
import pandas as pd
df = pd.DataFrame(data['response']['docs'])
df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
_id | abstract | blog | byline | document_type | headline | keywords | multimedia | news_desk | print_page | pub_date | score | snippet | source | type_of_material | web_url | word_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4fc04eb745c1498b0d23da00 | Spent $22,200 | {} | NaN | article | {'main': 'HIGGINS, SPENT $22,189.53.; Governor... | [{'name': 'persons', 'value': 'HIGGINS, LT. GO... | [] | NaN | 2 | 1904-11-17T00:00:00Z | 1 | Spent $22,200 | The New York Times | Article | https://query.nytimes.com/gst/abstract.html?re... | 213 |
1 | 4fc21ebf45c1498b0d612b22 | NaN | {} | NaN | article | {'main': 'GARDEN BOUTS CANCELED; Mauriello Say... | [] | [] | NaN | 15 | 1944-10-23T00:00:00Z | 1 | The New York Times | Article | https://query.nytimes.com/gst/abstract.html?re... | 149 | |
2 | 4fc3b41d45c1498b0d7fd41e | NaN | {} | {'original': 'By JOHN G. FORREST', 'person': [... | article | {'main': 'Stock Drop Is Biggest in 2 Months--M... | [] | [] | NaN | F1 | 1955-05-15T00:00:00Z | 1 | Stock prices last week, on the lightest volume... | The New York Times | Article | https://query.nytimes.com/gst/abstract.html?re... | 823 |
Breaking out nested data
Now that we have the data loaded, let's clean it up by breaking out some of the nested data. For example, you should notice that the headline entries are actualy dictionaries. We could transform these into singular data columns with something like this:
keys = df.headline.iloc[0].keys() #Get dictionary keys
#Keep track of columns we make for subsequent preview
new_cols = []
#Create a new feature for each of these keys
for key in keys:
new_col = 'headline_{}'.format(key) #Create new column name
df[new_col] = df.headline.map(lambda x: x[key]) #Create a new column
new_cols.append(new_col)
df[new_cols].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
headline_main | headline_kicker | headline_content_kicker | headline_print_headline | headline_name | headline_seo | headline_sub | |
---|---|---|---|---|---|---|---|
0 | HIGGINS, SPENT $22,189.53.; Governor-Elect's E... | None | None | None | None | None | None |
1 | GARDEN BOUTS CANCELED; Mauriello Says He Could... | 1 | None | None | None | None | None |
2 | Stock Drop Is Biggest in 2 Months--Margin Rise... | 1 | None | None | None | None | None |
3 | MUSIC OF THE WEEK | None | None | None | None | None | None |
4 | Anacomp Inc. reports earnings for Qtr to March 31 | None | None | None | None | None | None |
Wahoo! This is a good general strategy for transforming nested JSON: create a DataFrame and then break out nested features into their own column features.
Outputing to JSON
Finally, let's look at how we can write data back to JSON. Like loading, we first open a file (this time with write permission) and use the json package to transfer data to that file container.
with open('output.json', 'w') as f:
json.dump(data, f)
Summary
There you have it! In this, we took another look at JSON, taking a look at an example schema diagram and retrieving information. We also looked at a general procedure for transforming nested data to Pandas DataFrames (create a DataFrame, and then break apart nested data using lambda functions to create additional columns). We also took a brief look at saving data to json files.