Advanced Business Analytics – Final Project Description

Team Members

Ankita Guha
Kara Marsh

Project Title

Analyzing EPA (Environmental Protection Agency) data to determine patterns in pollution

Type of Final Project

Project Type I - Analysis of EPA data

Executive Summary of the Proposed Project

We have analyzed a large data set from the EPA. This data set contains information on the location, time, type, substance, and results of individual sample testing. We have cleaned the data and performed exploratory data analysis in R to find some general trends within subsets of the data. Based on this exploratory analysis, we picked some subset of the data on which to perform linear and non-linear analysis to find models and relationships that may be predictive of future samples. Because this data set covers sampling from 1987 to 2017, we have tested our models using the data within the set.

Data Needs and Sources

We have used the following dataset from Kaggle; Source: https://www.kaggle.com/epa/air-quality
Link of the original Data File Used to Clean and Prepare the Data Frames: https://www.kaggle.com/epa/air-quality/data

Important Variables Used & Brief Description

latitude:The monitoring site’s angular distance north of the equator measured in decimal degrees.
longitude:The monitoring site’s angular distance east of the prime meridian measured in decimal degrees.
parameter_name: Air Constituents which are Pollutants or Non-Pollutants
metric_used: The total time for which the parameter_name was measured.
method_name: A short description of the processes, equipment, and protocols used in gathering and measuring the sample.
year: The year the annual summary data represents.
arithmetic_mean: The average (arithmetic mean) value for the year.
arithmetic_standard_dev: The standard deviation about the mean of the values for the year.
address: The approximate street address of the monitoring site.
state_name: The name of the state where the monitoring site is located.
county_name: The name of the county where the monitoring site is located.
norm_mean: Calculated column of Normalized arithmetic mean value.
first_max_value: The highest value for the year.
observation_count:The number of observations (samples) taken during the year.
observation_percent: The percent representing the number of observations taken with respect to the number scheduled to be taken during the year. This is only calculated for monitors where measurements are required (e.g., only certain parameters).
valid_day_count: The number of days during the year where the daily monitoring criteria were met
required_day_count: The number of days during the year which the monitor was scheduled to take samples if measurements are required.
date_of_last_change: The date the last time any numeric values in this record were updated.
ninety_five_percentile: The value from this monitor for which 95 per cent of the rest of the measured values for the year are equal to or less than.

Challenges Encountered

Large data set - made it slow to import tables as data frames
Real-life data set - messy - we had to make decisions about NA values and whether they could be ignored; had to make decisions on which type of result information we were going to consider in our analysis
The data was not as continuous as it orginally appeared to be. We had to consider the best way to filter the data and make executive decisions on what data sets could be combined without reducing the integrity of the results.
The blank method_name has a great deal of influence over the dataset.
Mostly the parameter_name Ozone as the chief Pollutant has been found to be captured under the blank method_name, which we thought we could ignore initially.
Normalizing one subset of data frame improved not only the Model Fit but also produced an exact match of the train and test data prediction in case of building some of the Non Linear Regression Model.
While performing the Logistic Regression Model Fit, we found that Ozone that has been identified as the main Pollutant in previous models of Linear Regression Model Fit mainly due to the variable method_name or the unnkown test, which was blank in the original dataset that we got from Kaggle. Thus Ozone seemed to be out of the perview of our data, when we were not considering the blank method_name. We had a tough time to figure out on how to replace the blank string data values with 'NA' and then replace these 'NAs' with 'Missing Data From Kaggle'. That's because in the initial Data Cleaning & Preparation Stage these blank data values were not captured as NA. And we were on the verge of losing a significant chunk of Pollutant data that have been captured by this specific test or the method_name which was initially blank from the original Kaggle data.

Personal Learning Objectives

What did we learn?

We learned how to deal with Millions of data
Data cleaning and data manipulation for ease of analysis
Treat blank space as character string, in case those blank spaces are not showing up as NA
The library plyr should be installed before dplyr

What else did we learn?

Getting perfect Model Fit was mainly due to the fact that we used some of our Predictors which might actually act well as parts of Response Variables for predicting the Model Fitness.
Plot maps from the latitude and the longitude data
In ggplot each of the color palletes has a limit in the number of variables that it can show while plotting. So to get all the variables counted in the color Pallete we had to use a function that would first count the variables and then assign the colors to them accordingly.
For and If-Else Loop inside RMD File.

What third thing did we learn?

For the Map to be more explicit in nature, one can increase or decrease the zoom argument to acheieve the desired level of Map visual. We have decided our zoom to be at a desired level, for the purpose of providing a suitable aerial view of all the 4 States or the counties that we are looking into.
Another interesting fact point that we learned while using ggmap() is that after a certain point of time, query used to fetch map data might not run, if a certain quota of fetching map API data from google is met. We came across an error something like: geocode failed with status OVER_QUERY_LIMIT, location = "michigan", that means that we have run our code many times and hence the IP address has met it's limit to use and fetch API data from Google. Source: https://stackoverflow.com/questions/tagged/google-geocoding-api?page=4&sort=unanswered
Got comfortable with GitHub.
Got first hand exposure in Slack to collaborate with others more flexibly with team members.

Explanation of the Project stages and files

Necessary Packages to install:

boot
coefplot
dplyr
e1071
ggmap
ggplot2
gpclib
mapdata
maps
maptools
plyr
RColorBrewer
reshape2
scales
sp
stringr
VIM
viridis

Data Cleaning

Associated RMD files:

Data Cleaning and Data Preparation Phase 1
Data Cleaning and Data Preparation Phase 2

Exploratory Data Analysis

Associated RMD files:

EDA1
EDA2
TriCountyEDA

Data Modeling

Associated RMD files:

NonLinearModelsFor1987_2017_4States
PredictiveRegressionModelingForMI&US

Direction to be Followed For Running the RMD Files

Due to the enormous data quantity it would be necessary to follow the steps in downloading and running the data files as explicitly mentioned in the following steps below:

Step 1: Download the Data File from Kaggle (https://www.kaggle.com/epa/air-quality/data)
Step 2: Then Run the RMD File named: Data Cleaning and Data Preparation Phase 1
Step 3: Next Run the RMD File named: Data Cleaning and Data Preparation Phase 2
Step 4: i) Run the RMD File named: EDA1 ii) Run the RMD File named: EDA2 iii) Run the RMD File named: TriCountyEDA
Step 5: Run the NonLinearModelsFor1987_2017_4States
Step 6: Run the PredictiveRegressionModelingForMI&US

Once the original data file is downloaded from Kaggle. The individual CSV data files will be created while running the above mentioned RMD Files step bt step. These individual CSV data files so created after running the RMD Files, are used in our subsequent analysis that are projected in our RMD Files as well.

Alternatively, you can get the link of the data files from here:

Findings & Conclusion From Our Analysis

Ozone is one of the top Pollutant contributor in US
Some of the top Pollutant States are: Alaska, Alabama, & California
USA Today also seems to conform to these above two facts: From USA TODAY:California has eight of 10 most polluted U.S. cities https://usat.ly/2H7ihty

Future Project Scope

There are a lot of future scope that could be performed on this dataset and these are:

It would be nice to have a Forecasting of the Pollutants Data across the time lines
Performing a time series Prediction of the Pollutant Data across this Dataset.
Take into consideration of some of the outside influencers such as the presence of the Number of Cars, Industrial Belts etc that could have influenced the resultant measurement.

ankitaguhaoakland / projects Goto Github PK

projects's Introduction

Advanced Business Analytics – Final Project Description

Team Members

Project Title

Type of Final Project

Executive Summary of the Proposed Project

Data Needs and Sources

Important Variables Used & Brief Description

Challenges Encountered

Personal Learning Objectives

Explanation of the Project stages and files

Data Cleaning

Exploratory Data Analysis

Data Modeling

Direction to be Followed For Running the RMD Files

Findings & Conclusion From Our Analysis

Future Project Scope

projects's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org