Coder Social home page Coder Social logo

projects's Introduction

Advanced Business Analytics – Final Project Description

Team Members

  • Ankita Guha
  • Kara Marsh

Project Title

Analyzing EPA (Environmental Protection Agency) data to determine patterns in pollution

Type of Final Project

Project Type I - Analysis of EPA data

Executive Summary of the Proposed Project

We have analyzed a large data set from the EPA. This data set contains information on the location, time, type, substance, and results of individual sample testing. We have cleaned the data and performed exploratory data analysis in R to find some general trends within subsets of the data. Based on this exploratory analysis, we picked some subset of the data on which to perform linear and non-linear analysis to find models and relationships that may be predictive of future samples. Because this data set covers sampling from 1987 to 2017, we have tested our models using the data within the set.

Data Needs and Sources

Important Variables Used & Brief Description

  • latitude:The monitoring site’s angular distance north of the equator measured in decimal degrees.
  • longitude:The monitoring site’s angular distance east of the prime meridian measured in decimal degrees.
  • parameter_name: Air Constituents which are Pollutants or Non-Pollutants
  • metric_used: The total time for which the parameter_name was measured.
  • method_name: A short description of the processes, equipment, and protocols used in gathering and measuring the sample.
  • year: The year the annual summary data represents.
  • arithmetic_mean: The average (arithmetic mean) value for the year.
  • arithmetic_standard_dev: The standard deviation about the mean of the values for the year.
  • address: The approximate street address of the monitoring site.
  • state_name: The name of the state where the monitoring site is located.
  • county_name: The name of the county where the monitoring site is located.
  • norm_mean: Calculated column of Normalized arithmetic mean value.
  • first_max_value: The highest value for the year.
  • observation_count:The number of observations (samples) taken during the year.
  • observation_percent: The percent representing the number of observations taken with respect to the number scheduled to be taken during the year. This is only calculated for monitors where measurements are required (e.g., only certain parameters).
  • valid_day_count: The number of days during the year where the daily monitoring criteria were met
  • required_day_count: The number of days during the year which the monitor was scheduled to take samples if measurements are required.
  • date_of_last_change: The date the last time any numeric values in this record were updated.
  • ninety_five_percentile: The value from this monitor for which 95 per cent of the rest of the measured values for the year are equal to or less than.

Challenges Encountered

  1. Large data set - made it slow to import tables as data frames
  2. Real-life data set - messy - we had to make decisions about NA values and whether they could be ignored; had to make decisions on which type of result information we were going to consider in our analysis
  3. The data was not as continuous as it orginally appeared to be. We had to consider the best way to filter the data and make executive decisions on what data sets could be combined without reducing the integrity of the results.
  4. The blank method_name has a great deal of influence over the dataset.
  5. Mostly the parameter_name Ozone as the chief Pollutant has been found to be captured under the blank method_name, which we thought we could ignore initially.
  6. Normalizing one subset of data frame improved not only the Model Fit but also produced an exact match of the train and test data prediction in case of building some of the Non Linear Regression Model.
  7. While performing the Logistic Regression Model Fit, we found that Ozone that has been identified as the main Pollutant in previous models of Linear Regression Model Fit mainly due to the variable method_name or the unnkown test, which was blank in the original dataset that we got from Kaggle. Thus Ozone seemed to be out of the perview of our data, when we were not considering the blank method_name. We had a tough time to figure out on how to replace the blank string data values with 'NA' and then replace these 'NAs' with 'Missing Data From Kaggle'. That's because in the initial Data Cleaning & Preparation Stage these blank data values were not captured as NA. And we were on the verge of losing a significant chunk of Pollutant data that have been captured by this specific test or the method_name which was initially blank from the original Kaggle data.

Personal Learning Objectives

  1. What did we learn?
  • We learned how to deal with Millions of data
  • Data cleaning and data manipulation for ease of analysis
  • Treat blank space as character string, in case those blank spaces are not showing up as NA
  • The library plyr should be installed before dplyr
  1. What else did we learn?
  • Getting perfect Model Fit was mainly due to the fact that we used some of our Predictors which might actually act well as parts of Response Variables for predicting the Model Fitness.
  • Plot maps from the latitude and the longitude data
  • In ggplot each of the color palletes has a limit in the number of variables that it can show while plotting. So to get all the variables counted in the color Pallete we had to use a function that would first count the variables and then assign the colors to them accordingly.
  • For and If-Else Loop inside RMD File.
  1. What third thing did we learn?
  • For the Map to be more explicit in nature, one can increase or decrease the zoom argument to acheieve the desired level of Map visual. We have decided our zoom to be at a desired level, for the purpose of providing a suitable aerial view of all the 4 States or the counties that we are looking into.
  • Another interesting fact point that we learned while using ggmap() is that after a certain point of time, query used to fetch map data might not run, if a certain quota of fetching map API data from google is met. We came across an error something like: geocode failed with status OVER_QUERY_LIMIT, location = "michigan", that means that we have run our code many times and hence the IP address has met it's limit to use and fetch API data from Google. Source: https://stackoverflow.com/questions/tagged/google-geocoding-api?page=4&sort=unanswered
  • Got comfortable with GitHub.
  • Got first hand exposure in Slack to collaborate with others more flexibly with team members.

Explanation of the Project stages and files

Necessary Packages to install:

  • boot
  • coefplot
  • dplyr
  • e1071
  • ggmap
  • ggplot2
  • gpclib
  • mapdata
  • maps
  • maptools
  • plyr
  • RColorBrewer
  • reshape2
  • scales
  • sp
  • stringr
  • VIM
  • viridis

Data Cleaning

Associated RMD files:

  • Data Cleaning and Data Preparation Phase 1
  • Data Cleaning and Data Preparation Phase 2

Exploratory Data Analysis

Associated RMD files:

  • EDA1
  • EDA2
  • TriCountyEDA

Data Modeling

Associated RMD files:

  • NonLinearModelsFor1987_2017_4States
  • PredictiveRegressionModelingForMI&US

Direction to be Followed For Running the RMD Files

Due to the enormous data quantity it would be necessary to follow the steps in downloading and running the data files as explicitly mentioned in the following steps below:

  • Step 1: Download the Data File from Kaggle (https://www.kaggle.com/epa/air-quality/data)
  • Step 2: Then Run the RMD File named: Data Cleaning and Data Preparation Phase 1
  • Step 3: Next Run the RMD File named: Data Cleaning and Data Preparation Phase 2
  • Step 4: i) Run the RMD File named: EDA1 ii) Run the RMD File named: EDA2 iii) Run the RMD File named: TriCountyEDA
  • Step 5: Run the NonLinearModelsFor1987_2017_4States
  • Step 6: Run the PredictiveRegressionModelingForMI&US

Once the original data file is downloaded from Kaggle. The individual CSV data files will be created while running the above mentioned RMD Files step bt step. These individual CSV data files so created after running the RMD Files, are used in our subsequent analysis that are projected in our RMD Files as well.

OR

Alternatively, you can get the link of the data files from here:

Findings & Conclusion From Our Analysis

  • Ozone is one of the top Pollutant contributor in US
  • Some of the top Pollutant States are: Alaska, Alabama, & California
  • USA Today also seems to conform to these above two facts: From USA TODAY:California has eight of 10 most polluted U.S. cities https://usat.ly/2H7ihty

Future Project Scope

There are a lot of future scope that could be performed on this dataset and these are:

  • It would be nice to have a Forecasting of the Pollutants Data across the time lines
  • Performing a time series Prediction of the Pollutant Data across this Dataset.
  • Take into consideration of some of the outside influencers such as the presence of the Number of Cars, Industrial Belts etc that could have influenced the resultant measurement.

projects's People

Contributors

ankitaguhaoakland avatar karamarsh avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.