Coder Social home page Coder Social logo

nycplanning / ceqr-app-data-archive Goto Github PK

View Code? Open in Web Editor NEW
1.0 8.0 1.0 215 KB

(DEPRECATED)data pipelines for CEQR app, managed by data engineering

Home Page: https://github.com/NYCPlanning/ceqr-app-data

Python 84.40% Shell 4.43% TSQL 5.38% CSS 1.41% HTML 4.38%
data-pipelines data-engineering etl-pipeline

ceqr-app-data-archive's Introduction

CEQR App Data User Guides CEQR Runner

This repository includes ETL pipelines for all the datasets fed into CEQR (City Environmental Quality Review) app. It is managed by NYC Planning's data engineering team.

Building Instructions

  1. Set environmental variables: RECIPE_ENGINE, CEQR_DATA, BUILD_ENGINE,and EDM_DATA under the /ceqr directory. See .env.example.
  2. Run python3 -m venv base to set up the virtual environment.
  3. Run source base/bin/activate to activate the virtual environment. To deactivate once finished, type deactivate.
  4. Run pip3 install -e . to install packages required accross multiple data schema.
  5. Run ceqr run <schema_name> at root directory. For example, ceqr run ceqr_school_buildings, which allows you to build ceqr_school_buildings from scratch

Repo directory structure

  1. The ETL pipelines have been stored under the /ceqr/recipes directory as individual folders named by the datasets.
  2. Each dataset subfolder contains the following items.
    • build.py A python script that will transform and integrate source datas into a target table
    • config.json A configuration file specifying the input table names, output table name and DDL (output table schemas).
      • It is noted that the DDLs of "nysdec_state_facility_permits", "nysdec_title_v_facility_permits" and "sca_capacity_projects" actually reflect the schemas for their geo_rejects tables
    • README.md The metadata about the ETL pipeline
    • requirements The required dependencies need to install to run the python script
    • runner.sh A shell script, by executing which, you can build a dataset from scratch. or you can execute ceqr run <schema_name> at root directory
├── ceqr
│   ├── recipes
│   │   ├── <schema_name_1>
│   │   │   ├── build.py
│   │   │   ├── config.json
│   │   │   ├── README.md
│   │   │   ├── requirements.txt
│   │   │   └── runner.sh
│   │   ├── <schema_name_2>
│   │   │   ├── build.py
│   │   │   ├── config.json
│   │   │   ├── README.md
│   │   │   ├── requirements.txt
│   │   │   └── runner.sh
...

How to build a new dataset

  1. Create a new folder named by the dataset and put it under the /ceqr/recipes directory
  2. Create config.json, README.md, build.py, requirements.txt and runner.sh as described in the Repo directory structure within this new folder
    • For the output table schema, besides the requirements specified by the data users, it also need to follow the CEQR data schema standards.
  3. Follow the Building Instructions to test the ETL pipeline
  4. Output table can be found in EDM_DATA under a schema named by the dataset.

ceqr-app-data-archive's People

Contributors

baolingz avatar mgraber avatar sptkl avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

sptkl

ceqr-app-data-archive's Issues

doe_bluebook and doe_lcgms merge

https://gitlab.com/nycplanning/ceqr-app-data/issues/5
Currently, we provide schools data in two schemas on CEQR Data: doe_lcgms and sca_bluebook. This does not make intuitive sense on the app side, as these are just sources for the ultimate need of the app: a row representing one school.

This new schema would be named something like ceqr_school_buildings and look like:

CREATE TABLE ceqr_school_buildings."2017" (
    district integer,
    subdistrict integer,
    borocode integer,
    bldg_name character varying,
    excluded boolean,
    bldg_id character varying,
    org_id character varying,
    org_level character varying,
    name character varying,
    address character varying,
    ps_capacity integer,
    ps_enroll integer,
    is_capacity integer,
    is_enroll integer,
    hs_capacity integer,
    hs_enroll integer,
    source text, -- either 'bluebook' or 'lcgms'
    geom geometry(Point,4326)
);

This table would replace bluebook and lcgms tables in the app moving forward. The yearly data generation task would be to combine bluebook and lcgms to provide the most accurate list of schools open in the current school year, with their enrollments and capacities.

  • load both datasets into RECIPE_ENGINE
  • change column names for sensitive info
  • create build.py to merge the two

CEQR Air quality

file_name description
dot_traffic_cameras.zip DOT traffic cameras
nysdot_functional_class.zip NYS DOT Functional Class
nysdot_aadt.zip NYS Traffic Volume Data
nysdot_traffic_counts.zip (2015, 2016, 2017) General Highway Data
nysdec_air_monitoring_stations.zip Air monitoring stations
facilities_garages.zip Garages
tunnel_ventilation_tower_map.geojson Ventilation Outlets
area_of_concern.geojson Areas of Concern
atypical_roadways.geojson Atypical Roadway
dep_cats_permits.zip Manufacturing or processing facilities, boilers
nysdec_facility_permits.zip Title V State Permits and Facility State Permits

Apply Standard data schema on Air Datasets

Standard data schema: https://docs.google.com/spreadsheets/d/1Z41fgiU_mi1KltlS783kUZpPcC8Sn7hUxTqhaNxQFg8/edit?pli=1#gid=0

  • dep_cats_permits: update borough values to proper case, add geo_ to all fields return by geosupport, geometry-->geom, expirationdate::date, issuedate::date, add geo_address, (geo_latitude,geo_longitude)::double precision
  • nysdec_facility_permits: add geo_ to all fields return by geosupport, geometry-->geom, , issue_date::date, expire_date::date
  • dot_traffic_cameras: geometry-->geom
  • atypical_roadways: l_zip-->left_zipcode, r_zip-->right_zipcode
  • facilities_garages: add geo_ to all fields return by geosupport, geometry-->geom
  • nysdec_air_monitoring_stations: geometry-->geom, (Add geo_ in front of longitude, latitude if they are from geosupport), map county-->borough (remove outsiteNYC by setting region='2'), borocode::integer, (geo_latitude,geo_longitude, geo_x_coord, geo_y_coord)::double precision
  • nysdot_functional_class: geometry-->geom
  • nysdot_aadt: geometry-->geom
  • nysdot_traffic_counts: should we rename county_code to borocode and remove records outside NYC?

CEQR Schools: Capacity projects - take centroid of line string geometries

If we're going to use the Segment function of GeoSupport, take the centroid of the linestring that's generated so that we consistently have only points in the output dataset

If there's a more efficient method than just taking the centroid of the linestring feel free to implement it, so long as the outcome is the same

CEQR Schools: School buildings lists of schools to exclude

Confirmed lists

Specialized high schools

  • The Bronx High School of Science (X445)
  • The Brooklyn Latin School (K449)
  • Brooklyn Technical High School (K430)
  • High School for Mathematics, Science and Engineering at City College of New York ((M692)
  • High School of American Studies at Lehman College (X696)
  • Queens High School for the Sciences at York College (Q687)
  • Staten Island Technical High School (R605)
  • Stuyvesant High School (M475)

Citywide G&T schools below:

    1. N.E.S.T. (M539)
    1. The Anderson School (M334)
    1. Tag Young Scholars (M012)
    1. Brooklyn School of Inquiry (K686)
    1. The 30th Avenue School (Q300)

From here: https://insideschools.org/

CEQR: Schools SCA housing pipeline

Work with Capital Planning and HED to create data processing pipeline to produce the SCA housing pipeline that is used in CEQR

Output tables:

  • sca_housing_pipeline_by_boro
  • sca_housing_pipeline_by_sd

CEQR Air Quality: Ventilation Outlets

DOT data for ventilation outlets

  • Identify source data
    This refers to the tunnel ventilation towers for the 4 major vehicle tunnels in NYC (Midtown, Lincoln, Holland, Hugh L. Carey Tunnel aka Battery)
  • Create a spatial dataset of the vents on either side of the tunnel

ceqr_school_buildings clean up

The ETL pipeline is almost ready, except:

  • We still need the following four fields for doe_lcgms: org_level, ps_enrollment, ms_enrollment, hs_enrollment
  • The 2019 table has doubled its number of records compared to the 2018 and 2017 tables. Based on my observation, this issue might get solved after applying the following filter.
SELECT * FROM ceqr_school_buildings."2019"
WHERE source = 'bluebook'
AND (org_level = 'PK'
OR org_level = 'PS'
OR org_level = 'PSIS'
OR org_level = 'IS'
OR org_level = 'ISHS'
OR org_level = 'HS'
);

CEQR Air Permits

  • Geocode the three open datasets (dec_title_v_facility_permits, dec_state_facility_permits and dep_cats_permits)
  • merge dec_title_v_facility_permits and dec_state_facility_permits
  • deduplicate between the three datasets on facility_name and facility_location

Package and ship updated air permits data

For the latest ceqr updates, we added intersection geocoding logic which improves the geocoding result moderatively. Three datasets are affected by this update, and some new fields are added into their shapefiles and geo_rejects.csv:

  • dep_cats_permits:
    • address - address after programmetic cleaning,
    • streetname_1 - parsed cross street part 1,
    • streetname_2 - parsed cross street part 2,
    • geo_x_coord - x coordinate returned by Geosupport,
    • geo_y_coord - ycoordinate returned by Geosupport,
    • geo_function - the Geosupport function being applied
  • nysdec_state_facility_permits:
    • streetname_1 - parsed cross street part 1,
    • streetname_2 - parsed cross street part 2,
    • borough - backfilled borough via zipcode_borough look up table,
    • geo_x_coord - x coordinate returned by Geosupport,
    • geo_y_coord - ycoordinate returned by Geosupport,
    • geo_function - the Geosupport function being applied
  • nysdec_title_v_facility_permits:
    • streetname_1 - parsed cross street part 1,
    • streetname_2 - parsed cross street part 2,
    • borough - backfilled borough via zipcode_borough look up table,
    • geo_x_coord - x coordinate returned by Geosupport,
    • geo_y_coord - ycoordinate returned by Geosupport,
    • geo_function - the Geosupport function being applied

@aferrar the latest dataset can be found here, and please feel free to hide any above fields on your side and we plan to ship them to EARD by today. Let me know if you have any questions. Thanks!

Create sca_e_projections by boro

Aggregate HS data to boro level

  1. Create a new folder named sca_e_projections_by_boro under the ceqr_app_date/ceqr/recipes directory
  2. Create config.json, README.md, build.py, and requirements.txt
  3. use sca_e_projections.2019 as the input data
  4. target schema @bfreeds please confirm the schema for this CEQR table
CREATE TABLE sca_e_projections_by_boro."2019" (
    school_year integer,
    borocode integer,
    hs integer
);
  1. build its ETL pipeline using sca_e_projections as an example
    • convert the input table from a wide table to a long table using a pandas syntax called melt
    • calculate the hs projections by summing up projected = [9,10,11,12], don't use projected = '9-12 Total' from sca_e_projections.2019 which has incorrect values
    • write a python function or dictionary to map district to its matchingborocode
    • aggregate the data to boro level
  2. export the output to EDM_DATA
  3. push your code to another branch

CEQR Air: corrections to make after QAQC

dep_cats_permits

  • make values in geo_bbl NULL and not 0 ( or can we not do this because the field is a bigint?)

atypical_roadways

  • don't include records where FeatureTyp = 1

nysdot_functional_class

  • There is a segment showing up in Albany. Remove the segment from the dataset. It has urban_area_code 63217 and borough 'Queens,' which is probably causing the problem

Screen Shot 2019-10-29 at 12 36 48 PM

QA/QC reports

  • Generate report that reports the geocoding / mapping success rate for each CEQR dataset
  • Output the records that are part of the target subset that did not geocode for each dataset, including grc code

Purge the DEP CATS permtis

Purge the dep_cats_permits using the following logic

All Cancelled permits
All G-prefix permits
All C-prefix + Registration (including Registration, Boiler Registration II, and Registration Inspection)
All CA-Prefix + Work Permit + Expired permits

Mismatch in the number of DEP CATS permits

Please investigate the number of DEP CATS permits comparing to what EARD has purged with the following categories:

  • All Cancelled permits
  • All G-prefix permits
  • All C-prefix + Registration (including Registration, Boiler Registration II, and Registration Inspection)
  • All CA-Prefix + Work Permit + Expired permits

CATS Permits raw 010720.zip

Schools: Capacity projects update

Generate February version and deliver update to Labs

  • Replace capacity projects in process with the updated table from CP
  • Add TCU as a new input
  • Continue using the November tables for capacity projects from the 2015-2019 capital plan

CEQR Schools: sca_e_projections

  • Exclude PK (PS should only include K-5)
  • Use IS sub district percentages to calculate IS e_projections
  • Correct years (include 2018 in final output)
  • Reach a consensus on rounding

CEQR Air Quality: Atypical Roadway

What is an Atypical Roadway? A roadway that is either elevated or depressed compared to the surrounding environment. Not on the same level. If site is 200 feet from an Atypical Roadway then this needs to be considered in the analysis.

  • Determine best source for extracting Atypical Roadways (LION, Planimetrics, or...?)
  • Publish subset of data
  • Create line dataset (convert to roadbed?)

CEQR Air Quality: DOT data sources

Create data layer for each source. We'll need to investigate to see what data are available on Open Data or other sources.

CEQR Air Quality: Garages

Garages

  • Sanitation
  • School Buses
  • MTA bus garages
  • Uses Transportation trigger (> 85 spaces ?)

Tie to improvement of FacDB

CEQR source data - schools

@pichot @mshapiro1
Could you share with us the raw inputs for the following three datasets which are in the ceqr_data?

  • doe_lcgms
  • sca_bluebook
  • doe_school_subdistricts
  • sca_enrollment_pct_by_sd
  • sca_enrollment_projections_by_sd

Ideally, we should be able to load all these datasets from a csv or url into our recipe_data

Thank you!

Improve Geocoding / address cleanup for DEP air quality permits

  • 1. Where GRC code or GRC code 2 is 42 or 41, output subset of records and send to GRU, because addresses may be missing in PAD
  • 2. Add keywords such as Cross and Crs to geocoding intersection function
  • 3. Update JFK international airport so that it gets a hit from Geosupport
  • 4. Explore on-to-from formats and build out function in Geocoding suite
  • 5. Where reason code is EE data engineering will fix addresses if it's clear what it should be

If any component of the address is missing (house number (i.e. 13B), street name, borough) we cannot fix the address
Do not fix addresses where the error code indicates there is an extra hyphen (GRC 13 3)
Most likely cannot fix addresses where GRC = 13

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.