Coder Social home page Coder Social logo

satcherinstitute / health-equity-tracker Goto Github PK

View Code? Open in Web Editor NEW
17.0 5.0 24.0 254.65 MB

Health Equity Tracker is a free-to-use data visualization platform that is enabling new insights into the impact of COVID-19 and other social and political determinants of health on historically underrepresented groups in the United States.

Home Page: https://healthequitytracker.org/

License: MIT License

Dockerfile 0.20% Python 35.69% HCL 1.08% Shell 1.02% Makefile 0.03% HTML 0.12% TypeScript 61.39% JavaScript 0.27% CSS 0.21%
covid-19 covid19-tracker covid19-data covid-data health-data racial-disparities equity

health-equity-tracker's People

Contributors

adams314 avatar alinix1 avatar benhammondmusic avatar colespen avatar dependabot[bot] avatar ebonyrespress avatar efregoso avatar eriwarr avatar jdemlow avatar jenniebrown avatar joshzarrabi avatar juneezee avatar kccrtv avatar kkatzen avatar mayaspivak avatar sethvg avatar slagathorr avatar vanshkumar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

health-equity-tracker's Issues

Cloud run development issues

Couple things I noticed while getting started with cloud run for the ingestion pipeline:

  • logging.info() doesn't show up in the cloud run logs, but logging.warning() does.
  • There are seemingly infinite retries on failure that are very spammy. Also, when I ran the household income scheduler, even though it succeeded it was causing lots of retries that the other data sources weren't. I wonder if this is because it took too long to ack (it took much longer than the other ones I tested, around ~40 seconds).

Not sure but one/both of these things may be addressed if we use Airflow?

Frontend error handling and logging

Audit frontend code for error paths and ensure errors are A) caught, B) handled gracefully by the UI, C) reported to the server, and D) tracked somewhere on the server along with server errors.

Figure out strategy for error logging/monitoring/alerting

Very open-ended task, but includes:

  • How are errors monitored/alerted on?
  • Standardize severities, eg what constitutes failing a task vs just logging a warning
  • Once we do this, audit code and make sure errors are properly handled (I'm guessing for early development we're going to be lax about this)

Airflow might help with this, though the web app will also need to do similar things.

Create Airflow local development infrastructure

To create an environment for running our ingestion pipeline locally on a developer's machine, we need the following:

  • a docker-compose YAML file to configure a multi-container environment
  • a Makefile for simplified setup and tear down

Clean up duplicated data source code

Due to initially not having a shared code setup between different cloud run services, we have some duplicated code for different data sources. Eg column ids are duplicated between census.py and census_to_bq.py, state names are duplicated between primary_care_access.py and primary_care_access_to_bq.py, etc. Moving source metadata to a common location may also help with serving source metadata from the dataserver too

Split up python/ingestion package into different packages

This would be good both for code organization and for reducing cloud run image sizes. For example, we could have packages for

  • data-source logic, eg schemas, metadata, API calls that can be shared across services
  • cloud-run message-handling utilities
  • data-processing with pandas
  • downloading to gcs
  • uploading to BQ

Figure out python build setup

Goals of the build setup:

  • [P1] easily share code across Cloud Run services
  • [P1] clear, reliable, reproducible, and up-to-date dependency management (eg avoid breaking when a dependency is updated, keep deps up to date, and make it difficult to deploy code that's missing a dependency or using an incompatible one)
  • [P2] avoid pulling in unnecessary dependencies (I think this is less relevant unless somehow our codebase/dependency tree becomes very large, but it could help with reliability)
  • [P2] standardize dependency management for local setup and testing without polluting prod environments

Current setup at time of writing is:

  • All shared python is in /python directory at the root of the repository. This includes a setup.py file which declares the /python directory as a package. Shared code is placed in sub-packages with an __init__.py file. Each Cloud Run service has a Dockerfile that copies the /python directory to the container image and installs it with pip install. Each Cloud Run service has its own requirements.in file that is used to generate a requirements.txt file. It is up to each service to make sure that if any shared code in /python requires dependency X, that its requirements.in file is updated to specify dependency X and pip-compile is run to generate the correct requirements.txt file

Pros: easy to share code. Requirements are version-pinned (this helps standardize environments and avoid weird breakages)
Cons: all code from ./python is installed in all Cloud Run services, deps required from ./python/some_package must be added to the requirements.in file for every service that uses some_package.

I think ideally we should move to a model where:

  • each sub-package of /python specifies its own dependencies either via a requirements.in file or setup.py file
  • each Cloud Run service specifies a requirements.in file that references the requirements.in files for the shared packages being used.
  • we have some form of automation that ensures pinned versions are updated regularly (no idea how to do this, maybe something like Github Dependabot?)
  • some presubmit check that ensures that when new dependencies are added to a package, all services that rely on it have their requirements.txt files updated

This model may or may not require using pip-compile-multi, needs some investigation.

One way to solve the whole requirements.txt going out of date is to run pip-compile to generate the requirements.txt file from within the Dockerfile for each service. This would keep things up to date but I'm guessing it's a bad idea because it makes dependency versions really opaque and means that every time new code is deployed it also updates all versions which could lead to reliability issues.

Other random open question: should we use setuptools instead of distutils.core.setup? context

Verify dev vs prod dependency setup

Verify that when installing dev dependencies, create-react-app doesn’t include them with production builds. This seems not to be the case (see PR discussion), but would be good to double-check.

We might also consider having a separate devDependencies section and installing dev dependencies using the --save-dev option. Not sure how create-react-app deals with dev dependencies without this explicit difference, but the explicit different seems like a good structure IMO.

Add "--disallow-untyped-defs" option to mypy

This will enforce that all functions have type defs. As a prerequisite, someone will need to go through the codebase and add types to everything. We should definitely do this sooner rather than later so it's less burdensome to go through and add types and so we get the typechecking benefits sooner, before the codebase becomes large.

Add data serving service to terraform config

May need to create a new config if the one from the prototype repo is not yet copied over.

  • This should include the identity to run the service as and a custom role for the identity.
  • Need to figure out auth for invoking this service (not necessarily in scope for this issue)

Future improvements for ACS data ingestion

  • Figure out how to handle updating the url. It's currently hard-coded to 2018 which is the most recent 5-year ACS dataset.
  • We should register a developer key, which gives you a much higher quota. This may not be relevant since we'd likely be under the quota, but doesn't hurt to register one.

Done:

  • Confirm which census dataset to use for population breakdowns: currently using the 5-year ACS dataset but there's also a 1-year and 3-year dataset and I think you can also get population breakdowns from the actual 10-year census.
  • Consider migrating to another source. BigQuery public datasets include census data which is convenient since it's already in BigQuery, though the documentation on the schemas is pretty poor from what I've seen. Another option is using DataCommons which has a pretty powerful/expressive python API

Design for handling data sources changing and avoiding duplicate/redundant data

Currently the code is set up to ingest the same data source even if it hasn't changed. This causes duplicate data in BigQuery. This isn't a big deal for now because we have an ingestion_time timestamp so it's easy to just query all rows where ingestion_time = MAX(ingestion_time) to avoid duplicates/old data. The upside is it makes ingestion very simple. The downside is it complicates the data and uses unnecessary storage and query costs (time & money).

We should figure out whether to change this strategy. Some things that would need to be thought about if we do:

  • how do we detect if the data has changed to prevent ingesting duplicates?
  • how do we preserve historical data?

I think avoiding ingesting redundant data in the first place could get complicated as it may vary by data source. One idea I had is to keep the ingestion as is, and then periodically materialize new tables with just the latest data. Then the old data can be cleaned up with a periodic task or automatically with a TTL.

Eg:
ingest data to GCS bucket => upload from bucket to BigQuery => BigQuery_historical_table => BigQuery_current_data
BigQuery_historical_table would have duplicate/redundant data and periodically get cleaned up, but BigQuery_current_data would always be the most recent.

Testing coverage

We likely won't have great coverage early on due to prototyping and quickly iterating - we should go back and make sure we have coverage for key pieces of code.

Also might be worth having some more integration-ey tests that work with more end-to-end flows

Clean up error handling in ingestion pipeline

Right now we have lots of try/except blocks that just catch an error and log it. For errors that should fail the whole process (which is most errors), we should just let the error propagate up and get logged at a more appropriate place. Most errors have enough context that additional messaging isn't necessary, but if it is that can be handled by rethrowing:

except SomeError as e:
  raise RuntimeError("More specific message") from e

Determine supported browser versions and whether we need to set up additional polyfills for older browsers

create-react-app documentation is a bit confusing on this point. It says

The browserslist configuration controls the outputted JavaScript so that the emitted code will be compatible with the browsers specified

but it also says

Note that this does not include polyfills automatically for you. You will still need to polyfill language features (see above) as needed based on the browsers you are supporting.

... which sounds contradictory to me. One thing is clear, IE 11 and below require polyfills if we want to support that. This may come with the tradeoff of a bigger binary/slower code.

Also, the default config in package.json is

"production": [
    ">0.2%",
    "not dead",
    "not op_mini all"
],

I think the first line means browser versions that have >0.2% usage. Not sure what the other two lines are or if these lines are treated as a union or intersection.

AIs here:

  1. Investigate how create-react-app builds work to clarify what browsers it supports out of the box without additional polyfills
  2. Determine what browsers we're currently supporting with the default config, and whether we want to change the config to support older browsers (this likely requires product input).
  3. Determine whether we need any polyfills based on 1 and 2.

This is not time sensitive as it doesn't affect development. However we should avoid using experimental/non-standard features.

Migrate download button to do server-side downloads

I looked into this a bit, seems like downloading files in the browser is a bit of a mess with incomplete browser support for different mechanisms. Need to investigate more but I think there are roughly two approaches:

  • Server-side - Add Content-Disposition: attachment; filename="<name.csv>" headers to the response, and then just make a link to it from the client. Depending on content type, we may need to use the "download" attribute to ensure it downloads rather than opens in a new tab. The download attribute ensures it actually downloads, but isn't supported in IE.
  • Client-side - make fetch for table data and programmatically add download link to the DOM and trigger a click, providing the in-memory data to the link. This is a bit hacky but has the advantage of being able to use the same API as getting the data directly.

The server-side approach seems better generally. I think this will require updating the API to have the client specify whether it's for download or not, and the server then provides the appropriate response headers.

Before handoff, figure out netlify setup

Before handoff, figure out netlify setup - who has access, can we control access with a google group, do we need a paid plan, is there anything we need to document about the account/site setup, etc.

Replace the pub sub logic that connects services

Currently, at the end of the util::ingest_to_gcs function we publish a notification to start the next step of the data ingestion pipeline.

We should replace this mechanism by creating a PythonOperator that will send a payload to the run_gcs_to_bq service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.