satcherinstitute / health-equity-tracker Goto Github PK

Health Equity Tracker is a free-to-use data visualization platform that is enabling new insights into the impact of COVID-19 and other social and political determinants of health on historically underrepresented groups in the United States.

Home Page: https://healthequitytracker.org/

License: MIT License

Dockerfile 0.20% Python 35.69% HCL 1.08% Shell 1.02% Makefile 0.03% HTML 0.12% TypeScript 61.39% JavaScript 0.27% CSS 0.21%

covid-19 covid19-tracker covid19-data covid-data health-data racial-disparities equity

health-equity-tracker's People

Contributors

Stargazers

Watchers

health-equity-tracker's Issues

Integrate metadata from data server to populate data downloads page

We need to create a class for dataset metadata type
https://docs.google.com/document/d/1viK-Z8J2Y92ZaM4SqxUiylQiwkVsTfarjVeTH4LkTww/edit#

Cloud run development issues

Couple things I noticed while getting started with cloud run for the ingestion pipeline:

logging.info() doesn't show up in the cloud run logs, but logging.warning() does.
There are seemingly infinite retries on failure that are very spammy. Also, when I ran the household income scheduler, even though it succeeded it was causing lots of retries that the other data sources weren't. I wonder if this is because it took too long to ack (it took much longer than the other ones I tested, around ~40 seconds).

Not sure but one/both of these things may be addressed if we use Airflow?

Create skeleton/HelloWorld cloud run service for data serving

Frontend error handling and logging

Audit frontend code for error paths and ensure errors are A) caught, B) handled gracefully by the UI, C) reported to the server, and D) tracked somewhere on the server along with server errors.

Create simple e2e test for one of the cloud run services

E.g. for the data serving service, it could be a placeholder test that invokes the service and ensure it returns “Hello World”

[Optional] Cloud Run Service that serves CSV data

Create a Cloud Run service that connects to BigQuery and returns the data as a csv (Data can also be retrieved through the bigquery UI in the Cloud Console)

Figure out strategy for error logging/monitoring/alerting

Very open-ended task, but includes:

How are errors monitored/alerted on?
Standardize severities, eg what constitutes failing a task vs just logging a warning
Once we do this, audit code and make sure errors are properly handled (I'm guessing for early development we're going to be lax about this)

Airflow might help with this, though the web app will also need to do similar things.

Create github workflow to run when there is a new pull request

Add PyLint and MyPy steps to the workflow since they can be run without additional changes

Create Airflow local development infrastructure

To create an environment for running our ingestion pipeline locally on a developer's machine, we need the following:

a docker-compose YAML file to configure a multi-container environment
a Makefile for simplified setup and tear down

Clean up duplicated data source code

Due to initially not having a shared code setup between different cloud run services, we have some duplicated code for different data sources. Eg column ids are duplicated between census.py and census_to_bq.py, state names are duplicated between primary_care_access.py and primary_care_access_to_bq.py, etc. Moving source metadata to a common location may also help with serving source metadata from the dataserver too

Add integration test run to the PR merge workflow

Run all tests under integration testing directory after terraform apply is complete.

Delete cloud scheduler terraform configuration when Airflow is ready to replace it

main.tf has configuration to deploy cloud scheduler resources, but Airflow will make this redundant, at which point we should delete them.

Dynamically get node version in presubmits

Presubmits running "npm test" have a hardcoded version of node. It would be nice to dynamically get the value in app/package.json instead.

Provision BigQuery Tables

Set up the BigQuery tables in the GCP project.

BigQuery job for joining source tables

Create a BigQuery job that joins source tables on county and stores the resulting table.

Add resources required for the cloud run ingestion service to the Terraform config.

Split up python/ingestion package into different packages

This would be good both for code organization and for reducing cloud run image sizes. For example, we could have packages for

data-source logic, eg schemas, metadata, API calls that can be shared across services
cloud-run message-handling utilities
data-processing with pandas
downloading to gcs
uploading to BQ

Add some basic unit tests for one of the Cloud Run services.

We really should have full unit test coverage, but for the purposes of CI/CD setup, a few examples will do.

Figure out python build setup

Goals of the build setup:

[P1] easily share code across Cloud Run services
[P1] clear, reliable, reproducible, and up-to-date dependency management (eg avoid breaking when a dependency is updated, keep deps up to date, and make it difficult to deploy code that's missing a dependency or using an incompatible one)
[P2] avoid pulling in unnecessary dependencies (I think this is less relevant unless somehow our codebase/dependency tree becomes very large, but it could help with reliability)
[P2] standardize dependency management for local setup and testing without polluting prod environments

Current setup at time of writing is:

All shared python is in /python directory at the root of the repository. This includes a setup.py file which declares the /python directory as a package. Shared code is placed in sub-packages with an __init__.py file. Each Cloud Run service has a Dockerfile that copies the /python directory to the container image and installs it with pip install. Each Cloud Run service has its own requirements.in file that is used to generate a requirements.txt file. It is up to each service to make sure that if any shared code in /python requires dependency X, that its requirements.in file is updated to specify dependency X and pip-compile is run to generate the correct requirements.txt file

Pros: easy to share code. Requirements are version-pinned (this helps standardize environments and avoid weird breakages)
Cons: all code from ./python is installed in all Cloud Run services, deps required from ./python/some_package must be added to the requirements.in file for every service that uses some_package.

I think ideally we should move to a model where:

each sub-package of /python specifies its own dependencies either via a requirements.in file or setup.py file
each Cloud Run service specifies a requirements.in file that references the requirements.in files for the shared packages being used.
we have some form of automation that ensures pinned versions are updated regularly (no idea how to do this, maybe something like Github Dependabot?)
some presubmit check that ensures that when new dependencies are added to a package, all services that rely on it have their requirements.txt files updated

This model may or may not require using pip-compile-multi, needs some investigation.

One way to solve the whole requirements.txt going out of date is to run pip-compile to generate the requirements.txt file from within the Dockerfile for each service. This would keep things up to date but I'm guessing it's a bad idea because it makes dependency versions really opaque and means that every time new code is deployed it also updates all versions which could lead to reliability issues.

Other random open question: should we use setuptools instead of distutils.core.setup? context

Data Ingestion - Household income

Create a scheduler job to get household income data into GCS bucket

https://data.ers.usda.gov/reports.aspx?ID=17828

Copy the ingestion cloud function into the cloud run service.

Verify dev vs prod dependency setup

Verify that when installing dev dependencies, create-react-app doesn’t include them with production builds. This seems not to be the case (see PR discussion), but would be good to double-check.

We might also consider having a separate devDependencies section and installing dev dependencies using the --save-dev option. Not sure how create-react-app deals with dev dependencies without this explicit difference, but the explicit different seems like a good structure IMO.

Look into whether we need react-helmet and if so, set it up

Chris mentioned this could be valuable for managing the component, if we need any dynamic changes to it, eg updating the <title> element to the current page.

Add "--disallow-untyped-defs" option to mypy

This will enforce that all functions have type defs. As a prerequisite, someone will need to go through the codebase and add types to everything. We should definitely do this sooner rather than later so it's less burdensome to go through and add types and so we get the typechecking benefits sooner, before the codebase becomes large.

Display methodology/citations

Write urgent care facilities data to BigQuery

Currently, the ingestion pipeline only loads the urgent care facilities data into gcs. We need to update the pipeline to load this data into BigQuery .

Create github workflow to be run when a PR is merged into the master branch

Add docker build and push to GCR steps for the data serving service. This should be generic enough that the other CR services can be easily added in a similar way.

Issue 10 census 2

Add county names and population by race to the ingestion pipeline. This is "diffbased" on SatcherInstitute/prototype#25.

Still needs testing/cleanup, just wanted to get it out there before the weekend.

Add data serving service to terraform config

May need to create a new config if the one from the prototype repo is not yet copied over.

This should include the identity to run the service as and a custom role for the identity.
Need to figure out auth for invoking this service (not necessarily in scope for this issue)

Set up deploy to staging step in the PR merge workflow

Create terraform.tfvars file for staging environment
Add terraform steps to the workflow

Filter datasets by a few key pieces of metadata

Still TBD on those pieces, but some that come to mind are dataset category, provider, geographic level, and variable

Future improvements for ACS data ingestion

Figure out how to handle updating the url. It's currently hard-coded to 2018 which is the most recent 5-year ACS dataset.
We should register a developer key, which gives you a much higher quota. This may not be relevant since we'd likely be under the quota, but doesn't hurt to register one.

Done:

Confirm which census dataset to use for population breakdowns: currently using the 5-year ACS dataset but there's also a 1-year and 3-year dataset and I think you can also get population breakdowns from the actual 10-year census.
Consider migrating to another source. BigQuery public datasets include census data which is convenient since it's already in BigQuery, though the documentation on the schemas is pretty poor from what I've seen. Another option is using DataCommons which has a pretty powerful/expressive python API

Design for handling data sources changing and avoiding duplicate/redundant data

Currently the code is set up to ingest the same data source even if it hasn't changed. This causes duplicate data in BigQuery. This isn't a big deal for now because we have an ingestion_time timestamp so it's easy to just query all rows where ingestion_time = MAX(ingestion_time) to avoid duplicates/old data. The upside is it makes ingestion very simple. The downside is it complicates the data and uses unnecessary storage and query costs (time & money).

We should figure out whether to change this strategy. Some things that would need to be thought about if we do:

how do we detect if the data has changed to prevent ingesting duplicates?
how do we preserve historical data?

I think avoiding ingesting redundant data in the first place could get complicated as it may vary by data source. One idea I had is to keep the ingestion as is, and then periodically materialize new tables with just the latest data. Then the old data can be cleaned up with a periodic task or automatically with a TTL.

Eg:
ingest data to GCS bucket => upload from bucket to BigQuery => BigQuery_historical_table => BigQuery_current_data
BigQuery_historical_table would have duplicate/redundant data and periodically get cleaned up, but BigQuery_current_data would always be the most recent.

Select and download multiple datasets at a time

Initial Airflow dag setup

Create a dags/ directory with a boilerplate ingestion DAG.

Implement landing page

Not sure what will be on the landing page yet

Add linting and formatting to presubmit/submit Github action workflow

This is currently set up in the project to run locally on commit but I think it's good to also have it run along with other presubmits/submits for consistency and in case anyone wants to develop locally with the commit rule disabled.

Add a step to the new PR workflow to run unit tests.

Testing coverage

We likely won't have great coverage early on due to prototyping and quickly iterating - we should go back and make sure we have coverage for key pieces of code.

Also might be worth having some more integration-ey tests that work with more end-to-end flows

Documentation for frontend (local dev/release instructions, overall design, style guide, etc)

Eg things like React functional components vs class components, usage of typescript, module export style, project file stucture, unit testing guidelines, etc

Clean up error handling in ingestion pipeline

Right now we have lots of try/except blocks that just catch an error and log it. For errors that should fail the whole process (which is most errors), we should just let the error propagate up and get logged at a more appropriate place. Most errors have enough context that additional messaging isn't necessary, but if it is that can be handled by rethrowing:

except SomeError as e:
  raise RuntimeError("More specific message") from e

Data Ingestion - Primary care physician access

Create a scheduler job to get num primary care physicians into GCS bucket

Primary care physician: https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/county-health-rankings-model/health-factors/clinical-care/access-to-care/primary-care-physicians

Data Ingestion - CDC Covid 19 Deaths

Create a scheduler job to get CDC Covid 19 death data into GCS bucket

https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-County-and-Ra/k8wy-p9cg

Data Ingestion - Census race and population size

Create a scheduler job to get census race and population size data into GCS bucket

https://data.census.gov/cedsci/table?q=alameda%20county&tid=ACSDP1Y2018.DP05&hidePreview=false

Determine supported browser versions and whether we need to set up additional polyfills for older browsers

create-react-app documentation is a bit confusing on this point. It says

The browserslist configuration controls the outputted JavaScript so that the emitted code will be compatible with the browsers specified

but it also says

Note that this does not include polyfills automatically for you. You will still need to polyfill language features (see above) as needed based on the browsers you are supporting.

... which sounds contradictory to me. One thing is clear, IE 11 and below require polyfills if we want to support that. This may come with the tradeoff of a bigger binary/slower code.

Also, the default config in package.json is

"production": [
    ">0.2%",
    "not dead",
    "not op_mini all"
],

I think the first line means browser versions that have >0.2% usage. Not sure what the other two lines are or if these lines are treated as a union or intersection.

AIs here:

Investigate how create-react-app builds work to clarify what browsers it supports out of the box without additional polyfills
Determine what browsers we're currently supporting with the default config, and whether we want to change the config to support older browsers (this likely requires product input).
Determine whether we need any polyfills based on 1 and 2.

This is not time sensitive as it doesn't affect development. However we should avoid using experimental/non-standard features.

Migrate download button to do server-side downloads

I looked into this a bit, seems like downloading files in the browser is a bit of a mess with incomplete browser support for different mechanisms. Need to investigate more but I think there are roughly two approaches:

Server-side - Add Content-Disposition: attachment; filename="<name.csv>" headers to the response, and then just make a link to it from the client. Depending on content type, we may need to use the "download" attribute to ensure it downloads rather than opens in a new tab. The download attribute ensures it actually downloads, but isn't supported in IE.
Client-side - make fetch for table data and programmatically add download link to the DOM and trigger a click, providing the in-memory data to the link. This is a bit hacky but has the advantage of being able to use the same API as getting the data directly.

The server-side approach seems better generally. I think this will require updating the API to have the client specify whether it's for download or not, and the server then provides the appropriate response headers.

satcherinstitute / health-equity-tracker Goto Github PK

health-equity-tracker's People

Contributors

Stargazers

Watchers

Forkers

health-equity-tracker's Issues

Recommend Projects

Recommend Topics

Recommend Org