police-data-accessibility-project / data-sources-mirror Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 4.58 MB

A home for data initially entered into public Airtable forms

Python 100.00%

data-sources-mirror's Introduction

Purpose

You should probably just look here: https://github.com/Police-Data-Accessibility-Project

Data source name heading (data_source_name)
Agency name (agency_name)
Municipality info (municipality, state_iso)
"Record Type" heading as well as the actual record type below (record_type)
"Coverage" heading as well as the date range showing (coverage_start) and (coverage_end)
"Formats Available" heading as well as the list of (record_formats) parsed into separate labels
Visit Source URL button with an href (source_url)
Source Details button which expands with details from (description)

Create Quick Search Flask Endpoint

Create initial Flask app with connectors to Supabase
Build a V1 quick search endpoint, accepting two parameters: search and county
County should be an exact match on the county_name column in agencies table for now
Search should be a partial match on the name column in data_sources table (where ilike '%{}%')
Endpoint should return a json object of the form:
{
"num_records": LEN(RECORDS)
"data": {
"data_source_name": NAME (data_sources),
"agency_name": NAME (agencies),
"municipality": MUNICIPALITY (agencies),
"state": STATE_ISO (agencies),
"description": DESCRIPTION (data_sources),
"record_type": RECORD_TYPE (data_sources),
"source_url": SOURCE_URL (data_sources),
"record_format": RECORD_FORMAT (data_sources),
"coverage_start": COVERAGE_START (data_sources),
"coverage_end": COVERAGE_END (data_sources),
"agency_supplied": AGENCY_SUPPLIED (data_sources)
}
}

Surveys record type

added Surveys as a record_type, so we need to make sure we're expecting it when we mirror. Do we need an issue for data-sources-app as well? @mbodeantor

docs updated: https://docs.pdap.io/activities/data-dictionaries/record-types-taxonomy

Query new tables with API

The new versions of the Supabase tables are ready to be incorporated into the API:

Point all queries to the capitalized versions of the tables
Verify the API is still returning as expected
Delete old versions of the tables
Rename new tables to lower case
Utilize new link table to optimize API queries through joins

Quick Search Card

Create a simple search card that has:

PDAP logo at the top
A short description of PDAP, what the user can expect to get from the search
Text search field
Text county field
Search button that submits the contents of the two fields to the quick search API endpoint
Footer links to various portions of the PDAP site (see wireframe)

Host API on Digital Ocean

I'm getting this error when I try to deploy the API on Digital Ocean:
ImportError: cannot import name 'Mapping' from 'collections'

Verify flask app code on main branch will deploy locally in a clean virtual environment (install packages only from requirements.txt)
When that runs without errors, debug any errors when trying to deploy on Digital Ocean: https://cloud.digitalocean.com/apps/f7334ee3-2cd0-483f-812d-7e3aed217ff8/settings/data-sources-app?i=feca0b

Set up Supabase

Use Supabase to create a PostgreSQL database with roles for all devs
Use the public Airtable database mirror as a guide for the schema, feel free to infer data types.
Many of the fields in data_sources are lookups on the agencies or counties tables.
data_sources.agency_described_linked_uid foreign key → agencies.airtable_uid
agencies.county_fips foreign key → counties.fips
Ideally, this should not break if new Airtable columns are added, or column names change.

remove volunteers table

I'm moving the volunteers table to a separate airtable database. We should stop mirroring it to DigitalOcean.

documentation for our mirror configuration

It would be great to have a blurb in the README about how this is implemented—I can see the droplet, but nothing about how it's configured. For example, when we make changes to the code in this repo, does it sync? How often does the droplet run?

I think this stuff can just go in the README—this repo is public but none of this info is really privileged.

record type tags + associated schema changes

These changes are now live in Airtable and in the docs, so they should be made in the mirror.

added readme_url property. In airtable this is a "url" type property, just like source_url, but in the db it can just be a string.
added tags property. In airtable this is a "multiple select" type property, just like agency_aggregation, but in the db it can be an array.
added tags_other string property as a utility so people can submit additional tags in the form. This is a workaround to the fact that airtable does not let people add these naturally as part of their submission. These will be dealt with on intake, and are not considered a permanent / public part of the db.
changed Traffic Stops record_type to Stops
added Vehicle Pursuits record_type
added Field Contacts record_type

Update Mirror Code for Schema Refresh

Context

Add and remove columns from source_fieldnames_full function according to the schema refresh outlined in this issue: https://github.com/orgs/Police-Data-Accessibility-Project/projects/21/views/1?pane=issue&itemId=40187829

Requirements

Maintain same functionality with minor changes in the Supabase schema reflecting planned Airtable updates

Tests

Ensure Github Actions workflow runs successfully

Docs

Will be addressed in this issue: https://github.com/orgs/Police-Data-Accessibility-Project/projects/21/views/1?pane=issue&itemId=40198475

Schedule Airtable Mirror

Reuse Airtable mirror code https://github.com/Police-Data-Accessibility-Project/data-sources-mirror), (improving as needed), include connections to write to Supabase
- one improvement should be changing how we access the API, because they're deprecating API keys
pyAirtable is useful
The mirror should be regularly synchronized (daily, to start) with the original database to ensure data consistency.
Use GitHub actions to schedule this

detail_level column in data_sources.csv does not match airtable

On airtable, the detail_level has values like individual records, aggregated records, etc.
In the GitHub mirror, the values are federal, state, county, etc.

Notify Discord on Github Action pipeline failure

Allow the wider team to be aware of corrective action needed to get a pipeline running. Looks like this will be helpful in this effort: https://github.com/marketplace/actions/discord-workflow-status-notifier

Identifying Associated Agency

As part of the Data Source Identification Pipeline, we want attach to discovered data sources the agency they are associated with. To aid in this process, we should utilize the Sitemap Scraper tool that was developed: https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/tree/c7ecd224c2aab66666a96832bea6bc11b69de623/utils/sitemap_scraper

Query the Supabase Agency table for all agencies with a URL and their name
Modify Sitemap Scraper so the output retains the root URL and agency name
Run the agency URLs through the sitemap scraper

API Auth

Set up a users table in Supabase to support API auth
Implement authentication using your favorite framework, using flask_jwt seems like the most straightforward if you don't have another preference (https://blog.teclado.com/api-key-authentication-with-flask/)

Add Budgets & Finances record_type

There's now a record_type called "Budgets & Finances" in Airtable.

Search Results Page

Once a user has submitted their search query, they should see the results as multiple search result cards corresponding to each data source returned from the quick search endpoint. Includes:

"Search Results" heading
Text reflecting the search term submitted and the total number of results returned
A button for the user to request additional data sources (links here)

Periodic URL archives with Save Page Now

The goal

Periodically use Internet Archive's "Save Page Now" capturing service to preserve copies of Data Sources in our database. (What is a data source?)
- Log 404s, timeouts, and other errors in a file. Store the airtable_uid, source_url, and an error message as part of the log, so we can easily update the database if we can't fix the URL.

Why?

Retention policies can be unforgiving, and important records are lost to time every day. It's bad news when a URL in our data sources database The best time to plant a tree is 20 years ago…the second best time is now. Same here.

But why though?

It is an incentive to give us data sources. Currently, we ask people to do it just to do us a favor or because they believe in the cause. Instead, we can say “If you know about an internet data source, submit it to us and tell us how often it’s refreshed. We will create archives automatically and link to them.” Instead of just passively storing them, we’re doing our part to preserve data. Fun!

Seriously, explain why this is important

It makes our data sources database more and more useful over time. That URL we saved 404s now? No worries, we archived it. Instead of becoming a useless row we need to delete, it points to a place where information was published and can still be accessed.

Suggested approach

Make a new repository for this work (to keep this repo's purpose and function simple and clear)
- that's here: https://github.com/Police-Data-Accessibility-Project/automatic-archives
Use this Save Page Now Python wrapper
Police-Data-Accessibility-Project/automatic-archives#7

police-data-accessibility-project / data-sources-mirror Goto Github PK

data-sources-mirror's Introduction

Purpose

See also

data-sources-mirror's People

Contributors

Stargazers

Watchers

Forkers

data-sources-mirror's Issues