Coder Social home page Coder Social logo

data-sources-mirror's Introduction

data-sources-mirror's People

Contributors

dependabot[bot] avatar drowninginflowers avatar e-linear avatar josh-chamberlain avatar mbodeantor avatar thejqs avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

data-sources-mirror's Issues

Search Result Card

When the results of quick search are returned to the user, they will be made up of multiple search result cards. Each should contain:

  • Data source name heading (data_source_name)
  • Agency name (agency_name)
  • Municipality info (municipality, state_iso)
  • "Record Type" heading as well as the actual record type below (record_type)
  • "Coverage" heading as well as the date range showing (coverage_start) and (coverage_end)
  • "Formats Available" heading as well as the list of (record_formats) parsed into separate labels
  • Visit Source URL button with an href (source_url)
  • Source Details button which expands with details from (description)

Image

Create Quick Search Flask Endpoint

  • Create initial Flask app with connectors to Supabase
  • Build a V1 quick search endpoint, accepting two parameters: search and county
  • County should be an exact match on the county_name column in agencies table for now
  • Search should be a partial match on the name column in data_sources table (where ilike '%{}%')
  • Endpoint should return a json object of the form:
    {
    "num_records": LEN(RECORDS)
    "data": {
    "data_source_name": NAME (data_sources),
    "agency_name": NAME (agencies),
    "municipality": MUNICIPALITY (agencies),
    "state": STATE_ISO (agencies),
    "description": DESCRIPTION (data_sources),
    "record_type": RECORD_TYPE (data_sources),
    "source_url": SOURCE_URL (data_sources),
    "record_format": RECORD_FORMAT (data_sources),
    "coverage_start": COVERAGE_START (data_sources),
    "coverage_end": COVERAGE_END (data_sources),
    "agency_supplied": AGENCY_SUPPLIED (data_sources)
    }
    }

Query new tables with API

The new versions of the Supabase tables are ready to be incorporated into the API:

  • Point all queries to the capitalized versions of the tables
  • Verify the API is still returning as expected
  • Delete old versions of the tables
  • Rename new tables to lower case
  • Utilize new link table to optimize API queries through joins

Quick Search Card

Create a simple search card that has:

  • PDAP logo at the top
  • A short description of PDAP, what the user can expect to get from the search
  • Text search field
  • Text county field
  • Search button that submits the contents of the two fields to the quick search API endpoint
  • Footer links to various portions of the PDAP site (see wireframe)

Image

Set up Supabase

  • Use Supabase to create a PostgreSQL database with roles for all devs
  • Use the public Airtable database mirror as a guide for the schema, feel free to infer data types.
  • Many of the fields in data_sources are lookups on the agencies or counties tables.
  • data_sources.agency_described_linked_uid foreign key → agencies.airtable_uid
  • agencies.county_fips foreign key → counties.fips
  • Ideally, this should not break if new Airtable columns are added, or column names change.

remove volunteers table

I'm moving the volunteers table to a separate airtable database. We should stop mirroring it to DigitalOcean.

documentation for our mirror configuration

It would be great to have a blurb in the README about how this is implemented—I can see the droplet, but nothing about how it's configured. For example, when we make changes to the code in this repo, does it sync? How often does the droplet run?

I think this stuff can just go in the README—this repo is public but none of this info is really privileged.

record type tags + associated schema changes

These changes are now live in Airtable and in the docs, so they should be made in the mirror.

  • added readme_url property. In airtable this is a "url" type property, just like source_url, but in the db it can just be a string.

  • added tags property. In airtable this is a "multiple select" type property, just like agency_aggregation, but in the db it can be an array.

  • added tags_other string property as a utility so people can submit additional tags in the form. This is a workaround to the fact that airtable does not let people add these naturally as part of their submission. These will be dealt with on intake, and are not considered a permanent / public part of the db.

  • changed Traffic Stops record_type to Stops

  • added Vehicle Pursuits record_type

  • added Field Contacts record_type

Update Mirror Code for Schema Refresh

Context

Add and remove columns from source_fieldnames_full function according to the schema refresh outlined in this issue: https://github.com/orgs/Police-Data-Accessibility-Project/projects/21/views/1?pane=issue&itemId=40187829

Requirements

  • Maintain same functionality with minor changes in the Supabase schema reflecting planned Airtable updates

Tests

  • Ensure Github Actions workflow runs successfully

Docs

Identifying Associated Agency

As part of the Data Source Identification Pipeline, we want attach to discovered data sources the agency they are associated with. To aid in this process, we should utilize the Sitemap Scraper tool that was developed: https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/tree/c7ecd224c2aab66666a96832bea6bc11b69de623/utils/sitemap_scraper

  • Query the Supabase Agency table for all agencies with a URL and their name
  • Modify Sitemap Scraper so the output retains the root URL and agency name
  • Run the agency URLs through the sitemap scraper

Search Results Page

Once a user has submitted their search query, they should see the results as multiple search result cards corresponding to each data source returned from the quick search endpoint. Includes:

  • "Search Results" heading
  • Text reflecting the search term submitted and the total number of results returned
  • A button for the user to request additional data sources (links here)

Image

Periodic URL archives with Save Page Now

The goal

  • Periodically use Internet Archive's "Save Page Now" capturing service to preserve copies of Data Sources in our database. (What is a data source?)
    • Log 404s, timeouts, and other errors in a file. Store the airtable_uid, source_url, and an error message as part of the log, so we can easily update the database if we can't fix the URL.

Why?

Retention policies can be unforgiving, and important records are lost to time every day. It's bad news when a URL in our data sources database The best time to plant a tree is 20 years ago…the second best time is now. Same here.

But why though?

It is an incentive to give us data sources. Currently, we ask people to do it just to do us a favor or because they believe in the cause. Instead, we can say “If you know about an internet data source, submit it to us and tell us how often it’s refreshed. We will create archives automatically and link to them.” Instead of just passively storing them, we’re doing our part to preserve data. Fun!

Seriously, explain why this is important

It makes our data sources database more and more useful over time. That URL we saved 404s now? No worries, we archived it. Instead of becoming a useless row we need to delete, it points to a place where information was published and can still be accessed.

Suggested approach

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.