police-data-accessibility-project / scrapers Goto Github PK

View Code? Open in Web Editor NEW

157.0 17.0 33.0 209.83 MB

Code relating to scraping public police data.

Home Page: https://pdap.io

License: GNU General Public License v3.0

Python 97.18% JavaScript 2.21% Makefile 0.06% Dockerfile 0.55%

scraping

scrapers's Introduction

Welcome!

This is the GitHub home for web scraping at the Police Data Accessibility Project.

(What do we mean by web scraping?)

How PDAP works

This repo is part of a toolkit for people all over the country to learn about our police systems. Check out our software development roadmap and high-level technical diagram to learn more about our ecosystem.

How to run a scraper

Right now, this requires some Python knowledge and patience. We're in the early stages: there's no automated scraper farm or fancy GUI yet. Scrapers can be run locally as needed.

Install Python. Prefer a differently opinionated guide? Perhaps this is more your speed.
Clone this repo.
Find the scraper you wish to run. These are sorted geographically, so start by looking in /scrapers_library/....
Follow the instructions in the scraper's README to get going. (If it's broken or simply out of date, please open an issue in this repo or submit a PR.)

Sharing back to the PDAP community

If you do something cool or interesting or fun with your shiny new data, share that in our Discord. Want to kick around an idea or share something that doesn't work as expected? Discord's a great place for that, too.

How to contribute

To write a scraper, start with CONTRIBUTING.md. Be sure to check out the /utils folder!

For everything else, start with docs.pdap.io.

Resources

Here are some potentially useful tools. If you want to make additions or updates, you can edit the docs in GitHub!

scrapers's People

Contributors

Stargazers

Watchers

scrapers's Issues

Document process for submitting scraped data

Document this in the docs, and point there in the readme of this repo.

install python, 2. run this scraper, 3. etc

Extraction Intake

A process which, when run, submits a scraper’s Extraction and metadata to our database.

For now, we're going to use CKAN instead of making our own API from scratch.

Key user story

As a data scraping volunteer, I should be able to run a Scraper from the Scrapers repo and submit the Extraction to PDAP.

Details

We need a place to put Extractions and their Metadata. Once the Extraction is dropped, we should link to its path in the data_intake database.

The simplest, most modern solution is probably an API endpoint.

What's in an Extraction?

The goal: a synchronous bright line between the source material and the scraped result, with the source code thrown in. We can publish these on the website as case studies without fear of legal trouble.

an extraction of "raw files", i.e. no OCR or translation
a metadata.json file
the scraper.py code itself (nice to have)
- this could point at github
- we don't technically need this as long as we have time stamped version history in github, though that is tougher to untangle and troubleshoot and not as standalone

Visual aid

https://pdap.invisionapp.com/freehand/Data-intake-flow-Q01qjpCvN

To do

Centralize the fields.txt

Let's look at the scrapers we have and the fields they scrape, to see what we can learn from them.

Each scraper has a fields.txt.

Add tests to the Palm Beach County scraper

There are currently no tests for this scraper. Comprehensive tests will allow for increased confidence in changes.

Move over PRs from the main project, remove the code there

To move over:

To delete:

Everything in https://github.com/Police-Data-Accessibility-Project/Police-Data-Accessibility-Project/tree/master/Counties

Possible addition for template readme

Booking date and time should likely be added here

DoltPy and camelot packages have dependency conflict

Here's an image of the issue:

This was acquired when attempting to install requirements.txt on a fresh remote repo

Aurora, CO scraper requested

Discord user Steel requested information from Aurora Colorado.

https://www.dolthub.com/repositories/pdap/datasets/query/master?q=SELECT+*%0AFROM+%60agencies%60%0Awhere+%60city%60+like+%22Aurora%22&active=Tables

agency id b577cf2d087741e88e8851e1fc48baee in our system, but no current datasets.

Add scrapers style requirements to readme / templates

The task:

Represent these requirements in the scrapers readme or template as appropriate
Represent them by creating an example scraper that meets the criteria

Good scrapers:

Scraper must be able to pick up where it left off, i.e., not a complete grab each time, only the differences since the last run.
Scraper saves file to our Hadoop.
Scraper saves metadata to our database (Dolt or PostgreSQL)
Scraper to produce a SHA256 and MD5 hash for every file it generates and record it in database.
A separate script can be used for this. Workflow would be something like scraper>extractor>saver

Questions:

Where would they save the keys?
Keys or Developer API tokens, similar to those Github or other cloud services uses can be stored in config file of the individual scraper.
Does the script have to generate its own key?
We generate them on the server and assign to scrapers.
Do all the scrapers just use a common key that is located on the scraping server?
Each scraper will have its own.

Change wording on schema page of GUI

Change wording to mean lowest path that still leads to the police page

Scraper testing pages

If we could have a subpage to test the scrapers on, that'd be great. Basically two separate pages, both having a pdf with the same name, but different data.

Open Data Network data source scraper

The task

This is a list of potential data sources. (here it is in our data sources db)

Write a scraper which can collect information about these Data Sources and put them in a CSV, ready for upload to our Data Sources database.

We'll need a unique ID of some kind to check for duplicates when we run this again; maybe source_url?

Resources

Use the Data Sources data dictionary to see which properties we might like to know about each of these.
- most important are submitted_name, record_type, agency_described, source_url
- there may be others which are easy to grab and super helpful, like data_portal_type and readme_url
Use the Data Sources database for examples
This doesn't need to be automated; we can run it every once in a while.
This doesn't need to write to our database; uploading a CSV is pretty easy

Host the scrapers on DO, Hadoop, or Lambda

Looking to contribute

Hi, I'd like to contribute. What's the most valuable thing I could be doing? I've written a lot of scrapers in my day but I could also try submitting FOIA requests for additional data. Just looking for some initial direction.

GUI crashes when creating a pdf v3

Error:

\ScraperSetup.py", line 773, in create_button_pressed
    for i in range(len(lines_to_change)):
TypeError: object of type 'NoneType' has no len()

Reorganize all the existing scrapers

We want to keep with the $STATE/$COUNTY/$RECORD_TYPE layout that was proposed, but we need to also come up with a top-level directory name so we don't have one for each state in the top-level repo. Maybe sources?

Austin, TX Scrapers

Hi all 👋
Id like to add my logic, dependencies file, and docs for Austin, Texas citation and arrest data scrapers.
I wasn't sure if you want me to push to a separate branch (for which I think I need access) or push directly to master 😬

Let me know if you have any questions :)

Create a Socrata scraper

Socrata's a common publisher. A common scraper would take us a long way.

https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Log-Arrests/xi7q-nji6
https://dev.socrata.com/publishers/https://dev.socrata.com/docs/endpoints.html

Create Extraction Metadata when scraped data is submitted

Related to #80, #173

Tasks

the URL of the archive should be included in the Extraction's metadata
metadata should be generated in the sample format below

General purpose

This is a Python module called something like extraction_metadata.py in /common which generates metadata on the fly by using the dolthub API to get the most up to date information about the scraper at the time it's run.

Pinging the DoltHub API

Because scrapers and datasets are subject to change constantly, this should be be done on-the-fly.

This is python3 which gets all the agencies. We should still make a more useful query which just needs to substitute in the dataset ID.

import requests
url = "https://www.dolthub.com/api/v1alpha1/pdap/datasets/master?q=SELECT%20*%20FROM%20%60agencies%60"
response = requests.get(url)
data = response.json()
print(data)

Sample metadata

{
    "agency":{
        "agency_id": "73e93439e6bf4ffc8b3f931a86fa3ad0",
        "agency_name":"Clanton Police Department",
        "agency_coords":{"lat": "32.83853", "lng":"-86.62936"},
        "agency_type" : 4,
        "city":"Clanton",
        "state": "AL",
        "zip":"35045",
        "county_fips":"01021"
    },
    "dataset":{
          "dataset_id": "5740697099a311ebab258c8590d4a7fc",
          "url":"https://cityprotect.com/agency/540048e6ee664a6f88ae0ceb93717e50",
          "full_data_location":"data/cityprotect",
          "source_type": 3,
          "data_type": 10,
          "format_type": 2
    }
   "extraction":{
        "extraction_start":DATETIME,
        "extraction_finish":DATETIME,
        "dataset_archive":URL,
    }
}

python-app action doesn't play well with our requirements.txt

It checks for top-level requirements.txt, but that's not where ours live. If it's going to test everything, it'll need to install the deps for everything.

Legal links in CONTRIBUTING.md broken

Links to the legal requirements are broken.
https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/CONTRIBUTING.md

Collected Fields readme

In regards to PR #58

I personally found having all of the Fields listed to be beneficial to my work writing scrapers. If you do not wish to add them back to the template readme, please add them to some documentation elsewhere.

It may also be beneficial to tell scrapers to mention what the field is called within their data similar to what I have done with the Pomona readme, or for them to comment if they are unsure of the meaning, like I did with Butte's readme.

If you all want to keep it as is, so be it- but I won't be using that format.

Other potential changes:

Add a section for scrapers to add fields that are not listed, so that they are known and can be added to the dolt (if enough departments use them, and if the data is acceptable of course).
Further explanation of what is expected to be provided in How to locate the data source. All of my previous scrapers have just been me guessing what is meant by it, and currently do not follow any format.
Either make Time period of data it's own category, or combine it with Data Refresh Rate

Potential fields:

BookingNum
BookingDate
WarrantNum
BailAmount
SearchIncident

Sorry if this came out harshly, I'm just upset that it took me this long to figure out how to word it ;)

SB1421 Use of Force scrapers for CA agencies

Context

A researcher/journalist made this data request:

We are looking to scrape records posted by California agencies under SB 1421 and SB 16, including pdfs, audio, video and other files

To do

Check this table for a data source without a scraper_url
Write a scraper
- This is a nice example
- Running the scraper locally should cause any files at the page to be downloaded to the local directory.
Include a README, and see other guidelines in our contributing guide.
Comment on this issue if you're working on it or have a submission, and we'll add a scraper_url to each source as we complete scrapers
Bonus: once we have several scrapers, a very simple utility to run multiple agencies at once would be cool.

Calls for service scraper experiment

Latest

We're likely going to do an experiment with GitHub Actions scraping into a GitHub repo. I'm still waiting to check in with the original data requestors to see how they plan to use it.

Open questions:

where does the data go?
who does maintenance?
how big is the data?

Background:

https://discord.com/channels/828274060034965575/1034159909635358782

What to scrape:

airtable data source
http://gisapps1.mapoakland.com/callsforservice/
- scrapable with esri2geojson: http://gismaps.oaklandca.gov/oaklandgis/rest/services/callforservice_2015_FC/FeatureServer/0
- can be added to qgis and exported as CSV
- example from one day: cfs_1.csv

Scraper hosting options

github actions
digitalocean
aws lambda / ec2
jacob to present guidance on this

Data storage options

github repo (scraping to git)
https://www.wrgl.co/
https://www.documentcloud.org/
emailed to someone / a list of people daily

Has anyone started a Denver Police Department scraper?

Would like to start experimenting with that if not.

Create Archive snapshot of dataset url when Scrapers are run

This should ping the internet archive with a request to archive the site at the time the scraper should run.

https://archive.org/services/docs/api/
we shouldn't bother creating an archive if the scraper didn't run successfully

from archive-it:

While we would love to have y’all as an Archive-It partner, I think this specific request may be better suited for our Wayback Machine’s "Save Page Now" (SPN) functionality. I’ve found a few resources on SPN API integrations that might fit your needs. Here is the standard API info page: https://archive.org/help/wayback_api.php. Here is our developer wiki: https://archive.readme.io/docs/overview. I also found this resource for a python wrapper for SPN: https://github.com/palewire/savepagenow.

Please let me know if you find something here that works for you so I can share with the team and anyone else who may have a similar request in the future! If not, I can reach out to some of my colleagues in our patron services division to see if they have other suggestions, or simply connect you someone.

Automate running of Scrapers on PDAP servers

This is a late-stage project, because it relies on so many blockers.

#139
#141
submit the results to intake #80
#142

Add/Update schemas in states (Other than CA)

I finished CA (except for San Francisco), and now we need to finish the rest!

Set a run interval based on the dataset

Bay County Docker container is broken after changes to folder structure

After the changes in #24 which created a standardised captcha solver interface and moved the existing captcha solver into a 'common' directory at the project root, this has broken the Bay County Docker container.

Upon starting the container it says:

root@81778276cb3c:/scraper# python3 Scraper.py
Traceback (most recent call last):
  File "Scraper.py", line 14, in <module>
    from common.captcha.benchmark.BenchmarkAdditionSolver import CaptchaSolver
ModuleNotFoundError: No module named 'common'

I think this is because the container has the app root set in the Bay County/Scraper folder, because the error occurs at the import.

The container needs to be updated to work with the current project structure.

[GUI] Selecting opendata and then list_pdf or CG causes opendata setup to stay visible

Fix related select funtion to hide that tab

Allow on-demand Scraper usage

The end goal: a magic button for any Scraper that says something like Run Scraper Locally. When this button is clicked, the user needs to do as little as possible for the scraper to run and give them an Extraction. Allows a user to both donate compute time to PDAP and run scrapers for their own benefit.

If we have a Scraper written for a Data Source, and we've created an Archive of the Data Source, we should allow people to run that Scraper locally on demand. They will use their own compute power.

Can we write a package or plugin that lets anyone run our scrapers in-browser?

This is achieved adding things to the existing PDAP-app repo and probably deploying it to app.pdap.io or a local version.

This may be some kind of Docker file.

The package should include all necessary dependencies.

It could include a local version of data sources search

Users should be able to "Run Scraper Locally" on any Dataset they find that has a Scraper.

The Extractions should be saved locally.

Migrate scrapers from the main repo

Ye olde scrapers are in https://github.com/Police-Data-Accessibility-Project/Police-Data-Accessibility-Project/tree/master/Counties. We should port em over.

In a perfect world, we'd preserve the history on initial import (don't want to erase folks' contributions if we can avoid it) and then move them into the new structure in a follow-up commit.

Show status in scrapers table if it fails

Per Stabs' comment below, we should have clone of the Datasets db for Scrapers to use. Scrapers can update the db, which can in turn be synced to DoltHub.

Refresh Datasets when DoltHub merges are made

CKAN submission module for scrapers

Task:

Make a python module to be called at the end of a scraper.py file that takes the output of a scraper and submits it to our CKAN instance. This can be a pretty informal experiment.
we shouldn't submit the Extraction without a successful archive (see #180)
must also submit metadata.py (#154)

CKAN demo environment:

https://demo.dev.datopian.com/organization/pdap-io

CKAN API info:

https://github.com/ckan/ckanapi
https://github.com/ckan/ckanapi#ckanapi-python-module

Index of scrapers

The idea

This is a top-level file in the repo (a markdown or CSV are probably most readable) which we update automatically with GitHub Actions. It's to help people answer the question, "what scrapers are in this repo, anyway?" without looking through all the folders.

Eventually, we may outgrow this single-file directory or need fancier tools. For now, this should be fine.

What's in the index

A row for each scraper
We could populate this from Data Sources which have a scraper_url

These properties

scraper_url
agency_described
jurisdiction (state, county, municipality)
record_type

We can link to a more detailed public Airtable / DB view if they want to do a more specific search.

Consider: one group for "in this repo" and another group for "not in this repo"

Create the first properly configured scraper

Feature

Configure a scraper to the new paradigm from Police-Data-Accessibility-Project/meta#146.

Doing this with any scraper will close the issue.

Candidates

Minneapolis

#111
Then, we should run this scraper and add these incident reports, traffic stops, and shots fired to data-intake.

St. Louis

The Bounty Hunters are working on this right now.

Pittsburgh

#167

Develop a more scalable way to parse fields from tables for Benchmark portals

Tightly scoping this to the Benchmark scraper (Python) for now, but we could probably apply similar logic elsewhere.

Presently, parsing fields pertaining to charges from tables is lengthy and somewhat brittle (for example, not all portals expose the same fields). The content resembles this:

One can see what the raw data looks like by navigating to case 20000001CFMA at https://court.baycoclerk.com/BenchmarkWeb2/Home.aspx/Search.

The FR is to develop code that parses that table and exposes it as a list of dicts (where keys are derived from the thead values). In the case of the above screenshot, it would result in:

[
  {
    "count": "1",
    "description" "DRIVING WHILE LICENSE SUSPENDED OR REVOKED (32234 2a)",
    "level": "M",
    "degree": "S",
    "plea": "",
    "disposition": "TRANSFERRED TO ANOTHER COURT",
    "disposition date": "02/17/2020",
  },
]

From there, we can more easily translate this to our data model.

Update readme to point to GUI

Update the contribution section of the readme to point to the gui

Add raw_url table to data_intake db when Scrapers are run

Part of #80

When a Scraper is run and an Extraction is submitted successfully, we should be dropping a URL / path into the data_intake database.

Document overall data intake architecture / plan

Add Dolt support for raw data

This is started here: https://www.dolthub.com/repositories/pdap/data-intake/data/master/raw_data_urls

The ID they are submitting could instead be a URL to a file stored somewhere—generating the UUIDs is just something I do by default now.

We can allow people to submit one-off URLs or find their own way to automate it. Maybe someone will make us a utilities script, but that's a nice to have.

Extraction for Pittsburgh datasets

Request

We have a user in Pittsburgh looking for data on K-9 use and training. We should extract all the Pittsburgh police datasets and comb through them for potential K-9 applicable data.

Datasets

https://apps.pittsburghpa.gov/redtail/images/12507_Public_Datasets__Dashboards__and_Annual_Report_Links.pdf

https://pittsburghpa.gov/mayor/ctfpr

What is required?

Identify applicable datasets from the above links.
Add these datasets to the Datasets Repo. completed here
Submit one or more Scrapers to the Scrapers repo, and get them approved.
Run the Scrapers locally and link to the Extracted data in this issue.
Submit the Extractions and metadata to the CKAN API (#173)

What's in an Extraction?

Details here

Wire up Scraping app CI to the Scrapers table

https://www.dolthub.com/repositories/pdap/datasets/data/master/scrapers

As a scraper volunteer, I need to update this scrapers table using a variety of local utilities.

It should be updated by the utility @CaptainStabs made, or the intake utilities, or similar.

For more context, discussion: https://discord.com/channels/828274060034965575/828283062827221083/860953464527388713

(Don't Fear the) Repo overhaul

Scrapers repo

These serve as examples of different ways to access data. They're also individually useful.

Problems with the current repo

too intimidating
too complex
- we're trying to systematize everything; scrapers should be standalone
people aren't sure how to contribute
- people don't know what the code is for
- there's no good example of how the code can be used

How people use the repo

find a way to help
run scrapers locally
- find them geographically
- sorted by language
people can run utilities for writing scrapers
- common files
serve as a library of other scrapers

To do

Readme changes

#212
add a visual representation of how this relates to our other work (josh)
- scrapers outside this repo
- investigative process
- data sources
- work being done with data

Issue adjustment

Make issues in the Planning repo for wrappers / parsers for common data portals (josh)
- Kinda like this https://github.com/dwillis/fech-ftp

Structure changes

explain the repo's new structure in the README
reorganize according to the tree below
- one top-level directory for all utilities and common scripts
- organized geographically—the primary way for finding a scraper will not be through these directories, but by searching for Data Sources with a scraper_url present (so we can use record_type, location, any other Data Source property) or via a scraper index file: #196
- one major reason for this is that the "common" scripts are all over the place, so for a new user it's incredibly difficult to figure out how they relate.

setup_gui/Base_Scripts/Scrapers/crimegraphics/crimegraphics_bulletin.py

common/base_scrapers/crimegraphics/crimegraphics_bulletin.py

Base_Scripts/Scrapers/crimegraphics/crimegraphics_bulletin.py

CODE_OF_CONDUCT.md
CONTRIBUTING.md
LICENSE.md
README.md
requirements.txt
examples_templates/
   -- scraper_template/
     -- README.md
     -- scraper.py
   -- scraper_example_1/
      -- README.md
      -- scraper.py
   -- etc
scrapers/
    -- data_portals/
        -- cityprotect/
        -- crimegraphics/
             -- README.md
             -- crimegraphics.py
    -- federal/
    -- AR/
    -- CA/
    -- FL/
        -- scraper/
        -- county/
            -- scraper/
                -- scraper.py
                -- README.md
            -- municipality/
                -- scraper/
                    -- scraper.py
                    -- README.md
    -- etc
utils/
  -- meta/
    -- all_fields_extractor/
    -- etc
  -- setup_gui/
  -- etc

Related work

#196

Protect the master branch

https://help.github.com/en/github/administering-a-repository/configuring-protected-branches

At the very least, it'd be cool to require review for all PRs to prevent unilateral changes.

Scrape datasets from Police Open Data Census

There are a few dozen datasets here, we can use them.

Here's how to add new datasets:
https://docs.pdap.io/components/data-collection/dataset-catalog/submit-or-update-datasets

Here they are:
https://codeforamerica.github.io/PoliceOpenDataCensus/

Make a place (Hadoop) to store non-csv data

We need to keep our own audit trail. We’d probably just want the ETL library to drop the file on the file system and we store a record in the database about the scraper it came from and when (and who?)

Add City, Zip, FIPS, Lat & Lng to Agencies table

Goal

We will have to manually go through and add the relevant information to each agency. Once this is all done, it will give front-end devs a way to accurately place a pin for every agency in the US!

This issue can be closed once all agencies have the appropriate data—until then, we will need a lot of contributions.

Process:

Reference our dolt docs for information / context
Fork and clone the datasets repo
Navigate into the cloned directory and start a dolt sql-server (docs here)
Use a GUI tool like DBeaver or TablePlus for ease of data entry and connect to your localhost server
Pick a state, I started with Alaska (AK).
Copy the name of the agency, google their name to get the address of the main office/district
Copy the address into Google Maps. Copy the city and 5 digit ZIP code into the table
Right click on the map of the agency to display the coordinates. Copy those into lat, lng (google copies both fyi)
Use the coordinates here to get the right county_fips
Repeat for the rest of the agencies in that state, then commit and create a PR in DoltHub!

Update 5/22

There is now a tool here (thank you @ncpierson) which scrapes Geonames for this info.

police-data-accessibility-project / scrapers Goto Github PK

scrapers's Introduction

Welcome!

How PDAP works

How to run a scraper

Sharing back to the PDAP community

How to contribute

Resources

scrapers's People

Contributors

Stargazers

Watchers

Forkers

scrapers's Issues

Key user story

Details

What's in an Extraction?

Visual aid

To do

The task:

Good scrapers:

Questions:

The task

Resources

Tasks

General purpose

Pinging the DoltHub API

Sample metadata

Other potential changes:

Potential fields:

Context

To do

Latest

Open questions:

Background:

What to scrape:

Scraper hosting options

Data storage options

Task:

CKAN demo environment:

CKAN API info:

The idea

What's in the index

Feature

Candidates

Minneapolis

St. Louis

Pittsburgh

Request

Datasets

What is required?

What's in an Extraction?

Scrapers repo

Problems with the current repo

How people use the repo

To do

Readme changes

Issue adjustment

Structure changes

Related work

Goal

Process:

Update 5/22

Recommend Projects

Recommend Topics

Recommend Org