Coder Social home page Coder Social logo

police-data-accessibility-project / scrapers Goto Github PK

View Code? Open in Web Editor NEW
157.0 17.0 33.0 209.83 MB

Code relating to scraping public police data.

Home Page: https://pdap.io

License: GNU General Public License v3.0

Python 97.18% JavaScript 2.21% Makefile 0.06% Dockerfile 0.55%
scraping

scrapers's Introduction

Welcome!

This is the GitHub home for web scraping at the Police Data Accessibility Project.

(What do we mean by web scraping?)

How PDAP works

This repo is part of a toolkit for people all over the country to learn about our police systems. Check out our software development roadmap and high-level technical diagram to learn more about our ecosystem.

How to run a scraper

Right now, this requires some Python knowledge and patience. We're in the early stages: there's no automated scraper farm or fancy GUI yet. Scrapers can be run locally as needed.

  1. Install Python. Prefer a differently opinionated guide? Perhaps this is more your speed.
  2. Clone this repo.
  3. Find the scraper you wish to run. These are sorted geographically, so start by looking in /scrapers_library/....
  4. Follow the instructions in the scraper's README to get going. (If it's broken or simply out of date, please open an issue in this repo or submit a PR.)

Sharing back to the PDAP community

If you do something cool or interesting or fun with your shiny new data, share that in our Discord. Want to kick around an idea or share something that doesn't work as expected? Discord's a great place for that, too.

How to contribute

To write a scraper, start with CONTRIBUTING.md. Be sure to check out the /utils folder!

For everything else, start with docs.pdap.io.

Resources

Here are some potentially useful tools. If you want to make additions or updates, you can edit the docs in GitHub!

scrapers's People

Contributors

ayyubibrahimi avatar captainstabs avatar constantinek avatar csa-goose avatar dependabot[bot] avatar dongately avatar douglaskrouth avatar dtoboggan avatar ellygaytor avatar ericturner3 avatar evanhahn avatar evildrpurple avatar jlintag avatar josh-chamberlain avatar ktynski avatar mbodeantor avatar mcoberley avatar mcpf15 avatar mcsaucy avatar michaeldepace avatar mitchyme avatar nathanmentley avatar nfmcclure avatar nfrostdev avatar not-new avatar omnituensaeternum avatar oscarvanl avatar rainmana avatar richardji7 avatar thejqs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapers's Issues

Extraction Intake

A process which, when run, submits a scraper’s Extraction and metadata to our database.

For now, we're going to use CKAN instead of making our own API from scratch.

Key user story

As a data scraping volunteer, I should be able to run a Scraper from the Scrapers repo and submit the Extraction to PDAP.

Details

We need a place to put Extractions and their Metadata. Once the Extraction is dropped, we should link to its path in the data_intake database.

The simplest, most modern solution is probably an API endpoint.

What's in an Extraction?

The goal: a synchronous bright line between the source material and the scraped result, with the source code thrown in. We can publish these on the website as case studies without fear of legal trouble.

  • an extraction of "raw files", i.e. no OCR or translation
  • a metadata.json file
  • the scraper.py code itself (nice to have)
    • this could point at github
    • we don't technically need this as long as we have time stamped version history in github, though that is tougher to untangle and troubleshoot and not as standalone

Visual aid

https://pdap.invisionapp.com/freehand/Data-intake-flow-Q01qjpCvN

To do

Centralize the fields.txt

Let's look at the scrapers we have and the fields they scrape, to see what we can learn from them.

Each scraper has a fields.txt.

Add scrapers style requirements to readme / templates

The task:

  • Represent these requirements in the scrapers readme or template as appropriate
  • Represent them by creating an example scraper that meets the criteria

Good scrapers:

  • Scraper must be able to pick up where it left off, i.e., not a complete grab each time, only the differences since the last run.
  • Scraper saves file to our Hadoop.
  • Scraper saves metadata to our database (Dolt or PostgreSQL)
  • Scraper to produce a SHA256 and MD5 hash for every file it generates and record it in database.
    A separate script can be used for this. Workflow would be something like scraper>extractor>saver

Questions:

  • Where would they save the keys?
    Keys or Developer API tokens, similar to those Github or other cloud services uses can be stored in config file of the individual scraper.

  • Does the script have to generate its own key?
    We generate them on the server and assign to scrapers.

  • Do all the scrapers just use a common key that is located on the scraping server?
    Each scraper will have its own.

Scraper testing pages

If we could have a subpage to test the scrapers on, that'd be great. Basically two separate pages, both having a pdf with the same name, but different data.

Open Data Network data source scraper

The task

This is a list of potential data sources. (here it is in our data sources db)

Write a scraper which can collect information about these Data Sources and put them in a CSV, ready for upload to our Data Sources database.

We'll need a unique ID of some kind to check for duplicates when we run this again; maybe source_url?

Resources

  • Use the Data Sources data dictionary to see which properties we might like to know about each of these.
    • most important are submitted_name, record_type, agency_described, source_url
    • there may be others which are easy to grab and super helpful, like data_portal_type and readme_url
  • Use the Data Sources database for examples
  • This doesn't need to be automated; we can run it every once in a while.
  • This doesn't need to write to our database; uploading a CSV is pretty easy

Looking to contribute

Hi, I'd like to contribute. What's the most valuable thing I could be doing? I've written a lot of scrapers in my day but I could also try submitting FOIA requests for additional data. Just looking for some initial direction.

GUI crashes when creating a pdf v3

Error:

\ScraperSetup.py", line 773, in create_button_pressed
    for i in range(len(lines_to_change)):
TypeError: object of type 'NoneType' has no len()

Reorganize all the existing scrapers

We want to keep with the $STATE/$COUNTY/$RECORD_TYPE layout that was proposed, but we need to also come up with a top-level directory name so we don't have one for each state in the top-level repo. Maybe sources?

Austin, TX Scrapers

Hi all 👋
Id like to add my logic, dependencies file, and docs for Austin, Texas citation and arrest data scrapers.
I wasn't sure if you want me to push to a separate branch (for which I think I need access) or push directly to master 😬

Let me know if you have any questions :)

Create Extraction Metadata when scraped data is submitted

Related to #80, #173

Tasks

  • the URL of the archive should be included in the Extraction's metadata
  • metadata should be generated in the sample format below

General purpose

This is a Python module called something like extraction_metadata.py in /common which generates metadata on the fly by using the dolthub API to get the most up to date information about the scraper at the time it's run.

Pinging the DoltHub API

Because scrapers and datasets are subject to change constantly, this should be be done on-the-fly.

This is python3 which gets all the agencies. We should still make a more useful query which just needs to substitute in the dataset ID.

import requests
url = "https://www.dolthub.com/api/v1alpha1/pdap/datasets/master?q=SELECT%20*%20FROM%20%60agencies%60"
response = requests.get(url)
data = response.json()
print(data)

Sample metadata

{
    "agency":{
        "agency_id": "73e93439e6bf4ffc8b3f931a86fa3ad0",
        "agency_name":"Clanton Police Department",
        "agency_coords":{"lat": "32.83853", "lng":"-86.62936"},
        "agency_type" : 4,
        "city":"Clanton",
        "state": "AL",
        "zip":"35045",
        "county_fips":"01021"
    },
    "dataset":{
          "dataset_id": "5740697099a311ebab258c8590d4a7fc",
          "url":"https://cityprotect.com/agency/540048e6ee664a6f88ae0ceb93717e50",
          "full_data_location":"data/cityprotect",
          "source_type": 3,
          "data_type": 10,
          "format_type": 2
    }
   "extraction":{
        "extraction_start":DATETIME,
        "extraction_finish":DATETIME,
        "dataset_archive":URL,
    }
}

Collected Fields readme

In regards to PR #58

I personally found having all of the Fields listed to be beneficial to my work writing scrapers. If you do not wish to add them back to the template readme, please add them to some documentation elsewhere.

It may also be beneficial to tell scrapers to mention what the field is called within their data similar to what I have done with the Pomona readme, or for them to comment if they are unsure of the meaning, like I did with Butte's readme.

If you all want to keep it as is, so be it- but I won't be using that format.

Other potential changes:

  • Add a section for scrapers to add fields that are not listed, so that they are known and can be added to the dolt (if enough departments use them, and if the data is acceptable of course).
  • Further explanation of what is expected to be provided in How to locate the data source. All of my previous scrapers have just been me guessing what is meant by it, and currently do not follow any format.
  • Either make Time period of data it's own category, or combine it with Data Refresh Rate

Potential fields:

  • BookingNum
  • BookingDate
  • WarrantNum
  • BailAmount
  • SearchIncident

Sorry if this came out harshly, I'm just upset that it took me this long to figure out how to word it ;)

SB1421 Use of Force scrapers for CA agencies

Context

A researcher/journalist made this data request:

We are looking to scrape records posted by California agencies under SB 1421 and SB 16, including pdfs, audio, video and other files

To do

  • Check this table for a data source without a scraper_url
  • Write a scraper
    • This is a nice example
    • Running the scraper locally should cause any files at the page to be downloaded to the local directory.
  • Include a README, and see other guidelines in our contributing guide.
  • Comment on this issue if you're working on it or have a submission, and we'll add a scraper_url to each source as we complete scrapers
  • Bonus: once we have several scrapers, a very simple utility to run multiple agencies at once would be cool.

Calls for service scraper experiment

Latest

We're likely going to do an experiment with GitHub Actions scraping into a GitHub repo. I'm still waiting to check in with the original data requestors to see how they plan to use it.

Open questions:

  • where does the data go?
  • who does maintenance?
  • how big is the data?

Background:

https://discord.com/channels/828274060034965575/1034159909635358782

What to scrape:

Scraper hosting options

  • github actions
  • digitalocean
  • aws lambda / ec2
  • jacob to present guidance on this

Data storage options

Create Archive snapshot of dataset url when Scrapers are run

This should ping the internet archive with a request to archive the site at the time the scraper should run.

from archive-it:

While we would love to have y’all as an Archive-It partner, I think this specific request may be better suited for our Wayback Machine’s "Save Page Now" (SPN) functionality. I’ve found a few resources on SPN API integrations that might fit your needs. Here is the standard API info page: https://archive.org/help/wayback_api.php. Here is our developer wiki: https://archive.readme.io/docs/overview. I also found this resource for a python wrapper for SPN: https://github.com/palewire/savepagenow.

Please let me know if you find something here that works for you so I can share with the team and anyone else who may have a similar request in the future! If not, I can reach out to some of my colleagues in our patron services division to see if they have other suggestions, or simply connect you someone.

Bay County Docker container is broken after changes to folder structure

After the changes in #24 which created a standardised captcha solver interface and moved the existing captcha solver into a 'common' directory at the project root, this has broken the Bay County Docker container.

Upon starting the container it says:

root@81778276cb3c:/scraper# python3 Scraper.py
Traceback (most recent call last):
  File "Scraper.py", line 14, in <module>
    from common.captcha.benchmark.BenchmarkAdditionSolver import CaptchaSolver
ModuleNotFoundError: No module named 'common'

I think this is because the container has the app root set in the Bay County/Scraper folder, because the error occurs at the import.

The container needs to be updated to work with the current project structure.

Allow on-demand Scraper usage

The end goal: a magic button for any Scraper that says something like Run Scraper Locally. When this button is clicked, the user needs to do as little as possible for the scraper to run and give them an Extraction. Allows a user to both donate compute time to PDAP and run scrapers for their own benefit.

If we have a Scraper written for a Data Source, and we've created an Archive of the Data Source, we should allow people to run that Scraper locally on demand. They will use their own compute power.

Can we write a package or plugin that lets anyone run our scrapers in-browser?

This is achieved adding things to the existing PDAP-app repo and probably deploying it to app.pdap.io or a local version.

This may be some kind of Docker file.

The package should include all necessary dependencies.

It could include a local version of data sources search

Users should be able to "Run Scraper Locally" on any Dataset they find that has a Scraper.

The Extractions should be saved locally.

CKAN submission module for scrapers

Task:

  • Make a python module to be called at the end of a scraper.py file that takes the output of a scraper and submits it to our CKAN instance. This can be a pretty informal experiment.
  • we shouldn't submit the Extraction without a successful archive (see #180)
  • must also submit metadata.py (#154)

CKAN demo environment:

https://demo.dev.datopian.com/organization/pdap-io

CKAN API info:

https://github.com/ckan/ckanapi
https://github.com/ckan/ckanapi#ckanapi-python-module

Index of scrapers

The idea

This is a top-level file in the repo (a markdown or CSV are probably most readable) which we update automatically with GitHub Actions. It's to help people answer the question, "what scrapers are in this repo, anyway?" without looking through all the folders.

Eventually, we may outgrow this single-file directory or need fancier tools. For now, this should be fine.

What's in the index

  • A row for each scraper
    We could populate this from Data Sources which have a scraper_url

These properties

  • scraper_url
  • agency_described
  • jurisdiction (state, county, municipality)
  • record_type

We can link to a more detailed public Airtable / DB view if they want to do a more specific search.

Consider: one group for "in this repo" and another group for "not in this repo"

Develop a more scalable way to parse fields from tables for Benchmark portals

Tightly scoping this to the Benchmark scraper (Python) for now, but we could probably apply similar logic elsewhere.

Presently, parsing fields pertaining to charges from tables is lengthy and somewhat brittle (for example, not all portals expose the same fields). The content resembles this:
image

One can see what the raw data looks like by navigating to case 20000001CFMA at https://court.baycoclerk.com/BenchmarkWeb2/Home.aspx/Search.

The FR is to develop code that parses that table and exposes it as a list of dicts (where keys are derived from the thead values). In the case of the above screenshot, it would result in:

[
  {
    "count": "1",
    "description" "DRIVING WHILE LICENSE SUSPENDED OR REVOKED (32234 2a)",
    "level": "M",
    "degree": "S",
    "plea": "",
    "disposition": "TRANSFERRED TO ANOTHER COURT",
    "disposition date": "02/17/2020",
  },
]

From there, we can more easily translate this to our data model.

Extraction for Pittsburgh datasets

Request

We have a user in Pittsburgh looking for data on K-9 use and training. We should extract all the Pittsburgh police datasets and comb through them for potential K-9 applicable data.

Datasets

https://apps.pittsburghpa.gov/redtail/images/12507_Public_Datasets__Dashboards__and_Annual_Report_Links.pdf

https://pittsburghpa.gov/mayor/ctfpr

What is required?

  • Identify applicable datasets from the above links.
  • Add these datasets to the Datasets Repo. completed here
  • Submit one or more Scrapers to the Scrapers repo, and get them approved.
  • Run the Scrapers locally and link to the Extracted data in this issue.
  • Submit the Extractions and metadata to the CKAN API (#173)

What's in an Extraction?

Details here

(Don't Fear the) Repo overhaul

Scrapers repo

These serve as examples of different ways to access data. They're also individually useful.

Problems with the current repo

  • too intimidating
  • too complex
    • we're trying to systematize everything; scrapers should be standalone
  • people aren't sure how to contribute
    • people don't know what the code is for
    • there's no good example of how the code can be used

How people use the repo

  • find a way to help
  • run scrapers locally
    • find them geographically
    • sorted by language
  • people can run utilities for writing scrapers
    • common files
  • serve as a library of other scrapers

To do

Readme changes

  • #212
  • add a visual representation of how this relates to our other work (josh)
    • scrapers outside this repo
    • investigative process
    • data sources
    • work being done with data

Issue adjustment

Structure changes

  • explain the repo's new structure in the README
  • reorganize according to the tree below
    • one top-level directory for all utilities and common scripts
    • organized geographically—the primary way for finding a scraper will not be through these directories, but by searching for Data Sources with a scraper_url present (so we can use record_type, location, any other Data Source property) or via a scraper index file: #196
    • one major reason for this is that the "common" scripts are all over the place, so for a new user it's incredibly difficult to figure out how they relate.
setup_gui/Base_Scripts/Scrapers/crimegraphics/crimegraphics_bulletin.py

common/base_scrapers/crimegraphics/crimegraphics_bulletin.py

Base_Scripts/Scrapers/crimegraphics/crimegraphics_bulletin.py
CODE_OF_CONDUCT.md
CONTRIBUTING.md
LICENSE.md
README.md
requirements.txt
examples_templates/
   -- scraper_template/
     -- README.md
     -- scraper.py
   -- scraper_example_1/
      -- README.md
      -- scraper.py
   -- etc
scrapers/
    -- data_portals/
        -- cityprotect/
        -- crimegraphics/
             -- README.md
             -- crimegraphics.py
    -- federal/
    -- AR/
    -- CA/
    -- FL/
        -- scraper/
        -- county/
            -- scraper/
                -- scraper.py
                -- README.md
            -- municipality/
                -- scraper/
                    -- scraper.py
                    -- README.md
    -- etc
utils/
  -- meta/
    -- all_fields_extractor/
    -- etc
  -- setup_gui/
  -- etc

Related work

#196

Make a place (Hadoop) to store non-csv data

We need to keep our own audit trail. We’d probably just want the ETL library to drop the file on the file system and we store a record in the database about the scraper it came from and when (and who?)

Add City, Zip, FIPS, Lat & Lng to Agencies table

Goal

We will have to manually go through and add the relevant information to each agency. Once this is all done, it will give front-end devs a way to accurately place a pin for every agency in the US!

This issue can be closed once all agencies have the appropriate data—until then, we will need a lot of contributions.

Process:

  • Reference our dolt docs for information / context
  • Fork and clone the datasets repo
  • Navigate into the cloned directory and start a dolt sql-server (docs here)
  • Use a GUI tool like DBeaver or TablePlus for ease of data entry and connect to your localhost server
  • Pick a state, I started with Alaska (AK).
  • Copy the name of the agency, google their name to get the address of the main office/district
  • Copy the address into Google Maps. Copy the city and 5 digit ZIP code into the table
  • Right click on the map of the agency to display the coordinates. Copy those into lat, lng (google copies both fyi)
  • Use the coordinates here to get the right county_fips
  • Repeat for the rest of the agencies in that state, then commit and create a PR in DoltHub!

Update 5/22

There is now a tool here (thank you @ncpierson) which scrapes Geonames for this info.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.