Coder Social home page Coder Social logo

dismantl / caseharvester Goto Github PK

View Code? Open in Web Editor NEW
49.0 49.0 10.0 2 MB

AWS-based application for scraping the Maryland Judiciary Case Search

License: GNU General Public License v3.0

Makefile 0.71% Python 97.70% Mako 0.04% Dockerfile 0.02% Go 1.54%
scraping

caseharvester's People

Contributors

dismantl avatar varungujarathi9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caseharvester's Issues

Redact defendant PII

Defendant personally-identifiable information (PII) fields (name, address, DOB) should be pulled out into a separate table and referenced in the various case tables using the ID key from the PII table. This will need to be done for each type of case (DSCR, DSK8, DSCIVIL, CC, ODYTRAF, etc), though the PII can all be consolidated into a single table. Then a new read-only database user should be created that can read from all tables EXCEPT the PII table, which we can give to partners that don't have a compelling need to see the PII.

Evaluate scaling based on MJCS load testing

Since both the spider and scraper components make requests to the MJCS server, they are limited in their concurrency so as not to overload MJCS and make it unavailable to others. By default, both components are limited to 10 concurrent connections to MJCS. These defaults are arbitrary safe values; it would be better to determine safe concurrency levels based on load testing MJCS.

For each search of MJCS, the server reports the number of seconds it took to process the request:
image

If we were to periodically execute the same search on MJCS and track the number of seconds the search takes, and graph that versus the running concurrency of Case Harvester as we scale it up and down, that would help us see when increasing concurrency actually starts affecting MJCS server performance.

Report performance statistics to AWS CloudWatch

The spider and scraper component currently record several metrics that can be used to evaluate performance. These metrics should be sent to CloudWatch so they can be graphed and monitored in a dashboard.

For each spider search item, the following metrics are recorded:

  • Number of returned results
  • Timestamp when search item was queried from MJCS
  • Number of seconds for the query to return results

For each run of the spider, the following metrics are recorded:

  • Start and end date
  • Search criteria
  • Duration of run
  • Number of queue items still active
  • Number of queue items finished
  • Number of new cases added to database
  • Number of queries submitted to MJCS

For each scrape, the following metrics are recorded:

  • Timestamp of scrape
  • Number of seconds to complete scrape

Create glossary of field names and their meanings

To help users of the data understand what it means, especially all the legal jargon, it would be helpful to have a glossary of all the field names from the various case types along with their descriptions. For version control purposes I think this should live in a JSON file. Alternatively or in addition to this, the glossary could also be added to the database itself using COMMENT ON COLUMN.

I don't have the legal background for this, so help would be very much appreciated.

Add unit tests

Case Harvester currently doesn't have any tests. It would be great to add some basic unit tests using the Pytest framework. Once some tests are in place, they can be run with make test.

Implement parsers for all case formats

The following checklist indicates the MJCS case formats we currently have parsers for:

  • PGV - Prince George's County Circuit Court Civil Cases
  • MCCR - Montgomery County Criminal Cases
  • ODYTRAF - MDEC Traffic Cases
  • ODYCIVIL - MDEC Civil Cases
  • ODYCRIM - MDEC Criminal Cases
  • DSK8 - Baltimore City Criminal Cases
  • DSCP - District Court Civil Citations
  • DSCIVIL - District Court Civil Cases
  • MCCI - Montgomery County Civil Cases
  • CC - Circuit Court Civil Cases
  • DSTRAF - District Court Traffic Cases
  • K - Circuit Court Criminal Cases
  • PG - Prince George's County Circuit Court Criminal Cases
  • DSCR - District Court Criminal Cases
  • DV - Domestic Violence Cases
  • ODYCVCIT - MDEC Civil Citations
  • ODYCOA - Court of Appeals
  • ODYCOSA - Court of Special Appeals

Reduce spider pressure

I'm having good results by doing a see-saw approach between first and last name, and by assuming each case has a defendant/respondent. Seems to reduce clutter in the results a lot. Any interest in pr?

Create spider Lambda function

Currently the spider component is only run from the command line. We need a Lambda function for the spider that can be triggered:

  1. By a CloudWatch rule on a schedule, for example each day or week, i.e. cron.
  2. Manually by sending a SNS message.

Each method should include parameters that set the search criteria (e.g. a specific county and time range). For example, the weekly run could search for cases over the last month, while the daily run could only search for cases within the last week.

Add spider capability to search based on case number patterns

Right now the Case Harvester spider finds new cases by iteratively searching last names, but we haven't yet looked at case number patterns so we could potentially find any gaps, missing cases the spider search algorithm hasn't found. For example, if Baltimore County district criminal cases always use case numbers in the format BC-123-XXXXX, we could exhaustively search for all case numbers in that range.

I think the best way to do this would be to create a JSON file that maps the various courts and case types to their case number patterns/ranges, which Case Harvester could read in and use to look for new cases.

Infrastructure enhancements

  • Add an automated off-site backup of the database, along with a backup schedule.
  • Add an RDS read-only replica, which will be used for outside partners and researchers to access the database.
  • Move the Lambda functions inside the database's VPC, for better performance. This will require a NAT gateway, which incurs additional costs.

Update documentation

The README documentation needs to be update to reflect the following changes:

  • The spider component is now run in a Docker container using AWS Elastic Container Service (ECS). The description and diagrams in the README should be updated to reflect this.
  • Both the spider and scraper components are run automatically at multiple intervals using cron-like scheduled tasks.

Monitor MJCS notice page and report on scheduled downtimes

The (MJCS notice page)[https://mdcourts.gov/casesearch2/notice] reports on scheduled downtimes. It would be great to have a period script run (like a Lambda function on a schedule) that monitors for changes on the notice page and emails them to someone. This will allow us to make sure Case Harvester isn't running when MJCs is unavailable.

Add read-only Postgres user

This should be automatically created during database initialization, and credentials should be pulled from secrets.json, similar to the master and regular users.

Create website

We need a basic web presence describing the Case Harvester project, including mission, available data, how to contribute, how to report issues or give feedback, and info for researchers and journalists who want to access our database.

Down the line we would like to add sample analyses and visualizations, and also custom search functionality.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.