The caseharvester from dismantl

Redact defendant PII

Defendant personally-identifiable information (PII) fields (name, address, DOB) should be pulled out into a separate table and referenced in the various case tables using the ID key from the PII table. This will need to be done for each type of case (DSCR, DSK8, DSCIVIL, CC, ODYTRAF, etc), though the PII can all be consolidated into a single table. Then a new read-only database user should be created that can read from all tables EXCEPT the PII table, which we can give to partners that don't have a compelling need to see the PII.

Evaluate scaling based on MJCS load testing

Since both the spider and scraper components make requests to the MJCS server, they are limited in their concurrency so as not to overload MJCS and make it unavailable to others. By default, both components are limited to 10 concurrent connections to MJCS. These defaults are arbitrary safe values; it would be better to determine safe concurrency levels based on load testing MJCS.

For each search of MJCS, the server reports the number of seconds it took to process the request:

If we were to periodically execute the same search on MJCS and track the number of seconds the search takes, and graph that versus the running concurrency of Case Harvester as we scale it up and down, that would help us see when increasing concurrency actually starts affecting MJCS server performance.

Report performance statistics to AWS CloudWatch

The spider and scraper component currently record several metrics that can be used to evaluate performance. These metrics should be sent to CloudWatch so they can be graphed and monitored in a dashboard.

For each spider search item, the following metrics are recorded:

Number of returned results
Timestamp when search item was queried from MJCS
Number of seconds for the query to return results

For each run of the spider, the following metrics are recorded:

Start and end date
Search criteria
Duration of run
Number of queue items still active
Number of queue items finished
Number of new cases added to database
Number of queries submitted to MJCS

For each scrape, the following metrics are recorded:

Timestamp of scrape
Number of seconds to complete scrape

Parse event history for district court criminal cases (DSCR)

This document lists the meanings of the various codes used in DSCR event history records. The DSCR parser should be extended to parse these codes from the event history comments.

Create glossary of field names and their meanings

To help users of the data understand what it means, especially all the legal jargon, it would be helpful to have a glossary of all the field names from the various case types along with their descriptions. For version control purposes I think this should live in a JSON file. Alternatively or in addition to this, the glossary could also be added to the database itself using COMMENT ON COLUMN.

I don't have the legal background for this, so help would be very much appreciated.

Add unit tests

Case Harvester currently doesn't have any tests. It would be great to add some basic unit tests using the Pytest framework. Once some tests are in place, they can be run with make test.

Implement parsers for all case formats

The following checklist indicates the MJCS case formats we currently have parsers for:

Reduce spider pressure

I'm having good results by doing a see-saw approach between first and last name, and by assuming each case has a defendant/respondent. Seems to reduce clutter in the results a lot. Any interest in pr?

Create spider Lambda function

Currently the spider component is only run from the command line. We need a Lambda function for the spider that can be triggered:

By a CloudWatch rule on a schedule, for example each day or week, i.e. cron.
Manually by sending a SNS message.

Each method should include parameters that set the search criteria (e.g. a specific county and time range). For example, the weekly run could search for cases over the last month, while the daily run could only search for cases within the last week.

Add spider capability to search based on case number patterns

Right now the Case Harvester spider finds new cases by iteratively searching last names, but we haven't yet looked at case number patterns so we could potentially find any gaps, missing cases the spider search algorithm hasn't found. For example, if Baltimore County district criminal cases always use case numbers in the format BC-123-XXXXX, we could exhaustively search for all case numbers in that range.

I think the best way to do this would be to create a JSON file that maps the various courts and case types to their case number patterns/ranges, which Case Harvester could read in and use to look for new cases.

Infrastructure enhancements

Add an automated off-site backup of the database, along with a backup schedule.
Add an RDS read-only replica, which will be used for outside partners and researchers to access the database.
Move the Lambda functions inside the database's VPC, for better performance. This will require a NAT gateway, which incurs additional costs.

Update documentation

The README documentation needs to be update to reflect the following changes:

The spider component is now run in a Docker container using AWS Elastic Container Service (ECS). The description and diagrams in the README should be updated to reflect this.
Both the spider and scraper components are run automatically at multiple intervals using cron-like scheduled tasks.

Monitor MJCS notice page and report on scheduled downtimes

The (MJCS notice page)[https://mdcourts.gov/casesearch2/notice] reports on scheduled downtimes. It would be great to have a period script run (like a Lambda function on a schedule) that monitors for changes on the notice page and emails them to someone. This will allow us to make sure Case Harvester isn't running when MJCs is unavailable.

dismantl / caseharvester Goto Github PK

caseharvester's People

Contributors

Stargazers

Watchers

Forkers

caseharvester's Issues

Recommend Projects

Recommend Topics

Recommend Org