Coder Social home page Coder Social logo

deardurham / ciprs-reader Goto Github PK

View Code? Open in Web Editor NEW
2.0 9.0 7.0 1.65 MB

Python library for reading CIPRS PDFs

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 66.24% Python 32.46% Makefile 0.04% Dockerfile 1.26%
python pdf pdftotext pytest coverage docker codeforamerica

ciprs-reader's Introduction

CIPRS Reader

Build Status

Setup and Run:

Add pdf file to parse in /ignore folder then run:

docker build -t ciprs-reader .
docker run --rm -v /$(pwd):/usr/src/app ciprs-reader python ciprs-reader.py ignore/cypress-example.pdf

Example output:

[
    {
        "General": {
            "County": "DURHAM",
            "File No": "00GR000000"
        },
        "Case Information": {
            "Case Status": "DISPOSED",
            "Offense Date": "2018-01-01T20:00:00"
        },
        "Defendant": {
            "Date of Birth/Estimated Age": "1990-01-01",
            "Name": "DOE,JON,BOJACK",
            "Race": "WHITE",
            "Sex": "MALE"
        },
        "District Court Offense Information": [
            {
                "Records": [
                    {
                        "Action": "CHARGED",
                        "Description": "SPEEDING(70 mph in a 50 mph zone)",
                        "Severity": "TRAFFIC",
                        "Law": "20-141(J1)"
                    }
                ],
                "Disposed On": "2010-01-01",
                "Disposition Method": "DISMISSAL WITHOUT LEAVE BY DA"
            }
        ],
        "Superior Court Offense Information": [],
    }
]

Local Setup

Pre-requisites:

Mac

brew cask install pdftotext

Ubuntu

sudo apt-get install -y poppler-utils
wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz \
    && tar -xvf xpdf-tools-linux-4.04.tar.gz \
    && cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin/pdftotext-4

Setup:

pip install -r requirements.txt
pip install -e .

Read CIPRS PDF:

python ciprs_reader.py ./cypress-example.pdf

Run Jupyter:

jupyter-lab

Run tests:

pytest --pylint

Code for Durham

ciprs-reader's People

Contributors

brandon-mork avatar copelco avatar dependabot[bot] avatar dsummersl avatar georgehelman avatar himmallright avatar jtf621 avatar ljmerza avatar myerscody avatar nthall avatar rebecca-draben avatar robert-w-gries avatar sherinv avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ciprs-reader's Issues

Parsers do not catch numbered offense rows

The offense row parser is designed to capture data from offense rows that look like this:

ACTION DESCRIPTION SEVERITY LAW CODE

However, on some records they look like this:

# ACTION DESCRIPTION SEVERITY LAW

With # being either a zero padded number or blank

Offense date parsing fails with 00:00 time

Works:

Matched: {'value': '08/04/2004 01:00 AM'} in Offense Date/Time: 08/04/2004 01:00 AM                         • Date: 04/05/2005

Doesn't work:

No match: Offense Date/Time: 03/04/2005 00:00                            • Date: 02/16/2006

Extract record type

A machine-readable record type will be useful for dear-petition.

AC:

  • Add Record Type: Summary|Detail to JSON export

County names with special characters don't match

The CaseDetails parser fails to parse county names with special characters like:

No match: Case Summary for Court Case: GUILFORD-GR 15CR000000

The petition tool then groups all unmatched county names under UNKNOWN, which is a bug.

AC:

  • Update CaseDetails to match non A-Z characters. Perhaps just match anything that's not a space? E.g. (?P<county>.+)
  • Add test with special characters

Extract “additional offenses exist” language

Summary records can contain the following language:

Additional offenses exist. To see a complete breakdown, view full case detail.

If it exists, this should be noted in the JSON so dear-petition can pick it up and display a message to the user:

{
    "Additional offenses exist": "true"
}

Parsers do not catch multiple offense records.

Some CIPRS records look like this:

Offense Record 1 of 2 (Line Number 1)
...
Offense Record 2 of 2 (Line Number 2)
...

We are only capturing the charges from the first offense record and not the second.

Update date parsers to return MM/DD/YYYY format

Right now date parsers return datetime strings, which get converted to MM-DD-YYYY format for display, but this results in a lot of duplicated code because it always needs to be converted to MM/DD/YYYY (%m/%d/%Y) format in the end. We can DRY our code by making the parser return that format in the first place.

Jurisdiction on infraction offenses should be District

The CIPRS reader only recognizes CR (DISTRICT COURT) and CRS (SUPERIOR COURT) values in the CIPRS record file numbers, causing the Summary Document to put the infraction offenses in a NOT AVAILABLE jurisdiction. The infraction offenses should be put in the DISTRICT jurisdiction.

Screen Shot 2023-03-14 at 3 32 39 PM

For testing/demo: Can use CIPRS record of person with initials JR or MJ

Parse multiple CIPRS records from combined PDF

Attorney's manually combine CIPRS record PDFs to ease organization of his documents, so a single PDF may contain many CIPRS records.

AC

  • Split extracted PDF text on common delimiter. Potential options:
    • Maybe “Case Summary for Court”
    • Footer record number of page count
  • Run entity extraction separately on each record
  • Always return list of CIPRS records as JSON, even if majority will be single records
    • A continuation issue will need to be opened in dear-petition to always handle lists

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.