ccao-data / ptaxsim Goto Github PK

View Code? Open in Web Editor NEW

10.0 1.0 2.0 2.88 MB

R package for estimating Cook County property tax bills

Home Page: https://ccao-data.github.io/ptaxsim/

License: GNU Affero General Public License v3.0

R 94.49% CSS 0.07% Mermaid 5.44%

r assessment r-package taxes property-taxes research

ptaxsim's People

Contributors

Stargazers

Watchers

Forkers

gaybro8777 erhla

ptaxsim's Issues

Fix RPM distribution amounts

PINs within the special RPM TIF district have slightly incorrect total tax amounts, especially compared to non-TIF PINS. The RPM TIF has the following disbursement:

CPS gets its normal revenue as if the RPM TIF doesn't exist
80% of the remaining TIF funds go to the RPM project
20% of the remaining TIF funds are redistributed back to the jurisdiction except CPS

The first two are correct, but calculating the correct proportional distribution for part 3 is challenging. Currently using the following formula: agency_tax_rate / sum(agency_tax_rate (excluding CPS))

The following script yields the example data below:

left_join(
  tax_bill(2019, "14081020190000", simplify = FALSE),
  ptaxsim::sample_tax_bills_detail %>%
    filter(pin == "14081020190000") %>%
    select(pin, tax, agency),
  by = c("pin", "agency")
) %>%
  select(
    agency_name, tax_amt_calced = tax_amt_final, tax_amt_real = tax
  )

# A tibble: 10 x 3
   agency_name                                tax_amt_calced tax_amt_real
   <chr>                                               <dbl>        <dbl>
 1 COUNTY OF COOK                                      606.         624. 
 2 FOREST PRESERVE DISTRICT OF COOK COUNTY              78.8         78.4
 3 CITY OF CHICAGO                                    2141.        2131. 
 4 CITY OF CHICAGO LIBRARY FUND                        162.         161. 
 5 CITY OF CHICAGO SCHOOL BLDG & IMP FUND              226.         225. 
 6 CHICAGO COMMUNITY COLLEGE DISTRICT 508              199.         198. 
 7 BOARD OF EDUCATION                                 4561.        4560. 
 8 CHICAGO PARK DISTRICT                               435.         433. 
 9 CHICAGO PARK DIST. AQUARIUM & MUSEUM BONDS            0            0  
10 METRO WATER RECLAMATION DIST OF GR CHGO             520.         517.

Add data about assessment levels

Need to add data frame of assessment levels for different classes and years. Structure would be something like:

year  class  level
2016  205    .1
2016  599    .25

This would enable one to back out the market rate from the assessed values saved within PTAXSIM.

Add exception handling for SSAs

SSAs are a tool that add a fixed tax rate district to an area to provide services. Unlike other taxing districts, the tax rate of an SSA does not vary. In the data, the tax rate is kept fixed by varying the levy of the SSA as the tax base grows or shrinks.

For historical bills, PTAXSIM works fine as is. However, when calculating counterfactuals, SSAs can become a problem since any change in the base must be reflected in the levy in order to keep the SSA rate constant (and the counterfactual accurate). As such, we should add a warning or a helper function specifically to deal with SSAs.

Fix miscoded agency types

The Town of Cicero (agency ID = 020060000) is miscoded as a township when it should be a municipality. We should fix this type and any others like it + adjust the regex to classify subtypes.

Switch GHA workflows to use OIDC

GitHub and AWS can authenticate using OIDC. We should switch to using that method for S3 auth. More details here: https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect#adding-permissions-settings

Example here: ccao-data/data-architecture#50

Add tax share vignette

Original issue:

Please add the ability to easily calculate each property's tax price (See this paper for an explanation of tax price). This could be achieved by adding a new tax_price() function to the package, or adding tax price or agency tax base to the tax_bill() function output.

Not sure that a separate argument/function just for tax share is warranted, but I think it would be great to add a vignette using it as a technique to decompose a tax bill (or area).

Refactor GHA composable workflow to use Action instead

See the PR here for an example: ccao-data/data-architecture#50.

Error message on unsupported years in ptaxsim::tax_bill()

I was using ptaxsim, and I tried to recover a tax bill for the year 2022:

raceway <- c("02252020080000", "02243030070000", "02251000050000", "02262010100000")
test <- tax_bill(year_vec = 2022, pin_vec = raceway)
Error in check_pin_dt_str(pin_dt) : 
  pin_dt must be in the same format as the PIN data returned by lookup_pin(). Ensure there is 1 row per PIN per year and all column names and types are the same

I think that this year just isn't available in ptaxsim. Might want to include an error message to that effect.

Incentive property classes & their assessment rates

Is there a data source that has the assessment rate used for incentive properties in each tax year?

Incentive properties have their assessment rate change over time so backing out the market value of the property is not easily or accurately done. The Assessed Value and EAV are included in PTAXSIM, but without the assessment rate in a given year for these property classes, can the market value be calculated?

A CSV file of PIN level data for properties (PINS or parcels, either would be great) with 600 and higher property classes with the tax year and assessment rate would be appreciated!

Double check 2022 exemptions data

There seems to be some potential wonkiness with the 2022 exemptions data. Most PINs correctly aggregate to their final 2022 tax bill, however some of the rarer exemptions seem to be missing or not recorded in the Clerk's data. Here's a quick test PIN - 10252080490000 - which receives a veteran's exemption on their bill but not in the Clerk's data.

Add class checks and other data validation prior to final db export

Now that we've set up dbt tests in the data-architecture repo, we should set up something similar for PTAXSIM.

All scripts in data-raw/ actually populate an S3 bucket that feed the tax database in Athena. It should be relatively simple to add dbt tests on top of the tax data which need to pass prior to PTAXSIM export. Some ideas for tests:

Class code of each PIN is contained in our class dictionary
PIN-level AVs match iasWorld records
PIN-level class codes match iasWorld
Tax code of each PIN matches iasWorld

Create vignette on levy changes, inflation, PTELL, etc

Need to create a vignette to show how levy changes affect tax bills. Specifically, we should show how inflation relates to levy increases, how PTELL works, what the effect of inflation will be on levies, etc.

Parameterize existing vignettes

The current PTAXSIM vignettes are for the most part specific to a place and year. It would be useful to parameterize a few of them such that they could work with any municiaplity/tax code and year combination.

Using PTAXSIM to simulate residential exemption policies

I want to test the impact of different hypothetical values of a Homeowners' Exemption in Cook County on 1) progressivity of the tax structure and 2) the cost of modeling errors to low-income residents. I want to answer the question "What would happen in Cook County if Illinois increased the EAV amount of the Homeowners' Exemption by 100%? 200%? 300%? etc."

A naïve approach would be to use parcels' estimated market values from Cook County's open data and apply a fixed tax rate while varying the value of the Homeowners Exemption. This approach would not account for the impact of the Homeowners exemption on the overall tax base, and therefore on the tax rate applied in each iteration.

Reading the PTAXSIM documentation, it does not appear clear to me that it is easy to use PTAXSIM to simulate the impact of policy changes on the entire tax base.

I have to complete this by Thursday of next week. Is there a way to easily user PTAXSIM to simulate this scenario?

TIF Data Quality Issue

https://www.elkgrove.org/Home/Components/News/News/6046/31

030350504 TIF VIL OF ELK GROVE VILLAGE-MIDWAY COURT 16157

This tif seems to have been created/discontinued within the same tax year (2022). It appears that tax code 16157 does not have any pins / is not used in tax year 2022. Possible data quality issue.

New property tax classes in 2021?

The database includes a few parcels in tax year 2021 that have classes that are not in the assessor class code documentation.

Specifically:

pin 16151020170000 has a class of 192 (though in the parcel search it shows 190)
and pin 32091000160000 has a class of 826 (and still has a class of 826 in the parcel search)

Can you please clarify whether these classes are accurate in the database and, if so, provide more information about the classes (i.e. name and assessment ratio)?

My query was "select pin, class, tax_code_num, av_clerk from pin where year = 2021"

Thanks!

Investigate border area tax rates

Certain municipalities and taxing bodies can span both Cook and a bordering county. In such cases, the tax rate calculation in PTAXSIM may be wrong, since the denominator in the $levy / base$ calculation will be too low (it comes from cty_cook_eav).

Need to look into this more and possibly see how the Clerk does it. Best way to start is to pull some bills from a border area.

See code here for reference: https://github.com/AleaWM/CookCounty-PropertyTaxes/blob/6a3e4f737e9453d74cdaaa71dd02a936da6bda63/Residential_Burden_Shift.Rmd#L301

Possible Equalization Factor Vignette

The goal of this issue is to create a vignette which entirely replicates the IDOR calculations of tentative and final equalization factors for tax year 2022. It remains to be seen if this is entirely possible. The vignette then would show various impacts in differences between tentative and final equalization factors.

Identify IDOR sales utilized the creation of form PTAX-215/PTAX-236
Identify way to store sales used in studies and the actual results from IDOR for 2022 tentative and final stages
Identify other data sources required (Farm B assessed values?)

2022 016 Cook FINAL(25E)CCAO PTAX Forms 002 Forms 210-215-236 B-R Action Not Equal-1.pdf
2022 016 Cook Tentative CCAO PTAX Forms 002 Forms 210-215-236 DH.pdf

Collect additional property tax related data using LLMs

Goal

Collect additional property tax related data from documents, using LLMs for parsing.

Overview

There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. If this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.

The problem is that this data is messy. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.

Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We may be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.

⚠️ This is an experimental/moonshot task. We don't know for sure that LLMs will work here or that there's enough structured information to be useful. However, if it does work, then it will produce the first digitized collection of such data and a nice proof-of-concept that we can potentially use elsewhere in the office.

Getting Started

The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.

Data

We need to take stock of what data actually exists that is:

Available - Easy(ish) to collect
Valuable - Actually needed inside PTAXSIM and useful for analysis
Parsable - Possible to be read and contextualized by an LLM

I recommend we start with the following datasets:

TIF information

Includes ordinance, redevelopment plans, and annual reports
- See examples for the City of Chicago
- Example ordinance
- Redev. plan
This data might be a good place to start because it's limited in scope. There are a finite number of TIFs in Cook County, and most of their text should be available online
Possible datapoints to collect include:
- Establishment information (who proposed, what criteria were used, what was the projected revenue, initial PINs included, what projects were originally planned)
- Expenditures and porting information (where is money going, to what other TIFs)

Taxing district budgets

Includes topline expenditures by taxing agency + contextual notes on that agency's main functions
This might be easier to collect for the county since most bodies publish a public budget, but condensing it down into a single SQL table will be challenging
This will be harder to collect in the long run since there are far more taxing agencies than TIFs, and budgets would need to be collected for each fiscal year + different agencies my have different fiscal years

LLMs

The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:

Can ingest a large document and provide a summary or key bits of information, ideally in a machine-readable output format
Can ideally ingest a raw image PDF, rather than one that's OCR'd. If necessary, we could setup a separate OCR pipeline
Are specifically trained on legal or government documents
Are low-cost or free. We can run locally or on EC2 if needed

Tasks

Before proceeding to coding, the following tasks should be complete:

Topline inventory of data to be collected. Can be a markdown list or Excel sheet attached to this issue. Should include:
- Which documents need to be collected (TIF ordinance, muni budgets, etc.)
- Where those documents will be collected from
- What (estimated) percentage of those documents are immediately available i.e. without a records request
- What datapoints can be collected from those documents i.e. establishment criteria, projected revenue, etc.
Inventory of LLMs and their capabilities. I would create a table/matrix with each LLM as a column and each capability or attribute as a row. Attach to this issue
Specific inventory of data to be collected. Once the topline inventory is done, make a list of all the documents we need to collect, their source, whether they've been fetched, whether they've been OCR'd, whether they've been LLM-parsed, etc.

Outline

Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.

Processing

Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.

Share a broad overview of data processing architecture with @dfsnow
Any scripts used for processing must live in data-raw/, though we may not want the raw data itself there

Package updates

Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:

Update data-raw/create_db.sql to add new table definitions for your finished data
Add a new script (or scripts) to data-raw/ that pulls the processed data from S3 and loads it into the SQLite DB (via data-raw/create_db.R)
Update the database diagrams in the README to include your new tables
(Reach goal) Add a new R function (or arguments to an existing function) to return your data
(Reach goal) Add a short document to vignettes/ describing what your data is and how to use it

Additional Requirements

Any updates to the package must come via a pull request. You should work on a separate branch and notify @dfsnow when ready to merge
Don't commit large data objects to the repository, particularly large Git LFS objects
This data must be accurate. At some point down the line we will need to discuss a review process for this data

Add Tax Year 2022 data

We need to update the raw data sets used to generate the PTAXSIM database as tax year 2022 data becomes available from the Clerk, BoR, IDOR, and Treasurer. Once all TY2022 data is available, we will update both the R package and database. The following data sets need to be updated for 2022:

Tax code agency rates - Clerk
Agency tax rates - Clerk
CPI - IDOR
Equalization factors - IDOR
TIF data - Clerk
PIN-level AV and exemptions - CCAO/Clerk (CLERKVALUES internal DB table)
PIN-level total tax bills - CCAO/Treasurer (TAXBILLAMOUNTS internal DB table)
Parcel geometries - Cook County GIS

Additionally, we should add 2022 unit/integration tests and update PTAXSIM vignettes:

Update vignettes with 2022 data (where necessary)
Update unit tests with 2022 data
Update README with new package/DB version
Bump DB and package version
Pull at least 10 real second-installment 2022 tax bills to use for integration test

Add new accounting for PA 102-0519

PA 102-0519 alters PTELL to allow taxing districts to recapture funds lost through refunds, appeals, etc. We may need to alter code to capture the changes caused by this act. See full statute here:

https://www.ilga.gov/legislation/publicacts/102/PDF/102-0519.pdf

For reference: https://www.cookcountyclerkil.gov/sites/default/files/2022-12/Levy%20Adjustment%202022%20for%20PA%20102-0519.xlsx

Fix expanded mermaid diagram

The mermaid diagram in the readme is very helpful for explaining the db structure and previewing what columns are in which tables. It used to work beautifully, but now it breaks when trying to expand it in the readme. Let's fix it such that an expanded version is made available via the readme (e.g., creating a pdf image of the full expanded diagram and linking to it).

Retroactive AV updates

My understanding is that the below pins are possibly getting a retroactive raise to new_av for tax year 2021. Luckily this is all in a TIF so no impact to tax rates. Similar issues would occur for retroactive reductions made by PTAB.

updates <- tibble( pin = c('17084150010000', '17084070160000', '17084070170000'), new_av = c(41422313, 16158090, 4780), old_av = c(109231, 62244, 4792) )