Coder Social home page Coder Social logo

jsfenfen / 990-xml-reader Goto Github PK

View Code? Open in Web Editor NEW
115.0 115.0 35.0 8.27 MB

IRSx: Turn the IRS' versioned XML 990 nonprofit annual tax returns into standardized python objects, json, or human readable text with original line number and description.

License: MIT License

Python 100.00%
irs-form990-data nonprofit

990-xml-reader's People

Contributors

jsfenfen avatar kuirolo avatar myersjustinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

990-xml-reader's Issues

Recanonicalize based on 2016v3.0

2016v3.0 introduces lotsa new variables; eventually they should get a 2016 series canonical variable assigned, but in translating to 2015 schema they are just ignored.

/IRS990ScheduleA/AgriculturalNameAndAddressGrp/CityNm
/IRS990ScheduleA/AgriculturalNameAndAddressGrp/CollegeUniversityName/BusinessNameLine1Txt
/IRS990ScheduleA/AgriculturalNameAndAddressGrp/CollegeUniversityName/BusinessNameLine2Txt
/IRS990ScheduleA/AgriculturalNameAndAddressGrp/CountryCd
/IRS990ScheduleA/AgriculturalNameAndAddressGrp/StateAbbreviationCd
/IRS990ScheduleA/AgriculturalResearchOrgInd

/IRS990ScheduleA/DistributionAllocationsGrp/ExcessDistributionCyovYr3Amt
/IRS990ScheduleA/DistributionAllocationsGrp/ExcessFromYear4Amt

/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/EngageDeferDenyRqrPaymentInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/FAPTranslatedInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/LookBackMedicaidMedcrPrvtInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/LookBackMedicareInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/LookBackMedicarePrivateInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/MadeEffortOrallyNotifyInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/MadePresumptiveEligDetermInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/NotifiedFAPCopyBillDisplayInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/PermitDeferDenyRqrPaymentInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/PriorCHNAImpactInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/ProcessedFAPApplicationInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/ProspectiveMedicareMedicaidInd
/IRS990ScheduleH/HospitalFcltyPoliciesPrctcGrp/ProvidedWrittenNoticeInd

/ReturnHeader/FilingSecurityInformation/AtSubmissionCreationDeviceId
/ReturnHeader/FilingSecurityInformation/AtSubmissionFilingDeviceId
/ReturnHeader/FilingSecurityInformation/FederalOriginalSubmissionId
/ReturnHeader/FilingSecurityInformation/FederalOriginalSubmissionIdDt
/ReturnHeader/FilingSecurityInformation/FilingLicenseTypeCd
/ReturnHeader/FilingSecurityInformation/IPAddress/IPv4AddressTxt
/ReturnHeader/FilingSecurityInformation/IPAddress/IPv6AddressTxt
/ReturnHeader/FilingSecurityInformation/IPDt
/ReturnHeader/FilingSecurityInformation/IPTimezoneCd
/ReturnHeader/FilingSecurityInformation/IPTm

TY 2018 schema adjustments made

TY 2018 v3.1 schemas are now available. They can be processed after v0.2.4 though we haven't adjusted the schemas yet. Copies of these schemas are now available so the adjustments should be made.

Map variables across forms

Currently, these tools aim to map related variables across schema versions of a form. It would be great to have a meaningful way to map related variables across forms.

for example, total contributions is in both the 990 and 990 EZ (totcntrbs, totcntrbgfts), would be great to have a mapping somewhere that combines these into a single variable.

Could be done as a different model or other means. Happy to think through if this makes sense to do and, if it does, how to do it.

Runtime errors in run_filing are print statements

DESCRIPTION OF THE ISSUE

In run_filing of xml_runner.py, the final else clause, entered when there is a filing version not listed in ALLOWED_VERSIONSTRINGS or CSV_ALLOWED_VERSIONSTRINGS, has a print command not conditioned by the value of the verbose parameter, meaning the warning is printed when the OID is passed to run_filing and the Filing object instantiated. The Filing object is still returned by the function, so even though the version isn't allowed, the object is still instantiated and returned by the function but without the result attribute being set. Only then can the object be tested for its version.

This is true in run_sked where the final else statement contains the same code.

IMPACT

Print statements containing error information should be conditioned by the verbose flag for debugging purposes only so they can be suppressed in production code. Since this isn't passed as an exception, it can't be handled as an error when trying to access the Filling object with get_result and must be handled as a conditional on the Filing object's version_string. The possible workarounds include testing the first 4 characters for the version_string as an integer greater than 2012 or testing that Filing.result isn't None.

SUGGESTED CHANGES

  1. Suppress the print statement if verbose = False in both run_filing and run_sked
  2. Raise an exception when the filing version isn't supported for the operation either in get_result if there's a compelling reason to instantiate the Filing object when there is a filing version mismatch or in run_filing and run_sked if the Filing object shouldn't be instantiated when the filing version doesn't match the ALLOWED_VERSIONSTRINGS or CSV_ALLOWED_VERSIONSTRINGS
  3. Make the code more DRY by combining the functionality of run_filing and run_sked so these functions don't repeat the same code.

JUSTIFICATION

Better error handling for this situation will help make this code work better without having to dive into the code internals to understand what versions are allowed and which aren't. Raising the exception allows the user to put this into a try-catch and figure out how they want to handle the program flow without having print statements that can't be suppressed. Consolidating the run_filing and run_sked functions will help improve the maintainability of this code going forward. Being able to process the schedules individually is a great feature but this could break if updates to one function aren't replicated in the other.

roadmap

There should be a roadmap which lays out plans for variable standardization (outsourced to open data coalition folks, but urging folks to contribute there).

Support for new data 2021, 2022

First of all thanks a lot for this irsx, it is indeed very helpful.

Do we have tentative timeline when it will support 2021 and 2022 data which is released now?
(Ref link: https://apps.irs.gov/pub/epostcard/990/xml/2022/2022_TEOS_XML_01A.zip)

It is giving errors:
Ref:
Filing version 2021v4.2 isn't supported for this operation
Filing version 2021v4.1 isn't supported for this operation
Filing version 2021v4.0 isn't supported for this operation
Filing version 2021v4.2 isn't supported for this operation

Request you to look into the issue and provide an update.

Thanks,
Vishal

cookbook / recipes

Instead of endless technical documentation, it's probably more useful to have a cookbook for doing common tasks. This might not be part of the repo, which is weird and confusing enough, but should go somewhere that's visible. Maybe they should be jupyter-style notebooks?

Local local_settings.py being posted to Pip

Fresh installs/upgrade of IRSX from Pip throw an OS permissions error because local_settings.py is included as part of the repo w/ IRS_READER_ROOT set to "/Users/jfenton/...". You may want to exclude that file from the repo.

Also, I'd love to contribute to this repo where I can (vs making you deal with my issues) but I think the version of the repo on Git is slightly older(?) than the version on Pip, so I don't want to complicate things by making merge requests on an older repo.

Support for schema 2020v4.2?

The latest XML filings released by the IRS include 990s with schema version 2020v4.2. Are there plans to add support for this version?

Btw, thank you for this great project! It's been extremely helpful.

Organization501cInd

Report:

"The Organization501cInd element includes an ATTRIBUTE called organization501cTypeTxt that specifies the type of 501(c). Best I can tell, this data is not captured. Is capturing this value even possible?"

make result part of a returned Filing object

So that Filing methods like list_schedules are accessible in the result, this probably means changing filing's get_schedule to get_raw_schedule and making get_schedule return the parsed schedule.

Schema 2016v3.0

I processed several thousand returns today, so I put the reader through its paces, and it worked reliably (yay!). However, I've got a list of about 11,000 object IDs that return None. They all appear to be under schema 2016v3.0, but I didn't check them all. I think this is a schema not yet included in the concordance file(?), but I wanted to submit the issue anyway to check with you.

capturing organization501cTypeTxt

The text entered to define what type of 501(c) organization the filer is, e.g. (c)(8) in the below,
image

ends up getting captured in a nonstandard way that isn't getting added to irsx's data

<Organization501cInd organization501cTypeTxt="8">X</Organization501cInd>

Fix new xpaths from 2020 forwards

The settings.py file has been updated to allow newer versions, but the metadata has not yet been updated as to capture new xpaths added since 2020. Hope to do so after generating a report of what's missing.

Download all 990's from one year

How would I download all the 990 data from one year? Is that possible?

The examples is very specific, but what I'd like to do is pull all the data and then throw out the parts I don't need.

2020v1.2

should be listed in settings.py

xmltodict error on windows / anaconda

xmltodict on windows is choking on the xml formatting the IRS uses on the xml files. This has been reproduced on windows / anaconda, unclear how many versions are affected.

import xmltodict
 
filepath = r"c:\.....    anaconda3\lib\site-packages\irsx\XML\201533089349301428_public.xml"
fh = open(filepath, 'r')
raw_file = fh.read()
raw_irs_dict = xmltodict.parse(raw_file)


Traceback (most recent call last):
  File "xmltodict_test.py", line 6, in <module>
    raw_irs_dict = xmltodict.parse(raw_file)
  File "C:\Users\eharv\Anaconda3\lib\site-packages\xmltodict.py", line 330, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

This isn't an issue in linux / mac; the first char is id'ed as \ufeff. On an anaconda terminal, this char is represented as:  . Presumably this is some default codec / encoding type issue.

Filing version 2017v2.2 incompatible

Looks like IRS may have released a new version of the xml files.
IRSx doesn't seem to like them much.

$ /home/ec2-user/.local/bin/irsx --format=csv 201841359349103204 >> test.csv
$ cat test.csv
Filing version 2017v2.2 isn't supported for this operation
object_id,form,line_number,description,value,variable_name,xpath,in_group,group_name,group_index

"Filing version 2012v2.0 isn't supported for this operation"

Hello!
I'm trying to access information on records from 2013 but this warning keeps popping up. I can get the list of schedules this record has but can't get the information on said schedules. Is there a way to remedy this?
I can't seem to find anything on the documentation about this.

Thank you!

Architectural overview

Would be useful to have a single architectural overview, there are now at least 3 repos and it's getting confusing for developers to quickly grasp how everything works together.

Better error message for missing / mangled xml

A lot of my 2016v3.0 returns process, but on calling xml_runner.run_filing(201711459349300346) as well as irsx 201711459349300346, object ID 201711459349300346 is throwing:

File "/usr/local/bin/irsx", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/irsx/irsx_cli.py", line 107, in main
    run_main(args_read)
  File "/usr/local/lib/python3.6/site-packages/irsx/irsx_cli.py", line 93, in run_main
    verbose=args_read.verbose
  File "/usr/local/lib/python3.6/site-packages/irsx/xmlrunner.py", line 101, in run_filing
    this_filing.process(verbose=verbose)
  File "/usr/local/lib/python3.6/site-packages/irsx/filing.py", line 168, in process
    self._set_version()
  File "/usr/local/lib/python3.6/site-packages/irsx/filing.py", line 71, in _set_version
    self.version_string = self.raw_irs_dict['Return']['@returnVersion']
KeyError: 'Return'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.