imsweb / algorithms Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 6.0 154.45 MB

Java implementation of cancer-related algorithms (NHIA, NAPIIA, Survival Time, etc...)

License: Other

C++ 5.27% SAS 10.71% Java 84.02%

cancer napiia nhia seer survival

algorithms's People

Contributors

Stargazers

Watchers

Forkers

barretmonchka crc-iran aastha3 giandrea77

algorithms's Issues

Wrong note in surgery table for Breast

There is a note that is different from the SEER website information for Breast (2018) - see https://seer.cancer.gov/manuals/2018/AppendixC/Surgery_Codes_Breast_2018.pdf

Please fix that specific case.

Then review the Breast for all previous years.

And finally, try to do a full review of all the 2018 sites. You should be able to use the lab class that shows the content of our tables, and compare that to the current information on the SEER website. We need to make sure there are no other differences.

NAPIIA algorithm returns empty string instead of null

If the NAPIIA algorithm can't compute a result, it returns an empty string in its result object. I don't see any reason for that; "empty" values should be null. That's the case everywhere except in this algorithm. That doesn't make much sense to me and I would like to change that.

Add support for ICD-O-3 to ICD-O-2 conversion

We have this conversion in an old C++ program (I will provide it). We need to translate the program into Java logic and add that to our IcdUtils class.

Errors in site-specific surgery tables

We found an error in the 2007 data in the table selection, that means the wrong table can sometimes be returned for specific site/hist combinations.

We are going to fix that error and re-check all the data again.

Record Number Recode output by the Survival Time Alorithm

record.number.recode.error.algorithms.xlsx
I attached 3 patients where the record number recede comes out wrong. I found more cases than these, but the only ones I saw were when the last two records were after the study cutoff.

Col C is what we want the output to be. Col D is the current output. Cols E-I are the algorithms inputs, J-P are the rest of the outputs (these are correct).

Let me know if you need more information.

Updates for newly ocurring CS codes

Historic Stage needs to be updated annually to account for any new CS codes that had not previously been accounted for.

List of input fields for county at dx analysis

The list of input fields for the county of dx analysis algorithm in Algorithms.java includes a few extra fields that are not actually used by the algorithm.

These include:
State at dx geocode 70/80/90
State at dx geocode 2000
State at dx geocode 2010
State at dx geocode 2020

There are setters and getters for these in CountyAtDxAnalysisInputDto.java, and presumably that's why they were included among the input fields, but the getters are never actually used. It might be best to remove them to avoid confusion. The algorithm never uses state at dx geocode. The SAS code also never uses it.

Incidentally, the county at dx geocode 2020 and census tract certainty 2020 fields are also never used, but I can see them being used eventually... so I'll leave it up to you to decide if you want to keep them, comment them out, or whatever.

No rush on this whatsoever, btw.

Change survival record number index to be 1-based

It's currently 0-based but it will become an official NAACCR variable in 2018, and it will be defined as being 1-based. So we should fix it now...

Some of the algorithm names are really long

Here are some examples:

Main Classification from the International Classification of Childhood Cancer ICD-O-3/WHO 2008
SEER Behavior Recode for Analysis - 1973-2004 SEER Research Data (November 2006 submission) and Later Releases
NAACCR Asian/Pacific Islander Identification Algorithm

That last one is long, but it's kind of official and it also makes sense the way it is. And I would like to keep that one as-is but that would be the maximum length allowed.

I would like to change the first one to

International Classification of Childhood Cancer

And the second one to

SEER Behavior Recode for Analysis

I would also like to add a new attribute to the Algorithm class: a URL to more information about the algorithm. All SEER algorithm have documentation available on the SEER website; unfortunately the NAACCR ones don't and so that will be an optional attribute.

Cleanup RuralUrbanCsvData class

That class uses complicated logic to load the data; I think it can be greatly simplified.

As an example, the CensusTractPovertyIndicatorCsvData was also very complex and I simplified it. I think we should do the same for this one. In particular, I would like to stop using modulo operator to code several values into a single field. That's just not worth and it makes the logic very complex/convoluted.

Algorithms.java - override getUnknownValues() in createIarc() function

Can an override for getUnknownValues() be added to the createIarc() method in Algorithms.java?

The IARC functions return an unknown value (literal string "null").

Therefore, I would like "null" to be included in the list when calling getUnknownValues().... either that, or change the IARC algorithms to return null instead of "null".

Move AlgorithmsUtils to internal package

The few utility methods included in AlgorithmUtils were not meant to be exposed outside of the library; they are really "internal" utility methods. For now it doesn't cause any issues, but the implementation and behavior of those methods could be changed in the future and that could break other projects using those methods.

The methods will be moved to a new utility class in the "internal" package, and the ones in AlgorithmUtils will be deprecated.

Note that if one day this library fully (and properly) supports Java modules, the classes in the "internal" package won't be exposed outside of the library anymore...

Deprecate methods that use layout properties

Most algorithms have methods that either take input DTO objects, or the actual values (like site/hist for example).

For convenience, almost all of them support methods that take a map of properties, and they define the keys they expect to find. Those keys are based on the layout framework, and some of them are not going to be valid when NAACCR XML becomes the only standard and replaces the flat file (when that happens the layout flat file will need to be changed so their field names align with the NAACCR XML IDs).

In preparation for that transition, I am going to deprecate all those methods that use maps of properties. Application should use the input DTO objects, its' a cleaner approach anyway.

Support year-based surgery codes

The current surgery code modules takes it's data from the SEER website but only support "the latest data".

We need to add support for multiple SEER manuals (so we keep historic data) and the methods that request the data need to take a year parameter (I guess the current year can be assume if not provided).

Synchronization issue in all CSV providers

The providers initialize lookups from CSV files but those methods are not properly synchronized. As a result, it is possible for one thread to think the lookups have been synchronized, but only a few have been, and the returned results are not correct.

Fix raw types warnings for AlgorithmParameter class

That class got parametrized, but none of the code that uses it properly use the class parameters, resulting in a lot of warning in the code (and in the compilation I am sure). Need to look into this.

Add ICD-9CM/10CM to ICD-O-3 conversions

In SEER*DMS we handle the following in MorphologyUtils:

ICD-9CM to ICD-O-3
ICD-10CM to ICD-O-3
ICD-10 to ICD-O-3

These are accomplished through spreadsheets that are supplied by NCI. I think that logic should be added here.

NHAPIIA and CensusTractPoverty Updates

Census updated their list of Hispanic surnames so we will need to update the resource file so that it matches the new list.

The Census Tract Poverty Code gets updated every year. We will need to update it to include 2015 data.

Update Poverty Linkage For 2018

The ACS poverty linkage algorithm needs to be updated on an annual basis.

Cases diagnosed 2014 and later should use the ACS 5-year data from 2012-2016.

I already pulled the data from the Fact Finder website.

Provide more information in AlgorithmField

I am considering switching File*Pro to use the dynamic algorithms mechanism, but I need a bit more information from the fields:

Name (for standard field, it should be their NAACCR name)
Short Name
NAACCR XML Level (most are Tumor but a few are Patient fields)

I don't think any of this will affect current users of that mechanism (it's just extra information that they won't care about).

Add support to IARC Multiple Primary Algorithm

Unknown country code in NHIA algorithm

In the line below, I'm not sure which country "SWK" is. I didn't find it in the ISO-3166 table or SEER's country codes.

algorithms/src/main/java/com/imsweb/algorithms/nhia/NhiaUtils.java

Line 81 in b02faa9

    
           "ITA", "SMR", "VAT", "ROU", "XSL", "POL", "CSK", "CZE", "SWK", "YUG", "BIH", "HRV", "MKD", "MNE", "SRB", "SVN", "BGR", "RUS", "XUM", "UKR", "MDA", "BLR", "EST", "LVA", "LTU", "GRC", "HUN",

Changes to PRCDA/UIHO

There are some last minute changes to PRCDA/UIHO that need to be implemented. Need to rename a few fields to "17" for 2017 instead of "16" for 2016. Need to add logic for states where every county isn't PRCDA.

Record Number Recode in Survival is missing its leading zero

There is an issue with item #1775 (record number recode), which is part of the survival algorithm.

This data item should be padded on the left with a zero.

This is per the NAACCR specifications, which can be found here:

http://datadictionary.naaccr.org/default.aspx?c=10#1775

Can the leading zero be added to the output for this field?

IarcUtils null pointer exception

Line 215 of IarcUtils (shown below), which is part of the InternalRecDto constructor, can throw a null pointer exception.
_seqNum = record.getSequenceNumber();

This happens when a patient has more than 1 tumor and one of the tumors is missing the sequence number. The problem is that the sequence number in the IarcMpInputRecordDto object is an Object of type Integer and it's being assigned to the sequence number of the InternalRecDto which is a primitive int.

Use different codes for unknown flavors in RUCA/URIC algorithm

The returned codes are two digits to support unknown.

Regular codes (which are one digits) are left-zero padded (01, 02, etc...); the regular unknown code is then 09. And then the extra unknown flavors (which are not standard) are 96, 97, etc...

The algorithm will be changed to return 1 digit instead of two; it will return 1, 2, etc... for the regular codes, and A, B, etc... for the unknown.

The following values will be used for the unknown:

A = State, county, or tract are invalid
B = State and tract are valid, but county was not reported
C = State + county + tract combination was not found
D = State, county, or tract are blank or unknown

Add support for ICD-O-3 to ICD-9/10/CM conversions

Those conversations can be found in the spreadsheet on the SEER website:

Website: https://seer.cancer.gov/tools/conversion/
Document on that page: https://seer.cancer.gov/tools/conversion/ICD03toICD9CM-ICD10-ICD10CM.xls

We will need to write some code to convert from the spreadsheet to whatever internal format we decide to use; this will probably include some manual steps (at the very least, save each tab as a separate CSV file); those should be described somewhere so we can easily re-migrate the data if needed.

The big challenge here is the amount of data. We will need to find some pretty fancy way to not use too much memory with all this. Even with good optimizations in place, we should consider implementing the "data provider" pattern so applications like SEER*DMS can put all that data in its database and not use tons of memory...

Incorrect input value being set for countyAtDxAnalysis in Algorithms.java

On line 1454 the following:

inputDto.setCountyAtDxGeocode2000((String)inputTumor.get(FIELD_COUNTY_AT_DX_GEOCODE_2010));

Should be

inputDto.setCountyAtDxGeocode2010((String)inputTumor.get(FIELD_COUNTY_AT_DX_GEOCODE_2010));

Upgrade and secure XStream

XStream is the library we use to read XML; they recently added a "scurity" environment to fix some vulnerabilities.

Need to upgrade and define the proper security environment.

Replace Joda date library by new Java 8 date framework

The new framework is very similar to Joda, just better. There is no reason to keep Joda around anymore.

Fix esphagus site-specific surgery codes

For "Esphagus" table, histology exclusion "9733" should be "9732" and "9764" should be "9762".

I believe this only affects the latest XML data file (2018).

This was already fixed on the SEER website: https://seer.cancer.gov/manuals/2018/AppendixC/Surgery_Codes_Esophagus_2018.pdf

Site-specific Surgery Codes 2003-2007

Year based surgery codes were previously added for years 2010-2018. This issue is will add years 2003, 2004, and 2007, allowing us to support surgery codes for any year from 2003 to present.

Expose some methods from NhiaUtils

We need to expose isHeavilyHispanic() and isRarelyHispanic() so other processes can re-use the data contained in the library...

NHIA patient values different across tumors

Hi Fabian,

Recinda requested naaccr prep and the algorithm address the issue of the NHIA value being different across tumors. This occurs because of the option selected for the algorithm, based on % population county at diagnosis codes, and the patient moving from one county to another (with a different pop threshold). It causes the records to fail the inter-record edits program. It is very rare but i have been addressing it for the last couple years to make the values consistent across the records. Below is what i was told how to address (details for can be found in the naaccr squish project, issue 65034 and 64839). Here is an inter-record error report from CO as an example. co.2018cfd.inter-record_detailed-rpt.pdf

ICCC Major Category

I would like to request that the calculation for the ICCC Major Category be added to the library. The ICCC Major Category collapses the 3-digit ICCC code to 2 digits, each 2 digit code represents a different major site category.

I've attached some code, it may or may not compile but it should be more than sufficient to get folks started.

iccc-major-category.txt

Vital status 0 should be used instead of 4 for dead.

All algorithms should be checked. CauseSpecific is one of them,

calculateSurvivalTime() can return all space values

The function SurvivalTimeUtils.calculateSurvivalTime() can return all spaces for values. For certain cases you can get this back:
SurvivalTimeOutputRecordDto.getSurvivalTimeDxYear() = “ “
SurvivalTimeOutputRecordDto.getSurvivalTimeDxMonth() = “ “
SurvivalTimeOutputRecordDto.getSurvivalTimeDxDay() = “ “

If these values are placed in a fixed-width data file, they will be considered empty/non-existent/null. But for XML, we don't want to add fields which are spaces. If a value is empty or undefined, the value should be NULL.

County at Diagnosis for Analysis

I would like to request that the calculation for county at diagnosis (analysis) be added to the library. I've attached source code from NAACCR*Prep. It may need to be updated for 2019.

CountyAtDiagnosisAnalysis.txt

Some fields are not read correctly from XML

Found in SEER*DMS when using the site-specific surgery lookup.

I already fix this on the SNAPSHOT (1.4.3) by switching XStream from the Stax driver to the Xpp driver; it looks like the Stax driver has issues with CDATA sections. Doing so, I am pretty sure I broke the indentation in the writing process of XML, but the surgery tables are only read, never written, so that's OK for now. Although I would like fix that in the official release, if possible.

I still need to review my change, add a changelog entry and make an official release. I also need to verify other places to switch to that other driver as well.

New ACS module exploded the size of the library

This library was already pretty big because of all the resources it contains.

A few months ago, we added a new "acslinkage" module and added two new CSV resource files. I didn't notice at the time how big this is. They added more than 20MB of compressed data (so they added 20MB to the size of the JAR file). That's really a lot.

We found this out because we noticed that our SEER*DMS WAR files got bigger. This also affected other software like the SEER Data Viewer and the SEER Abstracting tool.

I am not sure what the answer is. Right now only one software uses the acslinkage. We might need to consider moving the resources to that project (but still keep the algorithm in this library; hopefully the algorithm was designed with a data producer pattern).

Cause specific algorithm change

Set a value of 8 (unknown/missing COD – 7777 or 7797) for the fields SEER Cause-specific death classification and SEER Other cause of death classification.

Survival Time Algorithm Bug

See the attached for a case that is being calculated incorrectly. I have no idea why or why we haven't seen more like it, it doesn't seem THAT unusual.

I included all algorithm inputs for all fields, the algorithm outputs for the problematic fields and what the correct values should be. I'm using version 2.1.

We can talk in person if that is helpful. The general idea is the the seq 60 case should be sorted first since it's clearly before the seq 1 case, both dx dates are known. Then, since we know the seq 2 case should come after the seq 1 case (due to sequence), it's missing dates should be imputed half way between the diagnosis date of seq 1 and the date of last contact since it's the same year and there are no other records. Does that make sense?

If the date and order get set properly, it will end up changing the flags and the survival months. Let me know if you have any questions.
Bad.Survival.Case.xlsx

Change implementation of census tract poverty indicator data provider

To optimize memory usage, the data provider uses a mechanism of multiplying the result for each year categories, and dividing it during the actual calculation. That mechanism is difficult to understand when reading the code, and more importantly, it doesn't scale. We just added new year categories and now we have to multiple the value by a million. That just doesn't work well. At all.

Implementation of UIHO facility and PRCDA county

As we discussed, these algorithms will need to be added to the library. Just making an issue for this for now.

Remove multiple-primaries module

That module has been moved into it's own "MPH" project: https://github.com/imsweb/mph

Add values to ICD10-to-ICDo3 conversion

These values should be added to the .csv:
C489 -> C488|8000|3|9
C769 -> C488|8000|3|9
C799 -> C809|8000|3|9
C849 -> C449|9702|3|9
C909 -> C421|9732|3|9

Update Survival to return the Vital Status Recode

Vital Status Recode is a new data item added in NAACCR 18.

The Survival SAS program has not been updated to return that variable yet, but when it is, we will need to update the corresponding Java program.

The SAS code has not been updated, but needs to be. I apologize – I forgot this was getting added (it came up a while ago). It is a simple calculation. For what it is used for, it is just the vital status at the study cutoff date. We have always calculated it as VSR = VS. If date of last contact > study cutoff, then VSR = Alive. With this logic, if the patient is born in 11/1/2017, and study cutoff is 12/31/2016 they will have VSR = alive (even though they are not alive yet). But the fields are used for survival and a patient born (or diagnosed after) study cutoff is excluded. I think live makes the most sense here.

Census Tract Poverty Indicator data for 2012-2016

In the process of updating the SAS program for census tract poverty I discovered that the 2012-2016 data was not being coded correctly. This would affect cases diagnosed in 2014.

The resource file for the 2012-2016 data needs to be updated. Many of the tracts that were coded as 1 should have been coded as something else.

I have the file ready. I just need to create a branch and commit.

Add support for registering and running algorithms

This library contains about 10 modules and most of them are "algorithms" (in the sense that they take some fields as input and return computed fields).

For those module, we would like to add the ability for the library to auto-register them so an application can provide standard NAACCR data items and ask the library to compute a given set of algorithms.

The idea is that the calling application won't be aware of the algorithms themselves (it won't call algorithm-specific static methods), instead the calling application will ask the library for all algorithms, then a user would select which one should run, and finally the application would call the library with the list of algorithm to execute. The advantage of this approach is that new algorithms won't require any (coding) work in the calling application.

Move NAACCR*Prep calculations into algorithms

The following NAACCR*Prep algorithms need to be added to the algorithms library or addressed through some other means....

County at DX analysis
ICCC Site Recode - Major Category
ACS ABSM Linkage Fields (2007-2011, 2011-2015)
Year of DX + 1

We also need to discuss the possibility of creating a census tract recode field to do the ACS linkage at IMS. This field would be generated at runtime.

So just as an example of how we do recoded IDs in Match*Pro...

We take the original ID and generate N random numbers (where n = the length of the ID). The set of random numbers is only created once and the shift is applied to every ID.

For each character in the ID we shift it by the N[i] value.

For example: ID # 001, random numbers 22, 2, 13, recoded ID = M2D.
For example: ID # 282, random numbers 22, 2, 13, recoded ID = OAF.

This is kind of like a Caesar Shift Cipher but instead of shifting each character by the same amount we are randomly shifting each character by a different amount.

When the registries create the extracts with the recoded census tracts they would need to send us the random numbers used by the ciphers through some other means (encrypted email, etc.) so we could decrypt them.