imsweb / algorithms Goto Github PK
View Code? Open in Web Editor NEWJava implementation of cancer-related algorithms (NHIA, NAPIIA, Survival Time, etc...)
License: Other
Java implementation of cancer-related algorithms (NHIA, NAPIIA, Survival Time, etc...)
License: Other
There is a note that is different from the SEER website information for Breast (2018) - see https://seer.cancer.gov/manuals/2018/AppendixC/Surgery_Codes_Breast_2018.pdf
Please fix that specific case.
Then review the Breast for all previous years.
And finally, try to do a full review of all the 2018 sites. You should be able to use the lab class that shows the content of our tables, and compare that to the current information on the SEER website. We need to make sure there are no other differences.
If the NAPIIA algorithm can't compute a result, it returns an empty string in its result object. I don't see any reason for that; "empty" values should be null. That's the case everywhere except in this algorithm. That doesn't make much sense to me and I would like to change that.
We have this conversion in an old C++ program (I will provide it). We need to translate the program into Java logic and add that to our IcdUtils class.
We found an error in the 2007 data in the table selection, that means the wrong table can sometimes be returned for specific site/hist combinations.
We are going to fix that error and re-check all the data again.
record.number.recode.error.algorithms.xlsx
I attached 3 patients where the record number recede comes out wrong. I found more cases than these, but the only ones I saw were when the last two records were after the study cutoff.
Col C is what we want the output to be. Col D is the current output. Cols E-I are the algorithms inputs, J-P are the rest of the outputs (these are correct).
Let me know if you need more information.
Historic Stage needs to be updated annually to account for any new CS codes that had not previously been accounted for.
The list of input fields for the county of dx analysis algorithm in Algorithms.java includes a few extra fields that are not actually used by the algorithm.
These include:
State at dx geocode 70/80/90
State at dx geocode 2000
State at dx geocode 2010
State at dx geocode 2020
There are setters and getters for these in CountyAtDxAnalysisInputDto.java, and presumably that's why they were included among the input fields, but the getters are never actually used. It might be best to remove them to avoid confusion. The algorithm never uses state at dx geocode. The SAS code also never uses it.
Incidentally, the county at dx geocode 2020 and census tract certainty 2020 fields are also never used, but I can see them being used eventually... so I'll leave it up to you to decide if you want to keep them, comment them out, or whatever.
No rush on this whatsoever, btw.
It's currently 0-based but it will become an official NAACCR variable in 2018, and it will be defined as being 1-based. So we should fix it now...
Here are some examples:
That last one is long, but it's kind of official and it also makes sense the way it is. And I would like to keep that one as-is but that would be the maximum length allowed.
I would like to change the first one to
And the second one to
I would also like to add a new attribute to the Algorithm class: a URL to more information about the algorithm. All SEER algorithm have documentation available on the SEER website; unfortunately the NAACCR ones don't and so that will be an optional attribute.
That class uses complicated logic to load the data; I think it can be greatly simplified.
As an example, the CensusTractPovertyIndicatorCsvData was also very complex and I simplified it. I think we should do the same for this one. In particular, I would like to stop using modulo operator to code several values into a single field. That's just not worth and it makes the logic very complex/convoluted.
Can an override for getUnknownValues() be added to the createIarc() method in Algorithms.java?
The IARC functions return an unknown value (literal string "null").
Therefore, I would like "null" to be included in the list when calling getUnknownValues().... either that, or change the IARC algorithms to return null instead of "null".
The few utility methods included in AlgorithmUtils were not meant to be exposed outside of the library; they are really "internal" utility methods. For now it doesn't cause any issues, but the implementation and behavior of those methods could be changed in the future and that could break other projects using those methods.
The methods will be moved to a new utility class in the "internal" package, and the ones in AlgorithmUtils will be deprecated.
Note that if one day this library fully (and properly) supports Java modules, the classes in the "internal" package won't be exposed outside of the library anymore...
Most algorithms have methods that either take input DTO objects, or the actual values (like site/hist for example).
For convenience, almost all of them support methods that take a map of properties, and they define the keys they expect to find. Those keys are based on the layout framework, and some of them are not going to be valid when NAACCR XML becomes the only standard and replaces the flat file (when that happens the layout flat file will need to be changed so their field names align with the NAACCR XML IDs).
In preparation for that transition, I am going to deprecate all those methods that use maps of properties. Application should use the input DTO objects, its' a cleaner approach anyway.
The current surgery code modules takes it's data from the SEER website but only support "the latest data".
We need to add support for multiple SEER manuals (so we keep historic data) and the methods that request the data need to take a year parameter (I guess the current year can be assume if not provided).
The providers initialize lookups from CSV files but those methods are not properly synchronized. As a result, it is possible for one thread to think the lookups have been synchronized, but only a few have been, and the returned results are not correct.
That class got parametrized, but none of the code that uses it properly use the class parameters, resulting in a lot of warning in the code (and in the compilation I am sure). Need to look into this.
In SEER*DMS we handle the following in MorphologyUtils
:
These are accomplished through spreadsheets that are supplied by NCI. I think that logic should be added here.
Census updated their list of Hispanic surnames so we will need to update the resource file so that it matches the new list.
The Census Tract Poverty Code gets updated every year. We will need to update it to include 2015 data.
The ACS poverty linkage algorithm needs to be updated on an annual basis.
Cases diagnosed 2014 and later should use the ACS 5-year data from 2012-2016.
I already pulled the data from the Fact Finder website.
I am considering switching File*Pro to use the dynamic algorithms mechanism, but I need a bit more information from the fields:
I don't think any of this will affect current users of that mechanism (it's just extra information that they won't care about).
In the line below, I'm not sure which country "SWK"
is. I didn't find it in the ISO-3166 table or SEER's country codes.
There are some last minute changes to PRCDA/UIHO that need to be implemented. Need to rename a few fields to "17" for 2017 instead of "16" for 2016. Need to add logic for states where every county isn't PRCDA.
There is an issue with item #1775 (record number recode), which is part of the survival algorithm.
This data item should be padded on the left with a zero.
This is per the NAACCR specifications, which can be found here:
http://datadictionary.naaccr.org/default.aspx?c=10#1775
Can the leading zero be added to the output for this field?
Line 215 of IarcUtils (shown below), which is part of the InternalRecDto constructor, can throw a null pointer exception.
_seqNum = record.getSequenceNumber();
This happens when a patient has more than 1 tumor and one of the tumors is missing the sequence number. The problem is that the sequence number in the IarcMpInputRecordDto object is an Object of type Integer and it's being assigned to the sequence number of the InternalRecDto which is a primitive int.
The returned codes are two digits to support unknown.
Regular codes (which are one digits) are left-zero padded (01, 02, etc...); the regular unknown code is then 09. And then the extra unknown flavors (which are not standard) are 96, 97, etc...
The algorithm will be changed to return 1 digit instead of two; it will return 1, 2, etc... for the regular codes, and A, B, etc... for the unknown.
The following values will be used for the unknown:
A = State, county, or tract are invalid
B = State and tract are valid, but county was not reported
C = State + county + tract combination was not found
D = State, county, or tract are blank or unknown
Those conversations can be found in the spreadsheet on the SEER website:
Website: https://seer.cancer.gov/tools/conversion/
Document on that page: https://seer.cancer.gov/tools/conversion/ICD03toICD9CM-ICD10-ICD10CM.xls
We will need to write some code to convert from the spreadsheet to whatever internal format we decide to use; this will probably include some manual steps (at the very least, save each tab as a separate CSV file); those should be described somewhere so we can easily re-migrate the data if needed.
The big challenge here is the amount of data. We will need to find some pretty fancy way to not use too much memory with all this. Even with good optimizations in place, we should consider implementing the "data provider" pattern so applications like SEER*DMS can put all that data in its database and not use tons of memory...
On line 1454 the following:
inputDto.setCountyAtDxGeocode2000((String)inputTumor.get(FIELD_COUNTY_AT_DX_GEOCODE_2010));
Should be
inputDto.setCountyAtDxGeocode2010((String)inputTumor.get(FIELD_COUNTY_AT_DX_GEOCODE_2010));
XStream is the library we use to read XML; they recently added a "scurity" environment to fix some vulnerabilities.
Need to upgrade and define the proper security environment.
The new framework is very similar to Joda, just better. There is no reason to keep Joda around anymore.
For "Esphagus" table, histology exclusion "9733" should be "9732" and "9764" should be "9762".
I believe this only affects the latest XML data file (2018).
This was already fixed on the SEER website: https://seer.cancer.gov/manuals/2018/AppendixC/Surgery_Codes_Esophagus_2018.pdf
Year based surgery codes were previously added for years 2010-2018. This issue is will add years 2003, 2004, and 2007, allowing us to support surgery codes for any year from 2003 to present.
We need to expose isHeavilyHispanic() and isRarelyHispanic() so other processes can re-use the data contained in the library...
Hi Fabian,
Recinda requested naaccr prep and the algorithm address the issue of the NHIA value being different across tumors. This occurs because of the option selected for the algorithm, based on % population county at diagnosis codes, and the patient moving from one county to another (with a different pop threshold). It causes the records to fail the inter-record edits program. It is very rare but i have been addressing it for the last couple years to make the values consistent across the records. Below is what i was told how to address (details for can be found in the naaccr squish project, issue 65034 and 64839). Here is an inter-record error report from CO as an example. co.2018cfd.inter-record_detailed-rpt.pdf
I would like to request that the calculation for the ICCC Major Category be added to the library. The ICCC Major Category collapses the 3-digit ICCC code to 2 digits, each 2 digit code represents a different major site category.
I've attached some code, it may or may not compile but it should be more than sufficient to get folks started.
All algorithms should be checked. CauseSpecific is one of them,
The function SurvivalTimeUtils.calculateSurvivalTime() can return all spaces for values. For certain cases you can get this back:
SurvivalTimeOutputRecordDto.getSurvivalTimeDxYear() = “ “
SurvivalTimeOutputRecordDto.getSurvivalTimeDxMonth() = “ “
SurvivalTimeOutputRecordDto.getSurvivalTimeDxDay() = “ “
If these values are placed in a fixed-width data file, they will be considered empty/non-existent/null. But for XML, we don't want to add fields which are spaces. If a value is empty or undefined, the value should be NULL.
I would like to request that the calculation for county at diagnosis (analysis) be added to the library. I've attached source code from NAACCR*Prep. It may need to be updated for 2019.
Found in SEER*DMS when using the site-specific surgery lookup.
I already fix this on the SNAPSHOT (1.4.3) by switching XStream from the Stax driver to the Xpp driver; it looks like the Stax driver has issues with CDATA sections. Doing so, I am pretty sure I broke the indentation in the writing process of XML, but the surgery tables are only read, never written, so that's OK for now. Although I would like fix that in the official release, if possible.
I still need to review my change, add a changelog entry and make an official release. I also need to verify other places to switch to that other driver as well.
This library was already pretty big because of all the resources it contains.
A few months ago, we added a new "acslinkage" module and added two new CSV resource files. I didn't notice at the time how big this is. They added more than 20MB of compressed data (so they added 20MB to the size of the JAR file). That's really a lot.
We found this out because we noticed that our SEER*DMS WAR files got bigger. This also affected other software like the SEER Data Viewer and the SEER Abstracting tool.
I am not sure what the answer is. Right now only one software uses the acslinkage. We might need to consider moving the resources to that project (but still keep the algorithm in this library; hopefully the algorithm was designed with a data producer pattern).
Set a value of 8 (unknown/missing COD – 7777 or 7797) for the fields SEER Cause-specific death classification and SEER Other cause of death classification.
See the attached for a case that is being calculated incorrectly. I have no idea why or why we haven't seen more like it, it doesn't seem THAT unusual.
I included all algorithm inputs for all fields, the algorithm outputs for the problematic fields and what the correct values should be. I'm using version 2.1.
We can talk in person if that is helpful. The general idea is the the seq 60 case should be sorted first since it's clearly before the seq 1 case, both dx dates are known. Then, since we know the seq 2 case should come after the seq 1 case (due to sequence), it's missing dates should be imputed half way between the diagnosis date of seq 1 and the date of last contact since it's the same year and there are no other records. Does that make sense?
If the date and order get set properly, it will end up changing the flags and the survival months. Let me know if you have any questions.
Bad.Survival.Case.xlsx
To optimize memory usage, the data provider uses a mechanism of multiplying the result for each year categories, and dividing it during the actual calculation. That mechanism is difficult to understand when reading the code, and more importantly, it doesn't scale. We just added new year categories and now we have to multiple the value by a million. That just doesn't work well. At all.
As we discussed, these algorithms will need to be added to the library. Just making an issue for this for now.
That module has been moved into it's own "MPH" project: https://github.com/imsweb/mph
These values should be added to the .csv:
C489 -> C488|8000|3|9
C769 -> C488|8000|3|9
C799 -> C809|8000|3|9
C849 -> C449|9702|3|9
C909 -> C421|9732|3|9
Vital Status Recode is a new data item added in NAACCR 18.
The Survival SAS program has not been updated to return that variable yet, but when it is, we will need to update the corresponding Java program.
The SAS code has not been updated, but needs to be. I apologize – I forgot this was getting added (it came up a while ago). It is a simple calculation. For what it is used for, it is just the vital status at the study cutoff date. We have always calculated it as VSR = VS. If date of last contact > study cutoff, then VSR = Alive. With this logic, if the patient is born in 11/1/2017, and study cutoff is 12/31/2016 they will have VSR = alive (even though they are not alive yet). But the fields are used for survival and a patient born (or diagnosed after) study cutoff is excluded. I think live makes the most sense here.
In the process of updating the SAS program for census tract poverty I discovered that the 2012-2016 data was not being coded correctly. This would affect cases diagnosed in 2014.
The resource file for the 2012-2016 data needs to be updated. Many of the tracts that were coded as 1 should have been coded as something else.
I have the file ready. I just need to create a branch and commit.
This library contains about 10 modules and most of them are "algorithms" (in the sense that they take some fields as input and return computed fields).
For those module, we would like to add the ability for the library to auto-register them so an application can provide standard NAACCR data items and ask the library to compute a given set of algorithms.
The idea is that the calling application won't be aware of the algorithms themselves (it won't call algorithm-specific static methods), instead the calling application will ask the library for all algorithms, then a user would select which one should run, and finally the application would call the library with the list of algorithm to execute. The advantage of this approach is that new algorithms won't require any (coding) work in the calling application.
The following NAACCR*Prep algorithms need to be added to the algorithms library or addressed through some other means....
We also need to discuss the possibility of creating a census tract recode field to do the ACS linkage at IMS. This field would be generated at runtime.
So just as an example of how we do recoded IDs in Match*Pro...
We take the original ID and generate N random numbers (where n = the length of the ID). The set of random numbers is only created once and the shift is applied to every ID.
For each character in the ID we shift it by the N[i] value.
For example: ID # 001, random numbers 22, 2, 13, recoded ID = M2D.
For example: ID # 282, random numbers 22, 2, 13, recoded ID = OAF.
This is kind of like a Caesar Shift Cipher but instead of shifting each character by the same amount we are randomly shifting each character by a different amount.
When the registries create the extracts with the recoded census tracts they would need to send us the random numbers used by the ciphers through some other means (encrypted email, etc.) so we could decrypt them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.