Coder Social home page Coder Social logo

jackwasey / icd Goto Github PK

View Code? Open in Web Editor NEW
236.0 31.0 59.0 73.68 MB

Fast ICD-10 and ICD-9 comorbidities, decoding and validation in R. NB use main instead of master for default branch.

Home Page: https://jackwasey.github.io/icd/

License: GNU General Public License v3.0

R 82.09% Shell 1.87% C 0.05% C++ 12.16% M4 0.89% Python 0.49% Dockerfile 0.03% Makefile 0.47% SAS 1.95%
comorbidities icd icd-10 icd-9 comorbidity cran icd-codes charlson-comorbidity-index charlson

icd's Issues

convert between wide and long ICD data

A good way of working with ICD-9 codes is in long format, so that each patient can have unlimited codes. Most EHR systems limit to 15 or 30 because they pre-allocate database table columns. This package should provide functions to convert between the two structures.

This is more complicated than just using reshape or reshape2...

icd9Charlson with return.df=TRUE yields error in data.frame(): arguments imply differing number of rows

Using RStudio Version 0.98.1091, R version 3.1.2 (2014-10-31) on Windows 7 x64

Tried every combination of Character v. Factor for input data.frame columns and stringsAsFactors being true or false, but always get the same result
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 197, 11

In this case, 197 is my total number of ICD-9 codes and 11 is the number of distinct visits.

Easy to workaround by using return.df=FALSE and merging manually.

Thanks for this very useful package

vet code for magrittr friendliness

Most of the functions already take the data as the first input. there is at least one exception (icd9PartsToShort / icd9PartsToDecimal). Initial thought it to drop support for providing two vectors major and minor, and just accept a data.frame.

audit co-morbidity mapping changes from year to year

Given annual updates in both ICD-9 codes (and the AHRQ co-morbidity mapping) the package should be able to both give both co-morbidities as would have been defined at a given data (see #6) but also show the differences.

Does a researcher want to apply the contemporary mapping of a set of ICD-9 codes, or a single mapping to all codes? It probably makes little difference, but it should be fairly easy to apply arbitrary co-morbidity mappings to a set of ICD-9 codings, and compare the com-morbidity flags.

calculate comorbidities directly from wide format data frame

Although it limits number of comorbidities per visit/patient, many people have 'wide' format ICD9 data, e.g.
visitId ICD_01 ICD_02 ICD_03 ...
PT123 4411 V1001 E8012 ...
PT789
...

Now that the allocation of comorbidities itself is so fast, the slowest step is setting up the vector of vectors of int values containing the icd9 codes, primarily because we need to search the list of visitsIds as we progress to check for duplicates, although there is an optimization for the case of the visitId being the same as the previous. The initial 'wide' structure could be coded more quickly to do this, and wouldn't (necessarily) require checking for duplicated visit IDs. We would still pay the price of converting these to factors, if they are not already. Factors for these often duplicated codes makes a lot of sense, since there are many duplicates.

The factor levels for all the columns wouldn't necessarily be the same, in fact the ICD_29 code, for example is likely to be relatively unpopulated, and probably have far fewer levels. The factor levels would have to be made consistent across the ICD columns (and consistent with the mapping, but this I do anyway, reducing the mappings only those codes I am actually going to assign.

The current way of doing this would be to call icd9WideToLong then icd9Comorbid. It would be better to have icd9ComorbidFromWide which would take a data frame and some information about how the columns are named. People are much less likely to have this in matrix format, so don't cover this case.

allow integer values for 'short' form codes

Floating point values lead to incorrect 'long' or decimal format codes, but 'short' form codes are not ambiguous, since we know that up to the first three characters are always the major part. They are not in numerical sequence, but can be uniquely represented by integers.

We should therefore allow character and integer 'short' form codes, but disallow non-character decimal codes.

For output, it would be good to try to return the same type as given as input, but won't guarantee this for now. Natural sorting of integer and character short codes is different, so if there is a problem here, will revert to character.

succinctly explaining a set of ICD-9 codes

icd9CondenseToExplainShort has two problems:

  1. works in principle, but doesn't account for issue #2
  2. doesn't account for the fact that not all possible child nodes exist. The fix will involve only finding codable children using the icd-9 to text look-up.

validate using age

neonatal, preterm, and infant only codes could be validated against age. If age only available in years, some validation could still be done.

geriatric only codes could also be checked, although less clear what age to cut-off. Validation could at least warn, not fail, for apparent errors.

enable direct use of wide format Present-on-Arrival

Wide format and long format ICD codes are handled fine, but currently filtering by POA is only done with long data. The user can convert to long format to do this, but eventually it would be good to provide a wide format POA matrix or data frame alongside a wide format ICD code matrix or data frame. The expectation would be that visitIds and position of POA flag against a particular code matched exactly, and this would normally be the case if the data originated in a database table with, e.g. 30 fields for ICD code, and another 30 alongside for POA status.

flesh out tests

There are already a lot of tests, but coverage could be expanded. My general approach is to over-test, even if (I think) I'm exercising an already tested code-path. This enables the code path underneath to change and the test to become more relevant, and sometimes my assumption about the code path is not correct, and the test is already effective.

  1. More tests for multiple inputs to many functions. Mostly I've hammered out the single-value inputs.
  2. ensure consistency in whether to accept zero length input, numeric vs character input, NA input.

undo workaround for Rcpp bug

The major/minor function argument limitation is not gone in version 0.11.4 of Rcpp, so the workaround can be removed providing we depend on that version of Rcpp.

incorporate annual changes to the ICD-9 specifications

This is a thorny issue. There have been small updates to ICD-9 each year, until recently.

Furthermore, AHRQ have updated their version of Elixhauser ICD-9 to co-morbidities annually.

A thorough implementation would optionally accept year or date with every ICD-9 code, and treat it appropriately.

explain single decimal categories

Currently, the 'major' 3 digit category, chapters and sub-chapters can be given for a code, but single decimal minor categories which are not themselves billable are currently not labelled. E.g. 003.2

filter whole data frames for icd9 validity or existence

in order to be more magrittr friendly, and to help easily pull out rows with valid or invalid icd9 codes. Complements similar functions for simple vectors of icd9 codes.

# get rows with invalid icd9 codes
myPatients %>% icd9FilterInvalid()
# same again, but convert to vector and find distinct invalid codes
myPatients %>% icd9FilterInvalid() %>% extract2("icd9") %>% unique
# show top few rows with valid codes, with named icd9 field:
myPatients %>% icd9FilterValid(icd9Field = "i9code", isShort = TRUE) %>% head
# get top few valid rows and show human readable names of the codes:
myPatients %>% icd9FilterValid() %>% extract2("icd9") %>% icd9Explain()

Validity vs existence in the master list of codes (which may be the wrong year, or incomplete..)

myPatients %>% icd9FilterExists()

source data for ICD-9 code to human-readable does not contain high-level descriptions

The data from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html is limited to the most detailed codes for each condition, and does not include the higher level classification. E.g. 053 herpes is not included in these data, but all the specific types are: 0530 0531 0531[0-4] 0532 0537 0538 and 0539.

The canonical ICD-9 description with the high-level codes included seems to be in rich text format at http://www.cdc.gov/nchs/icd/icd9cm.htm . Resolution of this issue will entail parsing out the 'major' part ICD-9 code level, e.g. 053 herpes. In addition, there are even higher level groupings, e.g. INTESTINAL INFECTIOUS DISEASES (001-009). These should also be extracted and be available in the ICD-9 code to text mapping.

allow "X" to terminate a code <5 characters long

This is technically allowed, although I've not seen it, yet. I'm currently just using shorter codes to represent this. E.g. "100" could also be "100XX"

In contrast, decimal format code "10.0" would be 0100 in short form, or 0100X with the terminating X notation.

Add tests.
May be better to just work the X processing into the existing functions, rather than having additional internal conversion functions? My initial thought is that just dropping the X will give an identical meaning valid code, and that would easy to code.

allow use of alternative canonical lists of ICD codes

This would include ICD-10, and any of the numerous national variations of ICD-9, ICD-10 etc, and indeed any other coding system. Currently, the use of ICD-9-CM (which has been unchanged for a few years), is hard-coded.

calculate sums of comorbidities

This is metric is sometimes used, with well known limitations (e.g. very sick people don't get minor diseases coded).

Will be very easy to count across the co-morbidity matrix to return these values.

use ordered factor for ICD-9-CM codes

Could this enable comparisons and ranges using standard R without having a load of custom code to maintain? I could then just define the master order, and let R do the work.

Fix R CMD check problems

When I run R CMD check on icd9 (as part of the release process for devtools), I see:

checking R code for possible problems ... NOTE
icd9Benchmark: no visible global function definition for
  ‘microbenchmark’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbid’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNamesAbbrev’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNames’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNamesHtnAbbrev’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNamesHtn’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbid’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesAbbrev’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNames’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesHtnAbbrev’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesHtn’
icd9ComorbiditiesQuanDeyo: no visible binding for global variable
  ‘quanDeyoComorbid’
icd9ComorbiditiesQuanDeyo: no visible binding for global variable
  ‘charlsonComorbidNamesAbbrev’
icd9ComorbiditiesQuanDeyo: no visible binding for global variable
  ‘charlsonComorbidNames’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbid’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesAbbrev’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNames’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesHtnAbbrev’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesHtn’
icd9CondenseToExplain: no visible binding for global variable
  ‘icd9Hierarchy’
icd9CondenseToExplain: no visible binding for global variable
  ‘icd9ChaptersMajor’
icd9Explain.character: no visible binding for global variable
  ‘icd9ChaptersMajor’
icd9Explain.character: no visible binding for global variable
  ‘icd9Hierarchy’
icd9GetChapters: no visible binding for global variable ‘icd9Chapters’
icd9GetChapters: no visible binding for global variable
  ‘icd9ChaptersSub’
icd9GetChapters: no visible binding for global variable
  ‘icd9ChaptersMajor’
icd9RealShort: no visible binding for global variable ‘icd9Hierarchy’
parseAhrqSas: no visible binding for global variable
  ‘ahrqComorbidNamesHtnAbbrev’
parseElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesHtnAbbrev’
parseQuanDeyoSas: no visible binding for global variable
  ‘charlsonComorbidNamesAbbrev’
parseQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesHtnAbbrev’

This usually means that you've forgotten to import a package or two

deal with factors consistently and correctly

Currently this is an under-tested area. Many key functions will probably fail because they expect character vectors. It would be reasonable to either always convert to strings, although it would be nice to give back a factor if the user gave one in the first place. This would be difficult in the C++ code which is typed for string vectors throughout. A painful workaround could be R wrappers to convert to and from.

Simplest initial approach is simply to convert all the factors to character vectors.

allow implicit visitId with ragged lists of icd9 codes

As pointed out by @gforge, the current code relies on a visitId per row, and one row per ICD-9 code. This is the primary structure of the data I have been using.

An alternative layout is one row per visit, (with or without ID field), and then multiple ICD-9 codes listed across the columns. This would be presented as a list of lists, or data frame with missing blank values when there were fewer than the maximum number of ICD-9 codes per patients. The data I am using caps at 30 codes per visit.

I've already written the code for this in C++, but it needs testing.

calculate Charlson score

This will be very straightforward once I have eliminated the double counting of the three mild and severe disease pairs.

other disease classifications

AHRQ not only provides the Elixhauser-based co-morbidity mapping, but also finer grained disease groups.
https://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt

"The Clinical Classifications Software (CCS) for ICD-9-CM is a diagnosis and procedure categorization scheme that can be employed in many types of projects analyzing data on diagnoses and procedures. CCS is based on the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM), a uniform and standardized coding system. The ICD-9-CM's multitude of codes - over 14,000 diagnosis codes and 3,900 procedure codes - are collapsed into a smaller number of clinically meaningful categories that are sometimes more useful for presenting descriptive statistics than are individual ICD-9-CM codes."

The following link has a zip with the best formatted computer-readable csv files:
http://www.hcup-us.ahrq.gov/toolssoftware/ccs/Single_Level_CCS_2015.zip

Of note, code 238 includes about 100 codes which should be purely in-hospital, not POA.

E-code ranges

Because of their different rules, and no imminent use-cases, parsing ranges of E codes has not been implemented.

This should not be a performance problem, because ranges should only be expanded once, even if processing a large data set. Use of memoise may be helpful if many ranges need to be processed for some reason.

diff co-morbidities

There are several ICD-9 to co-morbidity mappings, mostly based on either Elixhauser or Charlson. E.g. AHRQ, Deyo, Quan.

The user may have their own mapping structures.

The package should provide a function to find differences between Elixhauser-based or Charlson-based co-morbidities. There are some overlapping groups between Elixhauser and Charlson which could also be compared.

implement S3 classes to encapsulate data.frame of codes, matrices of comorbidities

as S3 classes are so lightweight, could easily add an attribute to label data as being short or decimal format. Another label might be the comorbidity mapping.

This would simplify code (drop all the isShort function arguments and drop many of the trivial triple functions used to dispatch on short vs long type.) It would also simplify the processing chain commands:

eg:

myPatients %>% icd9MarkShort() %>% icd9ShortToDecimal() %>% 
  icd9ComorbiditiesAhrq() %>% icd9CharlsonScore()
  • dont then need to remember and reassert that short or decimal data exist, so icd9ComorbiditiesAhrq determines whether the data is short or decimal without needing an argument.
  • Charlson score should fail because the function will recognize that the AHRQ comorbidities cannot be used, since the comorbidities matrix/data.frame will have an AHRQ attribute.

Other useful attributes might be the field names for the visitid poa, etc.

validate using gender

prostate, gynecologic, obstetric codes can be checked.
will have to assume gender is biologic.

E codes < 800

There are now E codes < 800:
E001-E030 Activity

Currently validation explicitly requires E codes >800.

create ICD-9 code ranges, optionally including parents which have broader scope.

include parent codes even if not all child codes are to be included in the range requested.

E.g. "V10.09"" to "V10.11" would not normally include "V10.1"

This may be useful, since people use ICD-9 ranges in many publications, and seem to have slightly different meanings, including the possibility described above. The current implementation is the most conservative in not including codes which could imply additional codes than those specified in a range.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.