jackwasey / icd Goto Github PK

View Code? Open in Web Editor NEW

234.0 31.0 59.0 73.68 MB

Fast ICD-10 and ICD-9 comorbidities, decoding and validation in R. NB use main instead of master for default branch.

Home Page: https://jackwasey.github.io/icd/

License: GNU General Public License v3.0

R 82.09% Shell 1.87% C 0.05% C++ 12.16% M4 0.89% Python 0.49% Dockerfile 0.03% Makefile 0.47% SAS 1.95%

comorbidities icd icd-10 icd-9 comorbidity cran icd-codes charlson-comorbidity-index charlson

icd's Introduction

icd

Fast comorbidities from ICD-9 and ICD-10 codes, decoding, manipulation and validation

Introduction

Calculate comorbidities, medical risk scores, and work very quickly and precisely with ICD-9 and ICD-10 codes. This package enables a work flow from raw tables of ICD codes in hospital databases to comorbidities. ICD-9 and ICD-10 comorbidity mappings from Quan (Deyo and Elixhauser versions), Elixhauser and AHRQ included. Common ambiguities and code formats are handled. Comorbidity computation includes Hierarchical Condition Codes, and an implementation of AHRQ Clinical Classifications. Risk scores include those of Charlson and van Walraven. US Clinical Modification, Word Health Organization, Belgian and French ICD-10 codes are supported, most of which are downloaded on demand.

icd is used by many researchers around the world who work in public health, epidemiology, clinical research, nutrition, journalism, health administration, insurance, and more. I’m grateful for contact from people in these fields for their feedback and code contributions, and I’m pleased to say that icd has been used in works like the Pulitzer finalist work on maternal death by ProPublica.

Features

find comorbidities of patients based on ICD-9 or ICD-10 codes, e.g. Cancer, Heart Disease
- several standard mappings of ICD codes to comorbidities are included (Quan, Deyo, Elixhauser, AHRQ, PCCC)
- very fast assignment of ICD codes to comorbidities (using novel matrix multiplication algorithm and C++ internally – see ‘efficiency’ vignette for details)
use your existing wide or long data format, icd can guess which columns are ICD-9 or ICD-10 codes.
explain and summarize groups of ICD codes in natural language, using ICD editions from the WHO, USA, France and Belgium. Many different annual editions of these data are available, and these may be downloaded automatically when used, or in bulk with download_all_icd_data().
Charlson and Van Walraven score calculations
Hierarchical Condition Codes (HCC) from CMS
Clinical Classifications Software (CCS) comorbidities from AHRQ
Pediatric Complex Chronic Condition comorbidities
AHRQ ICD-10 procedure code classification
correct conversion between different representations of ICD codes, with and without a decimal points, leading and trailing characters (this is not trivial for ICD-9-CM). ICD-9 to ICD-10 cross-walk is not yet implemented
comprehensive test suite to increase confidence in accurate processing of ICD codes

Examples

See also the vignettes and examples embedded in the help for each function for more. Here’s a taste:

# install.packages("icd")
library(icd)

# Typical diagnostic code data, with many-to-many relationship
patient_data
#>   visit_id  icd9
#> 1     1000 40201
#> 2     1000  2258
#> 3     1000  7208
#> 4     1000 25001
#> 5     1001 34400
#> 6     1001  4011
#> 7     1002  4011
#> 8     1000  <NA>

# get comorbidities using Quan's application of Deyo's Charlson comorbidity groups
comorbid_charlson(patient_data)
#>         MI   CHF   PVD Stroke Dementia Pulmonary Rheumatic   PUD LiverMild
#> 1000 FALSE  TRUE FALSE  FALSE    FALSE     FALSE     FALSE FALSE     FALSE
#> 1001 FALSE FALSE FALSE  FALSE    FALSE     FALSE     FALSE FALSE     FALSE
#> 1002 FALSE FALSE FALSE  FALSE    FALSE     FALSE     FALSE FALSE     FALSE
#>         DM  DMcx Paralysis Renal Cancer LiverSevere  Mets   HIV
#> 1000  TRUE FALSE     FALSE FALSE  FALSE       FALSE FALSE FALSE
#> 1001 FALSE FALSE      TRUE FALSE  FALSE       FALSE FALSE FALSE
#> 1002 FALSE FALSE     FALSE FALSE  FALSE       FALSE FALSE FALSE

# or go straight to the Charlson scores:
charlson(patient_data)
#> 1000 1001 1002 
#>    2    2    0

# plot summary of Uranium Cancer Registry sample data using AHRQ comorbidities
plot_comorbid(uranium_pathology)

Comorbidities example: make “Table 1” summary data

A common requirement for medical research involving patients is determining new or existing comorbidities. This is often reported in Table 1 of research papers to demonstrate the similarity or differences of groups of patients. This package is focussed on fast and accurate generation of this comorbidity information from raw lists of ICD-9 and ICD-10 codes.

Here we are using the US National Hospital Discharge Survey 2010 data from the nhds package. For the sake of example, let us compare emergency to other admissions. A real table would have more patient features; this primarily demonstrates how to get ICD codes into your Table 1.

NHDS 2010 comorbidities to demonstrate Table One creation. Presented as counts (percentage prevalence in group).

nhds <- nhds::nhds2010
# get the comorbidities using the Quan-Deyo version of the Charlson categories
cmb <- icd::comorbid_quan_deyo(nhds, abbrev_names = FALSE)
nhds <- cbind(nhds, cmb, stringsAsFactors = FALSE)
Y <- nhds$adm_type == "emergency"
tab_dat <- vapply(
  unname(unlist(icd_names_charlson)),
  function(x) {
    c(sprintf("%i (%.2f%%)", 
              sum(nhds[Y, x]), 
              100 * mean(nhds[Y, x])),
      sprintf("%i (%.2f%%)",
              sum(nhds[!Y, x]),
              100 * mean(nhds[!Y, x])))
  },
  character(2)
)
knitr::kable(t(tab_dat), col.names = c("Emergency", "Not emergency"))

	Emergency	Not emergency
Myocardial Infarction	2707 (3.69%)	1077 (1.38%)
Congestive Heart Failure	12339 (16.84%)	5628 (7.19%)
Periphral Vascular Disease	3798 (5.18%)	3042 (3.89%)
Cerebrovascular Disease	5329 (7.27%)	2748 (3.51%)
Dementia	2175 (2.97%)	728 (0.93%)
Chronic Pulmonary Disease	11989 (16.36%)	6762 (8.64%)
Connective Tissue Disease-Rheumatic Disease	1527 (2.08%)	1131 (1.44%)
Peptic Ulcer Disease	1044 (1.42%)	473 (0.60%)
Mild Liver Disease	2030 (2.77%)	1011 (1.29%)
Diabetes without complications	14399 (19.65%)	9125 (11.66%)
Diabetes with complications	2719 (3.71%)	1449 (1.85%)
Paraplegia and Hemiplegia	1386 (1.89%)	852 (1.09%)
Renal Disease	9322 (12.72%)	4604 (5.88%)
Cancer	2724 (3.72%)	3496 (4.47%)
Moderate or Severe Liver Disease	893 (1.22%)	352 (0.45%)
Metastatic Carcinoma	2100 (2.87%)	1663 (2.12%)
HIV/AIDS	0 (0.00%)	0 (0.00%)

How to get help

Look at the help files for details and examples of almost every function in this package. There are several vignettes showing the main features (See list with vignette(package = "icd")):

Introduction vignette("introduction", package = "icd")
Charlson scores vignette("charlson-scores", package = "icd")
Examples using ICD-10 codes vignette("ICD-10", package = "icd")
CMS Hierarchical Condition Codes (HCC) vignette("CMS-HCC", package = "icd")
Pediatric Complex Chronic Conditions (PCCC) vignette("PCCC", package = "icd")
Working with ICD code ranges vignette("ranges", package = "icd")
Comparing comorbidity maps vignette("compare-maps", package = "icd")
Paper detailing efficient matrix method of comorbidities vignette("efficiency", package = "icd")

Many users have emailed me directly for help, and I’ll do what I can, but it is often better to examine or add to the list of issues so we can help each other. Advanced users may look at the source code, particularly the extensive test suite which exercises all the key functions.

?comorbid
?comorbid_hcc
?explain_code
?is_valid

ICD-9 codes

ICD-9 codes are still in heavy use around the world, particularly in the USA where the ICD-9-CM (Clinical Modification) was in widespread use until the end of 2015. ICD-10 has been used worldwide for reporting cause of death for more than a decade, and ICD-11 is due to be released in 2019. ICD-10-CM is now the primary coding scheme for US hospital admission and discharge diagnoses used for regulatory purposes and billing. A vast amount of electronic patient data is recorded with ICD-9 codes of some kind: this package enables their use in R alongside ICD-10.

ICD-9 codes are not numbers, and great care is needed when matching individual codes and ranges of codes. It is easy to make mistakes, hence the need for this package. ICD-9 codes can be presented in short 5 character format, or decimal format, with a decimal place separating the code into two groups. There are also codes beginning with V and E which have different validation rules. Zeroes after a decimal place are meaningful, so numeric ICD-9 codes cannot be used in most cases. In addition, most clinical databases contain invalid codes, and even decimal and non-decimal format codes in different places. This package primarily deals with ICD-9-CM (Clinical Modification) codes, but should be applicable or easily extendable to the original WHO ICD-9 system.

ICD-10 codes

ICD-10 has a somewhat simpler format, with consistent use of a letter, then two alphanumeric characters. However, especially for ICD-10-CM, there are a multitude of qualifiers, e.g. specifying recurrence, laterality, which vastly increase the number of possible codes. This package recognizes validity of codes by syntax alone, or whether the codes appear in a canonical list. There is not yet the capability of converting between ICD-9 and ICD-10, but comorbidities can be generated from older ICD-9 codes and newer ICD-10 codes in parallel, and the comorbidities can then be compared.

icd's People

Contributors

Stargazers

Watchers

Forkers

pnandak manisahni wmurphyrd cbb280 matt2005 albiondervishi kalibera rpietro eribul jimhester edlee123 hickeye kippjohnson nathania 13479776 andrewdjac mkim0710 kathygcy huangrh mdhhs-mch-epi weiguolu joe-emma tosias quantitative72 vitallish cviru xpf100 patrickmdnet anobel lai1737 ericotta overmar notast phr85 yunjiao1119 novisci sarahfriedman boshek wilpi colegc abhisheknishantpuresoftware marek-tph meghutch bramamoorthy thameath lenamax2355 naddo jfontestad kuroshiwo magic-lantern nathandalton healthdsg gbriddick mariaanasantos tbilab armvndj study4code andrewallenbruce

icd's Issues

source data for ICD-9 code to human-readable does not contain high-level descriptions

The data from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html is limited to the most detailed codes for each condition, and does not include the higher level classification. E.g. 053 herpes is not included in these data, but all the specific types are: 0530 0531 0531[0-4] 0532 0537 0538 and 0539.

The canonical ICD-9 description with the high-level codes included seems to be in rich text format at http://www.cdc.gov/nchs/icd/icd9cm.htm . Resolution of this issue will entail parsing out the 'major' part ICD-9 code level, e.g. 053 herpes. In addition, there are even higher level groupings, e.g. INTESTINAL INFECTIOUS DISEASES (001-009). These should also be extracted and be available in the ICD-9 code to text mapping.

E-code ranges

Because of their different rules, and no imminent use-cases, parsing ranges of E codes has not been implemented.

This should not be a performance problem, because ranges should only be expanded once, even if processing a large data set. Use of memoise may be helpful if many ranges need to be processed for some reason.

AHRQ SAS code parses HIV/AIDS code slightly incorrectly.

"042 "-"0449 " = "AIDS" /* HIV and AIDS */

This omits "044", but should not becuase the specified range would include all chlidren of 044 also.

use public Vermont data for testing

http://healthvermont.gov/research/hospital-utilization/RECENT_PU_FILES.aspx

allow use of alternative canonical lists of ICD codes

This would include ICD-10, and any of the numerous national variations of ICD-9, ICD-10 etc, and indeed any other coding system. Currently, the use of ICD-9-CM (which has been unchanged for a few years), is hard-coded.

calculate Charlson score

This will be very straightforward once I have eliminated the double counting of the three mild and severe disease pairs.

allow integer values for 'short' form codes

Floating point values lead to incorrect 'long' or decimal format codes, but 'short' form codes are not ambiguous, since we know that up to the first three characters are always the major part. They are not in numerical sequence, but can be uniquely represented by integers.

We should therefore allow character and integer 'short' form codes, but disallow non-character decimal codes.

For output, it would be good to try to return the same type as given as input, but won't guarantee this for now. Natural sorting of integer and character short codes is different, so if there is a problem here, will revert to character.

icd9CharlsonComorbid miscalculates Charlson scores

An inadvertent matrix multiplication results in inaccurate Charlson scores

flesh out tests

There are already a lot of tests, but coverage could be expanded. My general approach is to over-test, even if (I think) I'm exercising an already tested code-path. This enables the code path underneath to change and the test to become more relevant, and sometimes my assumption about the code path is not correct, and the test is already effective.

More tests for multiple inputs to many functions. Mostly I've hammered out the single-value inputs.
ensure consistency in whether to accept zero length input, numeric vs character input, NA input.

incorporate annual changes to the ICD-9 specifications

This is a thorny issue. There have been small updates to ICD-9 each year, until recently.

Furthermore, AHRQ have updated their version of Elixhauser ICD-9 to co-morbidities annually.

A thorough implementation would optionally accept year or date with every ICD-9 code, and treat it appropriately.

include icd-9 cm procedure codes

This would be pretty straightforward. Always four digits. Mapping to text available in http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html

manually defined long-form mapping gives different results to short-form

parse SAS file for Dr. Quan's update of Elixhauser

Received files and permission to redistribute from Dr. Hude Quan.

validate using age

neonatal, preterm, and infant only codes could be validated against age. If age only available in years, some validation could still be done.

geriatric only codes could also be checked, although less clear what age to cut-off. Validation could at least warn, not fail, for apparent errors.

implement S3 classes to encapsulate data.frame of codes, matrices of comorbidities

as S3 classes are so lightweight, could easily add an attribute to label data as being short or decimal format. Another label might be the comorbidity mapping.

This would simplify code (drop all the isShort function arguments and drop many of the trivial triple functions used to dispatch on short vs long type.) It would also simplify the processing chain commands:

eg:

myPatients %>% icd9MarkShort() %>% icd9ShortToDecimal() %>% 
  icd9ComorbiditiesAhrq() %>% icd9CharlsonScore()

dont then need to remember and reassert that short or decimal data exist, so icd9ComorbiditiesAhrq determines whether the data is short or decimal without needing an argument.
Charlson score should fail because the function will recognize that the AHRQ comorbidities cannot be used, since the comorbidities matrix/data.frame will have an AHRQ attribute.

Other useful attributes might be the field names for the visitid poa, etc.

Code outside of code block in vignette

In the vignette at line 121 (https://github.com/jackwasey/icd9/blob/master/vignettes/icd9.Rnw#L121) there is code outside of any knitr code block. It looks like leftover code from the rangeanomaly block rather than the noexplain block, but I am not sure (which is why I'm not attaching a proposed fix).

create ICD-9 code ranges, optionally including parents which have broader scope.

include parent codes even if not all child codes are to be included in the range requested.

E.g. "V10.09"" to "V10.11" would not normally include "V10.1"

This may be useful, since people use ICD-9 ranges in many publications, and seem to have slightly different meanings, including the possibility described above. The current implementation is the most conservative in not including codes which could imply additional codes than those specified in a range.

enable direct use of wide format Present-on-Arrival

Wide format and long format ICD codes are handled fine, but currently filtering by POA is only done with long data. The user can convert to long format to do this, but eventually it would be good to provide a wide format POA matrix or data frame alongside a wide format ICD code matrix or data frame. The expectation would be that visitIds and position of POA flag against a particular code matched exactly, and this would normally be the case if the data originated in a database table with, e.g. 30 fields for ICD code, and another 30 alongside for POA status.

explanation different via icd9GetReal

quanDeyoComorbid[["Dementia"]] %>% icd9GetReal %>% icd9Explain
differs from
quanDeyoComorbid[["Dementia"]] %>% icd9Explain

include SAS source code from Dr. Hude Quan

Permission kindly given for redistribution.

use ordered factor for ICD-9-CM codes

Could this enable comparisons and ranges using standard R without having a load of custom code to maintain? I could then just define the master order, and let R do the work.

convert between wide and long ICD data

A good way of working with ICD-9 codes is in long format, so that each patient can have unlimited codes. Most EHR systems limit to 15 or 30 because they pre-allocate database table columns. This package should provide functions to convert between the two structures.

This is more complicated than just using reshape or reshape2...

deal with factors consistently and correctly

Currently this is an under-tested area. Many key functions will probably fail because they expect character vectors. It would be reasonable to either always convert to strings, although it would be nice to give back a factor if the user gave one in the first place. This would be difficult in the C++ code which is typed for string vectors throughout. A painful workaround could be R wrappers to convert to and from.

Simplest initial approach is simply to convert all the factors to character vectors.

filter whole data frames for icd9 validity or existence

in order to be more magrittr friendly, and to help easily pull out rows with valid or invalid icd9 codes. Complements similar functions for simple vectors of icd9 codes.

# get rows with invalid icd9 codes
myPatients %>% icd9FilterInvalid()
# same again, but convert to vector and find distinct invalid codes
myPatients %>% icd9FilterInvalid() %>% extract2("icd9") %>% unique
# show top few rows with valid codes, with named icd9 field:
myPatients %>% icd9FilterValid(icd9Field = "i9code", isShort = TRUE) %>% head
# get top few valid rows and show human readable names of the codes:
myPatients %>% icd9FilterValid() %>% extract2("icd9") %>% icd9Explain()

Validity vs existence in the master list of codes (which may be the wrong year, or incomplete..)

myPatients %>% icd9FilterExists()

visualize comorbidities in image map

image(matrix) does most of the work.

This, combined with some kind of sorting, e.g. by most frequent on the left, could be helpful.

Fix R CMD check problems

When I run R CMD check on icd9 (as part of the release process for devtools), I see:

checking R code for possible problems ... NOTE
icd9Benchmark: no visible global function definition for
  ‘microbenchmark’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbid’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNamesAbbrev’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNames’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNamesHtnAbbrev’
icd9ComorbiditiesAhrq: no visible binding for global variable
  ‘ahrqComorbidNamesHtn’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbid’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesAbbrev’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNames’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesHtnAbbrev’
icd9ComorbiditiesElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesHtn’
icd9ComorbiditiesQuanDeyo: no visible binding for global variable
  ‘quanDeyoComorbid’
icd9ComorbiditiesQuanDeyo: no visible binding for global variable
  ‘charlsonComorbidNamesAbbrev’
icd9ComorbiditiesQuanDeyo: no visible binding for global variable
  ‘charlsonComorbidNames’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbid’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesAbbrev’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNames’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesHtnAbbrev’
icd9ComorbiditiesQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesHtn’
icd9CondenseToExplain: no visible binding for global variable
  ‘icd9Hierarchy’
icd9CondenseToExplain: no visible binding for global variable
  ‘icd9ChaptersMajor’
icd9Explain.character: no visible binding for global variable
  ‘icd9ChaptersMajor’
icd9Explain.character: no visible binding for global variable
  ‘icd9Hierarchy’
icd9GetChapters: no visible binding for global variable ‘icd9Chapters’
icd9GetChapters: no visible binding for global variable
  ‘icd9ChaptersSub’
icd9GetChapters: no visible binding for global variable
  ‘icd9ChaptersMajor’
icd9RealShort: no visible binding for global variable ‘icd9Hierarchy’
parseAhrqSas: no visible binding for global variable
  ‘ahrqComorbidNamesHtnAbbrev’
parseElixhauser: no visible binding for global variable
  ‘elixhauserComorbidNamesHtnAbbrev’
parseQuanDeyoSas: no visible binding for global variable
  ‘charlsonComorbidNamesAbbrev’
parseQuanElixhauser: no visible binding for global variable
  ‘quanElixhauserComorbidNamesHtnAbbrev’

This usually means that you've forgotten to import a package or two

icd9Charlson with return.df=TRUE yields error in data.frame(): arguments imply differing number of rows

Using RStudio Version 0.98.1091, R version 3.1.2 (2014-10-31) on Windows 7 x64

Tried every combination of Character v. Factor for input data.frame columns and stringsAsFactors being true or false, but always get the same result
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 197, 11

In this case, 197 is my total number of ICD-9 codes and 11 is the number of distinct visits.

Easy to workaround by using return.df=FALSE and merging manually.

Thanks for this very useful package

undo workaround for Rcpp bug

The major/minor function argument limitation is not gone in version 0.11.4 of Rcpp, so the workaround can be removed providing we depend on that version of Rcpp.

prefer setequal to all(x %in% y) in tests

easily test sets in both directions, instead of what the tests often contain, which only checks one direction.

vignette should cover all public exported functions and data

icd9ShortToMajor gives incorrect E major parts

entered in error

allow implicit visitId with ragged lists of icd9 codes

As pointed out by @gforge, the current code relies on a visitId per row, and one row per ICD-9 code. This is the primary structure of the data I have been using.

An alternative layout is one row per visit, (with or without ID field), and then multiple ICD-9 codes listed across the columns. This would be presented as a list of lists, or data frame with missing blank values when there were fewer than the maximum number of ICD-9 codes per patients. The data I am using caps at 30 codes per visit.

I've already written the code for this in C++, but it needs testing.

vet code for magrittr friendliness

Most of the functions already take the data as the first input. there is at least one exception (icd9PartsToShort / icd9PartsToDecimal). Initial thought it to drop support for providing two vectors major and minor, and just accept a data.frame.

clean up CRAN documentation

Heading and TOC didn't appear in CRAN Rmd vignette.
Old vignette from Rnw did appear on CRAN.

Handle combinations of mild and severe illness when using a co-morbidity mapping

For my own use cases, I am primarily interested in using all the groupings defined in the mappings, however, for calculation of Charlson score, and counting number of co-morbidities, a few operations need to be performed, such as removing Mild Liver Failure if Severe Liver Failure is present.

other disease classifications

AHRQ not only provides the Elixhauser-based co-morbidity mapping, but also finer grained disease groups.
https://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt

"The Clinical Classifications Software (CCS) for ICD-9-CM is a diagnosis and procedure categorization scheme that can be employed in many types of projects analyzing data on diagnoses and procedures. CCS is based on the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM), a uniform and standardized coding system. The ICD-9-CM's multitude of codes - over 14,000 diagnosis codes and 3,900 procedure codes - are collapsed into a smaller number of clinically meaningful categories that are sometimes more useful for presenting descriptive statistics than are individual ICD-9-CM codes."

The following link has a zip with the best formatted computer-readable csv files:
http://www.hcup-us.ahrq.gov/toolssoftware/ccs/Single_Level_CCS_2015.zip

Of note, code 238 includes about 100 codes which should be purely in-hospital, not POA.

possible additional internal validation for sets of codes

There is an implication that some codes are mutually exclusive. E.g.
http://www.icd9data.com/2012/Volume1/390-459/430-438/436/436.htm

E codes < 800

There are now E codes < 800:
E001-E030 Activity

Currently validation explicitly requires E codes >800.

validate using gender

prostate, gynecologic, obstetric codes can be checked.
will have to assume gender is biologic.

audit co-morbidity mapping changes from year to year

Given annual updates in both ICD-9 codes (and the AHRQ co-morbidity mapping) the package should be able to both give both co-morbidities as would have been defined at a given data (see #6) but also show the differences.

Does a researcher want to apply the contemporary mapping of a set of ICD-9 codes, or a single mapping to all codes? It probably makes little difference, but it should be fairly easy to apply arbitrary co-morbidity mappings to a set of ICD-9 codings, and compare the com-morbidity flags.

Include MS-DRG so morbidities can be distinguished from co-morbidities

This data is publicly available, and compact. Main use would be to allow correct application of co-morbidity mappings, since e.g. Elixhauser specifies specific DRG exclusions, otherwise the patient has a morbidity, not a co-morbidity.

Multiple versions, now migrating to ICD-10-CM:
http://www.cms.hhs.gov/AcuteInpatientPPS/downloads/FY_2010_FR_Table_5.zip

manually enter ICD-9 to co-morbidities for mappings dervied from SAS code, and ensure they are the same

Although it is pleasing to have a direct conversion directly to R code from the SAS code which an author used to generate their data, I'm currently only testing a subset of the converted items. It wouldn't be too time consuming to manually enter the whole mapping as a double check.

calculate sums of comorbidities

This is metric is sometimes used, with well known limitations (e.g. very sick people don't get minor diseases coded).

Will be very easy to count across the co-morbidity matrix to return these values.

guess short vs decimal form of ICD-9 code

Accept a single value or list, and determine whether decimal or short form. If mix, or unclear from a ?random sample, then warn/error.

diff co-morbidities

There are several ICD-9 to co-morbidity mappings, mostly based on either Elixhauser or Charlson. E.g. AHRQ, Deyo, Quan.

The user may have their own mapping structures.

The package should provide a function to find differences between Elixhauser-based or Charlson-based co-morbidities. There are some overlapping groups between Elixhauser and Charlson which could also be compared.

succinctly explaining a set of ICD-9 codes

icd9CondenseToExplainShort has two problems:

works in principle, but doesn't account for issue #2
doesn't account for the fact that not all possible child nodes exist. The fix will involve only finding codable children using the icd-9 to text look-up.

allow "X" to terminate a code <5 characters long

This is technically allowed, although I've not seen it, yet. I'm currently just using shorter codes to represent this. E.g. "100" could also be "100XX"

In contrast, decimal format code "10.0" would be 0100 in short form, or 0100X with the terminating X notation.

Add tests.
May be better to just work the X processing into the existing functions, rather than having additional internal conversion functions? My initial thought is that just dropping the X will give an identical meaning valid code, and that would easy to code.

CMS HCC risk adjustment models

http://www.cms.gov/Medicare/Health-Plans/MedicareAdvtgSpecRateStats/Risk-Adjustors.html

assigns ICD-9 codes to HCC codes, but also needs age and gender inputs.
many:many mapping, but should still be able to map from set of ICD-9 codes for an individual to a set of HCC codes (which could be considered comorbidities).

explain single decimal categories

Currently, the 'major' 3 digit category, chapters and sub-chapters can be given for a code, but single decimal minor categories which are not themselves billable are currently not labelled. E.g. 003.2

calculate comorbidities directly from wide format data frame

Although it limits number of comorbidities per visit/patient, many people have 'wide' format ICD9 data, e.g.
visitId ICD_01 ICD_02 ICD_03 ...
PT123 4411 V1001 E8012 ...
PT789
...

Now that the allocation of comorbidities itself is so fast, the slowest step is setting up the vector of vectors of int values containing the icd9 codes, primarily because we need to search the list of visitsIds as we progress to check for duplicates, although there is an optimization for the case of the visitId being the same as the previous. The initial 'wide' structure could be coded more quickly to do this, and wouldn't (necessarily) require checking for duplicated visit IDs. We would still pay the price of converting these to factors, if they are not already. Factors for these often duplicated codes makes a lot of sense, since there are many duplicates.

The factor levels for all the columns wouldn't necessarily be the same, in fact the ICD_29 code, for example is likely to be relatively unpopulated, and probably have far fewer levels. The factor levels would have to be made consistent across the ICD columns (and consistent with the mapping, but this I do anyway, reducing the mappings only those codes I am actually going to assign.

The current way of doing this would be to call icd9WideToLong then icd9Comorbid. It would be better to have icd9ComorbidFromWide which would take a data frame and some information about how the columns are named. People are much less likely to have this in matrix format, so don't cover this case.