Wide format and long format ICD codes are handled fine, but currently filtering by POA

Thanks so much for your contribution, <a class="user-mention notranslate" data-hoverca

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

enable direct use of wide format Present-on-Arrival about icd HOT 7 CLOSED

jackwasey commented on May 26, 2024

enable direct use of wide format Present-on-Arrival

from icd.

Comments (7)

anobel commented on May 26, 2024

Hi Jack;
I just worked through dealing with this issue, dealing with POA diagnoses. I've pasted my code below in case anyone comes across it and may find it useful.

odiags <- c(paste("odiag",1:24,sep=""))
opoas <- c(paste("opoa",1:24,sep=""))

# Calculate the total number of listed ICD9 diagnoses per patient
pt$totaldiags <- apply(pt[,diags], 1, function(x) sum(!is.na(x)))

#####
# Assign Elixhauser Comorbidity 

# Subset the "Other Diagnoses" (everything except the principal diagnosis), and their corresponding POA fields
elix <- pt[,c(odiags, opoas)]

# Convert factors to characters, combine with visitIds 
elix <- as.data.frame(lapply(elix, as.character), stringsAsFactors = F)
elix <- cbind(visitId = pt$visitId, elix)

# Need to drop all "Other Diagnoses" that were NOT Present on Admission
# Convert from wide to long format, identify and drop all diagnoses that were not present on admission
# Then add principal diagnosis and calculate Elixhauser before merging with main data

# Convert wide to long and rename columns
elix <- gather(elix, visitId, value)
colnames(elix) <- c("visitId", "var", "value")

# Split the odiag1-24 and opoa1-24 columns into two so that I can identify the number associated with opoa==no
elix <- elix %>%
  extract(var, c('diag', 'number'), 
          '([a-z]+)([0-9]+)') %>%
  arrange(visitId, number)

# Made a DF of just the visitId and numbers associated with diagnoses NOT Present on Admission
temp <- elix[elix$value=="No",c("visitId","number")]

# drop NAs
temp <- temp[!is.na(temp$number),]
# Assign a flag for drops
temp$drop <- TRUE

# Merge working list of ICD9s with temp DF to identify rows to drop, drop them, and simplify DF, prep for icd9ComorbidElix()
elix <- elix %>%
  left_join(temp) %>%
  filter(is.na(drop)) %>%
  filter(diag=="odiag") %>%
  select(visitId, icd9=value) %>%
  filter(!is.na(icd9))

# diag_p: bring in primary diagnoses
diag_p <- pt[,c("visitId", "diag_p")]
colnames(diag_p) <- c("visitId", "icd9")
elix <- rbind(elix, diag_p)

# based on ICD9s for each patient/admission, excluding ICDs NOT Present on Admission,
# make matrix of all 30 Elixhauser categories (T/F)
elix <- as.data.frame(icd9ComorbidElix(elix, visitId="visitId", icd9Field="icd9"))

# add visitId as index and drop rownames
elix$visitId <- rownames(elix)
row.names(elix) <- NULL

# Sum the total number of positive Elixhauser categories per patient, add at end of DF)
elix <- cbind(elix, elixsum = rowSums(elix[-length(elix)]))

# Merge with main data
pt <- left_join(pt, elix)

from icd.

jackwasey commented on May 26, 2024

Thanks so much for your contribution, @anobel . Looks like you're using sqldf and tidyr? I've scanned the code, but will need to take a bit more time to understand it. I was thinking that, if the data for diagnoses and POA was in wide format, they would likely be all in the same row representing a single hospital admission, thus logic (e.g. POA == "N") could be applied to the POA matrix, and the resulting logical matrix could then mask in or out the diagnoses in the diagnoses matrix.

I see your goal is to sum the total number of positive Elixhauser categories. I think this could be achieved more simply following the example of the Charlson and Van Walraven scores, but counting 1 for everything, instead of weighting.

I like your use of (I think) tidyr for wide to long conversion. I wrote icd9WideToLong before tidyr existed, but found at the time that alternatives like dplyr were a bit cumbersome, and, as I know the data structure of the input data, it was quicker (and faster) to write ICD specific functions. The other thing is that the future ICD-10 code will optionally label the data as being ICD-9, ICD-10, ICD-10-CM, etc., and by using my own wide to long conversion, I can preserve this metadata.

from icd.

anobel commented on May 26, 2024

Hi @jackwasey. I just used tidyr. The POA and DIAGs are all one the same row, one for each hospitalization, so using a matrix could work. I was interested in summing elixhauser but also keeping all 30 logical fields.

I have run into some efficiency issues; I had posted the comment above on a sample of 1,000 rows. However, the full data set I'm working with has 13 million rows, so when I attempted to apply it to the full data set, performance issues made this approach impossible (was taking about 6-8 seconds per 1k rows, and 13 million hospitalizations x 50 Diag/POA fields led to ~630 million rows during reshaping.

I posted to StackOverflow and got some good feedback:
http://stackoverflow.com/questions/34230184/tidyr-wide-to-long-repeated-measures-and-efficiency

from icd.

jackwasey commented on May 26, 2024

This is something I would like to optimize, and which a general purpose data manipulation tool will never be as good at as some custom code. I think it is probably a common data layout. Maybe tidyr or similar will end up being fast enough for your use cases. Data.table seems to be the fastest general tool, but has a bizarre syntax. Did you try that?

from icd.

anobel commented on May 26, 2024

I tried all the solutions people posted, and turns out the step giving me the most problems with speed was the regular expression identifying columns. Instead, I made a dataframe of column names/numbers (as they were predictable), and used that to merge back with the core data. On my system this process took a few minutes.

I think it could be generalized in a function by taking diagnosis and poa field names, along with a number representing the number of fields as arguments.

# create vector listing just the fields with diagnosis codes
diags <- c("diag_p", paste("odiag",1:24,sep=""))
odiags <- c(paste("odiag",1:24,sep=""))
opoas <- c(paste("opoa",1:24,sep=""))

# Calculate the total number of listed ICD9 diagnoses per patient
pt$totaldx <- apply(pt[,diags], 1, function(x) sum(!is.na(x)))

# Subset the "Other Diagnoses" (everything except the principal diagnosis), and their corresponding POA fields
elix <- pt[,c(odiags, opoas)]

# Convert factors to characters, combine with visitIds 
elix <- as.data.frame(lapply(elix, as.character), stringsAsFactors = F)
elix <- cbind(visitId = pt$visitId, elix)

# Need to drop all "Other Diagnoses" that were NOT Present on Admission
# Convert from wide to long format, identify and drop all diagnoses that were not present on admission
# Then add principal diagnosis and calculate Elixhauser before merging with main data

# Convert wide to long and rename columns
elix <- gather(elix, visitId, value, na.rm=T)
colnames(elix) <- c("visitId", "var", "value")
elix$value <- factor(elix$value)

# Split the odiag1-24 and opoa1-24 columns into two so that I can identify the number associated with opoa==no
colsplit <- rbind(data.frame(var=paste("odiag",1:24, sep=""), var="odiag", number=1:24), data.frame(var=paste("opoa",1:24, sep=""), var="opoa", number=1:24))

# Join elix data with split column names
elix <- elix %>%
  left_join(colsplit) %>%
  select(-var) %>%
  rename(var = var.1)

rm(colsplit)

# Made a DF of just the visitId and numbers associated with diagnoses NOT Present on Admission
temp <- elix[elix$value=="No",c("visitId","number")]

# drop NAs
temp <- temp[!is.na(temp$number),]
# Assign a flag for drops
temp$drop <- TRUE

# Merge working list of ICD9s with temp DF to identify rows to drop, drop them, and simplify DF, prep for icd9ComorbidElix()
elix <- elix %>%
  left_join(temp) %>%
  filter(is.na(drop)) %>%
  filter(var=="odiag") %>%
  select(visitId, icd9=value) %>%
  filter(!is.na(icd9))

rm(temp)

# diag_p: bring in primary diagnoses
load(file="rao_workingdata/pt.rda")
diag_p <- pt[,c("visitId", "diag_p")]
colnames(diag_p) <- c("visitId", "icd9")
elix <- rbind(elix, diag_p)
rm(diag_p)

# based on ICD9s for each patient/admission, excluding ICDs NOT Present on Admission,
# make matrix of all 30 Elixhauser categories (T/F)

elix <- as.data.frame(icd9ComorbidElix(elix, visitId="visitId", icd9Field="icd9"))

# add visitId as index and drop rownames
elix$visitId <- rownames(elix)
row.names(elix) <- NULL

# Sum the total number of positive Elixhauser categories per patient, add at end of DF)
elix <- cbind(elix, elixsum = rowSums(elix[-length(elix)]))

# Merge with main data
pt <- left_join(pt, elix)

# Clean Up Environment
rm(elix, diags, odiags, opoas)

from icd.

jackwasey commented on May 26, 2024

This would be a nice thing to put in a vignette... Would you consider doing that? It would need you to generate some sample data, possibly based on the Vermont or uranium data I include in the package. I'll take the liberty of assigning this issue to you!

from icd.

jackwasey commented on May 26, 2024

I'm going to put this one to rest: I still like the idea, but as we can see, it is possible to use existing R tools to reshape data, and so I think this is out of scope. Trying to keep an already fairly big package more tightly focused. Happy to re-open if someone wants to look at using wide_to_long as a template for a Present-On-Arrival version.

from icd.

enable direct use of wide format Present-on-Arrival about icd HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent