lrsoenksen / haim Goto Github PK

This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) Publication in Nature Machine Intelligence (Soenksen LR, Ma Y, Zeng C et al. 2022).

License: Apache License 2.0

Python 64.67% Jupyter Notebook 35.33%

haim's People

Contributors

Stargazers

Watchers

haim's Issues

Experiments results

Hello!
I am trying to reproduce some of the results from your paper. In particular, I am interested in getting a plot like the one below to find out what combination of modalities justifies a multimodal approach compared to a visual only.

For example, for fracture, the smallest data, I was able to get a 5-fold cross-validation test average macro AUROC of about 0.78 for the unimodal model (fusing per-image and multi-image dense visual embeddings), but when I add new (and less informative) modalities to it, the results stay almost the same (somewhere getting a bit better, somewhere a bit worse). Perhaps because XGBoost handles the curse of dimensionality well. Since the number of combinations of input modalities is high,1023, I only tested a subset, but could not get close to 0.84 in average macro AUROC.

Could you please share the supportive information about the plot above like what combination of modalities is considered typical?

Also about the number of experiments performed in the article.
I understand how you got 1023 as the number of possible models for pathology diagnosis tasks. 1023 = Number of models of 1 modality + Number of models of 2 modality + Number of models of 3 modality + Number of models of 4 modality.

Where the number of models of 1 modality is calculated based on the number of combinations of the corresponding sources:
Tabular: 1
Time series: C(3, 1) + C(3, 2) + C(3,3) = 3 + 3 + 1 = 7
Notes (excluding radiology): C(2,1) + C(2,2) = 2 + 1 = 3
Visual: C(4,1) + C(4,2) + C(4,3) + C(4,4) = 4 + 6 + 4 + 1 = 15

Total: 26

And so on, up to 4 modalities. I also get a total of 1023 experiments.

However, I don't get the same number of experiments for the 48-hours length of stay and mortality prediction tasks, for which the difference is that radiology notes are included.
Could you please explain how you get 2047(2046)?

Thank you!

Definitions of records

Hi,

I'm trying to create the dataset you created for the research. At MIMIC-CXR there are only the subject_id and study_id identifiers, but in MIMIC-IV, there are the subject_id, hadm_id adn stay_id. What is the correct way to link images from MIMIC-CXR to data from MIMIC-IV? I understand from your article that records are defined by {subject_id, hadm_id, stay_id}, but I don't understand how should I select the matched images to each records?
Perhaps you can describe in detail what a single record is composed of, in terms of data from all sources?
Thank you

Some files in the repo are missing in MIMIC datasets?

Hello,

First, Thanks for your great project and code.

While running the 1_Generate_HAIM-MIMIC-MM file. I got this error:

FileNotFoundError: [Errno 2] No such file or directory: './data/HAIM/physionet/files/mimiciv/1.0/mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-jpeg-txt.csv'

The path and other things are correct and I have downloaded and extracted the following datasets as mentioned:
https://physionet.org/content/mimiciv/1.0/
https://physionet.org/content/mimic-cxr-jpg/2.0.0/

But in the second link, there is no file named "mimic-cxr-2.0.0-jpeg-txt.csv". How can I access that? And is the MIMIC-CXR version the same that you ran your code on it?

Thanks in advance

Missing file

To generate HAIM-MIMIC-MM data using 1_Generate HAIM-MIMIC-MM from downloaded MIMIC-IV and MIMIC-CXR.ipynb, it seems that mimic-cxr-2.0.0-jpeg-txt.csv you're loading, df_mimic-cxr_jpg =pd.read_csv(core_mimiciv_path + 'mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-jpeg-txt.csv'), to get image paths plus some extras, is actually missing in MIMIC-CXR-JPG v.2.0.0 database available at physionet.org

How can I access this file?

Can't locate file

In your code, you refer to a file named: "mimic-cxr-2.0.0-jpeg-txt.csv", which seems to be a part of the files of the MIMIC-CXR-JPG dataset. However, I can't find this file anywhere in the data of Physionet...
Can you please refer to the location I can find this file?
Thank you

How to use when missing files

Hi~Due to the missing file 'mimic-cxr-2.0.0-jpeg-txt.csv' , I intend to use the extracted HAIM embeddings you've posted on [https://physionet.org/content/haim-multimodal/1.0.1/] . When I use 'Sample_Multimodal_Patient_Files' in '2_Generate Embeddings from Pickle Files.ipynb', it reports an error : ModuleNotFoundError: No module named 'src' . have already downloaded all the files on this website, and I want to know how to use them in the repo.

get_chartevent_tsfresh_timeseries_embeddings has not been called

Hello, author：
Why is the chartevent embedding not invoked in '1_2-Generate Embeddings.py' through 'get_chartevent_tsfresh_timeseries_embeddings'?

Broken embeddings file on PhysioNet?

This might not be the right place for this issue, as it is about the data that you published on PhysioNet rather than the code you published here, so I would like to apologize in advance for misusing GitHub to bring this up:

I am having trouble with loading the cxr_ic_fusion_1103.csv file, i.e. the extracted HAIM embeddings, from your PhysioNet repository (https://doi.org/10.13026/3f8d-qe93), in particular with the last two lines:

Both of the last two lines hold 7173 entries, while all others hold 6405 entries. In other words, there are 768 entries more in the last two lines than in all others.
Moreover, both of the last two lines hold three consecutive runs of exactly repeating elements, starting from index 13 (zero-based) and having a length of 768 entries each, with no gaps (so the starting indices of the repetitions are 781 and 1549, respectively).

My first guess would have been that one embedding vector has been repeated accidentally, but this does not make sense as (1) there are three repetitions of 768 elements in each of the two lines, while the lines in total are only 768 elements longer and (2) the starting position at index 13 does not make any sense semantically if one looks at the header (line 0).

So my questions are: (1) Is this a known problem? (2) Is there anything that I can do to reconstruct the last two lines if I want to use all embeddings, or should I just ignore the last two lines? I checked the SHA256 hash of the file by the way, so the download should have not caused the problem.

Update: Just to clarify, by "exactly repeating elements" I do not mean that the entries at indices 13, 14, 15, … all have the same value, but that the entry at index 13 has the same value as the entries at index 781 and 1549, the entry at index 14 has the same value as the entries at index 782 and 1550, and so on.

Request for extracted dataset

Thanks for your excellent work! The MIMIC-CXR-JPG dataset is too large and it is difficult to download it in some special situations. Can you provide the extracted dataset after step 1? I would be very grateful.

Issues loading biobert

Hey guys, I was trying to run the same experiments you performed but I've run into an error when running the file 1_1-Create Pickle Files.py:
404 Client Error: Not Found for url: https://huggingface.co/pretrained_bert_tf/biobert_pretrain_output_all_notes_150000//resolve/main/config.json
Traceback (most recent call last):
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/configuration_utils.py", line 520, in get_config_dict
user_agent=user_agent,
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/file_utils.py", line 1371, in cached_path
local_files_only=local_files_only,
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/file_utils.py", line 1534, in get_from_cache
r.raise_for_status()
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/pretrained_bert_tf/biobert_pretrain_output_all_notes_150000//resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "1_1-Create Pickle Files.py", line 28, in
from MIMIC_IV_HAIM_API import *
File "/home/saia/files/HAIM/MIMIC_IV_HAIM_API.py", line 114, in
biobert_tokenizer = AutoTokenizer.from_pretrained(biobert_path)
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py", line 534, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/models/auto/configuration_auto.py", line 450, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/saia/programfiles/anaconda3/envs/haim/lib/python3.6/site-packages/transformers/configuration_utils.py", line 532, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'pretrained_bert_tf/biobert_pretrain_output_all_notes_150000/'. Make sure that:

'pretrained_bert_tf/biobert_pretrain_output_all_notes_150000/' is a correct model identifier listed on 'https://huggingface.co/models'
or 'pretrained_bert_tf/biobert_pretrain_output_all_notes_150000/' is the correct path to a directory containing a config.json file

It seems that the script is trying to load a BioBERT model from Hugging Face's model hub, but the specified path is not found. Do you know why this migth be happening?

Input window for time-series data

Hi,

Thanks for your exciting work!

I was wondering if all data throughout the patient's stay is used to form the patient embedding.

Especially for Mortality and Discharge prediction, the paper mentions the labels are defined relative to patient admission. Does this mean no time-series data is used as it does not yet exist for the patient? Or is the entire time-series data used? If the complete data is used, wouldn't the length of the time-series records alone have a strong correlation to the final output label?

Thanks a lot in advance,

Chantal

Patient vs Admission level aggregation

I have looked at the code that is used to generate the multimodal dataset. From my understanding all data except the CXR scans are aggregated on the level of unique admissions (based on hadm_id) but the CXR scans are aggregated on a patient level (based on subject_id), meaning that CXR scans belong to a patient and its multiple admissions but are not attributed to a specific admission.

Can you confirm if I am right with my assumption?
Thank you!

Value of `fname` in `2_3-Pathology Diagnosis Modeling.py` (and also other files)

I am studying the file 2_3-Pathology Diagnosis Modeling.py. It has a variable fname which is supposed to hold the filename of the embedding file. I am wondering is it the file Extracted_HAIM_Embeddings/cxr_ic_fusion_1103.csv file from https://physionet.org/content/haim-multimodal/1.0.1/?

Number of training samples

Hi,

I'm currently trying to generate your dataset, however the number of embeddings I get does not match yours. I managed to create all 34537 pickle files. Then, as I understood, in "Generate Embeddings from Pickle Files" you iterate over all cxr images available within a patient stay and generate a row in the embedding csv for each of the images. For me this leads to over 125000 patients, however the embedding file you provided only has 45050 rows (which also matches the number of samples for mortality and discharge prediction you mention in your paper).

Do you have any idea what the issue could be?
For example, do you use all images of a patient as single sample, including each view from the same study?

Thanks a lot in advance!

Could not find "AUPRC_All_Modality_Resources. csv"

Hello,
First, Thanks for your great project and code.
But when I tried running 2_ 3, I found that the program could not find "AUPRC_All_Modality_Resources. csv". It seems that I did not find the relevant operation to generate this file in the project team. What should I do?
Thanks again!

Tensorflow can removed from the requirements

I don't see any use of the TensorFlow library; I suppose it can be removed from requirements.txt, yaml and in MIMIC_IV_HAIM_API.py ?

lrsoenksen / haim Goto Github PK

haim's People

Contributors

Stargazers

Watchers

Forkers

haim's Issues

Recommend Projects

Recommend Topics

Recommend Org