wisdomikezogwo / quilt1m Goto Github PK

View Code? Open in Web Editor NEW

114.0 5.0 9.0 1.11 MB

[NeurIPS 2023 Oral] Quilt-1M: One Million Image-Text Pairs for Histopathology.

Home Page: https://quilt1m.github.io/

License: MIT License

Python 100.00%

clip-model histopathology medical-dataset multimodal-datasets vlm

quilt1m's Introduction

Quilt

Quilt-1M: One Million Image-Text Pairs for Histopathology [NeurIps 2023] (Oral)

Abstract

Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has slowed similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 802,148 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new pathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.

News

2023-03-03 Upated repository with links to models and data.
2023-06-13 Initial code/data release.
2023-06-25 Added model evaluate tips and added some new data links.
2023-08-15 Added restricted access to complete dataset.
2023-09-21 QUILT-1M is accepted to NeurIPS 2023 [ORAL] 🔥.
2023-10-26 Corrected sub-pathology column in quilt_1M_lookup csv file, as well as, Added additional columns ('single_wsi': 1-for videos that only cover one WSI or 0-more than one WSI, 'not_histology': manual checks of videos stratified into 0 or 1 with causes of 1 being image projections, drawings or strictly not histo due to false classification videos etc)
* 2023-10-27 Updated Arxiv paper.
* 2024-01-17 Quilt-LLAVA Paper, Website is released, a Large Language and Vision Assistant for #Pathology trained with spatially localized instruction tuning data generated from educational #YouTube videos, outperforming SOTA in various tasks. Models and data to be released soon.

Data (QUILT-1M) Restricted Access

Two versions of the data can be accessed after agreeing to certain terms, protecting against further distribution of the dataset and committing to its specified research use.

(Rescaled) On Zenodo you can access the dataset with all images resized to 512x512 px (36 Gb)
(Full) To access the dataset with full-sized images via Google Drive, please request time-limited access through this form Google (110 Gb)

Requirements

conda create --name quilt python=3.9 && conda activate quilt

Then install requirements/

Data Reconstruction

To collect Quilt, follow these data steps/

Eval

To evaluate QuiltNet, follow these steps/

Pretrained Model

We provide the checkpoints for all QuiltNet finetuned models.

Testing

Visualization of inputs and output:

Citing Quilt-1M

@misc{ikezogwo2023quilt1m,
      title={Quilt-1M: One Million Image-Text Pairs for Histopathology}, 
      author={Wisdom Oluchi Ikezogwo and Mehmet Saygin Seyfioglu and Fatemeh Ghezloo and Dylan Stefan Chan Geva and Fatwir Sheikh Mohammed and Pavan Kumar Anand and Ranjay Krishna and Linda Shapiro},
      year={2023},
      eprint={2306.11207},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

This code borrows heavily from and open-clip and TiMM's library. We also thank the contributors of merlot.

Maintenance

Please open a GitHub issue for any help. If you have any questions regarding the technical details, feel free to contact us.

License

The codes and the pretrained model in this repository are under the MIT license as specified by the LICENSE file.

quilt1m's People

Contributors

Stargazers

Watchers

Forkers

bolongzh thnguyn2 fatwir 15apk2000 myhdaniel fghezloo one-june junqiangchen awj2021

quilt1m's Issues

Downstream tasks setting

First thansk for your impressive work on meidcal VLP coummunity!

From your paper, there are many downstream tasks in the benchmark to evlaute the VLP model, could you provide the pipeline or script to prepare the downstream dataset and evaluation?

Best Regards

Reproducing zero-shot classification results

Hi, thank you very much for this great work on image-text contrastive training for histopathology and also publishing a valuable dataset.

I used the provided pre-trained QuiltNet along with given tokenizer to reproduce zero-shot classification results on NCT-CRC-HE-100k dataset. Used following commands,

import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:wisdomik/QuiltNet-B-32')
tokenizer = open_clip.get_tokenizer('hf-hub:wisdomik/QuiltNet-B-32')

I also used the class names and templates as given in the paper as follows,



nct_classnames = ["Adipose", "Debris", "Lymphocytes", "Mucus", "Smooth muscle", "Normal colon mucosa", "Cancer-associated stroma", "Colorectal adenocarcinoma epithelium"]


nct_template = [
    lambda c: f'a histopathology slide showing {c}.',
    lambda c: f'histopathology image of {c}.',
    lambda c: f'pathology tissue showing {c}.',
    lambda c: f'presence of {c} tissue on image.',
]

But I get a top1 accuracy lower than what's reported in the paper (59.56%), I get

zero shot metrics {'nct-zeroshot-val-top1': 0.28518236912136324, 'nct-zeroshot-val-top5': 0.7248697363418835}

I also tried training my own QuiltNet using Open_clip codebase from OpenAI, and the results were,

zero shot metrics {'nct-zeroshot-val-top1': 0.30728805599660086, 'nct-zeroshot-val-top5': 0.6808149026097458}

Could you kindly help me understand why I am. unable to reproduce the given numbers? I need to understand what I might be doing wrong.

Thank you.

Noisy text

Hello,
In your data.csv file. The noisy text and corrected text are pretty much identical.
I wonder, by any chance, if you may put the wrong data for the noisy text column.
Thank you in advance.

Keyframe Extraction Code

Can you provide code to extract keyframes from a video? Thank you.

Error in data_utils.py

Hello!

After downloading the videos I get an error running

python -m main --base_dir ${BASE_DIR}

the stack trace is as follows;

Traceback (most recent call last):
  File "/home/groups/jamesz/fede/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/groups/jamesz/fede/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/oak/stanford/groups/jamesz/pathtweets/quilt/quilt1m/data/main.py", line 166, in <module>
    main(args, data_df, recon_df, device, histo_models_dict, video_paths_dict)
  File "/oak/stanford/groups/jamesz/pathtweets/quilt/quilt1m/data/main.py", line 68, in main
    rep_chunk_im_temp = save_frame_chunks_recon(video_path, stable_times, chunk_id,fps, height, width)
  File "/oak/stanford/groups/jamesz/pathtweets/quilt/quilt1m/data/data_utils.py", line 108, in save_frame_chunks_recon
    clip_start_time, clip_end_time = start_end_time
TypeError: cannot unpack non-iterable int object

Some additional variables that can help in understanding what's happening

>>> stable_se_times 
(2, 17)

>>> start_end_time
2

basically, the assignment coming from start_end_time generates an error due to the line

clip_start_time, clip_end_time = start_end_time

Any clue on where this might come from?

Thanks!

Video and Frames Download

Hi again :)

do you have any code you can share for downloading the videos?

Thank you so much! I appreciate your help on this!

Error on loading QuiltNet-B-16

Hi,
The error is occurred from below command:

from transformers import CLIPModel
model = CLIPModel.from_pretrained("wisdomik/QuiltNet-B-16", use_auth_token=None) .

The error msg is:

RuntimeError: Error(s) in loading state_dict for CLIPModel:
	size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 32, 32]).
	size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([197, 768]) from checkpoint, the shape in current model is torch.Size([50, 768]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Is there any issues with that? or my dev environment has something wrong?

Training Text

Amazing work! But I do not find the code for training CLIP model. I have a question, how to use the text for training CLIP? I have no idea to confirm from the file quilt_1M_lookup.csv. Thank you~

Access to the data

Hello,

Firstly, thank you for this. Amazing work!

Hello, I'm a PhD student and I have applied to get access to your dataset. I haven't received any reply yet, could you please give me access.

[email protected]

Best regards,
Markus Ekvall

Clarification on stable_times assignment in data/main.py

Hello, thanks for sharing your work.

I am currently working with your project and I have a question regarding a specific line of code in the data/main.py file. In the file, I noticed the following line:"stable_times = list_idle_im_se_t_tup[chunk_id][chunk_id][0]". I wanted to confirm if this line should actually be:''stable_times = list_idle_im_se_t_tup[chunk_id][chunk_id]''Could you provide some clarification on this?

Additionally, I'm curious about how the list_idle_im_se_t_tup variable is generated and how it ensures that its length matches the length of chunks. Could you please point me to that section of the code or provide some insights on how this synchronization is achieved?

I appreciate your time and assistance. Thank you in advance for your help!

Best regards

Superflous images?

Dear authors,

thanks so much for providing this resource! It seems to me that the following 4 files have no metadata (in quilt_1M_lookup.csv). Is this possible?

_b_M_sOb4ZI_image_0760643c-923b-4f1e-a5e4-8b2f9b3f2849.jpg
uytytgxGP2Y_image_1c51efef-1301-4f83-ad35-bbf92fb6f90a.jpg
7M7Ol5StU7U_image_b61a7317-b9b7-4d66-9158-828ba75bfb27.jpg
7M7Ol5StU7U_image_84954e04-5f71-46cd-aa20-8595596e4649.jpg

If the error is on my side I apologize for it, but using the data in my dataloader it complained that there were files without metadata, so I thought I'd give you the feedback.

Best,
Marc

Load ViT-B/32

Hi, thank you for sharing your work.

Can you provide me some more details to load ViT-B/32 model?

Missing Images

Hi,

Thank you for creating this wonderful repository.

I've recived acess to the dataset through Zenodo and downloaded all files. There seems to be missign images. Out of the 10 packed .zip there are only 650K~ images (out of 1M).

Is this an issue or am I missing something?

Thank you again

Downloaded Dataset Size

Hi!

What is the expected size of the dataset once downloaded? After processing by calling main.py?
Also approximately how long would each step of the process take?

How to access the corresponding texts for QUILT dataset? I got the access to images via ZENODO but could not find corresponding texts.

Providing dataset

First of all, thank you for providing good paper and dataset.

I am wondering whether your team have a plan to provide quilt-1m dataset including additional dataset from twitter, PMC, etc.

Thank you!

Which preprocess should I use for linear probing ?

Hi, thank you for your work.

I am adapt your model to my dataset,

using preprocess_train for linear probing (only use the vision encoder); preprocess_val for testing.

_, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:wisdomik/QuiltNet-B-32')
Is it correct ?

Should I skip a projection layer of the vision model (the one maps features from 768 to 512), replace it by 768 -> num_class ?

Issues with CSV files

Hi,
I am trying to recreate the QUILT dataset. I have a doubt regarding some of the columns in the CSV files that you have shared in the repo. Can you please highlight how you obtained the "stable_times" column in quilt_recon.csv?

Also, Were the images in the "image_path" column of quilt_data.csv extracted using the Static Video Chunk Detection Algorithm? Can you please elaborate on the generation of the quilt_data.csv file?

Thank you

How to fine-tune QuiltNet-B-32 model

Hello, thanks for sharing your work.

I want to fine-tune the QuiltNet-B-32 model to suit my downstream tasks. Can you provide a fine-tuning script? Or give an example of using QuiltNet-B-32?

Image-to-text generation

Can you please guide, How I can use quilt1 for image to text generation. Like I input an image, and it generates the text. Do I need to use LLaVA and BLIP like modes where I assign the weights of the quilt1m and use it for text description generation. As the API mentioned at the hugging face is only for zero short classification. and I could not find the Text retrieval code in GitHub repo. Moreover, I also tried blip, but got compatibility issues. Thanks.

Visualizations of the results

Hello, thank you for your great work!

I see that in this repository, there are comparisons in terms of visualization. Could you provide the visualization tutorials or scipt on this? Thank you!

Missing Imports and Code Errors

Hello!

Thanks for the great resource!

I have been trying to run the data reconstruction but I kind of stumbled upon a couple of different errors (some are missing imports - e.g., nn from torch - one was a parenthesis that was not closed). There are also a couple of missing requirements (e.g., scikit-image) in the requirements file.

Would you mind taking a look? I have solved some of these and I am happy to send a PR in case but maybe you have an updated version of the code that runs out of the box.

please do not use BiomedCLIP for ARCH dataset

Dear Author,

The ARCH dataset is divided into two subsets: the books_set and the pubmed_set.

I have noticed that the pubmed_set appears to overlap with BioMedCLip, which sources from PubMed Central.

In your paper, you combined these two datasets for cross-modality retrieval. However, I decided to separate them and compare their performance individually.

The retrieval performance on the pubmed_set was as follows:
{15.7; 79.8; 94.4; 16.7; 78.9; 93.7}

Meanwhile, the retrieval performance on the books_set was:
{7.3; 49.2; 74.2; 8.2; 49.7; 73.2}

In contrast, the performance of QUILT-GPT/77 showed different results:

The retrieval performance on the pubmed_set was:
{1.8; 23.6; 46.0; 1.6; 23.4; 45.7}

The retrieval performance on the books_set was:
{1.8; 27.7; 52.8; 1.5; 23.4; 46.4}

From these results, it's clear that there isn't as significant a domain gap between the two datasets as there is with BiomedCLIP.