medicinetoken / medical-graph-rag Goto Github PK

View Code? Open in Web Editor NEW

127.0 10.0 19.0 1.46 MB

Medical Graph RAG: Graph RAG for the Medical Data

Home Page: https://arxiv.org/html/2408.04187v1

Python 99.99% Shell 0.01%

medical-graph-rag's Introduction

Medical-Graph-RAG

We build a Graph RAG System specifically for the medical domain.

Check our paper here: https://arxiv.org/html/2408.04187v1

Quick Start (Baseline: a simple Graph RAG pipeline on medical data)

conda env create -f medgraphrag.yml
export OPENAI_API_KEY = your OPENAI_API_KEY
python run.py -simple True (now using ./dataset_ex/report_0.txt as RAG doc, "What is the main symptom of the patient?" as the prompt, change the prompt in run.py as you like.)

Build from scratch (Complete Graph RAG flow in the paper)

About the dataset

Paper Datasets

Top-level Private data (user-provided): we used MIMIC IV dataset as the private data.

Medium-level Books and Papers: We used MedC-K as the medium-level data. The dataset sources from S2ORC. Only those papers with PubMed IDs are deemed as medical-related and used during pretraining. The book is listed in this repo as MedicalBook.xlsx, due to licenses, we cannot release raw content. For reproducing, pls buy and process the books.

Bottom-level Dictionary data: We used Unified Medical Language System (UMLS) as the bottom level data. To access it, you'll need to create an account and apply for usage. It is free and approval is typically fast.

In the code, we use the 'trinity' argument to enable the hierarchy graph linking function. If set to True, you must also provide a 'gid' (graph ID) to specify which graphs the top-level should link to. UMLS is largely structured as a graph, so minimal effort is required to construct it. However, MedC-K must be constructed as graph data. There are several methods you can use, such as the approach we used to process the top-level in this repo (open-source LLMs are recommended to keep costs down), or you can opt for non-learning-based graph construction algorithms (faster, cheaper, and generally noisier)

Example Datasets

Recognizing that accessing and processing all the data mentioned may be challenging, we are working to provide simpler example dataset to demonstrate functionality. Currently, we are using the mimic_ex here here as the Top-level data, which is the processed smaller dataset derived from MIMIC. For Medium-level and Bottom-level data, we are in the process of identifying suitable alternatives to simplify the implementation, welcome for any recommendations.

1. Prepare the environment, Neo4j and LLM

conda env create -f medgraphrag.yml
prepare neo4j and LLM (using ChatGPT here for an example), you need to export:

export OPENAI_API_KEY = your OPENAI_API_KEY

export NEO4J_URL= your NEO4J_URL

export NEO4J_USERNAME= your NEO4J_USERNAME

export NEO4J_PASSWORD= your NEO4J_PASSWORD

2. Construct the graph (use "mimic_ex" dataset as an example)

Download mimic_ex here, put that under your data path, like ./dataset/mimic_ex
python run.py -dataset mimic_ex -data_path ./dataset/mimic_ex(where you put the dataset) -grained_chunk True -ingraphmerge True -construct_graph True

3. Model Inference

put your prompt to ./prompt.txt
python run.py -dataset mimic_ex -data_path ./dataset/mimic_ex(where you put the dataset) -inference True

Acknowledgement

We are building on CAMEL, an awesome framework for construcing multi-agent pipeline.

Cite

@article{wu2024medical,
  title={Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation},
  author={Wu, Junde and Zhu, Jiayuan and Qi, Yunli},
  journal={arXiv preprint arXiv:2408.04187},
  year={2024}
}

medical-graph-rag's People

Contributors

Stargazers

Watchers

Forkers

dmccrearytg bvrabete jie311 kmanimaran daehee-neuralworks tamanna18 zhuvivienne numberchiffre szhowardhuang eunkubae ccvcd ganlu918 huangyingting wangsheng21s gaborandi tingwei161803 linglongqian ibinti healthecosystem

medical-graph-rag's Issues

Full dataset

@WuJunde

Thanks for sharing this great research.

You're currently providing an example dataset for testing, do you have any plans to provide a full dataset?

Chunking by using Sliding Window

Thank you so much for the wonderful pre-print and sharing the source code!

In the pre-print, a sliding window method was mentioned to be used in the chunking:

To reduce noise generated by sequential processing, we implement a sliding window technique, managing five paragraphs at a time. We continuously adjust the window by removing the first paragraph and adding the next, maintaining focus on topic consistency.

In data_chunk.py, I observed a sequential process of

split the text by \n\n
extract propositions from each paragraph
use add_propositions of AgenticChunker to do the chunking.

And add_propositions was sequentially adding propositions.

In the add_proposition of AgenticChunker, I observed that a proposition was added based on:

Whether it's the first one
Whether there are any relevant chunk

And in the _find_relevant_chunk, I think all the existing chunks were used for finding the most relevant chunk.

I will be very appreciative if you can point me to the part of using sliding window!

Thank you so much!

源码

作者，你好，我正在做知识图谱和大模型融合，看了您的论文，很有收获，可以分享一下您的源码吗

Make the graph database a configuration option

Our customers use TigerGraph, not Neo4j. This is because TigerGraph is a distributed graph, and can support queries over multiple servers. We want Med-Graph-RAG to work on existing healthcare graphs that already have billions of vertices and edges.

I want to work with this team to refactor the code so that the back-end database can be customized using some configuration file.

Please let me know your thoughts on this topic. I may not understand the code enough to understand the difficulty of this task.

Adding support for Ollama/local LLMs

Hey guys,

Thanks for making your work public. I'm wondering if you have or will be exploring LLMs other than GPT4 for your evaluations. For instance, you've used Llama 3 in your benchmarks, would you be using some sort of Llama 3.1 with Ollama or even Claude/DeepSeek? Am wondering if you'll support those APIs in your code soon.

Adapting prompts for medical domain

Hi,

Thank you for open-sourcing this project. I've noticed that the prompts in nano-graphrag/nano_graphrag/prompt.py seem to closely follow the default prompts from Microsoft's official implementation, which includes entities like person, technology, mission, organization, and location.

I'm curious if you've experimented with adjusting these prompts to better extract medical-specific entities. If so, do you have any insights or findings from your experiments that you can share?

Thank you!