Coder Social home page Coder Social logo

medical-graph-rag's Introduction

Medical-Graph-RAG

We build a Graph RAG System specifically for the medical domain.

Check our paper here: https://arxiv.org/html/2408.04187v1

Quick Start (Baseline: a simple Graph RAG pipeline on medical data)

  1. conda env create -f medgraphrag.yml

  2. export OPENAI_API_KEY = your OPENAI_API_KEY

  3. python run.py -simple True (now using ./dataset_ex/report_0.txt as RAG doc, "What is the main symptom of the patient?" as the prompt, change the prompt in run.py as you like.)

Build from scratch (Complete Graph RAG flow in the paper)

About the dataset

Paper Datasets

Top-level Private data (user-provided): we used MIMIC IV dataset as the private data.

Medium-level Books and Papers: We used MedC-K as the medium-level data. The dataset sources from S2ORC. Only those papers with PubMed IDs are deemed as medical-related and used during pretraining. The book is listed in this repo as MedicalBook.xlsx, due to licenses, we cannot release raw content. For reproducing, pls buy and process the books.

Bottom-level Dictionary data: We used Unified Medical Language System (UMLS) as the bottom level data. To access it, you'll need to create an account and apply for usage. It is free and approval is typically fast.

In the code, we use the 'trinity' argument to enable the hierarchy graph linking function. If set to True, you must also provide a 'gid' (graph ID) to specify which graphs the top-level should link to. UMLS is largely structured as a graph, so minimal effort is required to construct it. However, MedC-K must be constructed as graph data. There are several methods you can use, such as the approach we used to process the top-level in this repo (open-source LLMs are recommended to keep costs down), or you can opt for non-learning-based graph construction algorithms (faster, cheaper, and generally noisier)

Example Datasets

Recognizing that accessing and processing all the data mentioned may be challenging, we are working to provide simpler example dataset to demonstrate functionality. Currently, we are using the mimic_ex here here as the Top-level data, which is the processed smaller dataset derived from MIMIC. For Medium-level and Bottom-level data, we are in the process of identifying suitable alternatives to simplify the implementation, welcome for any recommendations.

1. Prepare the environment, Neo4j and LLM

  1. conda env create -f medgraphrag.yml

  2. prepare neo4j and LLM (using ChatGPT here for an example), you need to export:

export OPENAI_API_KEY = your OPENAI_API_KEY

export NEO4J_URL= your NEO4J_URL

export NEO4J_USERNAME= your NEO4J_USERNAME

export NEO4J_PASSWORD= your NEO4J_PASSWORD

2. Construct the graph (use "mimic_ex" dataset as an example)

  1. Download mimic_ex here, put that under your data path, like ./dataset/mimic_ex

  2. python run.py -dataset mimic_ex -data_path ./dataset/mimic_ex(where you put the dataset) -grained_chunk True -ingraphmerge True -construct_graph True

3. Model Inference

  1. put your prompt to ./prompt.txt

  2. python run.py -dataset mimic_ex -data_path ./dataset/mimic_ex(where you put the dataset) -inference True

Acknowledgement

We are building on CAMEL, an awesome framework for construcing multi-agent pipeline.

Cite

@article{wu2024medical,
  title={Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation},
  author={Wu, Junde and Zhu, Jiayuan and Qi, Yunli},
  journal={arXiv preprint arXiv:2408.04187},
  year={2024}
}

medical-graph-rag's People

Contributors

wujunde avatar

Stargazers

ibi the fairest avatar dychen avatar Minh Tran avatar  avatar Derek Zou avatar  avatar  avatar Emile Sabatier avatar LinglongQian avatar Zhizhe Liu avatar Jia-Liang, Lu avatar  avatar  avatar  avatar  avatar  avatar Ting Wei Chang avatar Kefan You avatar  avatar Mohammad Reza Taesiri avatar unisback avatar Wang Meng avatar Round3 avatar linjia avatar  avatar Kylin avatar Yanyu Xu avatar  avatar Bryce Zeng avatar Araz avatar  avatar  avatar  avatar  avatar Sohyeon Jeon avatar Jeff avatar Juan Pablo Manson avatar  avatar normanj avatar yiyiAI avatar Shuyue Jia (Bruce) avatar Ping  avatar skykiseki avatar yao zonghai avatar Frederic Monneret avatar Phương Nguyễn avatar Jinhui.Lin avatar  avatar Charles Johnston avatar Veblen avatar Terence Liu avatar Shiqiang Liu avatar Dan Conger avatar Yening Qin avatar  avatar Takahiro Ueda avatar Allan Thompson avatar Happydog avatar xxoyt avatar  avatar Harshit Karnatak avatar  avatar  avatar  avatar  avatar 人群里的蚂蚁 avatar Bell Eapen avatar  avatar Moein Shariatnia avatar  avatar XuQing Chai avatar  avatar gandli avatar MinWoo(Daniel) Park avatar  avatar  avatar  avatar Rui avatar  avatar  avatar Filippo Menolascina avatar  avatar Zafar Ansari avatar  avatar  avatar Fabio Dias Rollo avatar Unidatum integrated products avatar  avatar zhongl avatar  avatar Sathish Ravichandran avatar  avatar  avatar  avatar Li-Kuang Chen avatar  avatar  avatar  avatar Cheng avatar ice-ice bear avatar

Watchers

Unidatum integrated products avatar Paulo Alencar avatar GwonHyeok avatar Zafar Ansari avatar  avatar min.wu avatar  avatar  avatar  avatar  avatar

medical-graph-rag's Issues

Full dataset

@WuJunde

Thanks for sharing this great research.

You're currently providing an example dataset for testing, do you have any plans to provide a full dataset?

Chunking by using Sliding Window

Thank you so much for the wonderful pre-print and sharing the source code!

In the pre-print, a sliding window method was mentioned to be used in the chunking:

To reduce noise generated by sequential processing, we implement a sliding window technique, managing five paragraphs at a time. We continuously adjust the window by removing the first paragraph and adding the next, maintaining focus on topic consistency.

In data_chunk.py, I observed a sequential process of

  1. split the text by \n\n
  2. extract propositions from each paragraph
  3. use add_propositions of AgenticChunker to do the chunking.

And add_propositions was sequentially adding propositions.

In the add_proposition of AgenticChunker, I observed that a proposition was added based on:

  1. Whether it's the first one
  2. Whether there are any relevant chunk

And in the _find_relevant_chunk, I think all the existing chunks were used for finding the most relevant chunk.

I will be very appreciative if you can point me to the part of using sliding window!

Thank you so much!

源码

作者,你好,我正在做知识图谱和大模型融合,看了您的论文,很有收获,可以分享一下您的源码吗

Make the graph database a configuration option

Our customers use TigerGraph, not Neo4j. This is because TigerGraph is a distributed graph, and can support queries over multiple servers. We want Med-Graph-RAG to work on existing healthcare graphs that already have billions of vertices and edges.

I want to work with this team to refactor the code so that the back-end database can be customized using some configuration file.

Please let me know your thoughts on this topic. I may not understand the code enough to understand the difficulty of this task.

Adding support for Ollama/local LLMs

Hey guys,

Thanks for making your work public. I'm wondering if you have or will be exploring LLMs other than GPT4 for your evaluations. For instance, you've used Llama 3 in your benchmarks, would you be using some sort of Llama 3.1 with Ollama or even Claude/DeepSeek? Am wondering if you'll support those APIs in your code soon.

Adapting prompts for medical domain

Hi,

Thank you for open-sourcing this project. I've noticed that the prompts in nano-graphrag/nano_graphrag/prompt.py seem to closely follow the default prompts from Microsoft's official implementation, which includes entities like person, technology, mission, organization, and location.

I'm curious if you've experimented with adjusting these prompts to better extract medical-specific entities. If so, do you have any insights or findings from your experiments that you can share?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.