Coder Social home page Coder Social logo

bhargav794 / visual-chatgpt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chenfei-wu/taskmatrix

0.0 0.0 0.0 33.46 MB

Official repo for the paper: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

License: MIT License

Python 100.00%

visual-chatgpt's Introduction

Visual ChatGPT

Visual ChatGPT connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.

See our paper: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Open in Spaces Open in Colab

Updates:

  • Now Visual ChatGPT supports GroundingDINO and segment-anything! Thanks @jordddan for his efforts. For the image editing case, GroundingDINO is first used to locate bounding boxes guided by given text, then segment-anything is used to generate the related mask, and finally stable diffusion inpainting is used to edit image based on the mask.

  • Now Visual ChatGPT can support Chinese! Thanks to @Wang-Xiaodong1899 for his efforts.

  • We propose the template idea in Visual ChatGPT!

    • A template is a pre-defined execution flow that assists ChatGPT in assembling complex tasks involving multiple foundation models.
    • A template contains the experiential solution to complex tasks as determined by humans.
    • A template can invoke multiple foundation models or even establish a new ChatGPT session
    • To define a template, simply adding a class with attributes template_model = True
  • Thanks to @ShengmingYin and @thebestannie for providing a template example in InfinityOutPainting class (see the following gif)

    • Firstly, run python visual_chatgpt.py --load "ImageCaptioning_cuda:0,ImageEditing_cuda:1,VisualQuestionAnswering_cuda:2"
    • Secondly, say extend the image to 2048x1024 to Visual ChatGPT!
    • By simply creating an InfinityOutPainting template, Visual ChatGPT can seamlessly extend images to any size through collaboration with existing ImageCaptioning, ImageEditing, and VisualQuestionAnswering foundation models, without the need for additional training.
  • Visual ChatGPT needs the effort of the community! We crave your contribution to add new and interesting features!

Insight & Goal:

On the one hand, ChatGPT (or LLMs) serves as a general interface that provides a broad and diverse understanding of a wide range of topics. On the other hand, Foundation Models serve as domain experts by providing deep knowledge in specific domains. By leveraging both general and deep knowledge, we aim at building an AI that is capable of handling various tasks.

Demo

System Architecture

Logo

Quick Start

# clone the repo
git clone https://github.com/microsoft/visual-chatgpt.git

# Go to directory
cd visual-chatgpt

# create a new environment
conda create -n visgpt python=3.8

# activate the new environment
conda activate visgpt

#  prepare the basic environments
pip install -r requirements.txt
pip install  git+https://github.com/IDEA-Research/GroundingDINO.git
pip install  git+https://github.com/facebookresearch/segment-anything.git

# prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}

# prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}

# Start Visual ChatGPT !
# You can specify the GPU/CPU assignment by "--load", the parameter indicates which 
# Visual Foundation Model to use and where it will be loaded to
# The model and device are separated by underline '_', the different models are separated by comma ','
# The available Visual Foundation Models can be found in the following table
# For example, if you want to load ImageCaptioning to cpu and Text2Image to cuda:0
# You can use: "ImageCaptioning_cpu,Text2Image_cuda:0"

# Advice for CPU Users
python visual_chatgpt.py --load ImageCaptioning_cpu,Text2Image_cpu

# Advice for 1 Tesla T4 15GB  (Google Colab)                       
python visual_chatgpt.py --load "ImageCaptioning_cuda:0,Text2Image_cuda:0"
                                
# Advice for 4 Tesla V100 32GB                            
python visual_chatgpt.py --load "Text2Box_cuda:0,Segmenting_cuda:0,
    Inpainting_cuda:0,ImageCaptioning_cuda:0,
    Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,
    Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,
    InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,
    SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,
    Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,
    NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3"

GPU memory usage

Here we list the GPU memory usage of each visual foundation model, you can specify which one you like:

Foundation Model GPU Memory (MB)
ImageEditing 3981
InstructPix2Pix 2827
Text2Image 3385
ImageCaptioning 1209
Image2Canny 0
CannyText2Image 3531
Image2Line 0
LineText2Image 3529
Image2Hed 0
HedText2Image 3529
Image2Scribble 0
ScribbleText2Image 3531
Image2Pose 0
PoseText2Image 3529
Image2Seg 919
SegText2Image 3529
Image2Depth 0
DepthText2Image 3531
Image2Normal 0
NormalText2Image 3529
VisualQuestionAnswering 1495

Acknowledgement

We appreciate the open source of the following projects:

Hugging FaceLangChainStable DiffusionControlNetInstructPix2PixCLIPSegBLIP

Contact Information

For help or issues using the Visual ChatGPT, please submit a GitHub issue.

For other communications, please contact Chenfei WU ([email protected]) or Nan DUAN ([email protected]).

visual-chatgpt's People

Contributors

chenfei-wu avatar nanduan avatar jordddan avatar binayakjha avatar shengming-yin avatar dawnmsg avatar weizhenq avatar zetangforward avatar bojieli avatar magnetic2014 avatar rupeshs avatar heysujal avatar wang-xiaodong1899 avatar ezioruan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.