Topic: vision-language-model Goto Github

Some thing interesting about vision-language-model

👇 Here are 92 public repositories matching this topic...

alaalab / instructcv

vision-language-model,[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"

Organization: alaalab

Home Page: https://openreview.net/forum?id=Nu9mOSq7eH

diffusion-models generative-model multi-task-learning stable-diffusion text-to-image vision-language-model

alibabaresearch / advancedliteratemachinery

vision-language-model,A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Organization: alibabaresearch

artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer

ashleykleynhans / llava-docker

vision-language-model,Docker image for LLaVA: Large Language and Vision Assistant

User: ashleykleynhans

ai chatbot chatgpt docker docker-image foundation-models gpt-4 instruction-tuning llama llama-2

avisoori1x / seemore

vision-language-model,From scratch implementation of a vision language model in pure PyTorch

User: avisoori1x

large-language-models llm llms multimodal multimodal-large-language-models pytorch pytorch-implementation vision-language-model deep-learning neural-networks

baai-agents / cradle

vision-language-model,The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Organization: baai-agents

Home Page: https://baai-agents.github.io/Cradle/

ai-agent ai-agents-framework computer-control cradle gcc generative-ai grounding large-language-models llm lmm

batsresearch / menghini-neurips23-code

vision-language-model,Exploring prompt tuning with pseudolabels for multiple modalities, learning settings, and training strategies.

Organization: batsresearch

Home Page: https://openreview.net/pdf?id=2b9aY2NgXE

clip prompt-tuning self-training vision-language-model pseudolabels

chs20 / robustvlm

vision-language-model,Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

User: chs20

adversarial-attacks adversarial-defense ai ml vision-language-model clip

cocacola-lab / mineland

vision-language-model,Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs

Organization: cocacola-lab

ai-agent ai-agents large-language-models llm minecraft multimodal-large-language-models vision-language-model gpt4v

deepseek-ai / deepseek-vl

vision-language-model,DeepSeek-VL: Towards Real-World Vision-Language Understanding

Organization: deepseek-ai

Home Page: https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B

vision-language-model vision-language-pretraining foundation-models

dvlab-research / mgm

vision-language-model,Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Organization: dvlab-research

generation large-language-models vision-language-model

explainableml / probvlm

vision-language-model,ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models

Organization: explainableml

bayesian-deep-learning pytorch uncertainty-estimation uncertainty-quantification vision-language-model

feielysia / viecap

vision-language-model,Transferable Decoding with Visual Entities for Zero-Shot Image Captioning, ICCV 2023

User: feielysia

Home Page: https://openaccess.thecvf.com/content/ICCV2023/html/Fei_Transferable_Decoding_with_Visual_Entities_for_Zero-Shot_Image_Captioning_ICCV_2023_paper.html

transferability vision-language-model object-hallucination zero-shot-captioning modality-biases

foundationvision / groma

vision-language-model,Grounded Multimodal Large Language Model with Localized Visual Tokenization

Organization: foundationvision

Home Page: https://groma-mllm.github.io/

grounding llm mllm large-language-models foundation-models llama llama2 multimodal vision-language-model

gokayfem / awesome-vlm-architectures

vision-language-model,Famous Vision Language Models and Their Architectures

User: gokayfem

clip llava vlm image-encoder text-encoder multimodal blip cogvlm internlm kosmos

haotian-liu / llava

vision-language-model,[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

User: haotian-liu

Home Page: https://llava.hliu.cc

gpt-4 chatbot chatgpt llama multimodal llava foundation-models instruction-tuning multi-modality visual-language-learning

hieuphan33 / mavl

vision-language-model,Multi-Aspect Vision Language Pretraining - CVPR2024

User: hieuphan33

Home Page: https://arxiv.org/abs/2403.07636

medical-vision-and-language-pretraining vision-language-model vision-language-pretraining zero-shot-classification zero-shot-segmentation

huangwl18 / voxposer

vision-language-model,VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

User: huangwl18

Home Page: https://voxposer.github.io/

embodied-ai foundation-models large-language-models motion-planning robotic-manipulation robotics vision-language-model

internlm / internlm-xcomposer

vision-language-model,InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.

Organization: internlm

chatgpt visual-language-learning multi-modality foundation gpt-4 instruction-tuning mllm multimodal vision-language-model language-model

irohxu / awesome-multimodal-llm-autonomous-driving

vision-language-model,[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

User: irohxu

autonomous-car autonomous-driving autonomous-vehicles foundation-models large-language-models multimodal multimodal-large-language-models self-driving vision-language-model vision-transformer

jingyi0000 / vlm_survey

vision-language-model,Collection of AWESOME vision-language models for vision tasks

User: jingyi0000

computer-vision deep-learning knowledge-distillation survey transfer-learning vision-language-model clip multi-modal-model

linzhiqiu / t2v_metrics

vision-language-model,Evaluating text-to-image/video/3D models with VQAScore

User: linzhiqiu

Home Page: https://linzhiqiu.github.io/papers/vqascore/

generative-ai vision-language-model

llm-jp / awesome-japanese-llm

vision-language-model,日本語LLMまとめ - Overview of Japanese LLMs

Organization: llm-jp

Home Page: https://llm-jp.github.io/awesome-japanese-llm

awesome awesome-list language-model language-models large-language-model large-language-models llm llms japanese japanese-language

mala-lab / inctrl

vision-language-model,Official implementation of CVPR'24 paper 'Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts'.

Organization: mala-lab

anomaly-detection few-shot-anomaly-detection generalist-model vision-language-model

mbzuai-oryx / groundinglmm

vision-language-model,[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Organization: mbzuai-oryx

Home Page: https://grounding-anything.com

foundation-models lmm vision-and-language vision-language-model llm-agent

nvlabs / prismer

vision-language-model,The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Organization: nvlabs

Home Page: https://shikun.io/projects/prismer

image-captioning language-model multi-modal-learning multi-task-learning vision-language-model vision-and-language vqa

opendrivelab / elm

vision-language-model,Embodied Understanding of Driving Scenarios

Organization: opendrivelab

autonomous-driving vision-language-model end-to-end-driving

opengvlab / internvl

vision-language-model,[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源模型

Organization: opengvlab

Home Page: https://arxiv.org/abs/2404.16821

image-classification image-text-retrieval llm mme semantic-segmentation video-classification vision-language-model vit-22b vit-6b multi-modal

opengvlab / multi-modality-arena

vision-language-model,Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

Organization: opengvlab

chat chatbot chatgpt gradio large-language-models llms vqa multi-modality vision-language-model

pjlab-adg / awesome-knowledge-driven-ad

vision-language-model,A curated list of awesome knowledge-driven autonomous driving (continually updated)

Organization: pjlab-adg

autonomous-driving knowledge-driven large-language-models vision-language-model

pku-yuangroup / chat-univi

vision-language-model,[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Organization: pku-yuangroup

Home Page: https://arxiv.org/abs/2311.08046

image-understanding large-language-models video-understanding vision-language-model

qwenlm / qwen-vl

vision-language-model,The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Organization: qwenlm

large-language-models vision-language-model

richard-peng-xia / hgclip

vision-language-model,HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

User: richard-peng-xia

Home Page: https://arxiv.org/abs/2311.14064

graph-representations multi-modal-learning vision-language-model hierarchical-image-classification

richard-peng-xia / lmpt

vision-language-model,LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

User: richard-peng-xia

Home Page: https://arxiv.org/abs/2305.04536

long-tailed-learning multi-label-image-classification prompt-tuning vision-language-model

roboflamingo / roboflamingo

vision-language-model,Code for RoboFlamingo

User: roboflamingo

Home Page: https://roboflamingo.github.io

artificial-intelligence robotics vision-language-model

roboflow / multimodal-maestro

vision-language-model,Effective prompting for Large Multimodal Models like GPT-4 Vision, LLaVA or CogVLM. 🔥

Organization: roboflow

Home Page: https://maestro.roboflow.com

lmm multimodality segment-anything instance-segmentation object-detection gpt-4 gpt-4-vision llava prompt-engineering visual-prompting

ruili3 / know-your-neighbors

vision-language-model,[CVPR 2024] 🏡Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning

User: ruili3

Home Page: https://ruili3.github.io/kyn

3d-reconstruction cvpr2024 kitti-360 nerf self-supervised-learning semantics vision-language-model

sshh12 / multi_token

vision-language-model,Embed arbitrary modalities (images, audio, documents, etc) into large language models.

User: sshh12

large-language-models llava large-multimodal-models multi-modality multimodal vision-language-model large-context llm

sun-hailong / lamda-pilot

vision-language-model,🎉 PILOT: A Pre-trained Model-Based Continual Learning Toolbox

User: sun-hailong

machine-learning continual-learning deep-learning incremental-learning pre-trained-models vision-language-model vision-transformer reproducible-research lifelong-learning pytorch

sunzey / alphaclip

vision-language-model,[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

User: sunzey

Home Page: https://aleafy.github.io/alpha-clip

deep-learning machine-learning vision-language vision-language-model vision-transformer vision-and-language

surrey-uplab / recognize-any-regions

vision-language-model,Recognize Any Regions

User: surrey-uplab

Home Page: https://arxiv.org/abs/2311.01373

vision-foundation-model auto-labeling instance-segmentation multimodal-representation-learning object-detection open-vocabulary open-world vision-language-model vision-language-pretraining zero-shot

thomas-yanxin / karmavlm

vision-language-model,🧘🏻‍♂️KarmaVLM (相生)：A family of high efficiency and powerful visual language model.

User: thomas-yanxin

llama2 llava qwen2 vlm vision-language-model visual-language-learning multimodel

ucsc-vlaa / vllm-safety-benchmark

vision-language-model,Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"

Organization: ucsc-vlaa

Home Page: https://arxiv.org/abs/2311.16101

adversarial-attacks benchmark datasets llm multimodal-llm robustness safety vision-language-model

vincentlux / awesome-multimodal-llm

vision-language-model,Reading list for Multimodal Large Language Models

User: vincentlux

awesome-list computer-vision large-language-models machine-learning multimodal-machine-learning natural-language-processing paper-list vision-language-model multimodal-large-language-models

vpgtrans / vpgtrans

vision-language-model,Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.

User: vpgtrans

Home Page: https://vpgtrans.github.io/

llm vision-language-model large-scale-language-modeling vl-llm

wusize / clipself

vision-language-model,[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

User: wusize

Home Page: https://arxiv.org/abs/2310.01403

detection open-vocabulary vision-language-model

yonghaoxu / txt2img-mhn

vision-language-model,[IEEE TIP 2023] Txt2Img-MHN: Remote Sensing Image Generation from Text Using Modern Hopfield Networks

User: yonghaoxu

hopfield-network image-synthesis remote-sensing text-to-image-generation vision-language-model

yunqing-me / attackvlm

vision-language-model,[NeurIPS-2023] Annual Conference on Neural Information Processing Systems

User: yunqing-me

Home Page: https://arxiv.org/pdf/2305.16934.pdf

adversarial-attack deep-generative-model generative-ai image-to-text-generation text-to-image-generation foundation-models large-language-models vision-language-model trustworthy-ai

zhengli97 / awesome-prompt-learning-for-vision-language-models

vision-language-model,A curated list of prompt learning methods for vision-language models.

User: zhengli97

paper-list prompt-learning vision-language-model

zhengli97 / promptkd

vision-language-model,[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"

User: zhengli97

Home Page: https://zhengli97.github.io/PromptKD/

cvpr2024 multi-modal-learning prompt-learning vision-language-model knowledge-distillation clip

zwx8981 / liqe

vision-language-model,[CVPR2023] Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective

User: zwx8981

blind-image-quality-assessment clip image-quality-assessment multitask-learning no-reference-image-quality-assessment vision-language-model

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.