Papers and codes for large vision-language models.
This repo mainly focuses on the large vision-language models tasks. Please pull requests or email me by [email protected]
if you want to recommend papers.
If you are interested in related tasks, you can reach me out by discord account: yangcao#9724 or WeChat: 85298328912.
- [3D-LLM] 3D-LLM: Injecting the 3D World
into Large Language Models,
NeurIPS2023
. [Code] - [LL3DA] LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning,
CVPR2024
. [Code] - [GPT4Point] GPT4Point: A Unified Framework for Point-Language Understanding and Generation,
CVPR2024
. [Code] - [Uni3D] Uni3D: Exploring Unified 3D Representation at Scale,
ICLR2024
. [Code]
- [LLaMA-VID] LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models,
Arxiv2023
. [Code] - [Mini-Gemini] Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models,
Arxiv2024
. [Code] - [Prompt Highlighter] Prompt Highlighter: Interactive Control for Multi-Modal LLMs,
CVPR2024
. [Code]