Coder Social home page Coder Social logo

opengvlab / all-seeing Goto Github PK

View Code? Open in Web Editor NEW
435.0 435.0 14.0 58.92 MB

[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"

Home Page: https://huggingface.co/spaces/OpenGVLab/all-seeing

Dockerfile 0.24% Shell 9.26% Python 87.57% HTML 1.15% JavaScript 1.50% CSS 0.27%
all-seeing dataset region-text

all-seeing's People

Contributors

li-qingyun avatar weiyun1025 avatar whai362 avatar yuwenxiong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

all-seeing's Issues

About 100k of answers of sam data is `"what is the difference between a man and a woman?"'

I guess there was some kind of a bug? See, for example:
image

{'id': -1,
'image': 'sam/sa_000044/sa_500317.jpg',
'height': 1500,
'width': 2666,
'conversations': [{'from': 'human',
'value': '\nWhat is the primary function of these missiles? Please answer the question according to this region: [[529, 530, 590, 622]].'},
{'from': 'gpt',
'value': '"What is the difference between a man and a woman?"'},
{'from': 'human',
'value': 'What is the primary function of these missiles? Please answer the question according to this region: [[606, 597, 750, 737]].'},
{'from': 'gpt',
'value': '"What is the difference between a man and a woman?"'}]}

How to use this model?

Hello, thank you very much for your excellent work. I would like to use your model for some image captioning tasks. Could you please provide some usage instructions? Thank you!

Regarding generation of Relation Conversation.

To generate the dataset with predicates, which are represented as predicate, we need to parse the predicate from long sentences (i.e., captions). How did you parse these? did human manually parse, or use a rule-based parser?

ZeroDivisionError and ModelProto error

I ran sh scripts_asmv2/eval/psg_eval.sh OpenGVLab/ASMv2.
I get 2 errors: RuntimeError: Internal: could not parse ModelProto from OpenGVLab/ASMv2/tokenizer.model and ZeroDivisionError: division by zero
I am not using Docker.

(/s/red/a/nobackup/vision/anju/allseeing/cvenv) carnap:/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2$ sh scripts_asmv2/eval/psg_eval.sh OpenGVLab/ASMv2
Traceback (most recent call last):
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2/llava/eval/model_vqa_loader.py", line 143, in
eval_model(args)
File "/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2/llava/eval/model_vqa_loader.py", line 79, in eval_model
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
File "/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2/llava/model/builder.py", line 105, in load_pretrained_model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 702, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1841, in from_pretrained
return cls._from_pretrained(
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2004, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/models/llama/tokenization_llama.py", line 144, in init
self.sp_model.Load(vocab_file)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/sentencepiece/init.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/sentencepiece/init.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from OpenGVLab/ASMv2/tokenizer.model
/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/torch/init.py:747: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:431.)
_C._set_default_tensor_type(t)
Traceback (most recent call last):
File "llava/eval/eval_psg.py", line 252, in
eval_psg(
File "llava/eval/eval_psg.py", line 226, in eval_psg
print(f'Recall: {sum(recall) / len(recall) * 100:.2f}')
ZeroDivisionError: division by zero

关于训练阶段的疑问

您好!感谢您的工作!我看到两个阶段其实数据重合度很高,为什么不取消第一个阶段直接训练第二个阶段呢,是为了让模型更拟合通用理解能力的数据吗

Issue on bounding box coordinates

Hi there,

As I have observed from the annotations, I found that some values in bbox coordinates might exceed the limitation of the image size (usually 640*480),e.g.:
'\nWhat are the two people[[200, 251, 447, 963], [529, 246, 744, 984]] doing in the image?\nAnswer the question with scene graph.'

I am wondering if there is any extra operation that needs to be done (e.g. normalization)

Cheers!

关于数据准备

你好,关于AS-v2的stage2数据集准备,有几个点有一些不确定,还望解惑:

  1. 在说明里ScienceQA是需要下载的,但是在数据的目录结构里没有看见对应的folder。所以其实是不需要吗,还是需要放在哪个folder下呢;
  2. ShareGPT4V-100K提供了两个链接,但在结构里好像只用上了第一个链接中的其中三个(web-xxx和wiki),所以剩下的需要下载吗;
  3. `sam' folder的目录结构,是否意味着模型训练只用到了sa000000-sa000063,其他的都归到images folder下呢
  4. 在huggingface上还有很多的json文件,比如as_mix_4m.json , rec_detailed_description_42k.json,这些需要放在哪个目录呢

谢谢

AS-V2中坐标问题

AS-V2标注中,穿插在文本中的归一化坐标的X1和X2是不是有错误?

[Question]Can all-seeing-v2 handle images which the target does not exist for the grounding task?

Reference to grounding_eval.jsonl, I use the following prompt template:

Please provide the bounding box coordinate of the region this sentence describes: {target}

And based on all-seeing-v2/llava/eval/model_vqa_loader.py, I made some modifications to test my dataset:

    parser.add_argument("--model-path", type=str, default="OpenGVLab/ASMv2")
    parser.add_argument("--conv-mode", type=str, default="vicuna_v1")
    parser.add_argument("--temperature", type=float, default=0)

I want to ground the person holding a hose. But some of the images in my dataset have person holding hose, while others do not. However, the model cannot distinguish. Regardless of whether the person in the image holding a hose or not, it marks them with a box. Is there any way to make the model only mark the person holding the hose, and not mark anything when the target does not exist in the image?

And can the model mark multiple targets at once when there are multiple targets that meet the conditions in the image? For example, when there are multiple persons in the image and my target is "person", can the model output multiple boxes?

Thank you very much for your answers.

Special tokens

Nice work! Why didn't all-see-v2 add <ref> etc. to the special tokens?

llava module not found

I run into missing modules (llava) when I run your provided scripts since llava module is not installed. I tried running pip install . -e in the /llava directory but it doesn't have a setup.py or pyproject.toml file. I installed the original llava repo and successfully fine-tuned using your provided scripts however it looks like you have a modified llava code from the original repo.

How do you recommend installing the llava module? Add all-seeing/llava to python path or install the original llava repo?

二阶段微调错误

我按照文档准备finetune第二阶段,直接运行sh scripts_asmv2/stage2-finetune.sh,报了下边的错:

ValueError: Looks like distributed multinode run but MASTER_ADDR env not set, please try exporting rank 0's hostname as MASTER_ADDR

然后我把命令改成torchrun --master_port=xxxxx,结果报了CUDA Out of memory的错(即使我已经把bacthsize设成1了),环境是A100+deepspeed zero2,请问这是怎么回事

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.