opengvlab / all-seeing Goto Github PK

[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"

Home Page: https://huggingface.co/spaces/OpenGVLab/all-seeing

Dockerfile 0.24% Shell 9.26% Python 87.57% HTML 1.15% JavaScript 1.50% CSS 0.27%

all-seeing dataset region-text

all-seeing's People

Contributors

Stargazers

Watchers

Forkers

rese1f 2132660698 evanlovea joskid ai-machine-vision-lab paperwave 5l1v3r1 weiyun1025 haorand xiaolong-rrl cedrickuenzi msheng-lee zhuzi24

all-seeing's Issues

Where can the file "asmv2-13b.jsonl" be downloaded? Sincerely need help!

Where can the file "asmv2-13b.jsonl" be downloaded? Sincerely need help! @shepnerd @whai362 @orashi @czczup @JustinYuu

when will CRPE benchmark be released?

Great work!
But I find the link of CRPE Benchmark is 404 error:
https://huggingface.co/datasets/OpenGVLab/CRPE
Could you please kindly tell when this benchmark will be released?

When is the expected date for data release?

I would really appreciate it if you could let me know the expected dataset release date.

图片如何下载

请问数据中的图片如何下载呢

About 100k of answers of sam data is `"what is the difference between a man and a woman?"'

I guess there was some kind of a bug? See, for example:

{'id': -1,
'image': 'sam/sa_000044/sa_500317.jpg',
'height': 1500,
'width': 2666,
'conversations': [{'from': 'human',
'value': '\nWhat is the primary function of these missiles? Please answer the question according to this region: [[529, 530, 590, 622]].'},
{'from': 'gpt',
'value': '"What is the difference between a man and a woman?"'},
{'from': 'human',
'value': 'What is the primary function of these missiles? Please answer the question according to this region: [[606, 597, 750, 737]].'},
{'from': 'gpt',
'value': '"What is the difference between a man and a woman?"'}]}

Usage of dataset Browser is quite tricky and often encountered with obvious mistake.

as seen in picture above, quite curious to learn why ???

How to use this model?

Hello, thank you very much for your excellent work. I would like to use your model for some image captioning tasks. Could you please provide some usage instructions? Thank you!

Regarding generation of Relation Conversation.

To generate the dataset with predicates, which are represented as predicate, we need to parse the predicate from long sentences (i.e., captions). How did you parse these? did human manually parse, or use a rule-based parser?

ZeroDivisionError and ModelProto error

I ran sh scripts_asmv2/eval/psg_eval.sh OpenGVLab/ASMv2.
I get 2 errors: RuntimeError: Internal: could not parse ModelProto from OpenGVLab/ASMv2/tokenizer.model and ZeroDivisionError: division by zero
I am not using Docker.

(/s/red/a/nobackup/vision/anju/allseeing/cvenv) carnap:/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2$ sh scripts_asmv2/eval/psg_eval.sh OpenGVLab/ASMv2
Traceback (most recent call last):
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2/llava/eval/model_vqa_loader.py", line 143, in
eval_model(args)
File "/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2/llava/eval/model_vqa_loader.py", line 79, in eval_model
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
File "/s/red/a/nobackup/vision/anju/allseeing/all-seeing-main/all-seeing-v2/llava/model/builder.py", line 105, in load_pretrained_model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 702, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1841, in from_pretrained
return cls._from_pretrained(
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2004, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/transformers/models/llama/tokenization_llama.py", line 144, in init
self.sp_model.Load(vocab_file)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/sentencepiece/init.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/sentencepiece/init.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from OpenGVLab/ASMv2/tokenizer.model
/s/red/a/nobackup/vision/anju/allseeing/cvenv/lib/python3.8/site-packages/torch/init.py:747: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:431.)
_C._set_default_tensor_type(t)
Traceback (most recent call last):
File "llava/eval/eval_psg.py", line 252, in
eval_psg(
File "llava/eval/eval_psg.py", line 226, in eval_psg
print(f'Recall: {sum(recall) / len(recall) * 100:.2f}')
ZeroDivisionError: division by zero

关于训练阶段的疑问

您好！感谢您的工作！我看到两个阶段其实数据重合度很高，为什么不取消第一个阶段直接训练第二个阶段呢，是为了让模型更拟合通用理解能力的数据吗

Issue on bounding box coordinates

Hi there,

As I have observed from the annotations, I found that some values in bbox coordinates might exceed the limitation of the image size (usually 640*480),e.g.:
'\nWhat are the two people[[200, 251, 447, 963], [529, 246, 744, 984]] doing in the image?\nAnswer the question with scene graph.'

I am wondering if there is any extra operation that needs to be done (e.g. normalization)

Cheers!

关于数据准备

你好，关于AS-v2的stage2数据集准备，有几个点有一些不确定，还望解惑：

在说明里ScienceQA是需要下载的，但是在数据的目录结构里没有看见对应的folder。所以其实是不需要吗，还是需要放在哪个folder下呢；
ShareGPT4V-100K提供了两个链接，但在结构里好像只用上了第一个链接中的其中三个（web-xxx和wiki），所以剩下的需要下载吗；
`sam' folder的目录结构，是否意味着模型训练只用到了sa000000-sa000063，其他的都归到images folder下呢
在huggingface上还有很多的json文件，比如as_mix_4m.json , rec_detailed_description_42k.json，这些需要放在哪个目录呢

谢谢

AS-V2中坐标问题

AS-V2标注中，穿插在文本中的归一化坐标的X1和X2是不是有错误？

[Question]Can all-seeing-v2 handle images which the target does not exist for the grounding task?

Reference to grounding_eval.jsonl, I use the following prompt template:

Please provide the bounding box coordinate of the region this sentence describes: {target}

And based on all-seeing-v2/llava/eval/model_vqa_loader.py, I made some modifications to test my dataset:

    parser.add_argument("--model-path", type=str, default="OpenGVLab/ASMv2")
    parser.add_argument("--conv-mode", type=str, default="vicuna_v1")
    parser.add_argument("--temperature", type=float, default=0)

I want to ground the person holding a hose. But some of the images in my dataset have person holding hose, while others do not. However, the model cannot distinguish. Regardless of whether the person in the image holding a hose or not, it marks them with a box. Is there any way to make the model only mark the person holding the hose, and not mark anything when the target does not exist in the image?

And can the model mark multiple targets at once when there are multiple targets that meet the conditions in the image? For example, when there are multiple persons in the image and my target is "person", can the model output multiple boxes?

Thank you very much for your answers.

Special tokens

Nice work! Why didn't all-see-v2 add <ref> etc. to the special tokens?

llava module not found

I run into missing modules (llava) when I run your provided scripts since llava module is not installed. I tried running pip install . -e in the /llava directory but it doesn't have a setup.py or pyproject.toml file. I installed the original llava repo and successfully fine-tuned using your provided scripts however it looks like you have a modified llava code from the original repo.

How do you recommend installing the llava module? Add all-seeing/llava to python path or install the original llava repo?

v2中box的坐标是怎么normalized的？

如何可视化box标签

二阶段微调错误

我按照文档准备finetune第二阶段，直接运行sh scripts_asmv2/stage2-finetune.sh，报了下边的错：

ValueError: Looks like distributed multinode run but MASTER_ADDR env not set, please try exporting rank 0's hostname as MASTER_ADDR

然后我把命令改成torchrun --master_port=xxxxx，结果报了CUDA Out of memory的错（即使我已经把bacthsize设成1了），环境是A100+deepspeed zero2，请问这是怎么回事