If you wish to generate deions for multiple images at one time. Simply use the f

<div class="highlight highlight-source-python notranslate po

Yes, should listen to <a class="user-mention notranslate" data-hovercard-type="user" d

As I could remember, generating 1 image's deion takes 3-4 s/image. It's quite we

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes. the data can be build like this. <p dir="auto"

We truly dont have a demo for spot the diff. But our image demo <a href="https://githu

<div class="highlight highlight-source-python notranslate position-relati

[Feature Support] For Multi-Batch Data Inference Support about otter HOT 10 OPEN

luodian commented on August 15, 2024

[Feature Support] For Multi-Batch Data Inference Support

from otter.

Comments (10)

ZhangYuanhan-AI commented on August 15, 2024 3

Yes.

the data can be build like this.

[B,N,T,C,H,W] , B=1, N=1 and T=2 in this scenerio.

And we have a sub-task named spot_the_difference, this task is asking questions about the difference of a pair of images.

May we know what task you wanna Otter do, we can also include instruction data of the task you want！

from otter.

GuanlinLee commented on August 15, 2024 1

After some tests, the maximum batch size can be 20 for 4 RTX 3090. But the total inference time increases as well. Specifically, using 4 RTX 3090 to generate descriptions for one image will cost 8~9 seconds. When using 4 RTX 3090 to generate descriptions for 20 images at one time, the time cost is about 155 seconds, i.e., 7.75 s/image.

The pipeline can be further improved.

from otter.

ZhangYuanhan-AI commented on August 15, 2024 1

vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(1).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image><image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

May I ask if it is like this? After I ran it, otter did not give the expected response, only replied with a single word "lighting."

vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(0).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

Like this

However, the current Otter do not support spot_the_difference task, we will upload such model soon.

from otter.

Luodian commented on August 15, 2024 1

Yes, should listen to @ZhangYuanhan-AI Yuanhan's suggestion 😉

from otter.

Luodian commented on August 15, 2024

As I could remember, generating 1 image's description takes 3-4 s/image. It's quite weird that multiple images inference are that slow. I guess it's by huggingface's device_map mechanism, the models are sharded into different devices. And the data tensors are copied from device to device during inference. Maybe copying 20 images tensor cost more time than just 1 image? It's my initial guess.

We are also working on improving training & inference efficiency. We now support xformers for Otter model. You can check the main branch's latest update.

from otter.

GuanlinLee commented on August 15, 2024

Yes, if the tokens are 256, it will cost about 4 seconds. If you use 512 tokens, the time cost will double. I forget to point it out.

from otter.

Enderfga commented on August 15, 2024

@Luodian @GuanlinLee May I know if otter supports simultaneous input of multiple images, for example, inputting two images at once and asking questions about this pair of images other than Multi-Batch Data Inference Support?

from otter.

Enderfga commented on August 15, 2024

Yes.

the data can be build like this.

[B,N,T,C,H,W] , B=1, N=2 and T=1 in this scenerio.

And we have a sub-task named sp�ot_the_difference, this task is asking questions about the difference of a pair of images.

May we know what task you wanna Otter do, we can also include instruction data of the task you want！

May I ask if there are any demonstration codes available for this sub-task in this repository? I couldn't find them in pipeline/demo.

from otter.

Luodian commented on August 15, 2024

We truly dont have a demo for spot the diff. But our image demo here supports for multiple images input.

Sorry, just a small revision, the data format should be [B,N,T,C,H,W] , B=1, N=1 and T=2 in this spot_the_difference scenerio,.

In spot_the_difference scenerio, we can see two images are two frames in a video. And asking "what is the difference between of these two images."

Feel free to ask more about it. Our Otter's unique advantage is supporting both in-context inputs and video inputs.

from otter.

Enderfga commented on August 15, 2024

vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(1).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image><image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

May I ask if it is like this? After I ran it, otter did not give the expected response, only replied with a single word "lighting."

from otter.

[Feature Support] For Multi-Batch Data Inference Support about otter HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent