Coder Social home page Coder Social logo

Comments (10)

ZhangYuanhan-AI avatar ZhangYuanhan-AI commented on August 15, 2024 3

Yes.

the data can be build like this.

[B,N,T,C,H,W] , B=1, N=1 and T=2 in this scenerio.

And we have a sub-task named spot_the_difference, this task is asking questions about the difference of a pair of images.

May we know what task you wanna Otter do, we can also include instruction data of the task you want!

from otter.

GuanlinLee avatar GuanlinLee commented on August 15, 2024 1

After some tests, the maximum batch size can be 20 for 4 RTX 3090. But the total inference time increases as well. Specifically, using 4 RTX 3090 to generate descriptions for one image will cost 8~9 seconds. When using 4 RTX 3090 to generate descriptions for 20 images at one time, the time cost is about 155 seconds, i.e., 7.75 s/image.

The pipeline can be further improved.

from otter.

ZhangYuanhan-AI avatar ZhangYuanhan-AI commented on August 15, 2024 1
vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(1).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image><image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

May I ask if it is like this? After I ran it, otter did not give the expected response, only replied with a single word "lighting."

vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(0).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

Like this

However, the current Otter do not support spot_the_difference task, we will upload such model soon.

from otter.

Luodian avatar Luodian commented on August 15, 2024 1

Yes, should listen to @ZhangYuanhan-AI Yuanhan's suggestion 😉

from otter.

Luodian avatar Luodian commented on August 15, 2024

As I could remember, generating 1 image's description takes 3-4 s/image. It's quite weird that multiple images inference are that slow. I guess it's by huggingface's device_map mechanism, the models are sharded into different devices. And the data tensors are copied from device to device during inference. Maybe copying 20 images tensor cost more time than just 1 image? It's my initial guess.

We are also working on improving training & inference efficiency. We now support xformers for Otter model. You can check the main branch's latest update.

from otter.

GuanlinLee avatar GuanlinLee commented on August 15, 2024

Yes, if the tokens are 256, it will cost about 4 seconds. If you use 512 tokens, the time cost will double. I forget to point it out.

from otter.

Enderfga avatar Enderfga commented on August 15, 2024

@Luodian @GuanlinLee May I know if otter supports simultaneous input of multiple images, for example, inputting two images at once and asking questions about this pair of images other than Multi-Batch Data Inference Support?

from otter.

Enderfga avatar Enderfga commented on August 15, 2024

Yes.

the data can be build like this.

[B,N,T,C,H,W] , B=1, N=2 and T=1 in this scenerio.

And we have a sub-task named sp�ot_the_difference, this task is asking questions about the difference of a pair of images.

May we know what task you wanna Otter do, we can also include instruction data of the task you want!

May I ask if there are any demonstration codes available for this sub-task in this repository? I couldn't find them in pipeline/demo.

from otter.

Luodian avatar Luodian commented on August 15, 2024

We truly dont have a demo for spot the diff. But our image demo here supports for multiple images input.

Sorry, just a small revision, the data format should be [B,N,T,C,H,W] , B=1, N=1 and T=2 in this spot_the_difference scenerio,.

In spot_the_difference scenerio, we can see two images are two frames in a video. And asking "what is the difference between of these two images."

Feel free to ask more about it. Our Otter's unique advantage is supporting both in-context inputs and video inputs.

from otter.

Enderfga avatar Enderfga commented on August 15, 2024
vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(1).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image><image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

May I ask if it is like this? After I ran it, otter did not give the expected response, only replied with a single word "lighting."

from otter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.