I am requesting that support for fp16 inference with self-speculative decoding on XPU

Hi, this issue actually contains two parts: A bug that is caus

Just to provide a bit more information <a class="user-mention notranslate" data-hoverc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker about bigdl HOT 5 CLOSED

brosenfi commented on May 20, 2024

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker

from bigdl.

Comments (5)

gc-fu commented on May 20, 2024 2

Hi, this issue actually contains two parts:

A bug that is caused by using low_bit fp16 in ipex_llm_worker.
Feature request: support ipex_llm_worker with speculative decoding.

The first part have been fixed by this pr #10907.

The second part will be supported by @hzjane.

from bigdl.

brosenfi commented on May 20, 2024 1

Just to provide a bit more information @gc-fu - here, it's providing the torch_dtype value as "auto", but in the fp16 example with self-speculative decoding, it's showing that torch_dtype should be set to torch.float16. Also there are other parameters in the example that aren't being provided when launching via the ipex_llm_worker - specifically "speculative" and "optimize_model" - this is why I marked this as a feature request and not a bug, I thought this mode just isn't supported yet for the ipex_llm_worker module (would be nice if it was though).

from bigdl.

hzjane commented on May 20, 2024 1

@brosenfi
The self-speculative decoding using fastchat worker will be supported in this PR.
But the speculative example only supports running on intel max GPU due to the memory usage limitations. You can try it on max GPU or CPU later.

from bigdl.

brosenfi commented on May 20, 2024 1

Thank you @gc-fu

from bigdl.

gc-fu commented on May 20, 2024

Hi, I am working on to reproduce this issue.

from bigdl.

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker about bigdl HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent