System Info <div class="snippet-clipboard-content notranslate position-relative

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Have you tried adding "attention_bias": false <p di

I'm still getting the same issue as <a class="user-mention notranslate" data-hovercard

Phi-3 medium 128k instruct fails to start about text-generation-inference HOT 7 OPEN

xfalcox commented on June 25, 2024 1

Phi-3 medium 128k instruct fails to start

from text-generation-inference.

Comments (7)

OjoDojoJo commented on June 25, 2024 1

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

from text-generation-inference.

ulrichkr commented on June 25, 2024

I encounter this as well. I believe it arises from the recent addition of Granite support after Phi-3 support in TGI 2.0.3. See here.

from text-generation-inference.

amihalik commented on June 25, 2024

@OjoDojoJo What's your full command line? I'm running this command on a aws g6.48xlarge

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 2g   \
     -v /models/:/models/ ghcr.io/huggingface/text-generation-inference:2.0.3   \
     --model-id /models/microsoft/Phi-3-medium-128k-instruct/     \
     --hostname 0.0.0.0         --trust-remote-code --num-shard 8     \
     --max-input-length=9000 --max-total-tokens=9500 \
     --max-batch-prefill-tokens=9000

And I'm getting this error:

[rank1]: Traceback (most recent call last):

[rank1]:   File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank1]:     sys.exit(app())

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
[rank1]:     server.serve(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
[rank1]:     asyncio.run(

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank1]:     return loop.run_until_complete(main)

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank1]:     return future.result()

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
[rank1]:     model = get_model(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
[rank1]:     return FlashLlama(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
[rank1]:     model = FlashLlamaForCausalLM(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
[rank1]:     self.model = FlashLlamaModel(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
[rank1]:     [

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
[rank1]:     FlashLlamaLayer(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
[rank1]:     self.self_attn = FlashLlamaAttention(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
[rank1]:     self.query_key_value = load_attention(config, prefix, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 45, in load_attention
[rank1]:     return TensorParallelColumnLinear.load_multi(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 115, in load_multi
[rank1]:     weight = weights.get_multi_weights_col(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in get_multi_weights_col
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in <listcomp>
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 112, in get_sharded
[rank1]:     filename, tensor_name = self.get_filename(tensor_name)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
[rank1]:     raise RuntimeError(f"weight {tensor_name} does not exist")

[rank1]: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

from text-generation-inference.

dcbark01 commented on June 25, 2024

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

Can confirm that this works. There's currently an open PR on HF to fix the issue. In the meantime, you can run the model by directly specifying the revision. Here's my full command:

docker run --gpus all --shm-size 2g -p 8080:80 \
-v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:2.0 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--revision refs/pr/68 \
--trust-remote-code \
-p 8080 \
--hostname 0.0.0.0

from text-generation-inference.

stefanobranco commented on June 25, 2024

I'm still getting the same issue as @amihalik, even with the attention bias fixed:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Not sure what causes it, I'm using pretty much the exact same docker commands.

from text-generation-inference.

xfalcox commented on June 25, 2024

Still fails for me with TGI 2.0, trust remote code, attention_bias false.

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

from text-generation-inference.

pranavthombare commented on June 25, 2024

It is the same for us. tells me
The argument 'trust_remote_code' is to be used with Auto classes. It has no effect here and is ignored.

from text-generation-inference.

Phi-3 medium 128k instruct fails to start about text-generation-inference HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent