Coder Social home page Coder Social logo

Comments (7)

OjoDojoJo avatar OjoDojoJo commented on June 25, 2024 1

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

from text-generation-inference.

ulrichkr avatar ulrichkr commented on June 25, 2024

I encounter this as well. I believe it arises from the recent addition of Granite support after Phi-3 support in TGI 2.0.3. See here.

from text-generation-inference.

amihalik avatar amihalik commented on June 25, 2024

@OjoDojoJo What's your full command line? I'm running this command on a aws g6.48xlarge

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 2g   \
     -v /models/:/models/ ghcr.io/huggingface/text-generation-inference:2.0.3   \
     --model-id /models/microsoft/Phi-3-medium-128k-instruct/     \
     --hostname 0.0.0.0         --trust-remote-code --num-shard 8     \
     --max-input-length=9000 --max-total-tokens=9500 \
     --max-batch-prefill-tokens=9000

And I'm getting this error:

[rank1]: Traceback (most recent call last):

[rank1]:   File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank1]:     sys.exit(app())

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
[rank1]:     server.serve(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
[rank1]:     asyncio.run(

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank1]:     return loop.run_until_complete(main)

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank1]:     return future.result()

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
[rank1]:     model = get_model(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
[rank1]:     return FlashLlama(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
[rank1]:     model = FlashLlamaForCausalLM(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
[rank1]:     self.model = FlashLlamaModel(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
[rank1]:     [

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
[rank1]:     FlashLlamaLayer(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
[rank1]:     self.self_attn = FlashLlamaAttention(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
[rank1]:     self.query_key_value = load_attention(config, prefix, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 45, in load_attention
[rank1]:     return TensorParallelColumnLinear.load_multi(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 115, in load_multi
[rank1]:     weight = weights.get_multi_weights_col(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in get_multi_weights_col
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in <listcomp>
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 112, in get_sharded
[rank1]:     filename, tensor_name = self.get_filename(tensor_name)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
[rank1]:     raise RuntimeError(f"weight {tensor_name} does not exist")

[rank1]: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

from text-generation-inference.

dcbark01 avatar dcbark01 commented on June 25, 2024

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

Can confirm that this works. There's currently an open PR on HF to fix the issue. In the meantime, you can run the model by directly specifying the revision. Here's my full command:

docker run --gpus all --shm-size 2g -p 8080:80 \
-v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:2.0 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--revision refs/pr/68 \
--trust-remote-code \
-p 8080 \
--hostname 0.0.0.0

from text-generation-inference.

stefanobranco avatar stefanobranco commented on June 25, 2024

I'm still getting the same issue as @amihalik, even with the attention bias fixed:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Not sure what causes it, I'm using pretty much the exact same docker commands.

from text-generation-inference.

xfalcox avatar xfalcox commented on June 25, 2024

Still fails for me with TGI 2.0, trust remote code, attention_bias false.

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

from text-generation-inference.

pranavthombare avatar pranavthombare commented on June 25, 2024

It is the same for us. tells me
The argument 'trust_remote_code' is to be used with Auto classes. It has no effect here and is ignored.

from text-generation-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.