Feature request I tried to run LLama-3 on TGI (1.3). The model kin

add stop parameter, it works for me <div class="snippet-clipboard-content notransl

I am facing the same issue as <a class="user-mention notranslate" data-hovercard-type=

Llama-3 support about text-generation-inference HOT 20 OPEN

RomanKoshkin commented on May 27, 2024 7

Llama-3 support

from text-generation-inference.

Comments (20)

shuaills commented on May 27, 2024 10

There is a mismatch between the tokenizer version used for training weights and the version used for loading?
I am not sure if this is a problem with my weights.
I get this
Warning: Token '<|reserved_special_token_250|>' was expected to have ID '128255' but was given ID 'None' 2024-04-21T06:06:48.861440Z INFO text_generation_router: router/src/main.rs:471: Serving revision 561487d18c41c76bcb5fc6cfb73a324982f04f47 of model meta-llama/Meta-Llama-3-8B

I tried
prompt = """<|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant, providing informative and friendly answers to the user. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hello! Can you tell me how tall the Eiffel Tower is? <|eot_id|> <|start_header_id|>assistant<|end_header_id|> The Eiffel Tower is 324 meters tall and is an iconic landmark of Paris. It was built in 1889 and was once the tallest man-made structure in the world. Now, it is one of the most popular tourist attractions in France. The tower is named after its designer, Gustave Eiffel. It was originally constructed for the 1889 Paris World's Fair, showcasing the architectural capabilities of the late 19th century. <|eot_id|> <|start_header_id|>user<|end_header_id|> How many visitors does the Eiffel Tower typically receive in a day? Do I need to book tickets in advance? <|eot_id|> <|start_header_id|>assistant<|end_header_id|>"""
the response is
{'generated_text': "\nThe Eiffel Tower receives around 7 million visitors annually. While you don't need to book tickets in advance, I recommend booking them online to avoid long lines and to guarantee your spot. You can find more information about visiting the Eiffel Tower and booking tickets here: https://www.eiffeltower.paris/en/. If you have any other questions, feel free to ask!\nspNetesModuleGeneratedNetTitle: spnet\nuser pip install spnet\nूडुங투"}
Which is a bit weird.

from text-generation-inference.

waderwu commented on May 27, 2024 5

add stop parameter, it works for me

data = {
    'inputs': prompt,
    'parameters' : {
        'max_new_tokens': 1024,
        'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    }
}

from text-generation-inference.

Narsil commented on May 27, 2024 4

Yes, llama3 has 2 eos tokens. eot_id for turn token, and. "real" eos_token (not sure when used).

Currently the config defines <eos_token> as the eos token, which if what you're seeing here.

This is what was intended by the meta team when we received it, we're looking to update the config for those instruct models.

from text-generation-inference.

Narsil commented on May 27, 2024 2

Yes it is. And hf-chat sends that stop token currently.

from text-generation-inference.

oroojlooy commented on May 27, 2024 1

Same here. It just keeps generating until it gets to its max-gen-limit.

from text-generation-inference.

waderwu commented on May 27, 2024 1

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.

Could you please tell me the deployment command for hf-chat?

from text-generation-inference.

Vitaliy-Firebird commented on May 27, 2024 1

Okay, by slow I meant that it was not recognizing the stop tokens and was depleting the max_tokens with every request.

Upon further investigation, it appears that the system becomes erratic when parameters other than temperature and top_p are included, as it then disregards the stop tokens.

If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None) presence_penalty and frequency_penalty from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.

Thank you so much, @hooman-bayer! I'm using v2.0.1 docker image and I was struggling with the model (70b-instruct) as it kept generating nonsense when the presence_penalty and frequency_penalty were set to 0 (and it also looked like the stop tokens were not recognized either). As soon as I set these parameters to null in the request body it started working as expected! The model now delivers outputs that are exactly in line with what I see on hugging face's chat. I do wonder though why did it help? Is it because this forces the inference pipeline to skip the logits penalty modifications completely?

Anyway, thanks again for the great insight!

from text-generation-inference.

jtsai-quid commented on May 27, 2024 1

Just test llama3-8b in the 2.0.2, looks like this issues has been fixed.
#1808

from text-generation-inference.

n-imas commented on May 27, 2024

I am facing the same issue as @shuaills

from text-generation-inference.

axenov commented on May 27, 2024

It does not work with TGI v1.4 and v2.0.1 as well.

from text-generation-inference.

arunchandra23 commented on May 27, 2024

from text-generation-inference.

sa- commented on May 27, 2024

Does huggingface still use this image to serve their production models? Is it used by the llama3-70b chat that is currently deployed on https://huggingface.co/chat/ ?

from text-generation-inference.

hooman-bayer commented on May 27, 2024

@Narsil what version of TGI you recommend running Llama-3 models on? We noticed 2.0.1 seems to be a bit slow maybe you recommend earlier versions? We did not investigate much though

from text-generation-inference.

huwprosser commented on May 27, 2024

I got this working by building the image from this blog post:
https://lavaraja-padala.medium.com/deploy-google-gemma-2b-and-gemma-7b-models-on-aws-sagemaker-f441914ccc6f

TGI 2.1.1 seems to be what must be being used internally. Weirdly, HF endpoints seem to be using 2.0.1

from text-generation-inference.

Narsil commented on May 27, 2024

Slow ? What do you mean ? What hardware, TP ? What is slow in this case?

from text-generation-inference.

hooman-bayer commented on May 27, 2024

Okay, by slow I meant that it was not recognizing the stop tokens and was depleting the max_tokens with every request.

Upon further investigation, it appears that the system becomes erratic when parameters other than temperature and top_p are included, as it then disregards the stop tokens.

If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None) presence_penalty and frequency_penalty from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.

from text-generation-inference.

Narsil commented on May 27, 2024

The frequency penalty is being solved soon : #1765.

For the stop token, yes it's unfortunate setup, we're solving changing the default in many places (basically there are 2 stop tokens ..)

from text-generation-inference.

waderwu commented on May 27, 2024

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.

Could you please tell me the deployment command for hf-chat?

Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.

from text-generation-inference.

mjsteele12 commented on May 27, 2024

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?

Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.

Would you be able to post your settings and example call? I am unable to get llama3 to stop no matter what I try.

from text-generation-inference.

jatkinson-CRL commented on May 27, 2024

add stop parameter, it works for me

data = {
    'inputs': prompt,
    'parameters' : {
        'max_new_tokens': 1024,
        'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    }
}

Great find, thanks for sharing this. This works for me when I include it in the extra_body dictionary when using the OpenAI chat completions API w/ a text-generation inference endpoint.

I am hoping that huggingface could update their documentation though, seems that some documents are out of date or out of sync with the OpenAPI spec. This parameter is documented in the OpenAPI spec here: https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/generate but it was tough to find this before I came across this solution. The documentation that appears much more frequently when searching for this solutions to this problem is https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task, which does not contain all of the parameters listed in the OpenAPI spec.

from text-generation-inference.

Llama-3 support about text-generation-inference HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent