Coder Social home page Coder Social logo

Comments (20)

shuaills avatar shuaills commented on May 27, 2024 10

There is a mismatch between the tokenizer version used for training weights and the version used for loading?
I am not sure if this is a problem with my weights.
I get this
Warning: Token '<|reserved_special_token_250|>' was expected to have ID '128255' but was given ID 'None' 2024-04-21T06:06:48.861440Z INFO text_generation_router: router/src/main.rs:471: Serving revision 561487d18c41c76bcb5fc6cfb73a324982f04f47 of model meta-llama/Meta-Llama-3-8B

I tried
prompt = """<|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant, providing informative and friendly answers to the user. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hello! Can you tell me how tall the Eiffel Tower is? <|eot_id|> <|start_header_id|>assistant<|end_header_id|> The Eiffel Tower is 324 meters tall and is an iconic landmark of Paris. It was built in 1889 and was once the tallest man-made structure in the world. Now, it is one of the most popular tourist attractions in France. The tower is named after its designer, Gustave Eiffel. It was originally constructed for the 1889 Paris World's Fair, showcasing the architectural capabilities of the late 19th century. <|eot_id|> <|start_header_id|>user<|end_header_id|> How many visitors does the Eiffel Tower typically receive in a day? Do I need to book tickets in advance? <|eot_id|> <|start_header_id|>assistant<|end_header_id|>"""
the response is
{'generated_text': "\nThe Eiffel Tower receives around 7 million visitors annually. While you don't need to book tickets in advance, I recommend booking them online to avoid long lines and to guarantee your spot. You can find more information about visiting the Eiffel Tower and booking tickets here: https://www.eiffeltower.paris/en/. If you have any other questions, feel free to ask!\nspNetesModuleGeneratedNetTitle: spnet\nuser pip install spnet\nूडुங투"}
Which is a bit weird.

from text-generation-inference.

waderwu avatar waderwu commented on May 27, 2024 5

add stop parameter, it works for me

data = {
    'inputs': prompt,
    'parameters' : {
        'max_new_tokens': 1024,
        'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    }
}

from text-generation-inference.

Narsil avatar Narsil commented on May 27, 2024 4

Yes, llama3 has 2 eos tokens. eot_id for turn token, and. "real" eos_token (not sure when used).

Currently the config defines <eos_token> as the eos token, which if what you're seeing here.

This is what was intended by the meta team when we received it, we're looking to update the config for those instruct models.

from text-generation-inference.

Narsil avatar Narsil commented on May 27, 2024 2

Yes it is. And hf-chat sends that stop token currently.

from text-generation-inference.

oroojlooy avatar oroojlooy commented on May 27, 2024 1

Same here. It just keeps generating until it gets to its max-gen-limit.

from text-generation-inference.

waderwu avatar waderwu commented on May 27, 2024 1

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.

Could you please tell me the deployment command for hf-chat?

from text-generation-inference.

Vitaliy-Firebird avatar Vitaliy-Firebird commented on May 27, 2024 1

Okay, by slow I meant that it was not recognizing the stop tokens and was depleting the max_tokens with every request.

Upon further investigation, it appears that the system becomes erratic when parameters other than temperature and top_p are included, as it then disregards the stop tokens.

If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None) presence_penalty and frequency_penalty from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.

Thank you so much, @hooman-bayer! I'm using v2.0.1 docker image and I was struggling with the model (70b-instruct) as it kept generating nonsense when the presence_penalty and frequency_penalty were set to 0 (and it also looked like the stop tokens were not recognized either). As soon as I set these parameters to null in the request body it started working as expected! The model now delivers outputs that are exactly in line with what I see on hugging face's chat. I do wonder though why did it help? Is it because this forces the inference pipeline to skip the logits penalty modifications completely?

Anyway, thanks again for the great insight!

from text-generation-inference.

jtsai-quid avatar jtsai-quid commented on May 27, 2024 1

Just test llama3-8b in the 2.0.2, looks like this issues has been fixed.
#1808

from text-generation-inference.

n-imas avatar n-imas commented on May 27, 2024

I am facing the same issue as @shuaills

from text-generation-inference.

axenov avatar axenov commented on May 27, 2024

It does not work with TGI v1.4 and v2.0.1 as well.

from text-generation-inference.

arunchandra23 avatar arunchandra23 commented on May 27, 2024

+1

from text-generation-inference.

sa- avatar sa- commented on May 27, 2024

Does huggingface still use this image to serve their production models? Is it used by the llama3-70b chat that is currently deployed on https://huggingface.co/chat/ ?

from text-generation-inference.

hooman-bayer avatar hooman-bayer commented on May 27, 2024

@Narsil what version of TGI you recommend running Llama-3 models on? We noticed 2.0.1 seems to be a bit slow maybe you recommend earlier versions? We did not investigate much though

from text-generation-inference.

huwprosser avatar huwprosser commented on May 27, 2024

I got this working by building the image from this blog post:
https://lavaraja-padala.medium.com/deploy-google-gemma-2b-and-gemma-7b-models-on-aws-sagemaker-f441914ccc6f

TGI 2.1.1 seems to be what must be being used internally. Weirdly, HF endpoints seem to be using 2.0.1

from text-generation-inference.

Narsil avatar Narsil commented on May 27, 2024

Slow ? What do you mean ? What hardware, TP ? What is slow in this case?

from text-generation-inference.

hooman-bayer avatar hooman-bayer commented on May 27, 2024

Okay, by slow I meant that it was not recognizing the stop tokens and was depleting the max_tokens with every request.

Upon further investigation, it appears that the system becomes erratic when parameters other than temperature and top_p are included, as it then disregards the stop tokens.

If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None) presence_penalty and frequency_penalty from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.

from text-generation-inference.

Narsil avatar Narsil commented on May 27, 2024

The frequency penalty is being solved soon : #1765.

For the stop token, yes it's unfortunate setup, we're solving changing the default in many places (basically there are 2 stop tokens ..)

from text-generation-inference.

waderwu avatar waderwu commented on May 27, 2024

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.

Could you please tell me the deployment command for hf-chat?

Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.

from text-generation-inference.

mjsteele12 avatar mjsteele12 commented on May 27, 2024

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?

Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.

Would you be able to post your settings and example call? I am unable to get llama3 to stop no matter what I try.

from text-generation-inference.

jatkinson-CRL avatar jatkinson-CRL commented on May 27, 2024

add stop parameter, it works for me

data = {
    'inputs': prompt,
    'parameters' : {
        'max_new_tokens': 1024,
        'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    }
}

Great find, thanks for sharing this. This works for me when I include it in the extra_body dictionary when using the OpenAI chat completions API w/ a text-generation inference endpoint.

I am hoping that huggingface could update their documentation though, seems that some documents are out of date or out of sync with the OpenAPI spec. This parameter is documented in the OpenAPI spec here: https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/generate but it was tough to find this before I came across this solution. The documentation that appears much more frequently when searching for this solutions to this problem is https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task, which does not contain all of the parameters listed in the OpenAPI spec.

from text-generation-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.