Comments (20)
There is a mismatch between the tokenizer version used for training weights and the version used for loading?
I am not sure if this is a problem with my weights.
I get this
Warning: Token '<|reserved_special_token_250|>' was expected to have ID '128255' but was given ID 'None' 2024-04-21T06:06:48.861440Z INFO text_generation_router: router/src/main.rs:471: Serving revision 561487d18c41c76bcb5fc6cfb73a324982f04f47 of model meta-llama/Meta-Llama-3-8B
I tried
prompt = """<|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant, providing informative and friendly answers to the user. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hello! Can you tell me how tall the Eiffel Tower is? <|eot_id|> <|start_header_id|>assistant<|end_header_id|> The Eiffel Tower is 324 meters tall and is an iconic landmark of Paris. It was built in 1889 and was once the tallest man-made structure in the world. Now, it is one of the most popular tourist attractions in France. The tower is named after its designer, Gustave Eiffel. It was originally constructed for the 1889 Paris World's Fair, showcasing the architectural capabilities of the late 19th century. <|eot_id|> <|start_header_id|>user<|end_header_id|> How many visitors does the Eiffel Tower typically receive in a day? Do I need to book tickets in advance? <|eot_id|> <|start_header_id|>assistant<|end_header_id|>"""
the response is
{'generated_text': "\nThe Eiffel Tower receives around 7 million visitors annually. While you don't need to book tickets in advance, I recommend booking them online to avoid long lines and to guarantee your spot. You can find more information about visiting the Eiffel Tower and booking tickets here: https://www.eiffeltower.paris/en/. If you have any other questions, feel free to ask!\nspNetesModuleGeneratedNetTitle: spnet\nuser pip install spnet\nूडुங투"}
Which is a bit weird.
from text-generation-inference.
add stop parameter, it works for me
data = {
'inputs': prompt,
'parameters' : {
'max_new_tokens': 1024,
'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
}
}
from text-generation-inference.
Yes, llama3 has 2 eos tokens. eot_id for turn token, and. "real" eos_token (not sure when used).
Currently the config defines <eos_token>
as the eos token, which if what you're seeing here.
This is what was intended by the meta team when we received it, we're looking to update the config for those instruct models.
from text-generation-inference.
Yes it is. And hf-chat sends that stop token currently.
from text-generation-inference.
Same here. It just keeps generating until it gets to its max-gen-limit.
from text-generation-inference.
Yes it is. And hf-chat sends that stop token currently.
Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?
from text-generation-inference.
Okay, by
slow
I meant that it was not recognizing thestop
tokens and was depleting themax_tokens
with every request.Upon further investigation, it appears that the system becomes erratic when parameters other than
temperature
andtop_p
are included, as it then disregards thestop
tokens.If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to
None
)presence_penalty
andfrequency_penalty
from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.
Thank you so much, @hooman-bayer! I'm using v2.0.1
docker image and I was struggling with the model (70b-instruct) as it kept generating nonsense when the presence_penalty
and frequency_penalty
were set to 0 (and it also looked like the stop
tokens were not recognized either). As soon as I set these parameters to null
in the request body it started working as expected! The model now delivers outputs that are exactly in line with what I see on hugging face's chat. I do wonder though why did it help? Is it because this forces the inference pipeline to skip the logits penalty modifications completely?
Anyway, thanks again for the great insight!
from text-generation-inference.
Just test llama3-8b in the 2.0.2, looks like this issues has been fixed.
#1808
from text-generation-inference.
I am facing the same issue as @shuaills
from text-generation-inference.
It does not work with TGI v1.4 and v2.0.1 as well.
from text-generation-inference.
+1
from text-generation-inference.
Does huggingface still use this image to serve their production models? Is it used by the llama3-70b chat that is currently deployed on https://huggingface.co/chat/ ?
from text-generation-inference.
@Narsil what version of TGI you recommend running Llama-3 models on? We noticed 2.0.1
seems to be a bit slow maybe you recommend earlier versions? We did not investigate much though
from text-generation-inference.
I got this working by building the image from this blog post:
https://lavaraja-padala.medium.com/deploy-google-gemma-2b-and-gemma-7b-models-on-aws-sagemaker-f441914ccc6f
TGI 2.1.1 seems to be what must be being used internally. Weirdly, HF endpoints seem to be using 2.0.1
from text-generation-inference.
Slow ? What do you mean ? What hardware, TP ? What is slow in this case?
from text-generation-inference.
Okay, by slow
I meant that it was not recognizing the stop
tokens and was depleting the max_tokens
with every request.
Upon further investigation, it appears that the system becomes erratic when parameters other than temperature
and top_p
are included, as it then disregards the stop
tokens.
If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None
) presence_penalty
and frequency_penalty
from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.
from text-generation-inference.
The frequency penalty is being solved soon : #1765.
For the stop token, yes it's unfortunate setup, we're solving changing the default in many places (basically there are 2 stop tokens ..)
from text-generation-inference.
Yes it is. And hf-chat sends that stop token currently.
Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?
Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.
from text-generation-inference.
Yes it is. And hf-chat sends that stop token currently.
Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.
Would you be able to post your settings and example call? I am unable to get llama3 to stop no matter what I try.
from text-generation-inference.
add stop parameter, it works for me
data = { 'inputs': prompt, 'parameters' : { 'max_new_tokens': 1024, 'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"] } }
Great find, thanks for sharing this. This works for me when I include it in the extra_body
dictionary when using the OpenAI chat completions API w/ a text-generation inference endpoint.
I am hoping that huggingface could update their documentation though, seems that some documents are out of date or out of sync with the OpenAPI spec. This parameter is documented in the OpenAPI spec here: https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/generate but it was tough to find this before I came across this solution. The documentation that appears much more frequently when searching for this solutions to this problem is https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task, which does not contain all of the parameters listed in the OpenAPI spec.
from text-generation-inference.
Related Issues (20)
- LoRA Adapter from local model are leading to error HOT 4
- HF web service streaming response differs from OpenAI, breaking clients
- StarCoder2 AWQ does not work correctly
- Document Request HOT 2
- metric: tgi_request_total increments by 2 upon every request
- error: unexpected argument ‘–max-input-tokens’ found HOT 1
- Clarification and supplement to the online docs example
- Docs missing for LLaVA NeXT Model
- Phi-3 not starting on TGI 2.0.3 in kubernetes cluster HOT 2
- Wrong validations on `Parameters` in TGI python library
- LlavaNext Model cannot be started
- version in docker not correct
- Pydantic validation error re: ChoiceDelta (text_generation/types.py)
- TGI crash during Warming up model - invalid opcode in rotary_emb.cpython-310-x86_64-linux-gnu.so HOT 1
- Phi-3 medium 128k instruct fails to start HOT 6
- Falcon 11B VLM Support
- multiple origins
- Launching Idefics2 QLoRA failing on warmup - shape mismatch: value tensor of shape [64, 4096] cannot be broadcast to indexing result of shape [320, 4096]
- Expose `ignore_eos_token` to HTTP endpoints
- AttributeError: module 'vllm._C.ops' has no attribute 'moe_align_block_size'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-generation-inference.