modeltc / lightllm Goto Github PK

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

License: Apache License 2.0

Dockerfile 0.23% Python 99.20% Shell 0.57%

deep-learning gpt llama llm model-serving nlp openai-triton

lightllm's People

Contributors

Stargazers

Watchers

Forkers

xfplus jie311 llehtahw mzhang2054 cdj0311 xinxiangbobby jiyulongxu gqjia xin-li-67 isobe-y ssd227 tiandiao123 ouyangchucai techthiyanes hwaking oreo-lp gesanqiu hamelsmu tonylin52 plutoright andy-yang-1 af-74413592 shanshanpt wangbiao108 yangwang92 keyzf lighten001 super-buster xiepengli singularity-s0 jervint conradlz uranusseven huangzhenqiu gsaivinay weili-github morpheusph garycaokai jason-cs18 honglinchu zhaojp-frank tracin litprice yysjasmine pdtgct yunfeng-scale graceleeis wesissonb hudengjunai suhjohn gushiqiao yiming992 zhaoguochun1995 joyce94 cxh19940504 chielonewctle zhcharles fuheaven tmsagarofficial yonggucheng pannenetsf buptygz warrenzhao bet0x june01 qfxlcyc hiworldwzj mjdhasan wusiyu shihaobai standardgalactic bingo787 projstudy pb00000650 senselyk wxd000000 caowanxia123 beijingopra deeplink-org jiwenbiao1 chrisgao001 cynepiaadmin models-hub enjoysport2022 mvandermeulen engrgit apollohuang1 unix1986 llm-serve galeselee flyinglandlord liyonghua0910 lihuibng linotfan scv119 hillzhang1999 bjmsong liwenju0 yuanlehome yangyang233333

lightllm's Issues

Auto convert without tokenizer.json to prevent performance downgrade?

As mentioned in #20 , lightllm performance would downgrade a lot if without tokenizer.json. So for those model without this file, shall it be reasonable to add some auto conversion process in the server start to workaround this case?

Thx

Run error

triton/compiler/compiler.py", line 18, in
from ..runtime.autotuner import OutOfResources
ImportError: cannot import name 'OutOfResources' from partially initialized module 'triton.runtime.autotuner' (most likely due to a circular import) (

Triton Error [CUDA]: invalid argument

Issue description:

Got CUDA Error when sending request to server.

Steps to reproduce:

python -m lightllm.server.api_server --model_dir ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348 --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 120

And sending request using:

url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
        'inputs': PROMPT,
        "parameters": {
        'do_sample': args.do_sample,
        'ignore_eos': False,
        'max_new_tokens': max_new_tokens,
        }
}
generated_text = requests.post(url, headers=headers, data=json.dumps(data)).json()

Error logging:

Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/jina/jemfu/lightllm/lightllm/server/router/manager.py:88> exception=RuntimeError('Triton Error [CUDA]: invalid argument')>
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-7d1eb0d2fed8ff2032dccb99c2cc311a-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
    await self._step()
  File "/home/lightllm/lightllm/server/router/manager.py", line 112, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 201, in prefill_batch
    ans = self._prefill_batch(batch_id)
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 77, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 128, in forward
    logits = self.model.forward(**kwargs)
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 116, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 154, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 117, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home//lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 76, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jemfu/lightllm/lightllm/models/llama/triton_kernel/context_flashattention_nopad.py", line 224, in context_attention_fwd
    _fwd_kernel[grid](
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Environment:

Please provide information about your environment, such as:

Using container
OS: Ubuntu
GPU info:
- NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6
- RTX TITAN
Python: 3.10
LightLLm: I used git clone and pip install -e .
openai-triton: 2.0.0.dev20221202

Stream output

Hi,

This project is awesome! May I ask when would lightllm support stream output? Thanks

[feature request] add prompt styles support

There are many different styles of prompts for different LLMs, such like openai/llama2 style (especially support SYSTEM role prompt), pure text style, ziya, etc.
From the api_server.py 's parameter, we can see it just support the pure text style.
Maybe it will be a good feature to support these styles.

lightllm vs vllm

Hi,

I am testing the benchmark between lightllm and vllm, it seems that vllm achieves better results of 'token/ms' for llama 30b.

Here are the parameters for lightllm and vllm server:

Does lightllm support Qwen-7B deployment?

调用报错

/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
collect2: error: ld returned 1 exit status
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/house365ai/xxm/lightllm/lightllm/server/router/manager.py:84> exception=CalledProcessError(1, ['/usr/bin/gcc', '/tmp/tmpchhdqwt0/main.c', '-O3', '-I/usr/local/cuda-11.8/include', '-I/home/house365ai/.conda/envs/lightllm/include/python3.10', '-I/tmp/tmpchhdqwt0', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpchhdqwt0/_rms_norm_fwd_fused.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib32'])>
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 139, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 182, in prefill_batch
ans = self._prefill_batch(batch_id)
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 118, in forward
logits = self.model.forward(**kwargs)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 49, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/triton-2.0.0.dev20221202-py3.10-linux-x86_64.egg/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/triton-2.0.0.dev20221202-py3.10-linux-x86_64.egg/triton/compiler.py", line 1239, in compile
so = _build(fn.name, src_path, tmpdir)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/triton-2.0.0.dev20221202-py3.10-linux-x86_64.egg/triton/compiler.py", line 1169, in _build
ret = subprocess.check_call(cc_cmd)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpchhdqwt0/main.c', '-O3', '-I/usr/local/cuda-11.8/include', '-I/home/house365ai/.conda/envs/lightllm/include/python3.10', '-I/tmp/tmpchhdqwt0', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpchhdqwt0/_rms_norm_fwd_fused.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib32']' returned non-zero exit status 1.

RuntimeError("Distributed package doesn't have NCCL " "built in")

I follow the instruction:

setup env：
docker build -t image_name .
docker run -it --gpus all -p 8080:80 -v your_local_path:/data/ image_name /bin/bash
run model:
python -m lightllm.server.api_server --model_dir /path/llama-7B --tp 1 --max_total_token_num 120000

then meet the following exception:

Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:28765 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:28765 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:28765 (errno: 97 - Address family not supported by protocol).
Process Process-1:
Traceback (most recent call last):
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/manager.py", line 243, in start_router_process
asyncio.run(router.wait_to_model_ready())
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/manager.py", line 57, in wait_to_model_ready
await asyncio.gather(*init_model_ret)
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/model_infer/model_rpc.py", line 185, in init_model
ans : rpyc.AsyncResult = self._init_model(rank_id, world_size, weight_dir, max_total_token_num, load_way, mode)
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/model_infer/model_rpc.py", line 36, in exposed_init_model
dist.init_process_group('nccl', init_method=f'tcp://127.0.0.1:{setting["nccl_port"]}', rank=rank_id, world_size=world_size)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/manager.py", line 246, in start_router_process
err_str = '\n'.join(traceback.format_exception(e))
TypeError: format_exception() missing 2 required positional arguments: 'value' and 'tb'
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.

关于多并发时，batch_size的调整

请问上图中的 batch_size : 4 token 能否通过什么配置或者代码进行调整呢？
想要更大的batch_size尝试。

Llama2 and llama-30b does not work

Only able to run llama-7b model , llama2-13b , llama2-70b and llama-30 Failing
Using A100 GPU , followed the instruction about triton dependencies.

Task exception was never retrieved
future: <Task finished name='Task-12' coro=<RouterManager.loop_for_fwd() done, defined at /home/  /lightllm/lightllm/server/router/manager.py:84> exception='cef3986efedb4d1a966c56365091ddde'

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/  /lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
    batch: InferBatch = self.cache.pop(batch_id)
KeyError: 'cef3986efedb4d1a966c56365091ddde'
>
Traceback (most recent call last):
  File "/home/  /lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
    await self._step()
  File "/home/  /lightllm/lightllm/server/router/manager.py", line 106, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/  /lightllm/lightllm/server/router/manager.py", line 139, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 185, in prefill_batch
    return ans.value
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/async_.py", line 108, in value
    raise self._obj
_get_exception_class.<locals>.Derived: 'cef3986efedb4d1a966c56365091ddde'

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/  /lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
    batch: InferBatch = self.cache.pop(batch_id)
KeyError: 'cef3986efedb4d1a966c56365091ddde'

my A800 80G*8

How can this problem be solved？？

self.value_buffer = [torch.empty((size, head_num, head_dim), dtype=dtype, device="cuda") for _ in range(layer_num)]

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 0; 79.35 GiB total capacity; 77.83 GiB already allocated; 711.19 MiB free; 77.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[BUG]llama2-70B 加载完成后，无法启动服务

机器 A100 80G *3
启动命令：python -m lightllm.server.api_server --model_dir /path_to/Llama-2-70b-hf --tp 2 --max_total_token_num 120000 --max_req_input_len 3000 --max_req_total_len 4096
CUDA_V 11.8
python 3.9

模型权重加载卡住，有两张卡被使用，每张卡上面大概30多G现存，然后就一直卡住了。一直起不来服务。

log如下：
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
^[[MT2^[[MT2

RuntimeError: Triton Error [CUDA]: invalid argument

I encountered an error while running lightllm, and I need assistance in resolving it. Below is the traceback of the error:

After launching the server and running curl:

Task exception was never retrieved
future: <Task finished name='Task-8' coro=<RouterManager.loop_for_fwd() done, defined at /home/ec2-user/long/lightllm/lightllm/server/router/manager.py:83> exception=Triton Error [CUDA]: invalid argument

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 71, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 122, in forward
    logits = self.model.forward(**kwargs)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 109, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 147, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 111, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 69, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/triton_kernel/context_flashattention_nopad.py", line 234, in context_attention_fwd
    _fwd_kernel[grid](
  File "/opt/conda/envs/light/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument
>
Traceback (most recent call last):
  File "/home/ec2-user/long/lightllm/lightllm/server/router/manager.py", line 86, in loop_for_fwd
    await self._step()
  File "/home/ec2-user/long/lightllm/lightllm/server/router/manager.py", line 105, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/manager.py", line 138, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 197, in prefill_batch
    return await ans
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 159, in _func
    return ans.value
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/async_.py", line 108, in value
    raise self._obj
_get_exception_class.<locals>.Derived: Triton Error [CUDA]: invalid argument

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 71, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 122, in forward
    logits = self.model.forward(**kwargs)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 109, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 147, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 111, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 69, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/triton_kernel/context_flashattention_nopad.py", line 234, in context_attention_fwd
    _fwd_kernel[grid](
  File "/opt/conda/envs/light/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

pip show triton:

Name: triton
Version: 2.0.0.dev20221202
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /opt/conda/envs/light/lib/python3.9/site-packages
Requires: cmake, filelock, torch
Required-by: lightllm

commit id: head

Can nccl init port address be exposed to command line interface?

https://github.com/ModelTC/lightllm/blob/03cb09389b7cf11f245e85fb9c66d11ab1bc5a82/lightllm/server/router/model_infer/model_rpc.py#L32C9-L32C9

Any docs for support other new models?

such as Baichuan Qwen etc.

BTW, will the mem grows much if vocab size is a little big for instance 15w?

Comparison with deepspeed inference?

as title mentioned

计划什么时候支持baichuan-13b-chat ？

计划什么时候支持baichuan-13b-chat呢期待早点支持可以测试下推理性能

LightLLM maye should provide a script to download the weight and convert the weight.

I found there is no script that can download the weight and convert the weight.

Whether to support single card multi-instance loading

one 3090-24g gpu, load multi instance, like triton

benchmark stuck

Hi,

I try benchmark_serving.py to check the througput of lightllm, But seems benchmark process stuck after server print the "freed all gpu mem", then http post print would no longer print except last one.

Any idea?

current batch size: 1 token used ratio: 0.31983333333333336
freed all gpu mem size 6000
INFO:     127.0.0.1:34050 - "POST /generate HTTP/1.1" 200 OK

provide some examples

I tried test_llama.py，but.... help...T^T

Process Process-8:
Process Process-7:
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-09caff3db89e80ddf0eb4f72675bc8f9-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/lcx/lightllm/test/model/model_infer.py", line 51, in tppart_model_infer
logics = model_part.forward(batch_size,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 49, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Triton requires CUDA 11.4+
Process Process-2:
Process Process-5:
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-09caff3db89e80ddf0eb4f72675bc8f9-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/lcx/lightllm/test/model/model_infer.py", line 51, in tppart_model_infer
logics = model_part.forward(batch_size,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 49, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps)
Process Process-1:
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Triton requires CUDA 11.4+
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-09caff3db89e80ddf0eb4f72675bc8f9-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Support OpenAI-Compatible APIs

A lot of existing code uses the OpenAI API, could you also support this?

For example, FastChat and vLLM both provide OpenAI-compatible API servers (https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md).

why lightllm not suport ChatGLM-6b?

输入参数问题

我看 llama2-chinese 的输入 prompt 是这种，如何使用lightllm达到同样的效果嘞

当我执行abort之后，请求是否会立马被终止？

当我传入一个生成很长的请求时，如生成1280个tokens，然后中断请求，我观察到服务执行了abort命令，但同时观察到gpu利用率并没有马上为0，并且需要等待不少时间才会为0！

llama-7B一直卡在启动部分

如题，拉取最新代码(2023/8/9 14:00), 执行命令：
python -m lightllm.server.api_server --model_dir ./llm/llama-7b/ --tp 1 --max_total_token_num 6000
我的环境：
OS: centos 7
GPU=1 * 3090 24G。（内存有变化）
python=3.10.12
llama-7b-hf=https://huggingface.co/decapoda-research/llama-7b-hf/tree/main
torch=1.13.0+cu117（如果用2.0.0，运行时会报错：OSError: dlopen: cannot load any more object with static TLS）
cuda=/usr/local/cuda-11.6/（11.7也不行）

期待回复。谢谢

run error

docker build -t lightllm_v1 . --network host --build-arg http_proxy=http://127.0.0.1:7890 --build-arg https_proxy=http://127.0.0.1:7890
docker run -itd --gpus all --network=host -v /data/public_file:/data/ --name lightllm_test lightllm_v1
get into container and execute
python -m lightllm.server.api_server --model_dir /data/LLM_model/Llama2-Chinese-7b-Chat/ --tp 1 --max_total_token_num 120000

root@dell:/usr/src# python -m lightllm.server.api_server --model_dir /data/LLM_model/Llama2-Chinese-7b-Chat/ --tp 1 --max_total_token_num 120000
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 243, in start_router_process
    asyncio.run(router.wait_to_model_ready())
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 57, in wait_to_model_ready
    await asyncio.gather(*init_model_ret)
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/model_infer/model_rpc.py", line 179, in init_model
    ans : rpyc.AsyncResult = self._init_model(rank_id, world_size, weight_dir, max_total_token_num, load_way, mode)
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/model_infer/model_rpc.py", line 34, in exposed_init_model
    dist.init_process_group('nccl', init_method=f'tcp://127.0.0.1:{setting["nccl_port"]}', rank=rank_id, world_size=world_size)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
    default_pg = _new_process_group_helper(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 998, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 246, in start_router_process
    router.clean_up()
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 223, in clean_up
    model_rpc.rpc_server_process.kill()
AttributeError: 'NoneType' object has no attribute 'kill'
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
router init state: Distributed package doesn't have NCCL built in detoken init state: init ok

generate text garbled

When using the llama 2 model to predict "who are you", a bunch of incomprehensible things are generated. The same goes for asking other questions. Does the author know what the problem is?
{u'generated_text': [u"\u868a\n\u987c\u70f9\u4f18\u4f18\u891b\u4f18\u891b\u891b\u4f18\u891b\u4f18\u891b\u4f18\u573b\u4f18\u891b\u891b\u891b\u891b\u891b\u891b\u7d2c\u4f18\u573b\u4f18\u85e6\u891b\u7d2c\u6960\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u8be0\u8be0\u8be0\u85e6\n\n\u89c9\u62b5\u891b\u752b\u85e6\u85e6\n\u79a4\u6043\u752b\n\u62b5\u85e6\n\u62b5\u85e6\u85e6\u85e6\n\u73b7\u62b5\u85e6\n\u73b7\u62b5\u85e6\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\u989d\u989d\n'\n'\n'\n'\n'\n\u62b5\u62b5\u85e6\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\u989d\u989d\n'\n'\n\u62b5\u989d\n'\n'\n\u62b5\n'\n'\n\u62b5\u989d\u989d\u62b5\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n\u62b5\n'\n'\n'\n\u62b5\n'\n'\n\u62b5\u62b5\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n\n\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n''\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n''\n'\n'\n'\n'\n'\n'''\n'\n'\n'''\n'\n'\n'\n'\n'\n''''\n'\n'\n''\n'\n''''\n'\n'\n''\n'\n'\n''\n'\n'\n'\n'\n'\n'''''\n'\n'''\n"]}

LLama 模型支持问题

您好，文档上有列出支持 facebook 的 LLama 模型，尝试用 hf 上的其他llama 模型报错
理论上讲 LLama 的模型结构都比较相似，如果想推理类似 huggyllama/llama-7b、openlm-research/open_llama_7b 这些模型，请问工作量大吗，如果不大的话，需要做些什么改动呢？

是否有计划支持 lora？

Same as vllm-project/vllm#182 (comment)

[BUG] AttributeError: 'NoneType' object has no attribute 'to_ir'

Issue description:

I receive AttributeError: 'NoneType' object has no attribute 'to_ir'

Steps to reproduce:

Launch lightllm with the following arguments on 4x A100 80gb

python -m lightllm.server.api_server --model_dir /workspace/models/Llama-2-70b-chat-hf     \
                                     --host 0.0.0.0                 \
                                     --port 8080                    \
                                     --tp 4                         \
                                     --max_total_token_num 120000

Send the following request:

curl --location 'http://localhost:8080/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs":"[INST] <<SYS>>\nYou are a helpful AI assistant.\n<</SYS>>\n\nPlease tell me about frogs [/INST]\n",
    "parameters":{
        "max_new_tokens":1000, 
        "frequency_penalty":1
    }
}'

Expected behavior:

Expected a 200 code and a response. Instead, received an error and timeout.

Error logging:

(lightllm) azureuser@trainer1:/workspace/lightllm$ ./run.sh 
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO:     Started server process [32679]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)



Task exception was never retrieved
future: <Task finished name='Task-8' coro=<RouterManager.loop_for_fwd() done, defined at /workspace/lightllm/lightllm/server/router/manager.py:88> exception=at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-bde58180cc67fc4675629069557a5d0a-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
    fn.reset_type(self.prototype.to_ir(self.builder))
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in to_ir
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in <listcomp>
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 92, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 143, in forward
    logits = self.model.forward(**kwargs)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward
    return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward
    self._context_attention(input_embdings,
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 81, in _context_attention
    q = self._get_qkv(input1, cache_k, cache_v, infer_state, layer_weight)
  File "/workspace/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 43, in _get_qkv
    rotary_emb_fwd(q.view(-1, self.tp_q_head_num_, self.head_dim_), infer_state.position_cos, infer_state.position_sin)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/models/llama/triton_kernel/rotary_emb.py", line 62, in rotary_emb_fwd
    _rotary_kernel[grid](
  File "<string>", line 41, in _rotary_kernel
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1620, in compile
    next_module = compile(module)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1549, in <lambda>
    lambda src: ast_to_ttir(src, signature, configs[0], constants)),
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 962, in ast_to_ttir
    mod, _ = build_triton_ir(fn, signature, specialization, constants)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 942, in build_triton_ir
    raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^
>
Traceback (most recent call last):
  File "/workspace/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
    await self._step()
  File "/workspace/lightllm/lightllm/server/router/manager.py", line 112, in _step
    await self._prefill_batch(self.running_batch)
  File "/workspace/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 218, in prefill_batch
    return await ans
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 180, in _func
    return ans.value
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/async_.py", line 108, in value
    raise self._obj
rpyc.core.vinegar/triton.compiler._get_exception_class.<locals>.Derived: at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-bde58180cc67fc4675629069557a5d0a-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
    fn.reset_type(self.prototype.to_ir(self.builder))
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in to_ir
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in <listcomp>
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 92, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 143, in forward
    logits = self.model.forward(**kwargs)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward
    return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward
    self._context_attention(input_embdings,
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 81, in _context_attention
    q = self._get_qkv(input1, cache_k, cache_v, infer_state, layer_weight)
  File "/workspace/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 43, in _get_qkv
    rotary_emb_fwd(q.view(-1, self.tp_q_head_num_, self.head_dim_), infer_state.position_cos, infer_state.position_sin)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/models/llama/triton_kernel/rotary_emb.py", line 62, in rotary_emb_fwd
    _rotary_kernel[grid](
  File "<string>", line 41, in _rotary_kernel
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1620, in compile
    next_module = compile(module)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1549, in <lambda>
    lambda src: ast_to_ttir(src, signature, configs[0], constants)),
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 962, in ast_to_ttir
    mod, _ = build_triton_ir(fn, signature, specialization, constants)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 942, in build_triton_ir
    raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^

Environment:

Not using container. Using clean conda environment.

$ uname -a
Linux trainer1 5.15.0-1042-azure #49~20.04.1-Ubuntu SMP Wed Jul 12 12:44:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ nvidia-smi
Sat Aug 19 19:27:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   29C    P0    52W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000002:00:00.0 Off |                    0 |
| N/A   30C    P0    52W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000003:00:00.0 Off |                    0 |
| N/A   30C    P0    53W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000004:00:00.0 Off |                    0 |
| N/A   31C    P0    54W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ python --version
Python 3.10.12

$ git log -n 1
commit 21007c4b8ca556e0f54f6851a3322a5464d3857f (HEAD -> main, origin/main, origin/HEAD)
Author: hiworldwzj <[email protected]>
Date:   Fri Aug 18 18:23:13 2023 +0800

    Update README.md to Add support For Baichuan13B (#87)

$ pip show triton
Name: triton
Version: 2.0.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages
Requires: cmake, filelock, lit, torch
Required-by: lightllm

Cannot use NsightSystems to trace gpu usage

Hi,

I try use nsys profile to trace the detail of gpu usage, but fail to capture any gpu thread.
For other gpu env, the profile works ok, and I also could use nsys for vllm.

So I wonder why nsys cannot trace lightllm?

Thx

OOM when prompt length exceeds 1020.

Hi,

We deploy llama30b with lightllm, we found that when the length of prompt exceeds 1020, OOM happens.

Enviroment:
1xA100 80G
Driver Version: 460.106.00
Cuda: 11.7
LLaMA 30B

benchmark无法完整运行

再次压测时发现，最新的代码在按照readme运行以下代码时无法成功：

显卡：A100
模型：llama-7b-hf
出现的现象为：服务不断出现接收http的请求，但是显示batch token及token ratio的日志不再出现且gpu利用率为0，这有理由说明推理已经完成？值得一提的是最开始时batch token的日志是出现的，并且token ratio占比很高。

等待很久之后，终于返回了benchmark的结果，但这相比预期差了许多，在后面的很长一段时间，gpu的利用率一直为0！我怀疑服务已经完成推理，但引擎未发起正确的返回。

期待您的回复~

RuntimeError: CUDA: Error- invalid source

GPU info: NVIDIA A800
model: Llama-2-7b-hf

root@23-0-0-175:/code# python -m lightllm.server.api_server --model_dir /code/Llama-2-7b-hf --tp 1 --max_total_token_num 4096
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO: Started server process [35]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py:84> exception=RuntimeError('CUDA: Error- invalid source')>
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py", line 139, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/model_infer/model_rpc.py", line 182, in prefill_batch
ans = self._prefill_batch(batch_id)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/model_infer/model_rpc.py", line 118, in forward
logits = self.model.forward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/model.py", line 105, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/model.py", line 143, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 111, in context_forward
self._context_flash_attention(input_embdings,
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 57, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1268, in compile
return CompiledKernel(name, so_cache_manager._make_path(so_name), fn_cache_manager.cache_dir, device)
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1301, in init
mod, func, n_regs, n_spills = _triton.code_gen.load_binary(metadata["name"], self.asm["cubin"], self.shared, device)
RuntimeError: CUDA: Error- invalid source

triton kernel compile error

Hi,

When I try to use lightllm serving with llama7B/13B, I met triton compile error.
I use the dockerfile in git to build to test image, and with A100-40G to test.

So am I missing anything to enable the test?

The error log is as:

future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /opt/lightllm/lightllm/server/router/manager.py:84> exception=CompilationError('at 38:4:\ndef _rotary_kernel(\n    Q, Cos, Sin,\n    stride_qbs, stride_qh, stride_qd,\n    stride_cosbs, stride_cosd,\n    stride_sinbs, stride_sind,\n    max_total_len,\n    H,  # N_CTX 代表要计算的上下文长度\n    BLOCK_HEAD: tl.constexpr,\n    BLOCK_SEQ: tl.constexpr,\n    BLOCK_DMODEL: tl.constexpr,\n):\n    cur_head_index = tl.program_id(0)\n    cur_seq_index = tl.program_id(1)\n\n    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)\n    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)\n\n    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)\n    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)\n\n    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd\n    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd\n\n    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd\n\n    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n\n    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n\n    out0 = q0 * cos - q1 * sin\n    out1 = q0 * sin + q1 * cos\n\n    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n\n    return\n    ^')>
Traceback (most recent call last):
  File "<string>", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-2d732a2488b7ed996facc3e641ee56bf-2a292e5784d51bd8ac8bf0d3423dfbd4-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/opt/conda/lib/python3.9/ast.py", line 407, in visit
    return visitor(node)
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/opt/conda/lib/python3.9/ast.py", line 415, in generic_visit
    self.visit(item)
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/opt/conda/lib/python3.9/ast.py", line 407, in visit
    return visitor(node)
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
    fn.reset_type(self.prototype.to_ir(self.builder))
  File "/opt/conda/lib/python3.9/site-packages/triton/language/core.py", line 301, in to_ir
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
  File "/opt/conda/lib/python3.9/site-packages/triton/language/core.py", line 301, in <listcomp>
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

llama2-70B 加载完成后，无法启动服务

机器 A100 80G *4
启动命令：python -m lightllm.server.api_server --model_dir /path_to/Llama-2-70b-hf --tp 4 --tokenizer_mode auto --max_total_token_num 512
CUDA_V 11.7
python 3.8

模型权重加载卡住，每张卡上面大概30多G现存，然后就一直卡住了。一直起不来服务。

ziya运行报错，求助

/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/house365ai/xxm/lightllm/lightllm/server/router/manager.py:84> exception=CompilationError('at 38:4:\ndef _rotary_kernel(\n Q, Cos, Sin,\n stride_qbs, stride_qh, stride_qd,\n stride_cosbs, stride_cosd,\n stride_sinbs, stride_sind,\n max_total_len,\n H, # N_CTX 代表要计算的上下文长度\n BLOCK_HEAD: tl.constexpr,\n BLOCK_SEQ: tl.constexpr,\n BLOCK_DMODEL: tl.constexpr,\n):\n cur_head_index = tl.program_id(0)\n cur_seq_index = tl.program_id(1)\n\n cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)\n cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)\n\n dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)\n dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)\n\n off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd\n off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd\n\n off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd\n\n q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n\n cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n\n out0 = q0 * cos - q1 * sin\n out1 = q0 * sin + q1 * cos\n\n tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n\n return\n ^')>
Traceback (most recent call last):
File "", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-d000bd1a52e8da5725b7d0d3a84e9be4-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 937, in build_triton_ir
generator.visit(fn.parse())
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
return super().visit(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/ast.py", line 407, in visit
return visitor(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 183, in visit_Module
ast.NodeVisitor.generic_visit(self, node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/ast.py", line 415, in generic_visit
self.visit(item)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
return super().visit(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/ast.py", line 407, in visit
return visitor(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
fn.reset_type(self.prototype.to_ir(self.builder))
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/language/core.py", line 298, in to_ir
ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/language/core.py", line 298, in
ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 139, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 182, in prefill_batch
ans = self._prefill_batch(batch_id)
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 118, in forward
logits = self.model.forward(**kwargs)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 52, in _context_flash_attention
rotary_emb_fwd(q.view(calcu_shape1), infer_state.position_cos, infer_state.position_sin)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/triton_kernel/rotary_emb.py", line 62, in rotary_emb_fwd
_rotary_kernel[grid](
File "", line 41, in _rotary_kernel
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 1621, in compile
next_module = compile(module)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 1550, in
lambda src: ast_to_ttir(src, signature, configs[0], constants)),
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 962, in ast_to_ttir
mod, _ = build_triton_ir(fn, signature, specialization, constants)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 942, in build_triton_ir
raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 38:4:
def _rotary_kernel(
Q, Cos, Sin,
stride_qbs, stride_qh, stride_qd,
stride_cosbs, stride_cosd,
stride_sinbs, stride_sind,
max_total_len,
H, # N_CTX 代表要计算的上下文长度
BLOCK_HEAD: tl.constexpr,
BLOCK_SEQ: tl.constexpr,
BLOCK_DMODEL: tl.constexpr,
):
cur_head_index = tl.program_id(0)
cur_seq_index = tl.program_id(1)

cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

out0 = q0 * cos - q1 * sin
out1 = q0 * sin + q1 * cos

tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

return

Which version is the comparison of the pressure measurement data provided in the project based on?

The comparison data with TGI is based on what TGI version and startup parameters, as well as hardware.

the stream output is same to OpenAI?

can i use the below code to read the stream ouput?

import openai
if __name__ == "__main__":
    openai.api_base = "http://localhost:8080/v1"
    openai.api_key = "none"
    for chunk in openai.ChatCompletion.create(
        model="llama",
        messages=[
            {"role": "user", "content": "give me three healthy methods"}
        ],
        stream=True
    ):
        if hasattr(chunk.choices[0].delta, "content"):
            print(chunk.choices[0].delta.content, end="", flush=True)

Encounter error when serving with vicuna-13b-v1.3.

Use docker environment:
docker build -t image_name .
sudo docker run -it --runtime=nvidia --name=test --net=host --gpus all --privileged --shm-size 20G --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE llm bash

GPU V100
CUDA Version: 11.8
Python 3.9.16
pip uninstall torch
pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

server:
python -m lightllm.server.api_server --model_dir /data/workspace/vicuna-13b-v1.3 --tp 2 --max_total_token_num 121060 --tokenizer_mode auto

client:
python ./test/benchmark_serving.py --tokenizer /data/workspace/vicuna-13b-v1.3 --dataset /data/workspace/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100

Error message:
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO: Started server process [628]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Task exception was never retrieved
future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /data/workspace/lightllm/lightllm/server/router/manager.py:84> exception='97859f0c0d6242588bb78c8e4a29aed0'

========= Remote Traceback (1) =========
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
return obj(*args, **dict(kwargs))
File "/data/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
batch: InferBatch = self.cache.pop(batch_id)
KeyError: '97859f0c0d6242588bb78c8e4a29aed0'

Traceback (most recent call last):
File "/data/workspace/lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/data/workspace/lightllm/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/data/workspace/lightllm/lightllm/server/router/manager.py", line 139, in prefill_batch
ans = await asyncio.gather(*rets)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 185, in prefill_batch
return ans.value
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 108, in value
raise self._obj
_get_exception_class..Derived: '97859f0c0d6242588bb78c8e4a29aed0'

[QUESTION] Is it expected to see an exception when running `benchmark_serving.py`

Issue description:

I was running benchmark_serving.py against llama2-7b:

$ python benchmark_serving.py --tokenizer /path/to/Llama-2-7b-chat-hf --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json

Outputs:

Namespace(dataset='/path/to/ShareGPT_V3_unfiltered_cleaned_split.json', tokenizer='/path/to/Llama-2-7b-chat-hf', request_rate=inf, num_prompts=1000, seed=0)
read data set finish
total tokens: 494250
Traceback (most recent call last):
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 581, in write_bytes
    await self.body.write(writer)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/payload.py", line 247, in write
    await writer.write(self._value)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/http_writer.py", line 115, in write
    self._write(chunk)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/http_writer.py", line 75, in _write
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 236, in <module>
    main(args)
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 198, in main
    asyncio.run(benchmark(input_requests, args.request_rate))
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 187, in benchmark
    await asyncio.gather(*tasks)
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 162, in send_request
    async with session.post(url, headers=headers, json=data) as response:
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client.py", line 560, in _request
    await resp.start(conn)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 899, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno None] Can not write request body for http://localhost:8000/generate

The API server worked fine.

Please provide a clear and concise description of your issue.

Steps to reproduce:

Please list the steps to reproduce the issue, such as:

command 0
command 2
command 3
See error

Expected behavior:

Please describe what you expected to happen.

Error logging:

If applicable, please copy and paste the error message or stack trace here. Use code blocks for better readability.

Environment:

Please provide information about your environment, such as:

Using container
OS: (Ubuntu 14.04, CentOS7)
GPU info:
- nvidia-smi (e.g. NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0)
- Graphics cards: (e.g. 4090x8)
Python: (e.g. CPython3.9)
- currently, only python>=3.9 is supported
LightLLm: (git commit-hash)
- for container: docker run --entrypoint cat --rm ghcr.io/modeltc/lightllm:main /lightllm/.git/refs/heads/main
openai-triton: pip show triton

Additional context:

Please add any other context or screenshots about the issue here.

Language:

Please use English as much as possible for better communication.

modeltc / lightllm Goto Github PK

lightllm's People

Contributors

Stargazers

Watchers

Forkers

lightllm's Issues

Recommend Projects

Recommend Topics

Recommend Org