Coder Social home page Coder Social logo

modeltc / lightllm Goto Github PK

View Code? Open in Web Editor NEW
1.8K 1.8K 164.0 1.58 MB

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

License: Apache License 2.0

Dockerfile 0.23% Python 99.20% Shell 0.57%
deep-learning gpt llama llm model-serving nlp openai-triton

lightllm's People

Contributors

andy-yang-1 avatar bingo787 avatar chielonewctle avatar flyinglandlord avatar fuheaven avatar hamelsmu avatar helloyongyang avatar hiworldwzj avatar huochaitiantang avatar jingofxin avatar llehtahw avatar senbeiasano avatar shanshanpt avatar shihaobai avatar singularity-s0 avatar tmsagarofficial avatar tracin avatar uranusseven avatar wandy666 avatar wusiyu avatar wxd000000 avatar xfplus avatar xhplus avatar yunfeng-scale avatar zeyugao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightllm's Issues

run error with baichuan-7B

my env:
OS=centos 7
GPU=1 * 3090 24G
python=3.10.12
baichuan-7B=https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main
torch=1.13.0+cu117
cuda=cuda-11.8
triton=2.0.0.dev20221202
能启动,调用curl后报错:

Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /data/wangjie/code/github/lightllm/lightllm/server/router/manager.py:83> exception=TypeError("mm(): argument 'mat2' (position 2) must be Tensor, not NoneType")>
Traceback (most recent call last):
  File "/data/wangjie/code/github/lightllm/lightllm/server/router/manager.py", line 86, in loop_for_fwd
    await self._step()
  File "/data/wangjie/code/github/lightllm/lightllm/server/router/manager.py", line 105, in _step
    await self._prefill_batch(self.running_batch)
  File "/data/wangjie/code/github/lightllm/lightllm/server/router/manager.py", line 138, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/data/wangjie/code/github/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 195, in prefill_batch
    ans = self._prefill_batch(batch_id)
  File "/data/wangjie/code/github/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/data/wangjie/code/github/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 71, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/data/wangjie/code/github/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 122, in forward
    logits = self.model.forward(**kwargs)
  File "/data/wangjie/tools/miniconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data/wangjie/code/github/lightllm/lightllm/models/llama/layer_infer/model.py", line 113, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/data/wangjie/code/github/lightllm/lightllm/models/llama/layer_infer/model.py", line 151, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/data/wangjie/code/github/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 117, in context_forward
    self._context_flash_attention(input_embdings,
  File "/data/wangjie/code/github/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/data/wangjie/code/github/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 66, in _context_flash_attention
    q = torch.mm(input1.view(-1, self.embed_dim_), layer_weight.q_weight_)
TypeError: mm(): argument 'mat2' (position 2) must be Tensor, not NoneType

期待回复,谢谢

Run error

triton/compiler/compiler.py", line 18, in
from ..runtime.autotuner import OutOfResources
ImportError: cannot import name 'OutOfResources' from partially initialized module 'triton.runtime.autotuner' (most likely due to a circular import) (

Triton Error [CUDA]: invalid argument

Issue description:

Got CUDA Error when sending request to server.

Steps to reproduce:

python -m lightllm.server.api_server --model_dir ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348 --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 120

And sending request using:

url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
        'inputs': PROMPT,
        "parameters": {
        'do_sample': args.do_sample,
        'ignore_eos': False,
        'max_new_tokens': max_new_tokens,
        }
}
generated_text = requests.post(url, headers=headers, data=json.dumps(data)).json()

Error logging:

Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/jina/jemfu/lightllm/lightllm/server/router/manager.py:88> exception=RuntimeError('Triton Error [CUDA]: invalid argument')>
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-7d1eb0d2fed8ff2032dccb99c2cc311a-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
    await self._step()
  File "/home/lightllm/lightllm/server/router/manager.py", line 112, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 201, in prefill_batch
    ans = self._prefill_batch(batch_id)
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 77, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 128, in forward
    logits = self.model.forward(**kwargs)
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 116, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 154, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 117, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home//lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 76, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jemfu/lightllm/lightllm/models/llama/triton_kernel/context_flashattention_nopad.py", line 224, in context_attention_fwd
    _fwd_kernel[grid](
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Environment:

Please provide information about your environment, such as:

  • Using container

  • OS: Ubuntu

  • GPU info:

    • NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6
    • RTX TITAN
  • Python: 3.10

  • LightLLm: I used git clone and pip install -e .

  • openai-triton: 2.0.0.dev20221202

Stream output

Hi,

This project is awesome! May I ask when would lightllm support stream output? Thanks

[feature request] add prompt styles support

There are many different styles of prompts for different LLMs, such like openai/llama2 style (especially support SYSTEM role prompt), pure text style, ziya, etc.
From the api_server.py 's parameter, we can see it just support the pure text style.
Maybe it will be a good feature to support these styles.

lightllm vs vllm

Hi,

I am testing the benchmark between lightllm and vllm, it seems that vllm achieves better results of 'token/ms' for llama 30b.

Here are the parameters for lightllm and vllm server:
image
image

调用报错

/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
collect2: error: ld returned 1 exit status
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/house365ai/xxm/lightllm/lightllm/server/router/manager.py:84> exception=CalledProcessError(1, ['/usr/bin/gcc', '/tmp/tmpchhdqwt0/main.c', '-O3', '-I/usr/local/cuda-11.8/include', '-I/home/house365ai/.conda/envs/lightllm/include/python3.10', '-I/tmp/tmpchhdqwt0', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpchhdqwt0/_rms_norm_fwd_fused.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib32'])>
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 139, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 182, in prefill_batch
ans = self._prefill_batch(batch_id)
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 118, in forward
logits = self.model.forward(**kwargs)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 49, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps
)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/triton-2.0.0.dev20221202-py3.10-linux-x86_64.egg/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/triton-2.0.0.dev20221202-py3.10-linux-x86_64.egg/triton/compiler.py", line 1239, in compile
so = _build(fn.name, src_path, tmpdir)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/site-packages/triton-2.0.0.dev20221202-py3.10-linux-x86_64.egg/triton/compiler.py", line 1169, in _build
ret = subprocess.check_call(cc_cmd)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpchhdqwt0/main.c', '-O3', '-I/usr/local/cuda-11.8/include', '-I/home/house365ai/.conda/envs/lightllm/include/python3.10', '-I/tmp/tmpchhdqwt0', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpchhdqwt0/_rms_norm_fwd_fused.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib32']' returned non-zero exit status 1.

RuntimeError("Distributed package doesn't have NCCL " "built in")

I follow the instruction:

  1. setup env:
    docker build -t image_name .
    docker run -it --gpus all -p 8080:80 -v your_local_path:/data/ image_name /bin/bash

  2. run model:
    python -m lightllm.server.api_server --model_dir /path/llama-7B --tp 1 --max_total_token_num 120000

then meet the following exception:

Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:28765 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:28765 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:28765 (errno: 97 - Address family not supported by protocol).
Process Process-1:
Traceback (most recent call last):
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/manager.py", line 243, in start_router_process
asyncio.run(router.wait_to_model_ready())
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/manager.py", line 57, in wait_to_model_ready
await asyncio.gather(*init_model_ret)
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/model_infer/model_rpc.py", line 185, in init_model
ans : rpyc.AsyncResult = self._init_model(rank_id, world_size, weight_dir, max_total_token_num, load_way, mode)
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/model_infer/model_rpc.py", line 36, in exposed_init_model
dist.init_process_group('nccl', init_method=f'tcp://127.0.0.1:{setting["nccl_port"]}', rank=rank_id, world_size=world_size)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/storage03/users/kongxiangxing/projects/llm/llm_deploy/light_llm/lightllm/server/router/manager.py", line 246, in start_router_process
err_str = '\n'.join(traceback.format_exception(e))
TypeError: format_exception() missing 2 required positional arguments: 'value' and 'tb'
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.

Llama2 and llama-30b does not work

  • Only able to run llama-7b model , llama2-13b , llama2-70b and llama-30 Failing
  • Using A100 GPU , followed the instruction about triton dependencies.
Task exception was never retrieved
future: <Task finished name='Task-12' coro=<RouterManager.loop_for_fwd() done, defined at /home/  /lightllm/lightllm/server/router/manager.py:84> exception='cef3986efedb4d1a966c56365091ddde'

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/  /lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
    batch: InferBatch = self.cache.pop(batch_id)
KeyError: 'cef3986efedb4d1a966c56365091ddde'
>
Traceback (most recent call last):
  File "/home/  /lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
    await self._step()
  File "/home/  /lightllm/lightllm/server/router/manager.py", line 106, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/  /lightllm/lightllm/server/router/manager.py", line 139, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 185, in prefill_batch
    return ans.value
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/async_.py", line 108, in value
    raise self._obj
_get_exception_class.<locals>.Derived: 'cef3986efedb4d1a966c56365091ddde'

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/lightllm-infer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/  /lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/  /lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
    batch: InferBatch = self.cache.pop(batch_id)
KeyError: 'cef3986efedb4d1a966c56365091ddde'

my A800 80G*8

How can this problem be solved??

self.value_buffer = [torch.empty((size, head_num, head_dim), dtype=dtype, device="cuda") for _ in range(layer_num)]

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 0; 79.35 GiB total capacity; 77.83 GiB already allocated; 711.19 MiB free; 77.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[BUG]llama2-70B 加载完成后,无法启动服务

机器 A100 80G *3
启动命令:python -m lightllm.server.api_server --model_dir /path_to/Llama-2-70b-hf --tp 2 --max_total_token_num 120000 --max_req_input_len 3000 --max_req_total_len 4096
CUDA_V 11.8
python 3.9

模型权重加载卡住,有两张卡被使用,每张卡上面大概30多G现存,然后就一直卡住了。一直起不来服务。

log如下:
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
^[[MT2^[[MT2

RuntimeError: Triton Error [CUDA]: invalid argument

I encountered an error while running lightllm, and I need assistance in resolving it. Below is the traceback of the error:

After launching the server and running curl:

Task exception was never retrieved
future: <Task finished name='Task-8' coro=<RouterManager.loop_for_fwd() done, defined at /home/ec2-user/long/lightllm/lightllm/server/router/manager.py:83> exception=Triton Error [CUDA]: invalid argument

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 71, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 122, in forward
    logits = self.model.forward(**kwargs)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 109, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 147, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 111, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 69, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/triton_kernel/context_flashattention_nopad.py", line 234, in context_attention_fwd
    _fwd_kernel[grid](
  File "/opt/conda/envs/light/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument
>
Traceback (most recent call last):
  File "/home/ec2-user/long/lightllm/lightllm/server/router/manager.py", line 86, in loop_for_fwd
    await self._step()
  File "/home/ec2-user/long/lightllm/lightllm/server/router/manager.py", line 105, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/manager.py", line 138, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 197, in prefill_batch
    return await ans
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 159, in _func
    return ans.value
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/async_.py", line 108, in value
    raise self._obj
_get_exception_class.<locals>.Derived: Triton Error [CUDA]: invalid argument

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 71, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/ec2-user/long/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 122, in forward
    logits = self.model.forward(**kwargs)
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 109, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/model.py", line 147, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 111, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/ec2-user/long/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 69, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/opt/conda/envs/light/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/long/lightllm/lightllm/models/llama2/triton_kernel/context_flashattention_nopad.py", line 234, in context_attention_fwd
    _fwd_kernel[grid](
  File "/opt/conda/envs/light/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

pip show triton:

Name: triton
Version: 2.0.0.dev20221202
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /opt/conda/envs/light/lib/python3.9/site-packages
Requires: cmake, filelock, torch
Required-by: lightllm

commit id: head

benchmark stuck

Hi,

I try benchmark_serving.py to check the througput of lightllm, But seems benchmark process stuck after server print the "freed all gpu mem", then http post print would no longer print except last one.

Any idea?

current batch size: 1 token used ratio: 0.31983333333333336
freed all gpu mem size 6000
INFO:     127.0.0.1:34050 - "POST /generate HTTP/1.1" 200 OK

I tried test_llama.py,but.... help...T^T

Process Process-8:
Process Process-7:
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-09caff3db89e80ddf0eb4f72675bc8f9-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/lcx/lightllm/test/model/model_infer.py", line 51, in tppart_model_infer
logics = model_part.forward(batch_size,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 49, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps
)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Triton requires CUDA 11.4+
Process Process-2:
Process Process-5:
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-09caff3db89e80ddf0eb4f72675bc8f9-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/envs/stan/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/lcx/lightllm/test/model/model_infer.py", line 51, in tppart_model_infer
logics = model_part.forward(batch_size,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 49, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps
)
Process Process-1:
File "/opt/conda/envs/stan/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
File "/opt/conda/envs/stan/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Triton requires CUDA 11.4+
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-09caff3db89e80ddf0eb4f72675bc8f9-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

输入参数问题

image
我看 llama2-chinese 的输入 prompt 是这种,如何使用lightllm达到同样的效果嘞

llama-7B一直卡在启动部分

如题,拉取最新代码(2023/8/9 14:00), 执行命令:
python -m lightllm.server.api_server --model_dir ./llm/llama-7b/ --tp 1 --max_total_token_num 6000
我的环境:
OS: centos 7
GPU=1 * 3090 24G。(内存有变化)
python=3.10.12
llama-7b-hf=https://huggingface.co/decapoda-research/llama-7b-hf/tree/main
torch=1.13.0+cu117(如果用2.0.0,运行时会报错:OSError: dlopen: cannot load any more object with static TLS)
cuda=/usr/local/cuda-11.6/(11.7也不行)

期待回复。谢谢

run error

  1. docker build -t lightllm_v1 . --network host --build-arg http_proxy=http://127.0.0.1:7890 --build-arg https_proxy=http://127.0.0.1:7890
  2. docker run -itd --gpus all --network=host -v /data/public_file:/data/ --name lightllm_test lightllm_v1
  3. get into container and execute
    python -m lightllm.server.api_server --model_dir /data/LLM_model/Llama2-Chinese-7b-Chat/ --tp 1 --max_total_token_num 120000
root@dell:/usr/src# python -m lightllm.server.api_server --model_dir /data/LLM_model/Llama2-Chinese-7b-Chat/ --tp 1 --max_total_token_num 120000
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 243, in start_router_process
    asyncio.run(router.wait_to_model_ready())
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 57, in wait_to_model_ready
    await asyncio.gather(*init_model_ret)
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/model_infer/model_rpc.py", line 179, in init_model
    ans : rpyc.AsyncResult = self._init_model(rank_id, world_size, weight_dir, max_total_token_num, load_way, mode)
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/model_infer/model_rpc.py", line 34, in exposed_init_model
    dist.init_process_group('nccl', init_method=f'tcp://127.0.0.1:{setting["nccl_port"]}', rank=rank_id, world_size=world_size)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
    default_pg = _new_process_group_helper(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 998, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 246, in start_router_process
    router.clean_up()
  File "/opt/conda/lib/python3.9/site-packages/lightllm/server/router/manager.py", line 223, in clean_up
    model_rpc.rpc_server_process.kill()
AttributeError: 'NoneType' object has no attribute 'kill'
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
router init state: Distributed package doesn't have NCCL built in detoken init state: init ok

generate text garbled

When using the llama 2 model to predict "who are you", a bunch of incomprehensible things are generated. The same goes for asking other questions. Does the author know what the problem is?
{u'generated_text': [u"\u868a\n\u987c\u70f9\u4f18\u4f18\u891b\u4f18\u891b\u891b\u4f18\u891b\u4f18\u891b\u4f18\u573b\u4f18\u891b\u891b\u891b\u891b\u891b\u891b\u7d2c\u4f18\u573b\u4f18\u85e6\u891b\u7d2c\u6960\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u85e6\u8be0\u8be0\u8be0\u85e6\n\n\u89c9\u62b5\u891b\u752b\u85e6\u85e6\n\u79a4\u6043\u752b\n\u62b5\u85e6\n\u62b5\u85e6\u85e6\u85e6\n\u73b7\u62b5\u85e6\n\u73b7\u62b5\u85e6\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n\u73b7\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\u989d\u989d\n'\n'\n'\n'\n'\n\u62b5\u62b5\u85e6\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\u989d\u989d\n'\n'\n\u62b5\u989d\n'\n'\n\u62b5\n'\n'\n\u62b5\u989d\u989d\u62b5\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n\u62b5\n'\n'\n'\n\u62b5\n'\n'\n\u62b5\u62b5\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n\n\n'\n'\n'\n'\n'\n'\n\u62b5\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n''\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n'\n''\n'\n'\n'\n'\n'\n'''\n'\n'\n'''\n'\n'\n'\n'\n'\n''''\n'\n'\n''\n'\n''''\n'\n'\n''\n'\n'\n''\n'\n'\n'\n'\n'\n'''''\n'\n'''\n"]}

LLama 模型支持问题

您好,文档上有列出支持 facebook 的 LLama 模型,尝试用 hf 上的其他llama 模型报错
理论上讲 LLama 的模型结构都比较相似,如果想推理 类似 huggyllama/llama-7b、openlm-research/open_llama_7b 这些模型,请问工作量大吗,如果不大的话,需要做些什么改动呢?

[BUG] AttributeError: 'NoneType' object has no attribute 'to_ir'

Issue description:

I receive AttributeError: 'NoneType' object has no attribute 'to_ir'

Steps to reproduce:

  1. Launch lightllm with the following arguments on 4x A100 80gb
python -m lightllm.server.api_server --model_dir /workspace/models/Llama-2-70b-chat-hf     \
                                     --host 0.0.0.0                 \
                                     --port 8080                    \
                                     --tp 4                         \
                                     --max_total_token_num 120000
  1. Send the following request:
curl --location 'http://localhost:8080/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs":"[INST] <<SYS>>\nYou are a helpful AI assistant.\n<</SYS>>\n\nPlease tell me about frogs [/INST]\n",
    "parameters":{
        "max_new_tokens":1000, 
        "frequency_penalty":1
    }
}'

Expected behavior:

Expected a 200 code and a response. Instead, received an error and timeout.

Error logging:

(lightllm) azureuser@trainer1:/workspace/lightllm$ ./run.sh 
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:695] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO:     Started server process [32679]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)



Task exception was never retrieved
future: <Task finished name='Task-8' coro=<RouterManager.loop_for_fwd() done, defined at /workspace/lightllm/lightllm/server/router/manager.py:88> exception=at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-bde58180cc67fc4675629069557a5d0a-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
    fn.reset_type(self.prototype.to_ir(self.builder))
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in to_ir
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in <listcomp>
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 92, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 143, in forward
    logits = self.model.forward(**kwargs)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward
    return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward
    self._context_attention(input_embdings,
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 81, in _context_attention
    q = self._get_qkv(input1, cache_k, cache_v, infer_state, layer_weight)
  File "/workspace/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 43, in _get_qkv
    rotary_emb_fwd(q.view(-1, self.tp_q_head_num_, self.head_dim_), infer_state.position_cos, infer_state.position_sin)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/models/llama/triton_kernel/rotary_emb.py", line 62, in rotary_emb_fwd
    _rotary_kernel[grid](
  File "<string>", line 41, in _rotary_kernel
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1620, in compile
    next_module = compile(module)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1549, in <lambda>
    lambda src: ast_to_ttir(src, signature, configs[0], constants)),
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 962, in ast_to_ttir
    mod, _ = build_triton_ir(fn, signature, specialization, constants)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 942, in build_triton_ir
    raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^
>
Traceback (most recent call last):
  File "/workspace/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
    await self._step()
  File "/workspace/lightllm/lightllm/server/router/manager.py", line 112, in _step
    await self._prefill_batch(self.running_batch)
  File "/workspace/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 218, in prefill_batch
    return await ans
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 180, in _func
    return ans.value
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/async_.py", line 108, in value
    raise self._obj
rpyc.core.vinegar/triton.compiler._get_exception_class.<locals>.Derived: at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "<string>", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-bde58180cc67fc4675629069557a5d0a-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
    fn.reset_type(self.prototype.to_ir(self.builder))
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in to_ir
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/language/core.py", line 298, in <listcomp>
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 359, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/rpyc-5.3.1-py3.10.egg/rpyc/core/protocol.py", line 837, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 92, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 143, in forward
    logits = self.model.forward(**kwargs)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward
    return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/workspace/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward
    self._context_attention(input_embdings,
  File "/workspace/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 81, in _context_attention
    q = self._get_qkv(input1, cache_k, cache_v, infer_state, layer_weight)
  File "/workspace/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 43, in _get_qkv
    rotary_emb_fwd(q.view(-1, self.tp_q_head_num_, self.head_dim_), infer_state.position_cos, infer_state.position_sin)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/lightllm/lightllm/models/llama/triton_kernel/rotary_emb.py", line 62, in rotary_emb_fwd
    _rotary_kernel[grid](
  File "<string>", line 41, in _rotary_kernel
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1620, in compile
    next_module = compile(module)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 1549, in <lambda>
    lambda src: ast_to_ttir(src, signature, configs[0], constants)),
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 962, in ast_to_ttir
    mod, _ = build_triton_ir(fn, signature, specialization, constants)
  File "/workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages/triton/compiler.py", line 942, in build_triton_ir
    raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 38:4:
def _rotary_kernel(
    Q, Cos, Sin,
    stride_qbs, stride_qh, stride_qd,
    stride_cosbs, stride_cosd,
    stride_sinbs, stride_sind,
    max_total_len,
    H,  # N_CTX 代表要计算的上下文长度
    BLOCK_HEAD: tl.constexpr,
    BLOCK_SEQ: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
):
    cur_head_index = tl.program_id(0)
    cur_seq_index = tl.program_id(1)

    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

    out0 = q0 * cos - q1 * sin
    out1 = q0 * sin + q1 * cos

    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

    return
    ^

Environment:

Not using container. Using clean conda environment.

$ uname -a
Linux trainer1 5.15.0-1042-azure #49~20.04.1-Ubuntu SMP Wed Jul 12 12:44:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-smi
Sat Aug 19 19:27:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   29C    P0    52W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000002:00:00.0 Off |                    0 |
| N/A   30C    P0    52W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000003:00:00.0 Off |                    0 |
| N/A   30C    P0    53W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000004:00:00.0 Off |                    0 |
| N/A   31C    P0    54W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ python --version
Python 3.10.12
$ git log -n 1
commit 21007c4b8ca556e0f54f6851a3322a5464d3857f (HEAD -> main, origin/main, origin/HEAD)
Author: hiworldwzj <[email protected]>
Date:   Fri Aug 18 18:23:13 2023 +0800

    Update README.md to Add support For Baichuan13B (#87)
$ pip show triton
Name: triton
Version: 2.0.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /workspace/miniconda3/envs/lightllm/lib/python3.10/site-packages
Requires: cmake, filelock, lit, torch
Required-by: lightllm

Cannot use NsightSystems to trace gpu usage

Hi,

I try use nsys profile to trace the detail of gpu usage, but fail to capture any gpu thread.
For other gpu env, the profile works ok, and I also could use nsys for vllm.

So I wonder why nsys cannot trace lightllm?

Thx

OOM when prompt length exceeds 1020.

Hi,

We deploy llama30b with lightllm, we found that when the length of prompt exceeds 1020, OOM happens.

Enviroment:
1xA100 80G
Driver Version: 460.106.00
Cuda: 11.7
LLaMA 30B

benchmark无法完整运行

再次压测时发现,最新的代码在按照readme运行以下代码时无法成功:
image
显卡:A100
模型:llama-7b-hf
出现的现象为:服务不断出现接收http的请求,但是显示batch token及token ratio的日志不再出现且gpu利用率为0,这有理由说明推理已经完成?值得一提的是最开始时batch token的日志是出现的,并且token ratio占比很高。
image

等待很久之后,终于返回了benchmark的结果,但这相比预期差了许多,在后面的很长一段时间,gpu的利用率一直为0!我怀疑服务已经完成推理,但引擎未发起正确的返回。
image

期待您的回复~

RuntimeError: CUDA: Error- invalid source

GPU info: NVIDIA A800
model: Llama-2-7b-hf

root@23-0-0-175:/code# python -m lightllm.server.api_server --model_dir /code/Llama-2-7b-hf --tp 1 --max_total_token_num 4096
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO: Started server process [35]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py:84> exception=RuntimeError('CUDA: Error- invalid source')>
Traceback (most recent call last):
File "", line 21, in _rms_norm_fwd_fused
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'fp32'), (16384,), (True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/manager.py", line 139, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/model_infer/model_rpc.py", line 182, in prefill_batch
ans = self._prefill_batch(batch_id)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/server/router/model_infer/model_rpc.py", line 118, in forward
logits = self.model.forward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/model.py", line 105, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/model.py", line 143, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 111, in context_forward
self._context_flash_attention(input_embdings,
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama2/layer_infer/transformer_layer_inference.py", line 57, in context_flash_attention
input1 = rmsnorm_forward(input_embding, weight=layer_weight.input_layernorm, eps=self.layer_norm_eps
)
File "/opt/conda/lib/python3.10/site-packages/lightllm-1.0.0-py3.10.egg/lightllm/models/llama/triton_kernel/rmsnorm.py", line 59, in rmsnorm_forward
_rms_norm_fwd_fused[(M,)](x_arg, y, weight,
File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "", line 41, in _rms_norm_fwd_fused
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1268, in compile
return CompiledKernel(name, so_cache_manager._make_path(so_name), fn_cache_manager.cache_dir, device)
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1301, in init
mod, func, n_regs, n_spills = _triton.code_gen.load_binary(metadata["name"], self.asm["cubin"], self.shared, device)
RuntimeError: CUDA: Error- invalid source

triton kernel compile error

Hi,

When I try to use lightllm serving with llama7B/13B, I met triton compile error.
I use the dockerfile in git to build to test image, and with A100-40G to test.

So am I missing anything to enable the test?

The error log is as:

future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /opt/lightllm/lightllm/server/router/manager.py:84> exception=CompilationError('at 38:4:\ndef _rotary_kernel(\n    Q, Cos, Sin,\n    stride_qbs, stride_qh, stride_qd,\n    stride_cosbs, stride_cosd,\n    stride_sinbs, stride_sind,\n    max_total_len,\n    H,  # N_CTX 代表要计算的上下文长度\n    BLOCK_HEAD: tl.constexpr,\n    BLOCK_SEQ: tl.constexpr,\n    BLOCK_DMODEL: tl.constexpr,\n):\n    cur_head_index = tl.program_id(0)\n    cur_seq_index = tl.program_id(1)\n\n    cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)\n    cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)\n\n    dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)\n    dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)\n\n    off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd\n    off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd\n\n    off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd\n\n    q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n    q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n\n    cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n    sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n\n    out0 = q0 * cos - q1 * sin\n    out1 = q0 * sin + q1 * cos\n\n    tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n    tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n\n    return\n    ^')>
Traceback (most recent call last):
  File "<string>", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-2d732a2488b7ed996facc3e641ee56bf-2a292e5784d51bd8ac8bf0d3423dfbd4-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/opt/conda/lib/python3.9/ast.py", line 407, in visit
    return visitor(node)
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/opt/conda/lib/python3.9/ast.py", line 415, in generic_visit
    self.visit(item)
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/opt/conda/lib/python3.9/ast.py", line 407, in visit
    return visitor(node)
  File "/opt/conda/lib/python3.9/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
    fn.reset_type(self.prototype.to_ir(self.builder))
  File "/opt/conda/lib/python3.9/site-packages/triton/language/core.py", line 301, in to_ir
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
  File "/opt/conda/lib/python3.9/site-packages/triton/language/core.py", line 301, in <listcomp>
    ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

llama2-70B 加载完成后,无法启动服务

机器 A100 80G *4
启动命令:python -m lightllm.server.api_server --model_dir /path_to/Llama-2-70b-hf --tp 4 --tokenizer_mode auto --max_total_token_num 512
CUDA_V 11.7
python 3.8

模型权重加载卡住,每张卡上面大概30多G现存,然后就一直卡住了。一直起不来服务。

ziya运行报错,求助

/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/house365ai/xxm/lightllm/lightllm/server/router/manager.py:84> exception=CompilationError('at 38:4:\ndef _rotary_kernel(\n Q, Cos, Sin,\n stride_qbs, stride_qh, stride_qd,\n stride_cosbs, stride_cosd,\n stride_sinbs, stride_sind,\n max_total_len,\n H, # N_CTX 代表要计算的上下文长度\n BLOCK_HEAD: tl.constexpr,\n BLOCK_SEQ: tl.constexpr,\n BLOCK_DMODEL: tl.constexpr,\n):\n cur_head_index = tl.program_id(0)\n cur_seq_index = tl.program_id(1)\n\n cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)\n cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)\n\n dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)\n dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)\n\n off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd\n off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd\n\n off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd\n\n q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)\n\n cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)\n\n out0 = q0 * cos - q1 * sin\n out1 = q0 * sin + q1 * cos\n\n tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))\n\n return\n ^')>
Traceback (most recent call last):
File "", line 21, in _rotary_kernel
KeyError: ('2-.-0-.-0-d000bd1a52e8da5725b7d0d3a84e9be4-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (4, 32, 128), (True, True, True, (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 937, in build_triton_ir
generator.visit(fn.parse())
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
return super().visit(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/ast.py", line 407, in visit
return visitor(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 183, in visit_Module
ast.NodeVisitor.generic_visit(self, node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/ast.py", line 415, in generic_visit
self.visit(item)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 855, in visit
return super().visit(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/ast.py", line 407, in visit
return visitor(node)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 263, in visit_FunctionDef
fn.reset_type(self.prototype.to_ir(self.builder))
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/language/core.py", line 298, in to_ir
ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/language/core.py", line 298, in
ret_types = [ret_type.to_ir(builder) for ret_type in self.ret_types]
AttributeError: 'NoneType' object has no attribute 'to_ir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/manager.py", line 139, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 182, in prefill_batch
ans = self._prefill_batch(batch_id)
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/home/house365ai/xxm/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 118, in forward
logits = self.model.forward(**kwargs)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 103, in forward
predict_logics = self._context_forward(input_ids, infer_state)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/model.py", line 141, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 103, in context_forward
self._context_flash_attention(input_embdings,
File "/home/house365ai/xxm/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 52, in _context_flash_attention
rotary_emb_fwd(q.view(calcu_shape1), infer_state.position_cos, infer_state.position_sin)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/house365ai/xxm/lightllm/lightllm/models/llama/triton_kernel/rotary_emb.py", line 62, in rotary_emb_fwd
_rotary_kernel[grid](
File "", line 41, in _rotary_kernel
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 1621, in compile
next_module = compile(module)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 1550, in
lambda src: ast_to_ttir(src, signature, configs[0], constants)),
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 962, in ast_to_ttir
mod, _ = build_triton_ir(fn, signature, specialization, constants)
File "/home/house365ai/.conda/envs/lightllm/lib/python3.9/site-packages/triton/compiler.py", line 942, in build_triton_ir
raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 38:4:
def _rotary_kernel(
Q, Cos, Sin,
stride_qbs, stride_qh, stride_qd,
stride_cosbs, stride_cosd,
stride_sinbs, stride_sind,
max_total_len,
H, # N_CTX 代表要计算的上下文长度
BLOCK_HEAD: tl.constexpr,
BLOCK_SEQ: tl.constexpr,
BLOCK_DMODEL: tl.constexpr,
):
cur_head_index = tl.program_id(0)
cur_seq_index = tl.program_id(1)

cur_head_range = cur_head_index * BLOCK_HEAD + tl.arange(0, BLOCK_HEAD)
cur_seq_range = cur_seq_index * BLOCK_SEQ + tl.arange(0, BLOCK_SEQ)

dim_range0 = tl.arange(0, BLOCK_DMODEL // 2)
dim_range1 = tl.arange(BLOCK_DMODEL // 2, BLOCK_DMODEL)

off_q0 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range0[None, None, :] * stride_qd
off_q1 = cur_seq_range[:, None, None] * stride_qbs + cur_head_range[None, :, None] * stride_qh + dim_range1[None, None, :] * stride_qd

off_dimcos_sin = cur_seq_range[:, None, None] * stride_cosbs + dim_range0[None, None, :] * stride_cosd

q0 = tl.load(Q + off_q0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)
q1 = tl.load(Q + off_q1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H), other=0.0)

cos = tl.load(Cos + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)
sin = tl.load(Sin + off_dimcos_sin, mask=cur_seq_range[:, None, None] < max_total_len, other=0.0)

out0 = q0 * cos - q1 * sin
out1 = q0 * sin + q1 * cos

tl.store(Q + off_q0, out0, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))
tl.store(Q + off_q1, out1, mask=(cur_seq_range[:, None, None] < max_total_len) & (cur_head_range[None, :, None] < H))

return

the stream output is same to OpenAI?

can i use the below code to read the stream ouput?

import openai
if __name__ == "__main__":
    openai.api_base = "http://localhost:8080/v1"
    openai.api_key = "none"
    for chunk in openai.ChatCompletion.create(
        model="llama",
        messages=[
            {"role": "user", "content": "give me three healthy methods"}
        ],
        stream=True
    ):
        if hasattr(chunk.choices[0].delta, "content"):
            print(chunk.choices[0].delta.content, end="", flush=True)

Encounter error when serving with vicuna-13b-v1.3.

Use docker environment:
docker build -t image_name .
sudo docker run -it --runtime=nvidia --name=test --net=host --gpus all --privileged --shm-size 20G --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE llm bash

GPU V100
CUDA Version: 11.8
Python 3.9.16
pip uninstall torch
pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

server:
python -m lightllm.server.api_server --model_dir /data/workspace/vicuna-13b-v1.3 --tp 2 --max_total_token_num 121060 --tokenizer_mode auto

client:
python ./test/benchmark_serving.py --tokenizer /data/workspace/vicuna-13b-v1.3 --dataset /data/workspace/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100

Error message:
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO: Started server process [628]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Task exception was never retrieved
future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /data/workspace/lightllm/lightllm/server/router/manager.py:84> exception='97859f0c0d6242588bb78c8e4a29aed0'

========= Remote Traceback (1) =========
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
return obj(*args, **dict(kwargs))
File "/data/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
batch: InferBatch = self.cache.pop(batch_id)
KeyError: '97859f0c0d6242588bb78c8e4a29aed0'

Traceback (most recent call last):
File "/data/workspace/lightllm/lightllm/server/router/manager.py", line 87, in loop_for_fwd
await self._step()
File "/data/workspace/lightllm/lightllm/server/router/manager.py", line 106, in _step
await self._prefill_batch(self.running_batch)
File "/data/workspace/lightllm/lightllm/server/router/manager.py", line 139, in prefill_batch
ans = await asyncio.gather(*rets)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 185, in prefill_batch
return ans.value
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async
.py", line 108, in value
raise self._obj
_get_exception_class..Derived: '97859f0c0d6242588bb78c8e4a29aed0'

========= Remote Traceback (1) =========
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
return obj(*args, **dict(kwargs))
File "/data/workspace/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 67, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/data/workspace/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 104, in forward
batch: InferBatch = self.cache.pop(batch_id)
KeyError: '97859f0c0d6242588bb78c8e4a29aed0'

[QUESTION] Is it expected to see an exception when running `benchmark_serving.py`

Issue description:

I was running benchmark_serving.py against llama2-7b:

$ python benchmark_serving.py --tokenizer /path/to/Llama-2-7b-chat-hf --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json

Outputs:

Namespace(dataset='/path/to/ShareGPT_V3_unfiltered_cleaned_split.json', tokenizer='/path/to/Llama-2-7b-chat-hf', request_rate=inf, num_prompts=1000, seed=0)
read data set finish
total tokens: 494250
Traceback (most recent call last):
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 581, in write_bytes
    await self.body.write(writer)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/payload.py", line 247, in write
    await writer.write(self._value)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/http_writer.py", line 115, in write
    self._write(chunk)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/http_writer.py", line 75, in _write
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 236, in <module>
    main(args)
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 198, in main
    asyncio.run(benchmark(input_requests, args.request_rate))
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 187, in benchmark
    await asyncio.gather(*tasks)
  File "/mnt/data/data_jon/repos/lightllm/test/benchmark_serving.py", line 162, in send_request
    async with session.post(url, headers=headers, json=data) as response:
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client.py", line 560, in _request
    await resp.start(conn)
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 899, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/jon/data/miniconda3/envs/lightllm/lib/python3.9/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno None] Can not write request body for http://localhost:8000/generate

The API server worked fine.

Please provide a clear and concise description of your issue.

Steps to reproduce:

Please list the steps to reproduce the issue, such as:

  1. command 0
  2. command 2
  3. command 3
  4. See error

Expected behavior:

Please describe what you expected to happen.

Error logging:

If applicable, please copy and paste the error message or stack trace here. Use code blocks for better readability.

Environment:

Please provide information about your environment, such as:

  • Using container

  • OS: (Ubuntu 14.04, CentOS7)

  • GPU info:

    • nvidia-smi (e.g. NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0)
    • Graphics cards: (e.g. 4090x8)
  • Python: (e.g. CPython3.9)

    • currently, only python>=3.9 is supported
  • LightLLm: (git commit-hash)

    • for container: docker run --entrypoint cat --rm ghcr.io/modeltc/lightllm:main /lightllm/.git/refs/heads/main
  • openai-triton: pip show triton

Additional context:

Please add any other context or screenshots about the issue here.

Language:

Please use English as much as possible for better communication.

[BUG]No module named 'lightllm.models.chatglm2.triton_kernel

执行的是百川13b模型,命令和报错如下,辛苦看下如何解决。

python -m lightllm.server.api_server --model_dir output_dir_0728/global_step_e3_60344
--host 0.0.0.0
--port 3302
--tp 1
--trust_remote_code
--max_total_token_num 1500

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.