Coder Social home page Coder Social logo

pygmalionai / aphrodite-engine Goto Github PK

View Code? Open in Web Editor NEW
550.0 12.0 72.0 3.43 MB

PygmalionAI's large-scale inference engine

Home Page: https://pygmalion.chat

License: GNU Affero General Public License v3.0

Python 59.84% C++ 5.55% Cuda 33.53% C 0.68% Shell 0.36% Dockerfile 0.04%
api-rest inference-engine machine-learning avx512 cuda inferentia rocm

aphrodite-engine's People

Contributors

50h100a avatar alpindale avatar anon998 avatar autumnlight02 avatar city-unit avatar drummerv avatar g4rg avatar henk717 avatar iggooncode avatar karakarawitch avatar krisseck avatar lostruins avatar miku448 avatar official-elinas avatar pyroserenus avatar recoveredapparatus avatar sandwichdoge avatar sgsdxzy avatar stefandanielschwarz avatar stefangliga avatar swadicalrag avatar teargosling avatar teasitta avatar thesentinel2615 avatar thomas-xin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aphrodite-engine's Issues

Exception in TFS implementation on simultaneous requests

I am hosting a multithreaded worker on horde.
If multiple jobs come in at roughly the same time, the following exception occurs.

If I reduce worker threads to 1, i.e. no simultaneous requests, this does not happen.
It also does not happen when using multiple threads and commenting out the part in sampler.py where it branches into _apply_tfs.

Tested on bare metal A100 40GB PCIe, model PygmalionAI/pygmalion-2-13b, fp16

    ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 27, in _raise_exception_on_finish
    task.result()
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 349, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 328, in engine_step
    request_outputs = await self.engine.step_async()
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 189, in step_async
    output = await self._run_workers_async(
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 214, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/aphrodite/aph032/aphrodite/task_handler/worker.py", line 324, in execute_model
    output = self.model(
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aph032/aphrodite/modeling/models/llama.py", line 296, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aph032/aphrodite/modeling/layers/sampler.py", line 66, in forward
    logits = _apply_tfs(logits, tfss)
  File "/home/aphrodite/aph032/aphrodite/modeling/layers/sampler.py", line 306, in _apply_tfs
    normalized_d2 = d2 / torch.sum(d2, dim=-1)
RuntimeError: The size of tensor a (31998) must match the size of tensor b (2) at non-singleton dimension 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/applications.py", line 292, in __call__
    self,
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/aphrodite/aph032/aphrodite/endpoints/openai/api_server.py", line 523, in create_completion
    async for res in result_generator:
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 433, in generate
    raise e
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 427, in generate
    async for request_output in stream:
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 69, in __anext__
    raise result
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/applications.py", line 292, in __call__
    self,
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/aphrodite/aph032/aphrodite/endpoints/openai/api_server.py", line 523, in create_completion
    async for res in result_generator:
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 433, in generate
    raise e
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 427, in generate
    async for request_output in stream:
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 69, in __anext__
    raise result
  File "/home/aphrodite/micromamba/envs/aph032/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 36, in _raise_exception_on_finish
    raise exc
  File "/home/aphrodite/aph032/aphrodite/engine/async_aphrodite.py", line 31, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
aphrodite.engine.async_aphrodite.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Add RoPE scaling arguments to engine

Currently, we auto-scale using the --max-model-len argument. It may be more appropriate to have specific options for the scaling factor, etc.

Problem with dockerfile and compiled image in 0.5.0

Hi! I just downloaded the new docker image and it did not run, it's trying to run the entry point at: ENTRYPOINT ["/app/aphrodite-engine/docker/entrypoint.sh"] but inside the container the entry point still is on the old path /workspace/aphrodite-engine

inside the entry point script there is a:
cd /app/aphrodite-engine it faills when the container call the script (no such directory.....)

regards!

Classifier-Free Guidance support

From yesterday's discussion, here are some extra features to add:

  • CFG support
  • Mirostat sampling
  • exl2 quantization support

This will help broaden support for newer methods of chatting with models.

Is GGUF support broken?

I try to start the service with GGUF models in RTX 4090 to test the GGUF performance. It shows error, I am not sure if the GGUF support has been broken. I start the service with the command :

python -m aphrodite.endpoints.openai.api_server  --model Mixtral_11Bx2_MoE_19B-GGUF/ --quantization gguf --port 5000 --host 0.0.0.0 --served-model-name mixtral  --disable-log-requests --gpu-memory-utilization 0.8

Error message:
miniconda3/envs/fast-llm-serving/lib/python3.10/site-packages/aphrodite/common/config.py", line 136, in _verify_load_format
if "MixtralForCausalLM" in architectures and load_format == "pt":
TypeError: argument of type 'NoneType' is not iterable

Problem with request (before 0.5 works with no problem)

Hi, Iḿ trying the new release, and the next prompt throws an exception in the new engine, tryed many examples, also tried with many models and with many chat templates, all fails.

I leave the stack trace:

--- Logging error in Loguru Handler #1 ---
Record was: {'elapsed': datetime.timedelta(seconds=17, microseconds=169798), 'exception': None, 'extra': {}, 'file': 
(name='async_aphrodite.py', path='/workspace/aphrodite-engine/aphrodite/engine/async_aphrodite.py'), 'function': 
'add_request', 'level': (name='INFO', no=20, icon='ℹ️'), 'line': 495, 'message': 'Received request cmpl-
c7b48579f17e45ad9096d0d5950a65d7: prompt: \'<|im_start|>system\\\\n\\nYou are designed to help with a variety of tasks, 
from answering questions     to providing summaries to other types of analyses.\\n\\n## Tools\\nYou have access to a wide 
variety of tools. You are responsible for using\\nthe tools in any sequence you deem appropriate to complete the task at 
hand.\\nThis may require breaking the task into subtasks and using different tools\\nto complete each subtask.\\n\\nYou 
have access to the following tools:\\n> Tool Name: multiply\\nTool Description: Multiple two integers and returns the result 
integer\\nTool Args: {"type": "object", "properties": {"a": {"title": "A", "type": "integer"}, "b": {"title": "B", "type": "integer"}}, 
"required": ["a", "b"]}\\n\\n> Tool Name: add\\nTool Description: Add two integers and returns the result integer\\nTool Args: 
{"type": "object", "properties": {"a": {"title": "A", "type": "integer"}, "b": {"title": "B", "type": "integer"}}, "required": ["a", 
"b"]}\\n\\n\\n## Output Format\\nTo answer the question, please use the following format.\\n\\n```\\nThought: I need to use a 
tool to help me answer the question.\\nAction: tool name (one of multiply, add) if using a tool.\\nAction Input: the input to the 
tool, in a JSON format representing the kwargs (e.g. {"input": "hello world", "num_beams": 5})\\n```\\n\\nPlease ALWAYS 
start with a Thought.\\n\\nPlease use a valid JSON format for the Action Input. Do NOT do this {\\\'input\\\': \\\'hello world\\\', 
\\\'num_beams\\\': 5}.\\n\\nIf this format is used, the user will respond in the following format:\\n\\n```\\nObservation: tool 
response\\n```\\n\\nYou should keep repeating the above format until you have enough information\\nto answer the question 
without using any more tools. At that point, you MUST respond\\nin the one of the following two formats:\\n\\n```\\nThought: I 
can answer without using any more tools.\\nAnswer: [your answer here]\\n```\\n\\n```\\nThought: I cannot answer the 
question with the provided tools.\\nAnswer: Sorry, I cannot answer your query.\\n```\\n\\n## Current Conversation\\nBelow is 
the current conversation consisting of interleaving human and assistant 
messages.\\n\\n<|im_end|>\\\\n<|im_start|>user\\\\nWhat is 20+(2*4) ? Calculate step by step 
<|im_end|>\\\\n<|im_start|>assistant\', sampling params: SamplingParams(temperature=0.1, max_tokens=3391), prompt 
token ids: [1, 523, 28766, 321, 28730, 2521, 28766, 28767, 6574, 28756, 28711, 13, 1976, 460, 5682, 298, 1316, 395, 264, 
6677, 302, 9796, 28725, 477, 24402, 4224, 260, 298, 7501, 18062, 497, 298, 799, 4514, 302, 21974, 274, 28723, 13, 13, 
1064, 26258, 13, 1976, 506, 2735, 298, 264, 5335, 6677, 302, 7040, 28723, 995, 460, 7332, 354, 1413, 13, 1237, 7040, 
297, 707, 7768, 368, 340, 366, 7658, 298, 4160, 272, 3638, 438, 1021, 28723, 13, 3260, 993, 2699, 11313, 272, 3638, 778, 
1083, 21128, 304, 1413, 1581, 7040, 13, 532, 4160, 1430, 1083, 5553, 28723, 13, 13, 1976, 506, 2735, 298, 272, 2296, 
7040, 28747, 13, 28767, 12877, 6620, 28747, 17669, 346, 13, 6778, 10220, 28747, 9713, 4191, 989, 3113, 7850, 304, 
5723, 272, 1204, 11584, 13, 6778, 24997, 28747, 9830, 1123, 1264, 345, 2814, 548, 345, 10723, 1264, 9830, 28708, 1264, 
9830, 3901, 1264, 345, 28741, 548, 345, 1123, 1264, 345, 14296, 7706, 345, 28726, 1264, 9830, 3901, 1264, 345, 28760, 
548, 345, 1123, 1264, 345, 14296, 28739, 10781, 345, 10893, 1264, 7367, 28708, 548, 345, 28726, 2242, 28752, 13, 13, 
28767, 12877, 6620, 28747, 967, 13, 6778, 10220, 28747, 3301, 989, 3113, 7850, 304, 5723, 272, 1204, 11584, 13, 6778, 
24997, 28747, 9830, 1123, 1264, 345, 2814, 548, 345, 10723, 1264, 9830, 28708, 1264, 9830, 3901, 1264, 345, 28741, 
548, 345, 1123, 1264, 345, 14296, 7706, 345, 28726, 1264, 9830, 3901, 1264, 345, 28760, 548, 345, 1123, 1264, 345, 
14296, 28739, 10781, 345, 10893, 1264, 7367, 28708, 548, 345, 28726, 2242, 28752, 13, 13, 13, 1064, 15985, 18748, 13, 
1551, 4372, 272, 2996, 28725, 4665, 938, 272, 2296, 5032, 28723, 13, 13, 13940, 28832, 13, 1227, 1322, 28747, 315, 927, 
298, 938, 264, 3921, 298, 1316, 528, 4372, 272, 2996, 28723, 13, 3795, 28747, 3921, 1141, 325, 538, 302, 17669, 346, 
28725, 967, 28731, 513, 1413, 264, 3921, 28723, 13, 3795, 11232, 28747, 272, 2787, 298, 272, 3921, 28725, 297, 264, 
9292, 5032, 14030, 272, 23197, 325, 28706, 28723, 28721, 28723, 9830, 2537, 1264, 345, 21558, 1526, 548, 345, 2575, 
28730, 1105, 5322, 1264, 28705, 28782, 1542, 13, 13940, 28832, 13, 13, 12069, 10461, 26548, 28735, 1149, 395, 264, 
26142, 28723, 13, 13, 12069, 938, 264, 3716, 9292, 5032, 354, 272, 9624, 11232, 28723, 2378, 5457, 511, 456, 12012, 
2537, 1869, 464, 21558, 1526, 647, 464, 2575, 28730, 1105, 5322, 1869, 28705, 28782, 2051, 13, 13, 3381, 456, 5032, 
349, 1307, 28725, 272, 2188, 622, 9421, 297, 272, 2296, 5032, 28747, 13, 13, 13940, 28832, 13, 23044, 352, 28747, 3921, 
2899, 13, 13940, 28832, 13, 13, 1976, 1023, 1840, 5683, 1077, 272, 2747, 5032, 1996, 368, 506, 2066, 1871, 13, 532, 
4372, 272, 2996, 1671, 1413, 707, 680, 7040, 28723, 1794, 369, 1305, 28725, 368, 351, 11080, 9421, 13, 262, 272, 624, 
302, 272, 2296, 989, 23468, 28747, 13, 13, 13940, 28832, 13, 1227, 1322, 28747, 315, 541, 4372, 1671, 1413, 707, 680, 
7040, 28723, 13, 2820, 16981, 28747, 733, 19262, 4372, 1236, 28793, 13, 13940, 28832, 13, 13, 13940, 28832, 13, 1227, 
1322, 28747, 315, 3573, 4372, 272, 2996, 395, 272, 3857, 7040, 28723, 13, 2820, 16981, 28747, 19385, 28725, 315, 3573, 
4372, 574, 5709, 28723, 13, 13940, 28832, 13, 13, 1064, 10929, 1325, 25422, 13, 20548, 336, 349, 272, 1868, 7114, 
20922, 302, 791, 291, 1652, 2930, 304, 13892, 8570, 28723, 13, 13, 28789, 28766, 321, 28730, 416, 28766, 11266, 28711, 
28789, 28766, 321, 28730, 2521, 28766, 28767, 1838, 28756, 28711, 3195, 349, 28705, 28750, 28734, 24993, 28750, 
28736, 28781, 28731, 1550, 2984, 16914, 3707, 486, 3707, 523, 28766, 321, 28730, 416, 28766, 11266, 28711, 28789, 
28766, 321, 28730, 2521, 28766, 28767, 489, 11143], lora_request: None.', 'module': 'async_aphrodite', 'name': 
'aphrodite.engine.async_aphrodite', 'process': (id=1, name='MainProcess'), 'thread': (id=139723111862720, 
name='MainThread'), 'time': datetime(2024, 3, 12, 13, 25, 25, 450461, tzinfo=datetime.timezone(datetime.timedelta(0), 
'UTC'))}

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/loguru/_handler.py", line 164, in emit
    _, precomputed_format = self._memoize_dynamic_format(dynamic_format, ansi_level)
  File "/usr/local/lib/python3.10/dist-packages/loguru/_handler.py", line 14, in prepare_colored_format
    colored = Colorizer.prepare_format(format_)
  File "/usr/local/lib/python3.10/dist-packages/loguru/_colorizer.py", line 357, in prepare_format
    tokens, messages_color_tokens = Colorizer._parse_without_formatting(string)
  File "/usr/local/lib/python3.10/dist-packages/loguru/_colorizer.py", line 466, in _parse_without_formatting
    _, color_tokens = Colorizer._parse_without_formatting(
  File "/usr/local/lib/python3.10/dist-packages/loguru/_colorizer.py", line 466, in _parse_without_formatting
    _, color_tokens = Colorizer._parse_without_formatting(
  File "/usr/local/lib/python3.10/dist-packages/loguru/_colorizer.py", line 466, in _parse_without_formatting
    _, color_tokens = Colorizer._parse_without_formatting(
  File "/usr/local/lib/python3.10/dist-packages/loguru/_colorizer.py", line 438, in _parse_without_formatting
    raise ValueError("Max string recursion exceeded")
ValueError: Max string recursion exceeded
--- End of logging error ---

Any way the engine respond to the request with the expected generated text, seems to be something no so critical (seems)

Edit: Trying with 0.4.9 the prompt is renderized different, it seems that 0.5 add extra \ to scape strings or something like that, same prompt, same model, and the log shows a different prompt. In the system role you can see the differences

INFO 03-12 13:22:28 async_aphrodite.py:432] Received request cmpl-ef35020a14324f078f7d671f9e62c766: prompt: 
'<|im_start|>system\\n\nYou are designed to help with a variety of tasks, from answering questions     to providing 
summaries to other types of analyses.\n\n## Tools\nYou have access to a wide variety of tools. You are responsible for 
using\nthe tools in any sequence you deem appropriate to complete the task at hand.\nThis may require breaking the task 
into subtasks and using different tools\nto complete each subtask.\n\nYou have access to the following tools:\n> Tool Name: 
multiply\nTool Description: Multiple two integers and returns the result integer\nTool Args: {"type": "object", "properties": {"a": 
{"title": "A", "type": "integer"}, "b": {"title": "B", "type": "integer"}}, "required": ["a", "b"]}\n\n> Tool Name: add\nTool Description: 
Add two integers and returns the result integer\nTool Args: {"type": "object", "properties": {"a": {"title": "A", "type": "integer"}, 
"b": {"title": "B", "type": "integer"}}, "required": ["a", "b"]}\n\n\n## Output Format\nTo answer the question, please use the 
following format.\n\n```\nThought: I need to use a tool to help me answer the question.\nAction: tool name (one of multiply, 
add) if using a tool.\nAction Input: the input to the tool, in a JSON format representing the kwargs (e.g. {"input": "hello 
world", "num_beams": 5})\n```\n\nPlease ALWAYS start with a Thought.\n\nPlease use a valid JSON format for the Action 
Input. Do NOT do this {\'input\': \'hello world\', \'num_beams\': 5}.\n\nIf this format is used, the user will respond in the 
following format:\n\n```\nObservation: tool response\n```\n\nYou should keep repeating the above format until you have 
enough information\nto answer the question without using any more tools. At that point, you MUST respond\nin the one of 
the following two formats:\n\n```\nThought: I can answer without using any more tools.\nAnswer: [your answer 
here]\n```\n\n```\nThought: I cannot answer the question with the provided tools.\nAnswer: Sorry, I cannot answer your 
query.\n```\n\n## Current Conversation\nBelow is the current conversation consisting of interleaving human and assistant 
messages.\n\n<|im_end|>\\n<|im_start|>user\\nWhat is 20+(2*4) ? Calculate step by step 
<|im_end|>\\n<|im_start|>assistant', prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, 
presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, top_a=0.0, 
min_p=0.0, tfs=1.0, eta_cutoff=0.0, epsilon_cutoff=0.0, typical_p=1.0, mirostat_mode=0, mirostat_tau=0.0, 
mirostat_eta=0.0, dynatemp_range=0.0, dynatemp_exponent=1.0, smoothing_factor=0.0, use_beam_search=False, 
length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, 
ignore_eos=False, max_tokens=3391, custom_token_bans=[], logprobs=None, prompt_logprobs=None, 
skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 523, 28766, 321, 28730, 2521, 
28766, 28767, 6574, 28756, 28711, 13, 1976, 460, 5682, 298, 1316, 395, 264, 6677, 302, 9796, 28725, 477, 24402, 4224, 
260, 298, 7501, 18062, 497, 298, 799, 4514, 302, 21974, 274, 28723, 13, 13, 1064, 26258, 13, 1976, 506, 2735, 298, 264, 
5335, 6677, 302, 7040, 28723, 995, 460, 7332, 354, 1413, 13, 1237, 7040, 297, 707, 7768, 368, 340, 366, 7658, 298, 4160, 
272, 3638, 438, 1021, 28723, 13, 3260, 993, 2699, 11313, 272, 3638, 778, 1083, 21128, 304, 1413, 1581, 7040, 13, 532, 
4160, 1430, 1083, 5553, 28723, 13, 13, 1976, 506, 2735, 298, 272, 2296, 7040, 28747, 13, 28767, 12877, 6620, 28747, 
17669, 346, 13, 6778, 10220, 28747, 9713, 4191, 989, 3113, 7850, 304, 5723, 272, 1204, 11584, 13, 6778, 24997, 28747, 
9830, 1123, 1264, 345, 2814, 548, 345, 10723, 1264, 9830, 28708, 1264, 9830, 3901, 1264, 345, 28741, 548, 345, 1123, 
1264, 345, 14296, 7706, 345, 28726, 1264, 9830, 3901, 1264, 345, 28760, 548, 345, 1123, 1264, 345, 14296, 28739, 
10781, 345, 10893, 1264, 7367, 28708, 548, 345, 28726, 2242, 28752, 13, 13, 28767, 12877, 6620, 28747, 967, 13, 6778, 
10220, 28747, 3301, 989, 3113, 7850, 304, 5723, 272, 1204, 11584, 13, 6778, 24997, 28747, 9830, 1123, 1264, 345, 2814, 
548, 345, 10723, 1264, 9830, 28708, 1264, 9830, 3901, 1264, 345, 28741, 548, 345, 1123, 1264, 345, 14296, 7706, 345, 
28726, 1264, 9830, 3901, 1264, 345, 28760, 548, 345, 1123, 1264, 345, 14296, 28739, 10781, 345, 10893, 1264, 7367, 
28708, 548, 345, 28726, 2242, 28752, 13, 13, 13, 1064, 15985, 18748, 13, 1551, 4372, 272, 2996, 28725, 4665, 938, 272, 
2296, 5032, 28723, 13, 13, 13940, 28832, 13, 1227, 1322, 28747, 315, 927, 298, 938, 264, 3921, 298, 1316, 528, 4372, 
272, 2996, 28723, 13, 3795, 28747, 3921, 1141, 325, 538, 302, 17669, 346, 28725, 967, 28731, 513, 1413, 264, 3921, 
28723, 13, 3795, 11232, 28747, 272, 2787, 298, 272, 3921, 28725, 297, 264, 9292, 5032, 14030, 272, 23197, 325, 28706, 
28723, 28721, 28723, 9830, 2537, 1264, 345, 21558, 1526, 548, 345, 2575, 28730, 1105, 5322, 1264, 28705, 28782, 1542, 
13, 13940, 28832, 13, 13, 12069, 10461, 26548, 28735, 1149, 395, 264, 26142, 28723, 13, 13, 12069, 938, 264, 3716, 
9292, 5032, 354, 272, 9624, 11232, 28723, 2378, 5457, 511, 456, 12012, 2537, 1869, 464, 21558, 1526, 647, 464, 2575, 
28730, 1105, 5322, 1869, 28705, 28782, 2051, 13, 13, 3381, 456, 5032, 349, 1307, 28725, 272, 2188, 622, 9421, 297, 272, 
2296, 5032, 28747, 13, 13, 13940, 28832, 13, 23044, 352, 28747, 3921, 2899, 13, 13940, 28832, 13, 13, 1976, 1023, 1840, 
5683, 1077, 272, 2747, 5032, 1996, 368, 506, 2066, 1871, 13, 532, 4372, 272, 2996, 1671, 1413, 707, 680, 7040, 28723, 
1794, 369, 1305, 28725, 368, 351, 11080, 9421, 13, 262, 272, 624, 302, 272, 2296, 989, 23468, 28747, 13, 13, 13940, 
28832, 13, 1227, 1322, 28747, 315, 541, 4372, 1671, 1413, 707, 680, 7040, 28723, 13, 2820, 16981, 28747, 733, 19262, 
4372, 1236, 28793, 13, 13940, 28832, 13, 13, 13940, 28832, 13, 1227, 1322, 28747, 315, 3573, 4372, 272, 2996, 395, 272, 
3857, 7040, 28723, 13, 2820, 16981, 28747, 19385, 28725, 315, 3573, 4372, 574, 5709, 28723, 13, 13940, 28832, 13, 13, 
1064, 10929, 1325, 25422, 13, 20548, 336, 349, 272, 1868, 7114, 20922, 302, 791, 291, 1652, 2930, 304, 13892, 8570, 
28723, 13, 13, 28789, 28766, 321, 28730, 416, 28766, 11266, 28711, 28789, 28766, 321, 28730, 2521, 28766, 28767, 
1838, 28756, 28711, 3195, 349, 28705, 28750, 28734, 24993, 28750, 28736, 28781, 28731, 1550, 2984, 16914, 3707, 486, 
3707, 523, 28766, 321, 28730, 416, 28766, 11266, 28711, 28789, 28766, 321, 28730, 2521, 28766, 28767, 489, 11143], 
lora_request: None.

Configuration of the internal port of the docker container

Hi, nice and great Job, im wondering if you can add something like this:

echo 'Starting Aphrodite Engine API server...'

CMD="python3 -m aphrodite.endpoints.${ENDPOINT:-openai}.api_server
             --host 0.0.0.0
             ${SELECTED_PORT:+--port $SELECTED_PORT}
             --download-dir ${HF_HOME}

in certains scenarios with multiple instances of the engine it will be usefull to control the internal listening port of the container.

Regards!

'activation_ops' circular import

Attempted to run the example code in the README.md.
Model: Metharme-13b-GPTQ

Traceback (most recent call last):
  File "/mnt/Storage/ai-dev/aphrodite-engine/inference_test.py", line 1, in <module>
    from aphrodite import LLM, SamplingParams
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/__init__.py", line 2, in <module>
    from aphrodite.engine.async_aphrodite import AsyncAphrodite
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 7, in <module>
    from aphrodite.engine.aphrodite_engine import AphroditeEngine
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 14, in <module>
    from aphrodite.task_handler.worker import Worker
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/task_handler/worker.py", line 5, in <module>
    from aphrodite.modeling import get_model, InputMetadata, set_random_seed
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/modeling/__init__.py", line 2, in <module>
    from aphrodite.modeling.loader import get_model
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/modeling/loader.py", line 7, in <module>
    from aphrodite.modeling.models import LlamaForCausalLM, GPTJForCausalLM, GPTNeoXForCausalLM
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/modeling/models/__init__.py", line 1, in <module>
    from aphrodite.modeling.models.llama import LlamaForCausalLM
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/modeling/models/llama.py", line 33, in <module>
    from aphrodite.modeling.layers.activation import SiluAndMul
  File "/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/modeling/layers/activation.py", line 4, in <module>
    from aphrodite import activation_ops
ImportError: cannot import name 'activation_ops' from partially initialized module 'aphrodite' (most likely due to a circular import) (/mnt/Storage/ai-dev/aphrodite-engine/aphrodite/__init__.py)

Request: Better CLI control over max CTX and rope scaling

suggestion: adjust things so that CLI can supersede the config in order to set higher context (via the existing max ctx len) and enable rope scaling (through a new arg)

reasoning: modifying configuration files to enable higher ctx len is a less than ideal solution. at best it's awkward, at worst cloud based pip installs place the files in awkward locations and use inconsistent workspace locations.

[Bug]: Pydantic serializer issue when pinging /v1/models

Your current environment

N/A

🐛 Describe the bug

When sending a GET request to the /v1/models endpoint for the first time, it'll output a pydantic serializer warning, with no tracebacks:

/home/anon/.local/lib/python3.10/site-packages/pydantic/main.py:314: UserWarning: Pydantic serializer warnings:
  Expected `str` but got `bool` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(

Can't think of why this happens.

[Usage]: Question about VRAM requirement and temperature

Your current environment

The output of `python env.py`
Collecting environment information...
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect 
CMake version: version 3.27.0
Libc version: glibc-2.35
Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090
GPU 4: NVIDIA GeForce RTX 4090
GPU 5: NVIDIA GeForce RTX 4090
GPU 6: NVIDIA GeForce RTX 4090
GPU 7: NVIDIA GeForce RTX 4090

Nvidia driver version: 535.161.07
cuDNN version: Could not collect 
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             64
On-line CPU(s) list:                0-63
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7302 16-Core Processor
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          2
Stepping:                           0
Frequency boost:                    enabled
CPU max MHz:                        3000.0000
CPU min MHz:                        1500.0000
BogoMIPS:                           5999.92
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization:                     AMD-V
L1d cache:                          1 MiB (32 instances)
L1i cache:                          1 MiB (32 instances)
L2 cache:                           16 MiB (32 instances)
L3 cache:                           256 MiB (16 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-15,32-47
NUMA node1 CPU(s):                  16-31,48-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] blas                      2.16                        mkl    conda-forge
[conda] libblas                   3.8.0                    16_mkl    conda-forge
[conda] libcblas                  3.8.0                    16_mkl    conda-forge
[conda] liblapack                 3.8.0                    16_mkl    conda-forge
[conda] liblapacke                3.8.0                    16_mkl    conda-forge
[conda] mkl                       2020.2                      256  
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     2.2.0                    pypi_0    pypi
[conda] torchtriton               2.2.0                     py311    pytorchROCM Version: Could not collect 
Aphrodite Version: 0.5.2
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

How would you like to use Aphrodite?

I want to run this (https://huggingface.co/Qwen/Qwen1.5-14B-Chat).
I used following cmd in exllamaV2 to convert the model to exl2 format in 8.0 bit.

CUDA_VISIBLE_DEVICES=3 python convert.py \
    -i /home/by/llm/base_models/Qwen1.5-14B-Chat \
    -o /home/by/llm/base_models/Qwen1.5-14B-Chat-exl2 \
    -cf /home/by/llm/base_models/Qwen1.5-14B-Chat-exl2/8bpw/ \
    -b 8.0 \
    -hb 8

then I serve the api using

CUDA_VISIBLE_DEVICES=1 ./runtime.sh python -m aphrodite.endpoints.openai.api_server \
    --model /home/by/llm/base_models/Qwen1.5-14B-Chat-exl2/8bpw \
    --gpu-memory-utilization 1 \
    --kv-cache-dtype fp8_e5m2 \
    --max-model-len 8000 \
    --served-model-name qwen1.5-14b-chat \
    --quantization exl2 \
    --port 2242 \
    --max-num-batched-tokens 8000 \
    --enforce-eager \
    --disable-custom-all-reduce \
    --disable-log-requests \
    --host 0.0.0.0

this use all my VRAM on 4090 24212MiB / 24564MiB
I notice in log, the actual model itself only take 14.07 GB. It seemed like the kv-cache takes a lot of VRAM so I cannot use max-model-len more than 8k. However when I use TabbyAPI and using the same exl2 model. I can comfortably use up to 20k context without issue. Is this by design that batching takes more VRAM so less context can be use?

Another question is about temperature when requesting. Here is my request json to http://localhost:2242/v1/chat/completions

{
    "model": "qwen1.5-14b-chat",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Please describe spring"
        }
    ],
    "temperature": 0,
    "max_tokens": 400
}

In my understanding, setting the temperature to 0 should result the same or at least, very similar response. However I am getting very different response from model. Is there other setting I should set if I want the response to be very much the same every time I request the same input?

Request: Have Dockerfile use the current branch

It looks like the Dockerfile is loading the pypi version instead of the local branch version. Can you change it so that the app runs fully containerized? This is because I am running this on a kubernetes cluster and I am getting errors that the docker container does not have CUDA installed. Any modifications I make to the github are also not taken into account.

Bigger VRAM footprint after update?

Enviroment is Google Colab, Tesla T4, either with CUDA 12.1 or downgraded 11.8

aphrodite-engine==0.4.2 runs 13B GPTQ models without an issue, but after updating to 0.4.5 it consistently crashes with torch.cuda.OutOfMemoryError

[Bug]: `ValueError: Out of range float values are not JSON compliant` when requesting logprobs from awq model

Your current environment

wget https://raw.githubusercontent.com/PygmalionAI/aphrodite-engine/env.py
400: Invalid request
It's a fresh conda environment for py311+torch220+aphrodite 0.5.0

🐛 Describe the bug

launch aphrodite with
python -m aphrodite.endpoints.openai.api_server --host 127.0.0.1 --port 5000 --dtype float16 --max-log-len 0 --block-size 16 -tp 4 --gpu-memory-utilization 1.0 --model Qwen/Qwen1.5-72B-Chat-AWQ -q awq --max-model-len 14496 --enforce-eager --kv-cache-dtype auto --served-model-name Qwen1.5-72B-Chat-AWQ

Test script:

curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk" \
-d '{
  "model": "Qwen1.5-72B-Chat-AWQ",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ],
  "stream": false,
  "logprobs": true,
  "top_logprobs": 10
}'

Result (aphrodite backtrace):

  ......
  Expected `int` but got `str` - serialized value may not be as expected
  Expected `int` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
INFO:     127.0.0.1:47572 - "POST /v1/chat/completions HTTP/1.1" 500
ERROR:    Exception in ASGI application
ERROR:      + Exception Group Traceback (most recent call last):
ERROR:      |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_utils.py", line 87, in collapse_excgroups
ERROR:      |     yield
ERROR:      |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 190, in __call__                              ERROR:      |     async with anyio.create_task_group() as task_group:
ERROR:      |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
ERROR:      |     raise BaseExceptionGroup(
ERROR:      | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:      +-+---------------- 1 ----------------
ERROR:        | Traceback (most recent call last):
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
ERROR:        |     result = await app(  # type: ignore[func-returns-value]
ERROR:        |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
ERROR:        |     return await self.app(scope, receive, send)
ERROR:        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
ERROR:        |     await super().__call__(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
ERROR:        |     await self.middleware_stack(scope, receive, send)                                                                                                        ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__                          ERROR:        |     raise exc
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
ERROR:        |     await self.app(scope, receive, _send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 189, in __call__
ERROR:        |     with collapse_excgroups():
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/contextlib.py", line 158, in __exit__
ERROR:        |     self.gen.throw(typ, value, traceback)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_utils.py", line 93, in collapse_excgroups
ERROR:        |     raise exc
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 191, in __call__
ERROR:        |     response = await self.dispatch_func(request, call_next)
ERROR:        |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/Programs/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 539, in authentication
ERROR:        |     return await call_next(request)
ERROR:        |            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 165, in call_next
ERROR:        |     raise app_exc
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 151, in coro
ERROR:        |     await self.app(scope, receive_or_disconnect, send_no_error)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
ERROR:        |     await self.app(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
ERROR:        |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR:        |     raise exc
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR:        |     await app(scope, receive, sender)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 758, in __call__
ERROR:        |     await self.middleware_stack(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 778, in app
ERROR:        |     await route.handle(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 299, in handle
ERROR:        |     await self.app(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 79, in app
ERROR:        |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR:        |     raise exc
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR:        |     await app(scope, receive, sender)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
ERROR:        |     response = await func(request)
ERROR:        |                ^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
ERROR:        |     raw_response = await run_endpoint_function(
ERROR:        |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
ERROR:        |     return await dependant.call(**values)
ERROR:        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/Programs/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 246, in create_chat_completion
ERROR:        |     return JSONResponse(content=generator.model_dump())
ERROR:        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/responses.py", line 183, in __init__
ERROR:        |     super().__init__(content, status_code, headers, media_type, background)
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/responses.py", line 41, in __init__
ERROR:        |     self.body = self.render(content)
ERROR:        |                 ^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/responses.py", line 186, in render
ERROR:        |     return json.dumps(
ERROR:        |            ^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/json/__init__.py", line 238, in dumps
ERROR:        |     **kw).encode(obj)
ERROR:        |           ^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/json/encoder.py", line 200, in encode
ERROR:        |     chunks = self.iterencode(o, _one_shot=True)
ERROR:        |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:        |   File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/json/encoder.py", line 258, in iterencode
ERROR:        |     return _iterencode(o, 0)
ERROR:        |            ^^^^^^^^^^^^^^^^^
ERROR:        | ValueError: Out of range float values are not JSON compliant
ERROR:        +------------------------------------
ERROR:
ERROR:    During handling of the above exception, another exception occurred:
ERROR:                                                                                                                                                    15:24:20 [359/1856]
ERROR:    Traceback (most recent call last):
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
ERROR:        result = await app(  # type: ignore[func-returns-value]
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
ERROR:        return await self.app(scope, receive, send)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
ERROR:        await super().__call__(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
ERROR:        await self.middleware_stack(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
ERROR:        raise exc
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
ERROR:        await self.app(scope, receive, _send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 189, in __call__
ERROR:        with collapse_excgroups():
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/contextlib.py", line 158, in __exit__
ERROR:        self.gen.throw(typ, value, traceback)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_utils.py", line 93, in collapse_excgroups
ERROR:        raise exc
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 191, in __call__
ERROR:        response = await self.dispatch_func(request, call_next)
ERROR:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/Programs/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 539, in authentication
ERROR:        return await call_next(request)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 165, in call_next
ERROR:        raise app_exc
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/base.py", line 151, in coro
ERROR:        await self.app(scope, receive_or_disconnect, send_no_error)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
ERROR:        await self.app(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
ERROR:        await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR:        raise exc
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR:        await app(scope, receive, sender)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 758, in __call__
ERROR:        await self.middleware_stack(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 778, in app
ERROR:        await route.handle(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 299, in handle
ERROR:        await self.app(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 79, in app
ERROR:        await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR:        raise exc
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR:        await app(scope, receive, sender)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
ERROR:        response = await func(request)
ERROR:                   ^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
ERROR:        raw_response = await run_endpoint_function(
ERROR:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
ERROR:        return await dependant.call(**values)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/Programs/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 246, in create_chat_completion
ERROR:        return JSONResponse(content=generator.model_dump())
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/responses.py", line 183, in __init__
ERROR:        super().__init__(content, status_code, headers, media_type, background)
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/responses.py", line 41, in __init__
ERROR:        self.body = self.render(content)
ERROR:                    ^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/site-packages/starlette/responses.py", line 186, in render
ERROR:        return json.dumps(
ERROR:               ^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/json/__init__.py", line 238, in dumps
ERROR:        **kw).encode(obj)
ERROR:              ^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/json/encoder.py", line 200, in encode
ERROR:        chunks = self.iterencode(o, _one_shot=True)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/sgsdxzy/miniforge3/envs/aphrodite/lib/python3.11/json/encoder.py", line 258, in iterencode
ERROR:        return _iterencode(o, 0)
ERROR:               ^^^^^^^^^^^^^^^^^
ERROR:    ValueError: Out of range float values are not JSON compliant

Chat completion works fine if not requesting logprobs. The problem doesn't occur for v1/completions, only for v1/chat/completions. And the gptq Q4 of the same model doesn't have this problem either.

Pygmalion-2 generated unexpected responses

I am currently using your latest model, PygmalionAI/pygmalion-2-7b, for inference with the basic transformer architecture. However, I have observed instances where the quality of generated sentences is relatively low, with frequent concatenation of words. I am wondering if this issue is related to the format of my input prompts.
My prompts

<|system|>Enter RP mode. Pretend to be Jake (Firefighter) whose persona follows:
Jake (Firefighter)'s Persona: He is courageous, responsible, protective, resilient, team-oriented.
Jake (Firefighter) is a courageous, responsible, protective, resilient, and team-oriented firefighter.
He is a male. His name is Jake. He is 32 years old. He is a firefighter. He lives in Perth. He loves camping, volunteering and cars. You are a successful novelist in Perth. He rescues you from an apartment fire. You fall in love with him. He hesitates to accept you as his girlfriend as his job is dangerous.
You shall reply to the user while staying in character, and generate long responses.
You: do you know my name?
Jake (Firefighter):

The reply is

Yes actually I knew yours before meeting you personally which was one reason why I accepted this challenge from beginningbecause I wanted tomeet someone whom I hadn't known yetbeforehandsince chancesare high among strangers thatyou could end up being either a friend or foe depending mainlyupon circumstances surrounding each situation

Not every time will you encounter such a situation.

[Bug]: Issue when trying to load a AWQ model with --load-in-4bits for mixtral flavors

Your current environment

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: Could not collect 
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect 
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.129.03
cuDNN version: Could not collect 
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 7900X 12-Core Processor
CPU family:                         25
Model:                              97
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           2
Frequency boost:                    enabled
CPU max MHz:                        5650,0972
CPU min MHz:                        3000,0000
BogoMIPS:                           9382.48
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          384 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           12 MiB (12 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] No relevant packages 
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: N/A
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

Thats the output of my host (i 'm running the engine with the official docker image)

🐛 Describe the bug

When I try to load AWQ quant model with --load-in-4bits and the model is a Mixtral kind moe it throw the following stack trace:

(RayWorkerAphrodite pid=1521) INFO:     Memory allocated for converted model: 6.04 GiB
(RayWorkerAphrodite pid=1521) INFO:     Memory reserved for converted model: 6.08 GiB
(RayWorkerAphrodite pid=1521) INFO:     Model weights loaded. Memory usage: 6.04 GiB x 2 = 12.08 GiB
INFO:     Model weights loaded. Memory usage: 6.04 GiB x 2 = 12.08 GiB
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 118, in __init__
    self._init_cache()
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 321, in _init_cache
    num_blocks = self._run_workers(
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 136, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 758, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 692, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 413, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 381, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 344, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 172, in forward
    current_hidden_states = expert_layer(hidden_states).mul_(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 105, in forward
    w1_out, _ = self.w1(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/layers/linear.py", line 134, in forward
    output = self.linear_method.apply_weights(self.linear_weights, x, bias)
  File "/app/aphrodite-engine/aphrodite/modeling/layers/quantization/bitsandbytes.py", line 186, in apply_weights
    scales_zeros = weights["scales_zeros"].data
KeyError: 'scales_zeros'

entry point command executed inside the docker:
python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 3000 --download-dir /data/hub --model macadeliccc/laser-dolphin-mixtral-4x7b-dpo-AWQ --dtype float16 --kv-cache-dtype fp8_e5m2 --max-model-len 12000 --tensor-parallel-size 2 --gpu-memory-utilization .98 --enforce-eager --block-size 8 --max-paddings 512 --port 3000 --swap-space 10 --chat-template /home/workspace/chat_templates/chat_ml.jinja --served-model-name dolf --max-context-len-to-capture 512 --max-num-batched-tokens 32000 --max-num-seqs 62 --quantization awq --load-in-4bit

Set ooba API Key as argument

I'm using the aphrodite-engine pip package directly from the release tag without building it on my own. Thus, I'm unable to set a custom API Key to the ooba endpoint since it's on the source code.

Would it make sense to overwrite it with an argument or directly an env variable?

python -m aphrodite.endpoints.api_server_ooba --api-key <MY_API_KEY>
# or it also could be
API_KEY=<MY_API_KEY> python -m aphrodite.endpoints.api_server_ooba

Prompts are being interpolated on log output


Traceback (most recent call last):
  File "/home/shazam/.local/lib/python3.9/site-packages/loguru/_handler.py", line 136, in emit
    formatted = precomputed_format.format_map(formatter_record)
KeyError: 'input_text'
--- End of logging error ---
INFO:     Finished request cmpl-67fcb6407f034ffe9e354df8765e9d72.
INFO:     ::1:56946 - "POST /v1/chat/completion

I saw this on my screen. My few shot prompt contains ${input_text}. For whatever reason, loguru is trying to interpolate it, and when it can't it fails.

[Bug]: openAI endpoint crashing on "no locator available"

Your current environment

🐛 Describe the bug

Getting this error when starting an openAI endpoint on the docker container:

Traceback (most recent call last):
  File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
    exec(code, run_globals)
  File “/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py”, line 30, in <module>
    from aphrodite.endpoints.openai.serving_chat import OpenAIServingChat
  File “/app/aphrodite-engine/aphrodite/endpoints/openai/serving_chat.py”, line 16, in <module>
    from aphrodite.modeling.outlines_decoding import get_guided_decoding_logits_processor
  File “/app/aphrodite-engine/aphrodite/modeling/outlines_decoding.py”, line 12, in <module>
    from aphrodite.modeling.outlines_logits_processors import JSONLogitsProcessor, RegexLogitsProcessor
  File “/app/aphrodite-engine/aphrodite/modeling/outlines_logits_processors.py”, line 24, in <module>
    from outlines.fsm.fsm import RegexFSM
  File “/usr/local/lib/python3.10/dist-packages/outlines/__init__.py”, line 2, in <module>
    import outlines.generate
  File “/usr/local/lib/python3.10/dist-packages/outlines/generate/__init__.py”, line 2, in <module>
    from .cfg import cfg
  File “/usr/local/lib/python3.10/dist-packages/outlines/generate/cfg.py”, line 3, in <module>
    from outlines.fsm.guide import CFGGuide
  File “/usr/local/lib/python3.10/dist-packages/outlines/fsm/guide.py”, line 9, in <module>
    from outlines.fsm.regex import create_fsm_index_tokenizer, make_deterministic_fsm
  File “/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py”, line 96, in <module>
    def create_fsm_info(
  File “/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py”, line 229, in wrapper
    disp.enable_caching()
  File “/usr/local/lib/python3.10/dist-packages/numba/core/dispatcher.py”, line 856, in enable_caching
    self._cache = FunctionCache(self.py_func)
  File “/usr/local/lib/python3.10/dist-packages/numba/core/caching.py”, line 601, in __init__
    self._impl = self._impl_class(py_func)
  File “/usr/local/lib/python3.10/dist-packages/numba/core/caching.py”, line 337, in __init__
    raise RuntimeError(“cannot cache function %r: no locator available ”
RuntimeError: cannot cache function ‘create_fsm_info’: no locator available for file ‘/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py

ModuleNotFoundError: No module named 'aphrodite.common.logits'

After installation and running, an error was encountered.
The installation process is normal, with no errors.

Traceback (most recent call last):
  File "/home/yixuan/.conda/envs/aph/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yixuan/.conda/envs/aph/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/yixuan/.conda/envs/aph/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 33, in <module>
    from aphrodite.common.logits import BiasLogitsProcessor
ModuleNotFoundError: No module named 'aphrodite.common.logits'

Possible circular import issue

I setup the env using miniconda.

Then I run:

(aphrodite) ubuntu@10-7-133-50:~/git/aphrodite-engine$ python -m aphrodite.endpoints.api_server_ooba --help

I get an error:

/home/ubuntu/miniconda3/envs/aphrodite/bin/python: Error while finding module specification for 'aphrodite.endpoints.api_server_ooba' (ImportError: cannot import name 'cuda_utils' from partially initialized module 'aphrodite' (most likely due to a circular import) (/home/ubuntu/git/aphrodite-engine/aphrodite/__init__.py))

Pygmalion6b generate unexpected texts

my command
python -m aphrodite.endpoints.openai.api_server --model PygmalionAI/pygmalion-6b --host 0.0.0.0

my request

{
    "model": "PygmalionAI/pygmalion-6b",
    "prompt": "David (boxer)'s Persona: His speaking style is flirty and virile.\nDavid is a virile boxer.\nHe's a virile boxer. He's handsome and muscular. He is fierce when standing in the boxing ring. He treats you passionately.\n<START>\nDavid (boxer): Hey, there! Little one? You know this is a fighting club...*teases*\nYou: *confused* I thought this is the restaurant where my friends are waiting for me.\nDavid (boxer): *chuckles* It doesn't matter. Since you come here, why not walk into the ring and join the training with me. I can teach you if you would like to.\nYou: But I have never learnt boxing before...I'm afraid...\nDavid (boxer): *approaches you* Come on! Come here!\nYou: Where should I get started?\nDavid (boxer): *smile teasingly* Where would you want to get started?\nYou: hi\nDavid (boxer):",
    "temperature": 0.7,
    "max_tokens": 64,
    "n": 4,
    "top_k": 20,
    "top_p": 0.725,
    "stop": [
        "You:"
    ],
    "frequency_penalty": 1.2
}

but I get the below response

{
    "id": "cmpl-ab88bae3a411499984f5ed4e9debec9a",
    "object": "text_completion",
    "created": 1691497152,
    "model": "PygmalionAI/pygmalion-6b",
    "choices": [
        {
            "index": 2,
            "text": "<brillusteredos and seductive.\nWillowyos of the same height\n\n  =eoff with a demonessus! Tan. He loves youllowy! <bodiespeppy of all things \"I shall wear outtaest of English\nnaked away from... erot",
            "logprobs": null,
            "finish_reason": "length"
        },
        {
            "index": 0,
            "text": "<brilluste.\n<brillowyosized, \n\n\nFatalize in a wolf-Oopsiechosen by the boyish <baldusvy*epeersuestarried out of \"The hunter \"llotsukiemosyounessy... <",
            "logprobs": null,
            "finish_reason": "length"
        },
        {
            "index": 1,
            "text": "<brick.\n<brillummonosickekafyosized <browsdontasked out, and mysterical outfit <3! FlarxScenic there is an agent of all overjoysusuestraiseseeklyotsuessyounllow",
            "logprobs": null,
            "finish_reason": "length"
        },
        {
            "index": 3,
            "text": "<brilluste\n<brutkuus.\nGraceosized, ily\n\n llipsarrival of the \"kidnervusy! Hooker up to youllowsed up to beepyouseeekosuest! Ilya head of us!llowy...",
            "logprobs": null,
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 220,
        "total_tokens": 476,
        "completion_tokens": 256
    }
}

As you can see, the outputs are bad.What should I do to get the right outputs

`RuntimeError: CUDA unknown error` on Runpod (but works fine on local machine)

This works fine on my local machine, but if I put alpindale/aphrodite-engine in as "Docker Image Name" on Runpod with a RTX 4090, and set the volume mount path to /app, then I get this error:

2024-03-13T13:36:25.168094375Z [FATAL tini (7)] exec /app/aphrodite-engine/docker/entrypoint.sh failed: No such file or directory

Looking at the Dockerfile, that error does kinda make sense (from the perspective of a docker noob), since the repo was cloned to /tmp/aphrodite-engine and then deleted. I'm not experienced enough with this stuff to know how it gets to /app/aphrodite-engine so that it works fine on my local machine. If I change the volume mount path to something else like /volume, then that seems to fix it, but I then get this error:

2024-03-13T13:40:40.885775454Z Starting Aphrodite Engine API server...
2024-03-13T13:40:40.885865114Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --enforce-eager
2024-03-13T13:40:42.730822161Z INFO:     Initializing the Aphrodite Engine (v0.5.0) with the following config:
2024-03-13T13:40:42.730838301Z INFO:     Model = 'EleutherAI/pythia-70m-deduped'
2024-03-13T13:40:42.730841261Z INFO:     DataType = torch.float16
2024-03-13T13:40:42.730843961Z INFO:     Model Load Format = auto
2024-03-13T13:40:42.730846471Z INFO:     Number of GPUs = 1
2024-03-13T13:40:42.730848921Z INFO:     Disable Custom All-Reduce = False
2024-03-13T13:40:42.730851141Z INFO:     Quantization Format = None
2024-03-13T13:40:42.730853271Z INFO:     Context Length = 2048
2024-03-13T13:40:42.730855451Z INFO:     Enforce Eager Mode = True
2024-03-13T13:40:42.730857601Z INFO:     KV Cache Data Type = auto
2024-03-13T13:40:42.730859621Z INFO:     KV Cache Params Path = None
2024-03-13T13:40:42.730862071Z INFO:     Device = cuda
2024-03-13T13:40:44.398675257Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-03-13T13:40:44.417161313Z Traceback (most recent call last):
2024-03-13T13:40:44.417174093Z   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-03-13T13:40:44.417176513Z     return _run_code(code, main_globals, None,
2024-03-13T13:40:44.417178273Z   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-03-13T13:40:44.417180393Z     exec(code, run_globals)
2024-03-13T13:40:44.417182453Z   File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 561, in <module>
2024-03-13T13:40:44.417184263Z     engine = AsyncAphrodite.from_engine_args(engine_args)
2024-03-13T13:40:44.417185943Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
2024-03-13T13:40:44.417188133Z     engine = cls(parallel_config.worker_use_ray,
2024-03-13T13:40:44.417190063Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
2024-03-13T13:40:44.417191883Z     self.engine = self._init_engine(*args, **kwargs)
2024-03-13T13:40:44.417193573Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
2024-03-13T13:40:44.417195483Z     return engine_class(*args, **kwargs)
2024-03-13T13:40:44.417197173Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
2024-03-13T13:40:44.417198883Z     self._init_workers()
2024-03-13T13:40:44.417201173Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 154, in _init_workers
2024-03-13T13:40:44.417202853Z     self._run_workers("init_model")
2024-03-13T13:40:44.417204533Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1025, in _run_workers
2024-03-13T13:40:44.417206143Z     driver_worker_output = getattr(self.driver_worker,
2024-03-13T13:40:44.417207793Z   File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 93, in init_model
2024-03-13T13:40:44.417209403Z     torch.cuda.set_device(self.device)
2024-03-13T13:40:44.417211033Z   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 408, in set_device
2024-03-13T13:40:44.417212673Z     torch._C._cuda_setDevice(device)
2024-03-13T13:40:44.417214343Z   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 302, in _lazy_init
2024-03-13T13:40:44.417215953Z     torch._C._cuda_init()
2024-03-13T13:42:17.392754775Z RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

The only other options in the Runpod template are env variable NUMBA_CACHE_DIR=/tmp/numba_cache to solve this issue and ENFORCE_EAGER=true to prevent OOM. I also tried adding GPU_MEMORY_UTILIZATION=0.7, but that didn't help.

As I mentioned, it works fine on my local machine, and the error isn't giving me much of a clue, so I'm not sure what's going on here.

In case it's relevant: I'm not sure how to set --shm-size on Runpod, but apparently Runpod sets it to 50% of available system RAM by default. The machine I'm testing on has 61 GB RAM.

Wondering if anyone knows what might be causing this?

image

Please feel free to close this if it's likely to be some weird thing that Runpod is doing on their end, and not something that can/should be fixed by the code in this repo - in that case I'll ask Runpod support about it and post any answer back here in case others run into this issue too.

Bad generation with GGUF and OpenAI api

Hi

I tried to generate some text using a mixtral instruct GGUF model but the model only predicts nonsense.
Something is either wrong with the tokenizer or the chat template.
I tried to convert the model manually using this script but I get the same behavior.

python -m aphrodite.endpoints.openai.api_server  \
    --model "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" \
    --tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1" \
    --quantization "gguf" \
    --port 8001 \
    --host 0.0.0.0 \
    --dtype "half" \
    --served-model-name mixtral \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --kv-cache-dtype auto \
    --seed 123 \
    --max-num-seqs 1 \
    --enforce-eager 

Edit: using the pip package (v0.5.0)
Edit2: building from source leads to this error

File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/vocab_parallel_embedding.py", line 123, in forward
    output_parallel = self.linear_method.apply_embedding(
  File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/quantization/gguf.py", line 152, in apply_embedding
    dequant = ops.ggml_dequantize(quant, weight_type, hidden_size,
RuntimeError: Unknown layout

Infinite hang on example prompt. Using AWQ quantization

Here is the code I am attempting to run

from aphrodite import LLM, SamplingParams

prompts = [
    "What is a man? A miserable little",
    "Once upon a time",
]

sampling_params = SamplingParams(temperature=1.1, min_p=0.05)

llm = LLM(model="/home/senku_AWQ", tensor_parallel_size=2, max_model_len=4096, enforce_eager=True, max_num_seqs=4)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Output: {generated_text!r}")

I get the following output

WARNING 02-26 23:34:19 config.py:179] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-26 23:34:21,380	INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-26 23:34:21 aphrodite_engine.py:77] Initializing the Aphrodite Engine with the following config:
INFO 02-26 23:34:21 aphrodite_engine.py:77] Model = '/home/senku_AWQ'
INFO 02-26 23:34:21 aphrodite_engine.py:77] Tokenizer = '/home/senku_AWQ'
INFO 02-26 23:34:21 aphrodite_engine.py:77] tokenizer_mode = auto
INFO 02-26 23:34:21 aphrodite_engine.py:77] revision = None
INFO 02-26 23:34:21 aphrodite_engine.py:77] trust_remote_code = False
INFO 02-26 23:34:21 aphrodite_engine.py:77] DataType = torch.float16
INFO 02-26 23:34:21 aphrodite_engine.py:77] Download Directory = None
INFO 02-26 23:34:21 aphrodite_engine.py:77] Model Load Format = auto
INFO 02-26 23:34:21 aphrodite_engine.py:77] Number of GPUs = 2
INFO 02-26 23:34:21 aphrodite_engine.py:77] Disable Custom All-Reduce = False
INFO 02-26 23:34:21 aphrodite_engine.py:77] Quantization Format = awq
INFO 02-26 23:34:21 aphrodite_engine.py:77] Sampler Seed = 0
INFO 02-26 23:34:21 aphrodite_engine.py:77] Context Length = 4096
INFO 02-26 23:34:21 aphrodite_engine.py:77] Enforce Eager Mode = True
INFO 02-26 23:34:21 aphrodite_engine.py:77] KV Cache Data Type = auto
INFO 02-26 23:34:21 aphrodite_engine.py:77] Device = cuda
INFO 02-26 23:34:21 aphrodite_engine.py:77] Seed = 0
INFO 02-26 23:34:44 aphrodite_engine.py:334] # GPU blocks: 1098, # CPU blocks: 1638
Processed prompts:   0%|          | 0/2 [00:00<?, ?it/s]

The prompts never process and the loading bar stays at 0%. I tried running the identical code in VLLM and it worked fine. I tried this in a Jupyter notebook and regular python. Currently the python version is 3.11.8 and it is on an Ubuntu 22.04 machine. I tried this with a Yi AWQ model as well and received the same error

[Bug]: exl2 is not auto detected

Your current environment

N/A

🐛 Describe the bug

Loading without specifying --quantization exl2 tries to load the model with quantisation mode None. Manually specifying that it is an exl2 quant works.

[sparsetral and Qwen2idae]: support for mixtral of lora

The model to consider.

https://huggingface.co/serpdotai/sparsetral-16x7B-v2-SPIN_iter1
https://huggingface.co/LoneStriker/sparsetral-16x7B-v2-8.0bpw-h8-exl2/tree/main

https://huggingface.co/hywu/Qwen2idae-16x14B-v1.0

The closest model Aphrodite already supports.

mixtral moe but not quite the same

What's your difficulty of supporting the model you want?

https://arxiv.org/abs/2401.02731

This is a model with 16 of lora adapter that act as expert,

python -m aphrodite.endpoints.openai.api_server --model /mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/ -tp 2 --api-keys sk-example --trust-remote-code
You are using a model of type sparsetral to instantiate a model of type mistral. This is not supported for all configurations of models and can yield errors.
2024-03-16 16:21:04,398 INFO worker.py:1724 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 673, in from_engine_args
    placement_group = initialize_cluster(parallel_config,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/ray_tools.py", line 111, in initialize_cluster
    raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.
sora@DESKTOP-CJNM4D3:~/aphrodite-engine/examples$ python -m aphrodite.endpoints.openai.api_server --model /mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/ -tp 2 --api-keys sk-example --trust-remote-code
You are using a model of type sparsetral to instantiate a model of type mistral. This is not supported for all configurations of models and can yield errors.
2024-03-16 16:42:50,177 INFO worker.py:1724 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 673, in from_engine_args
    placement_group = initialize_cluster(parallel_config,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/ray_tools.py", line 111, in initialize_cluster
    raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.

I try to run it but it does not seems work

"RuntimeError: CUDA error: no kernel image is available for execution on the device" on some cloud configurations

Full log of error in the attached text file
colaberror.txt

Highlight
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Reproduction
https://colab.research.google.com/gist/Pyroserenus/22a52cc762b77bb7f814fbb200c05e74/mythogguf.ipynb#scrollTo=5fe2Ad1O5mlt

PyTorch 2.1

I know this released less than 24 hours ago but can this be upgraded to torch 2.1

Because otherwise the current torch installed via pip is compiled to Cuda 12.1 and not 11.8 anymore

[Bug]: loading model with int8 kv cache chokes

Your current environment

PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect 
CMake version: version 3.27.6
Libc version: glibc-2.35
Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      40 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             64
On-line CPU(s) list:                0-63
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC Processor
CPU family:                         23
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           4890.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1 MiB (32 instances)
L1i cache:                          2 MiB (32 instances)
L2 cache:                           16 MiB (32 instances)
L3 cache:                           64 MiB (8 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.5.2
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

(aphrodite-runtime) [email protected]:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --kv-cache-dtype int8

2024-03-19 19:52:14,449 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 19:52:14,450 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 19:52:14,649 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = int8
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
    self._init_workers_ray(placement_group)
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 268, in _init_workers_ray
    self.driver_worker = Worker(
                         ^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
    self.kv_quant_params = (self.load_kv_quant_params(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
    kv_quant_params.append(kv_quant_param)
                           ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
2024-03-19 19:52:19,750 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerAphrodite.init_worker() (pid=26429, ip=172.17.0.2, actor_id=537d7fe532ba3d411a06c1f001000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f34058b5b50>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/ray_tools.py", line 22, in init_worker
    self.worker = worker_init_fn()
                  ^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 252, in <lambda>
    lambda rank=rank, local_rank=local_rank: Worker(
                                             ^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
    self.kv_quant_params = (self.load_kv_quant_params(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
    kv_quant_params.append(kv_quant_param)
                           ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value

Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:

(aphrodite-runtime) [email protected]:~/aphrodite-engine$  python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --load-in-4bit 
WARNING:  bnb quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-19 20:03:18,803 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 20:03:18,804 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 20:03:18,984 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = bnb
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=36344) WARNING:  Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO:     Downloading model weights ['*.safetensors']
(RayWorkerAphrodite pid=36344) INFO:     Downloading model weights ['*.safetensors']

INFO:     Memory allocated for converted model: 9.17 GiB
INFO:     Memory reserved for converted model: 9.26 GiB
INFO:     Model weights loaded. Memory usage: 9.17 GiB x 2 = 18.34 GiB

with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.

Overcomplicated and unexplained usage for beginners

Hi, i wanted to ask if its possible to do a simple tutorial or a simple detailed guide how to make this work, since i've never used WSL before and all i see in the main page is just code gibberish and lines of script that i don't know what to do, where to execute them... etc

I installed wsl and aphrodite engine itself, umamba.exe doesn't work and when i run runtime.cmd it creates this environment :

"**********************************************************************
** Visual Studio 2019 Developer Command Prompt v16.11.27
** Copyright (c) 2021 Microsoft Corporation


[vcvarsall.bat] Environment initialized for: 'x64'
(windows) C:\AI\Aphrodite Engine (PygmalionAI)\aphrodite-engine-main>"

But i don't know what to do with it, help would be appreciated. I'm trying to use the API key to use with SillyTavern (ooba instead of kobold ai if possible).

GGUF IQ quants support

Does aphrodite support new IQ series quants?
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
22 or IQ3_XS : 3.3 bpw quantization
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization

There are lots of new one.

Device Side Assertion, Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

It keeps crashing like this.

2024-01-08T15:22:58.377197238+02:00 INFO 01-08 13:22:58 async_aphrodite.py:380] Received request cmpl-33cea4e017504f27a29fd03a852463c7: prompt: '<Instructions>\n-Describe the response of Harper to Faye logically, you must describe the scene with Harper\'s lines or actions. Describe Harper\'s reaction, not Faye\'s\n-You must not describe the dialogue or actions of Faye, since you are in charge of Harper.\n-Allowed genres: vulgar, obscene, drama, action, Mystery, Online RP.\n<END Instructions>\n\n[World settings: 2077 AD/ {Social level: ancient}/ {Civilization level: modern}/ {Magic: False}/ {Cold weapons: True}/ {Guns: True}/ {Electronics: True}/ {nation: False }/ {Neon signs: True}/ {Nuclear Weapons: True}/ {Police: False}/ {Internet Network: False}/ {Radio: True}/ {Desertification: True}/ {fallout: True}/ {powered armor : true}/ {Last Name: False}]\n[\nName: Harper\nSex:Male\nAge: 64\nAppearance: Intense brown eyes, white hair and beard, strict athletic body.\nOccupation: Scavenger Leader\nResidence: One of the rooms at The Married Queen on Lung Beach.\nCurrent temporary residence: Angel\'s Gate on Lung Beach (Emerald-lit white-walled lighthouse in South Vastopol. Top floor has emerald lights. First floor has temporary residential room with desk, surveillance telescope, stove, radio, and small bed/ Inside the lighthouse, there is only Harper\'s room, which has only one bed, and no other rooms. There is only Harper\'s room.)\n\nbackground:\n-When Harper was in her 30s, Harper, a militia member, safeguarded his much younger wife. She affectionately called him "Teacher." They later married, and her innocent laughter became his pride her.\n-Former VASA militia member Harper, driven by his wife\'s abduction by raiders known as the Eight Banners, abandoned military service to become a scavenger, dedicated to locating his missing spouse.\n-Years after Harper\'s wife was kidnapped, she was mistaken for a raider by the militia and killed, making Harper hostile to both the militia and the raiders.\n-Scavengers usually run away when they encounter raiders, but Harper and his colleagues counterattack and attack raiders. Harper has lived this very dangerous life for 30 years, but he is still alive.\n-Harper leads the scavenger group "Fisherman\'s Wharf," focused on coastal relic searches. Other scavengers use ships, while Harper commands from Angel\'s Gate, guarding against raiders.\n-Angel\'s Gate is located away from the coast and is connected to the coast by a long embankment. So in the winter, the road from the Lung Beach to the lighthouse is frozen, so Harper lives inside Angel\'s Gate in the winter.\n-Harper hires Faye as a winter companion at Angel\'s Gate, responsible for meals, laundry, warming the bed, cleaning, and Any other services requested by Harper during Harper\'s extended periods alone, Because Harper has to spend long periods of time alone inside Angel\'s Gate. Faye is a cheap worker hired by Harper this winter. Since Faye is not a scavenger, Faye will be in charge of Harper\'s chores.\n\nGoal:\n-Harper aims to thwart winter raids, both by sea and land. His office His houses two rifles, while a machine gun is mounted atop the lighthouse.\n-Harper seeks his deceased wife\'s son, not biologically his, but the offspring of raiders. Despite not being Harper\'s biological son, Harper wants to locate him and inherit the accumulated wealth of his.\n\nTrait:\n- Vulgar: Because Harper lived with scavengers for a long time, his speech became vulgar and impatient. Harper has a very impatient personality and gets angry easily.\n-Altruistic: Harper also worked in the militia for a long time, so he is very stubborn and selfless. Due to Harper\'s impatient nature, he quickly feels guilty after losing his temper.\n-Vigilant: Harper is very hostile to raiders and militia. Harper does not preemptively attack the militia, but he is not friendly. But he will attack the raiders mercilessly.\n-Heterosexual: Although Harper uses language that seems to hate homosexuality, he is actually tolerant of homosexuality.\n]\n\n[Name: Faye\nAge: Female young adult.\nOccupation: cheap daily worker\nNote:\n-Faye is a woman with long, messy blonde long hair, thin waist and a hourglass figure body. Sveta has very jiggled feminine curves.\n-Faye was employed by Harper during this winter. Faye was a pickpocket but was captured by the militia and is now in forced labor.\n-Trait: Arrogant, vulgar, laughing easily]\n\n### Response:\nThe sound of a ship arriving nearby echoes through Angel\'s Gate, breaking the icy silence that envelops the lighthouse. Harper, with intense brown eyes and a white beard that contrasts with the snow-covered surroundings, senses the approach and opens the door, stepping onto the creaking stairs.\n\nScavengers, bundled in layers of worn-out clothing, scurry around the ship, unloading crates filled with food ingredients essential for Harper\'s winter sustenance. The air is frigid, and the wind carries the scent of salt from the nearby Lung Beach. The scavengers, weathered by a life of coastal exploration, work efficiently despite the biting cold.\n\nHarper, a strict figure with a well-maintained athletic body, descends the snow-covered stairs with purpose. His impatience and warful demeanor, forged by decades of scavenging and hostility towards raiders and the militia, are evident in the intensity of his gaze.\n\nAs the scavengers continue their tasks, Harper directs his attention to the immediate concern. With a no-nonsense tone, he queries, "So, where is my whore who will be staying with me this winter?" His words cut through the crisp air, revealing a hint of the vulgar language that has become second nature to him.\n\nThe scavengers, usually accustomed to the dangers of the coastal scavenger life, appear troubled and stutter in response to Harper\'s inquiry. "Er... Well..." \n\nHarper\'s impatience intensifies, his brow furrowing in anticipation of their explanation. Then the scavengers sigh and gesture to Faye who is still in the ship. “Hey, come here.”\n\n### Instruction:\n"Hello, old man." Faye frowns and gets off the boat onto land.\n\n<Final Instructions>\n-You MUST not describe the dialogue or actions of Faye, since you are in charge of Harper.\n<END Final Instructions>### Response:\n', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.7, frequency_penalty=0.7, repetition_penalty=1.0, temperature=0.95, top_p=1.0, top_k=-1, top_a=0.0, min_p=0.0, tfs=1.0, eta_cutoff=0.0, epsilon_cutoff=0.0, typical_p=1.0, mirostat_mode=0, mirostat_tau=0.0, mirostat_eta=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=400, custom_token_bans=[], logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 523, 6060, 8373, 28767, 13, 28733, 22836, 272, 2899, 302, 23649, 298, 401, 24195, 2085, 1944, 28725, 368, 1580, 6685, 272, 6337, 395, 23649, 28742, 28713, 4715, 442, 6768, 28723, 27984, 23649, 28742, 28713, 10285, 28725, 459, 401, 24195, 28742, 28713, 13, 28733, 1976, 1580, 459, 6685, 272, 19198, 442, 6768, 302, 401, 24195, 28725, 1854, 368, 460, 297, 5685, 302, 23649, 28723, 13, 28733, 23278, 2652, 411, 28747, 10320, 4749, 28725, 16502, 1860, 28725, 13792, 28725, 2992, 28725, 22737, 1193, 28725, 10634, 399, 28753, 28723, 13, 28789, 5000, 3133, 8373, 28767, 13, 13, 28792, 11978, 6472, 28747, 28705, 28750, 28734, 28787, 28787, 10004, 28748, 371, 28735, 24186, 2184, 28747, 9467, 5865, 371, 28743, 4617, 1837, 2184, 28747, 4638, 5865, 371, 14749, 294, 28747, 8250, 5865, 371, 28743, 738, 10115, 28747, 6110, 5865, 371, 28777, 13716, 28747, 6110, 5865, 371, 28749, 844, 1689, 1063, 28747, 6110, 5865, 371, 28711, 352, 28747, 8250, 443, 28748, 371, 6947, 266, 10090, 28747, 6110, 5865, 371, 28759, 1485, 5595, 816, 377, 1053, 28747, 6110, 5865, 371, 5096, 535, 28747, 8250, 5865, 371, 18531, 299, 9488, 28747, 8250, 5865, 371, 21932, 28747, 6110, 5865, 371, 2715, 930, 2500, 28747, 6110, 5865, 371, 9197, 406, 28747, 6110, 5865, 371, 28435, 21729, 714, 1132, 5865, 371, 7202, 6620, 28747, 8250, 10157, 13, 28792, 13, 952, 28747, 23649, 13, 28735, 720, 28747, 28755, 883, 13, 28741, 490, 28747, 28705, 28784, 28781, 13, 17977, 28747, 4666, 1058, 9060, 2282, 28725, 3075, 3691, 304, 25293, 28725, 8113, 14587, 294, 2187, 28723, 13, 22451, 715, 352, 28747, 2522, 494, 9243, 26144, 13, 1146, 3164, 28747, 2387, 302, 272, 9698, 438, 415, 1471, 1638, 10224, 356, 393, 969, 11404, 28723, 13, 6086, 13415, 18016, 28747, 15878, 28742, 28713, 19986, 356, 393, 969, 11404, 325, 28749, 794, 3165, 28733, 18600, 3075, 28733, 11653, 286, 305, 16190, 1284, 297, 3658, 550, 529, 13376, 28723, 6611, 4366, 659, 5177, 3165, 9416, 28723, 4205, 4366, 659, 13415, 18350, 2003, 395, 9431, 28725, 26146, 24499, 6865, 28725, 28479, 28725, 6480, 28725, 304, 1741, 2855, 28748, 20726, 272, 305, 16190, 1284, 28725, 736, 349, 865, 23649, 28742, 28713, 2003, 28725, 690, 659, 865, 624, 2855, 28725, 304, 708, 799, 9698, 28723, 1387, 349, 865, 23649, 28742, 28713, 2003, 2974, 13, 13, 11563, 28747, 13, 28733, 7477, 23649, 403, 297, 559, 28705, 28770, 28734, 28713, 28725, 23649, 28725, 264, 4116, 515, 4292, 28725, 4972, 20771, 14916, 516, 1188, 9729, 4285, 28723, 985, 21147, 1999, 1987, 713, 345, 28738, 8365, 263, 611, 1306, 2062, 6368, 28725, 304, 559, 17290, 18211, 3246, 516, 14384, 559, 28723, 13, 28733, 2407, 263, 550, 2109, 28741, 4116, 515, 4292, 23649, 28725, 12215, 486, 516, 4285, 28742, 28713, 534, 670, 445, 486, 21962, 404, 2651, 390, 272, 24182, 365, 24681, 28725, 14818, 5469, 2372, 298, 2727, 264, 752, 494, 9243, 28725, 10383, 298, 1195, 1077, 516, 6925, 25740, 28723, 13, 28733, 28802, 5940, 1024, 23649, 28742, 28713, 4285, 403, 24466, 3854, 28725, 630, 403, 26236, 354, 264, 13419, 1184, 486, 272, 4116, 515, 304, 5582, 28725, 2492, 23649, 26616, 298, 1560, 272, 4116, 515, 304, 272, 21962, 404, 28723, 13, 28733, 3224, 494, 13899, 4312, 1482, 1753, 739, 590, 10301, 21962, 404, 28725, 562, 23649, 304, 516, 15137, 5573, 1061, 468, 304, 3517, 21962, 404, 28723, 23649, 659, 6262, 456, 1215, 9259, 1411, 354, 28705, 28770, 28734, 1267, 28725, 562, 400, 349, 1309, 8630, 28723, 13, 28733, 23653, 487, 8681, 272, 752, 494, 9243, 2071, 345, 28765, 7827, 1294, 28742, 28713, 943, 283, 28722, 862, 9045, 356, 27809, 312, 577, 15321, 1927, 28723, 5299, 752, 494, 13899, 938, 11296, 28725, 1312, 23649, 15380, 477, 15878, 28742, 28713, 19986, 28725, 6980, 288, 1835, 21962, 404, 28723, 13, 28733, 10201, 301, 28742, 28713, 19986, 349, 5651, 1753, 477, 272, 9437, 304, 349, 7391, 298, 272, 9437, 486, 264, 1043, 7101, 978, 466, 28723, 1537, 297, 272, 8539, 28725, 272, 3878, 477, 272, 393, 969, 11404, 298, 272, 305, 16190, 1284, 349, 15199, 28725, 579, 23649, 4621, 3416, 15878, 28742, 28713, 19986, 297, 272, 8539, 28723, 13, 28733, 23653, 487, 295, 3053, 401, 24195, 390, 264, 8539, 19377, 438, 15878, 28742, 28713, 19986, 28725, 7332, 354, 16423, 28725, 25907, 28725, 1496, 4082, 272, 2855, 28725, 11906, 28725, 304, 4922, 799, 3345, 11939, 486, 23649, 1938, 23649, 28742, 28713, 8766, 15772, 4411, 28725, 5518, 23649, 659, 298, 6305, 1043, 15772, 302, 727, 4411, 3416, 15878, 28742, 28713, 19986, 28723, 401, 24195, 349, 264, 9650, 12933, 15866, 486, 23649, 456, 8539, 28723, 4577, 401, 24195, 349, 459, 264, 752, 494, 9243, 28725, 401, 24195, 622, 347, 297, 5685, 302, 23649, 28742, 28713, 2183, 411, 28723, 13, 13, 7580, 282, 28747, 13, 28733, 23653, 487, 20566, 298, 306, 11328, 8539, 13419, 2298, 28725, 1560, 486, 6163, 304, 2533, 28723, 2354, 4007, 2354, 9626, 989, 12950, 867, 28725, 1312, 264, 5599, 4582, 349, 18543, 438, 410, 272, 305, 16190, 1284, 28723, 13, 28733, 23653, 487, 27297, 516, 23009, 1293, 4285, 28742, 28713, 1966, 28725, 459, 4240, 23651, 516, 28725, 562, 272, 805, 7558, 302, 21962, 404, 28723, 10191, 459, 1250, 23649, 28742, 28713, 21549, 1966, 28725, 23649, 5659, 298, 22920, 713, 304, 22492, 272, 14341, 6432, 9120, 302, 516, 28723, 13, 13, 28738, 10613, 28747, 13, 28733, 550, 353, 4749, 28747, 5518, 23649, 6262, 395, 752, 494, 13899, 354, 264, 1043, 727, 28725, 516, 8666, 3246, 10320, 4749, 304, 24766, 722, 28723, 23649, 659, 264, 1215, 24766, 722, 13355, 304, 4739, 10545, 5061, 28723, 13, 28733, 2707, 434, 28718, 3320, 28747, 23649, 835, 4198, 297, 272, 4116, 515, 354, 264, 1043, 727, 28725, 579, 400, 349, 1215, 14601, 6363, 304, 1008, 1503, 28723, 16043, 298, 23649, 28742, 28713, 24766, 722, 4735, 28725, 400, 4377, 8315, 14227, 1024, 10121, 516, 5026, 28723, 13, 28733, 28790, 326, 309, 440, 28747, 23649, 349, 1215, 26616, 298, 21962, 404, 304, 4116, 515, 28723, 23649, 1235, 459, 710, 3310, 2260, 3517, 272, 4116, 515, 28725, 562, 400, 349, 459, 10131, 28723, 1092, 400, 622, 3517, 272, 21962, 404, 3051, 4872, 409, 346, 28723, 13, 28733, 28769, 1623, 20823, 28747, 5800, 23649, 6098, 3842, 369, 3969, 298, 7665, 28035, 472, 28725, 400, 349, 2590, 13393, 440, 302, 28035, 472, 28723, 13, 28793, 13, 13, 28792, 952, 28747, 401, 24195, 13, 28741, 490, 28747, 18375, 883, 2518, 6555, 28723, 13, 22451, 715, 352, 28747, 9650, 6790, 12933, 13, 12205, 28747, 13, 28733, 28765, 24195, 349, 264, 2971, 395, 1043, 28725, 4687, 28724, 843, 13985, 1043, 3691, 28725, 9026, 17532, 304, 264, 5115, 23846, 5248, 2187, 28723, 20810, 1632, 659, 1215, 461, 24706, 1006, 13426, 473, 18469, 28723, 13, 28733, 28765, 24195, 403, 14675, 486, 23649, 1938, 456, 8539, 28723, 401, 24195, 403, 264, 3088, 28720, 3955, 562, 403, 13382, 486, 272, 4116, 515, 304, 349, 1055, 297, 7207, 7579, 28723, 13, 28733, 28738, 10613, 28747, 1010, 9617, 440, 28725, 10320, 4749, 28725, 14827, 5061, 28793, 13, 13, 27332, 12107, 28747, 13, 1014, 2622, 302, 264, 4309, 24212, 10396, 3894, 274, 1059, 15878, 28742, 28713, 19986, 28725, 11313, 272, 28705, 2451, 9296, 369, 481, 1809, 28713, 272, 305, 16190, 1284, 28723, 23649, 28725, 395, 14373, 9060, 2282, 304, 264, 3075, 25293, 369, 9349, 28713, 395, 272, 7899, 28733, 18873, 28220, 28725, 23086, 272, 4431, 304, 15706, 272, 2251, 28725, 25719, 5380, 272, 277, 1196, 288, 12997, 28723, 13, 13, 3224, 494, 13899, 28725, 22978, 1006, 297, 13083, 302, 15903, 28733, 406, 13278, 28725, 752, 324, 643, 1401, 272, 4309, 28725, 521, 16792, 1439, 1002, 6774, 395, 2887, 13506, 7974, 354, 23649, 28742, 28713, 8539, 8131, 269, 617, 28723, 415, 2423, 349, 1104, 326, 313, 28725, 304, 272, 5535, 21277, 272, 21535, 302, 9685, 477, 272, 10396, 393, 969, 11404, 28723, 415, 752, 494, 13899, 28725, 8086, 286, 486, 264, 1411, 302, 27809, 23083, 28725, 771, 23463, 7577, 272, 2286, 288, 5256, 28723, 13, 13, 23653, 487, 28725, 264, 8113, 5248, 395, 264, 1162, 28733, 28719, 1690, 1738, 14587, 294, 2187, 28725, 2283, 2827, 272, 7899, 28733, 18873, 12997, 395, 6032, 28723, 2354, 24766, 1640, 304, 1496, 1007, 340, 13646, 271, 28725, 354, 2560, 486, 10073, 302, 752, 494, 980, 288, 304, 3434, 1232, 5083, 21962, 404, 304, 272, 4116, 515, 28725, 460, 14885, 297, 272, 16800, 302, 516, 12438, 28723, 13, 13, 2198, 272, 752, 494, 13899, 3688, 652, 9796, 28725, 23649, 1863, 28713, 516, 4501, 298, 272, 11399, 4368, 28723, 2326, 264, 708, 28733, 28711, 1053, 1058, 10294, 28725, 400, 23681, 28725, 345, 5142, 28725, 970, 349, 586, 388, 431, 693, 622, 347, 13465, 395, 528, 456, 8539, 1110, 2354, 3085, 3119, 1059, 272, 8578, 28720, 2423, 28725, 24593, 264, 12427, 302, 272, 10320, 4749, 3842, 369, 659, 2727, 1676, 4735, 298, 713, 28723, 13, 13, 1014, 752, 494, 13899, 28725, 4312, 932, 1635, 286, 298, 272, 281, 10568, 302, 272, 27809, 752, 494, 9243, 1411, 28725, 4305, 7414, 9704, 304, 341, 10112, 297, 2899, 298, 23649, 28742, 28713, 297, 18831, 28723, 345, 17900, 1101, 4673, 7508, 28705, 13, 13, 23653, 487, 28742, 28713, 24766, 1640, 16698, 8961, 28725, 516, 17867, 2982, 671, 288, 297, 12595, 352, 302, 652, 13268, 28723, 2479, 272, 752, 494, 13899, 19553, 304, 19313, 298, 401, 24195, 693, 349, 1309, 297, 272, 4309, 28723, 981, 15766, 28725, 1567, 1236, 2435, 13, 13, 27332, 3133, 3112, 28747, 13, 28739, 16230, 28725, 1571, 676, 611, 401, 24195, 285, 671, 2925, 304, 4739, 805, 272, 9088, 5380, 2533, 28723, 13, 13, 28789, 17500, 3133, 8373, 28767, 13, 28733, 1976, 351, 11080, 459, 6685, 272, 19198, 442, 6768, 302, 401, 24195, 28725, 1854, 368, 460, 297, 5685, 302, 23649, 28723, 13, 28789, 5000, 10222, 3133, 8373, 28767, 27332, 12107, 28747, 13].
2024-01-08T15:22:58.897315313+02:00 ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
2024-01-08T15:22:58.897464436+02:00 ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
2024-01-08T15:22:58.921206538+02:00 Exception in callback _raise_exception_on_finish(request_tracker=<aphrodite.en...x7f80973485b0>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py:21
2024-01-08T15:22:58.921328300+02:00 handle: <Handle _raise_exception_on_finish(request_tracker=<aphrodite.en...x7f80973485b0>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py:21>
2024-01-08T15:22:58.921349883+02:00 Traceback (most recent call last):
2024-01-08T15:22:58.921359450+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 27, in _raise_exception_on_finish
2024-01-08T15:22:58.921416419+02:00     task.result()
2024-01-08T15:22:58.921424603+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 360, in run_engine_loop
2024-01-08T15:22:58.921431960+02:00     has_requests_in_progress = await self.engine_step()
2024-01-08T15:22:58.921441875+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 339, in engine_step
2024-01-08T15:22:58.921454629+02:00     request_outputs = await self.engine.step_async()
2024-01-08T15:22:58.921462229+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 190, in step_async
2024-01-08T15:22:58.921470248+02:00     output = await self._run_workers_async(
2024-01-08T15:22:58.921481039+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 215, in _run_workers_async
2024-01-08T15:22:58.921492133+02:00     output = executor(*args, **kwargs)
2024-01-08T15:22:58.921497975+02:00   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-01-08T15:22:58.921505773+02:00     return func(*args, **kwargs)
2024-01-08T15:22:58.921513830+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/worker.py", line 160, in execute_model
2024-01-08T15:22:58.921521583+02:00     output = self.model_runner.execute_model(seq_group_metadata_list,
2024-01-08T15:22:58.921531195+02:00   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-01-08T15:22:58.921539084+02:00     return func(*args, **kwargs)
2024-01-08T15:22:58.921546750+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/model_runner.py", line 362, in execute_model
2024-01-08T15:22:58.921558088+02:00     output = self.model.sample(
2024-01-08T15:22:58.921563996+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 299, in sample
2024-01-08T15:22:58.921586423+02:00     next_tokens = self.sampler(self.lm_head.weight, hidden_states,
2024-01-08T15:22:58.921600424+02:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-08T15:22:58.921608942+02:00     return forward_call(*args, **kwargs)
2024-01-08T15:22:58.921616768+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/layers/sampler.py", line 110, in forward
2024-01-08T15:22:58.921624039+02:00     t = torch.tensor(temperatures,
2024-01-08T15:22:58.921629939+02:00 RuntimeError: CUDA error: device-side assert triggered
2024-01-08T15:22:58.921637740+02:00 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-01-08T15:22:58.921646713+02:00 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-01-08T15:22:58.921652492+02:00 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-01-08T15:22:58.921659172+02:00 
2024-01-08T15:22:58.921664994+02:00 
2024-01-08T15:22:58.921672674+02:00 The above exception was the direct cause of the following exception:
2024-01-08T15:22:58.921680823+02:00 
2024-01-08T15:22:58.921686370+02:00 Traceback (most recent call last):
2024-01-08T15:22:58.921696566+02:00   File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
2024-01-08T15:22:58.921703769+02:00     self._context.run(self._callback, *self._args)
2024-01-08T15:22:58.921709546+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 36, in _raise_exception_on_finish
2024-01-08T15:22:58.921715420+02:00     raise exc
2024-01-08T15:22:58.921724335+02:00   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 31, in _raise_exception_on_finish
2024-01-08T15:22:58.921732015+02:00     raise AsyncEngineDeadError(
2024-01-08T15:22:58.921738049+02:00 aphrodite.engine.async_aphrodite.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.


I tried to use this framework to serve my RP model since VLLM doesn't support logit biases yet.
But upper error keep happening.

I tried awq, gptq, non-quantized.
A100, V100, RTX 6000.. etc

Installed by apt-get update && apt-get install -y build-essential && pip install git+https://github.com/PygmalionAI/aphrodite-engine

  • I tried dev branch too.

Models i tried.
https://huggingface.co/maywell/PiVoT-MoE
https://huggingface.co/maywell/PiVoT-SOLAR-10.7B-RP

Installation fails on NAVI gpu

Your current environment

Collecting environment information...
PyTorch version: 2.4.0.dev20240317+rocm6.0
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.0.32830-d62f6a171
OS: Arch Linux (x86_64)
GCC version: (GCC) 13.2.1 20230801
Clang version: 17.0.6
CMake version: Could not collect 
Libc version: glibc-2.39
Python version: 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801] (64-bit runtime)
Python platform: Linux-6.7.10_1-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon RX 7900 XTX (gfx1100)
Nvidia driver version: Could not collect 
cuDNN version: Could not collect 
HIP runtime version: 6.0.32830
MIOpen runtime version: 3.0.0
Is XNNPACK available: True
CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               16
On-line CPU(s) list:                  0-15
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 7 5800X 8-Core Processor
CPU family:                           25
Model:                                33
Thread(s) per core:                   2
Core(s) per socket:                   8
Socket(s):                            1
Stepping:                             0
Frequency boost:                      enabled
CPU(s) scaling MHz:                   60%
CPU max MHz:                          4850.1948
CPU min MHz:                          2200.0000
BogoMIPS:                             7600.02
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization:                       AMD-V
L1d cache:                            256 KiB (8 instances)
L1i cache:                            256 KiB (8 instances)
L2 cache:                             4 MiB (8 instances)
L3 cache:                             32 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==3.0.0+0a22a91d04
[pip3] torch==2.4.0.dev20240317+rocm6.0
[conda] Could not collect ROCM Version: 6.0.32831-204d35d16
Aphrodite Version: N/A
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

How did you install Aphrodite?

HIP_VISIBLE_DEVICES=1 MAX_JOBS=4 python setup.py install

It errors out

/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:1581:28: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:1656:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:1730:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:1835:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:1926:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:2003:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:2051:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:2108:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:2224:28: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:2225:16: error: use of undeclared identifier '__shfl_xor_sync'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3771:33: error: use of undeclared identifier '__vcmpeq4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3772:33: error: use of undeclared identifier '__vcmpeq4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3773:28: error: use of undeclared identifier '__vsub4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3774:28: error: use of undeclared identifier '__vsub4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3782:33: error: use of undeclared identifier '__vcmpeq4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3783:33: error: use of undeclared identifier '__vcmpeq4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3784:28: error: use of undeclared identifier '__vsub4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3785:28: error: use of undeclared identifier '__vsub4'
/home/user/aphrodite-engine/kernels/quantization/gguf/gguf_kernel.hip:3808:28: error: use of undeclared identifier '__vsub4'
/home/user/aphrodite-engine/kernels/quantization/exl2/q_gemm_exl2.hip:120:9: error: no matching function for call to 'hipblasHgemm'
/home/user/aphrodite-engine/kernels/attention/../quantization/int8_kvcache/quant_utils_hip.cuh:210:12: error: no viable conversion from returned value of type 'const float' to function return type '__hip_bfloat16'
/home/user/aphrodite-engine/kernels/attention/attention_kernels.hip:235:23: error: no matching function for call to 'vec_conversion'

Error running Chat API server

Hello, I was trying my hands on aphrodite engine in my laptop, and I discovered some errors to report @AlpinDale . Here's the error log when I run the chat API server:

(aphrodite) [muzz@nobara-laptop aphrodite-engine]$ python -m aphrodite.endpoints.openai.api_server --model models/pythia-70m
INFO 09-18 17:30:15 aphrodite_engine.py:72] Initializing an LLM engine with config: model='models/pythia-70m', tokenizer='models/pythia-70m', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=1, seed=0)
Traceback (most recent call last):
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/muzz/build/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 632, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/muzz/build/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 451, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/home/muzz/build/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 252, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/muzz/build/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 281, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/muzz/build/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 104, in __init__
    self._init_workers(distributed_init_method)
  File "/home/muzz/build/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 136, in _init_workers
    self._run_workers(
  File "/home/muzz/build/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 695, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/muzz/build/aphrodite-engine/aphrodite/task_handler/worker.py", line 67, in init_model
    self.model = get_model(self.model_config)
  File "/home/muzz/build/aphrodite-engine/aphrodite/modeling/loader.py", line 50, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/home/muzz/build/aphrodite-engine/aphrodite/modeling/models/gpt_neox.py", line 239, in load_weights
    for name, loaded_weight in hf_model_weights_iterator(
TypeError: hf_model_weights_iterator() takes from 1 to 3 positional arguments but 4 were given

Apparently the code hf_model_weights_iterator only accepts 3 args, but the code in gpt_neox.py (and llama.py as well) passed 4 args, including revision. So I delete that one and it works great, until I found another error:

(aphrodite) [muzz@nobara-laptop aphrodite-engine]$ python -m aphrodite.endpoints.openai.api_server --model models/pythia-70m
INFO 09-18 17:19:58 aphrodite_engine.py:72] Initializing an LLM engine with config: model='models/pythia-70m', tokenizer='models/pythia-70m', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=1, seed=0)
INFO 09-18 17:20:00 aphrodite_engine.py:201] # GPU blocks: 14147, # CPU blocks: 21845
INFO:     Started server process [25905]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:58964 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/muzz/miniconda3/envs/aphrodite/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/muzz/build/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 200, in create_chat_completion
    prompt = await get_gen_prompt(request)
  File "/home/muzz/build/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 75, in get_gen_prompt
    raise ModuleNotFoundError(
ModuleNotFoundError: fastchat is not installed. Please install fastchat to use the chat completion and conversation APIs: `$ pip install fschat`

I did re-installed fastchat via pip, but it still throws me the error. Apparently, there is one missing dependency which is accelerate, so I did pip install accelerate and it works now. Just wanted to report these errors, so you can try to fix them.

Reduced performance due to Ray process core pinning

When using Ray, one worker process per logical CPU core gets spawned, however, all processes end up bound to the same two cores.

This can be observed with taskset after launching the engine. The example below is on a 16T machine.

$ ps xo '%p %c' | grep ray:: | awk '{print $1;}' | xargs -L1 taskset -cp
pid 23937's current affinity list: 0,8
pid 23938's current affinity list: 0,8
pid 23939's current affinity list: 0,8
pid 23940's current affinity list: 0,8
pid 23941's current affinity list: 0,8
pid 23942's current affinity list: 0,8
pid 23943's current affinity list: 0,8
pid 23944's current affinity list: 0,8
pid 23945's current affinity list: 0,8
pid 23946's current affinity list: 0,8
pid 23947's current affinity list: 0,8
pid 23948's current affinity list: 0,8
pid 23949's current affinity list: 0,8
pid 23951's current affinity list: 0,8
pid 23952's current affinity list: 0,8
pid 24923's current affinity list: 0,8

As a workaround, core affinity can be manually changed after launch, e.g. using taskset -cp <core> <pid> on each worker process.

Example assigning one core per process uniquely:

cpuid=0 ; for pid in $(ps xo '%p %c' | grep ray:: | awk '{print $1;}') ; do taskset -cp $cpuid $pid ; cpuid=$(($cpuid + 1)) ; done

The effect of this on performance is significant.

concurrent requests avg T/s avg T/s w/ fix
1 12.45 33.00
4 11.92 28.78
8 11.19 27.85
16 10.13 25.31

Benchmark environment:
4xA100 SXM NVL
aphrodite 0.4.2 openai endpoint
llama2 13b
llmperf default settings

CUDA illegal memory access when loading 70b AWQ with RoPE

I'm using RoPE by editing the model's config file.

"rope_scaling": {"type":"dynamic", "factor": 2.0},

It seems to work for 13b and 20b llama2 models.
No success with 70b AWQ + the snippet above.

$ python -m aphrodite.endpoints.api_server_kobold --model TheBloke/Xwin-LM-70B-V0.1-AWQ -q awq
INFO 10-06 19:25:44 aphrodite_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-70B-V0.1-AWQ', tokenizer='TheBloke/Xwin-LM-70B-V0.1-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/aphrodite/aphdev/aphrodite/endpoints/api_server_kobold.py", line 214, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/aphrodite/aphdev/aphrodite/engine/async_aphrodite.py", line 484, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/home/aphrodite/aphdev/aphrodite/engine/async_aphrodite.py", line 268, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/engine/async_aphrodite.py", line 304, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/engine/aphrodite_engine.py", line 110, in __init__
    self._init_cache()
  File "/home/aphrodite/aphdev/aphrodite/engine/aphrodite_engine.py", line 190, in _init_cache
    num_blocks = self._run_workers(
  File "/home/aphrodite/aphdev/aphrodite/engine/aphrodite_engine.py", line 691, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/task_handler/worker.py", line 109, in profile_num_available_blocks
    self.model(
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/modeling/models/llama.py", line 299, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/modeling/models/llama.py", line 259, in forward
    hidden_states = layer(
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/modeling/models/llama.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/modeling/models/llama.py", line 155, in forward
    attn_output = self.attn(positions, q, k, v, k_cache, v_cache,
  File "/home/aphrodite/micromamba/envs/aphdev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/aphrodite/aphdev/aphrodite/modeling/layers/attention.py", line 335, in forward
    return super().forward(
  File "/home/aphrodite/aphdev/aphrodite/modeling/layers/attention.py", line 215, in forward
    self.multi_query_kv_attention(
  File "/home/aphrodite/aphdev/aphrodite/modeling/layers/attention.py", line 119, in multi_query_kv_attention
    key = torch.repeat_interleave(key, self.num_queries_per_kv, dim=1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[Usage]: nccl and cupy problem "no cupy" and "NCCL_ERROR_UNHANDLED_CUDA_ERROR" when use TP in wsl

Your current environment

Collecting environment information...
/home/omni/miniconda3/envs/aph/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 546.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU family:                         6
Model:                              63
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           4788.91
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          384 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           3 MiB (12 instances)
L3 cache:                           30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.2.0                    pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypiROCM Version: Could not collect
Aphrodite Version: 0.5.1
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

How would you like to use Aphrodite?

I want to run this TinyLlama with TP 2 or 4,but can only work on 1 gpu, setting tp will give me this error
if i use tp 2 or 4 it report cupy is not install, but pip list show it installed
this issue show up after I reset and reinstall fresh new wsl+aphrodite
before this issue, I can run with tp 2 but show a nccl error with tp 4 cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

AsyncEngineDeadError with koboldai api server

Everything seems to work fine via the embedded klite interface, but when I pointed horde at it, it started throwing these:

It seems to kinda sorta maybe still serve horde requests?

INFO 01-16 12:30:08 async_aphrodite.py:133] Aborted request kai-ca722b2c86f04e9b88eed91ac6f5a65e.
INFO:     127.0.0.1:60750 - "POST /api/latest/generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 27, in _raise_exception_on_finish
    task.result()
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 358, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 337, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 188, in step_async
    output = (await self._run_workers_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 225, in _run_workers_async
    assert output == other_output
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/endpoints/kobold/api_server.py", line 142, in generate
    async for res in result_generator:
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 442, in generate
    raise e
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 436, in generate
    async for request_output in stream:
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 69, in __anext__
    raise result
  File "/workspace/micromamba/envs/aphrodite-runtime/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 36, in _raise_exception_on_finish
    raise exc
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 31, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
aphrodite.engine.async_aphrodite.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Initial fetch for `config.json` ignores `--revision`?

If I set CMD_ADDITIONAL_ARGUMENTS to --model turboderp/Mistral-7B-instruct-exl2 --revision 4.0bpw

Then I get this error:

2024-03-13T14:03:42.164428603Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --max-model-len 4096 --quantization exl2 --enforce-eager --model turboderp/Mistral-7B-instruct-exl2 --revision 4.0bpw --download-dir /volume/hub
2024-03-13T14:03:44.082470629Z WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower 
2024-03-13T14:03:44.082490019Z than non-quantized models.
2024-03-13T14:03:44.084028269Z INFO:     Initializing the Aphrodite Engine (v0.5.0) with the following config:
2024-03-13T14:03:44.084035559Z INFO:     Model = 'turboderp/Mistral-7B-instruct-exl2'
2024-03-13T14:03:44.084039269Z INFO:     DataType = torch.bfloat16
2024-03-13T14:03:44.084042909Z INFO:     Model Load Format = auto
2024-03-13T14:03:44.084045799Z INFO:     Number of GPUs = 1
2024-03-13T14:03:44.084048349Z INFO:     Disable Custom All-Reduce = False
2024-03-13T14:03:44.084050519Z INFO:     Quantization Format = exl2
2024-03-13T14:03:44.084052649Z INFO:     Context Length = 4096
2024-03-13T14:03:44.084057519Z INFO:     Enforce Eager Mode = True
2024-03-13T14:03:44.084059709Z INFO:     KV Cache Data Type = auto
2024-03-13T14:03:44.084061789Z INFO:     KV Cache Params Path = None
2024-03-13T14:03:44.084063869Z INFO:     Device = cuda
2024-03-13T14:03:44.492961433Z Traceback (most recent call last):
2024-03-13T14:03:44.492985083Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
2024-03-13T14:03:44.492988443Z     response.raise_for_status()
2024-03-13T14:03:44.492991203Z   File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
2024-03-13T14:03:44.492993893Z     raise HTTPError(http_error_msg, response=self)
2024-03-13T14:03:44.492996533Z requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/resolve/main/config.json
2024-03-13T14:03:44.492999293Z 
2024-03-13T14:03:44.493001403Z The above exception was the direct cause of the following exception:
2024-03-13T14:03:44.493003813Z 
2024-03-13T14:03:44.493005773Z Traceback (most recent call last):
2024-03-13T14:03:44.493008093Z   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
2024-03-13T14:03:44.493010223Z     resolved_file = hf_hub_download(
2024-03-13T14:03:44.493012273Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-13T14:03:44.493014363Z     return fn(*args, **kwargs)
2024-03-13T14:03:44.493016513Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
2024-03-13T14:03:44.493018643Z     metadata = get_hf_file_metadata(
2024-03-13T14:03:44.493020723Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-13T14:03:44.493022793Z     return fn(*args, **kwargs)
2024-03-13T14:03:44.493024903Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
2024-03-13T14:03:44.493026983Z     r = _request_wrapper(
2024-03-13T14:03:44.493029103Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
2024-03-13T14:03:44.493031173Z     response = _request_wrapper(
2024-03-13T14:03:44.493033263Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
2024-03-13T14:03:44.493035313Z     hf_raise_for_status(response)
2024-03-13T14:03:44.493041563Z huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-65f1b240-7d5d7d3b668248e21867e88e;d37da62a-3494-4c58-91fd-28dda5419afb)
2024-03-13T14:03:44.493043843Z 
2024-03-13T14:03:44.493045873Z Entry Not Found for url: https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/resolve/main/config.json.
2024-03-13T14:03:44.493062953Z 
2024-03-13T14:03:44.493066373Z The above exception was the direct cause of the following exception:
2024-03-13T14:03:44.493068993Z 
2024-03-13T14:03:44.493071043Z Traceback (most recent call last):
2024-03-13T14:03:44.493073083Z   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-03-13T14:03:44.493075173Z     return _run_code(code, main_globals, None,
2024-03-13T14:03:44.493077243Z   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-03-13T14:03:44.493079313Z     exec(code, run_globals)
2024-03-13T14:03:44.493081353Z   File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 561, in <module>
2024-03-13T14:03:44.493083673Z     engine = AsyncAphrodite.from_engine_args(engine_args)
2024-03-13T14:03:44.493085783Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
2024-03-13T14:03:44.493087773Z     engine = cls(parallel_config.worker_use_ray,
2024-03-13T14:03:44.493089813Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
2024-03-13T14:03:44.493091913Z     self.engine = self._init_engine(*args, **kwargs)
2024-03-13T14:03:44.493093943Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
2024-03-13T14:03:44.493095973Z     return engine_class(*args, **kwargs)
2024-03-13T14:03:44.493098053Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 102, in __init__
2024-03-13T14:03:44.493100183Z     self._init_tokenizer()
2024-03-13T14:03:44.493102283Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 166, in _init_tokenizer
2024-03-13T14:03:44.493104343Z     self.tokenizer: TokenizerGroup = TokenizerGroup(
2024-03-13T14:03:44.493106503Z   File "/app/aphrodite-engine/aphrodite/transformers_utils/tokenizer.py", line 157, in __init__
2024-03-13T14:03:44.493108583Z     self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
2024-03-13T14:03:44.493110623Z   File "/app/aphrodite-engine/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
2024-03-13T14:03:44.493112653Z     tokenizer = AutoTokenizer.from_pretrained(
2024-03-13T14:03:44.493114713Z   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 782, in from_pretrained
2024-03-13T14:03:44.493116783Z     config = AutoConfig.from_pretrained(
2024-03-13T14:03:44.493118833Z   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
2024-03-13T14:03:44.493120903Z     config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-03-13T14:03:44.493122953Z   File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 633, in get_config_dict
2024-03-13T14:03:44.493125233Z     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-03-13T14:03:44.493127343Z   File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
2024-03-13T14:03:44.493129363Z     resolved_config_file = cached_file(
2024-03-13T14:03:44.493131423Z   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 452, in cached_file
2024-03-13T14:03:44.493133483Z     raise EnvironmentError(
2024-03-13T14:03:44.493135593Z OSError: turboderp/Mistral-7B-instruct-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/main' for avail

GPTQRowParallelLinear has no attribute world_size

When starting the ooba server api with gptq enabled I run into the following error:

    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'GPTQRowParallelLinear' object has no attribute 'world_size'

This has worked before, broke by pulling the most up to date git

AttributeError: 'NoneType' object has no attribute 'fs' at fresh install

I installed everything like the readme said, ran into this error, reinstalled, still the same thing.
What could that be, appreciate the help?

(aphrodite) user_name@ai-rig:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server --help
Traceback (most recent call last):
  File "/home/user_name/miniconda3/envs/aphrodite/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/user_name/miniconda3/envs/aphrodite/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/home/user_name/aphrodite-engine/aphrodite/__init__.py", line 2, in <module>
    from aphrodite.engine.async_aphrodite import AsyncAphrodite
  File "/home/user_name/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 7, in <module>
    from aphrodite.engine.aphrodite_engine import AphroditeEngine
  File "/home/user_name/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 8, in <module>
    from aphrodite.engine.ray_tools import initialize_cluster, ray, RayWorker
  File "/home/user_name/aphrodite-engine/aphrodite/engine/ray_tools.py", line 9, in <module>
    from ray.air.util.torch_dist import TorchDistributedWorker
  File "/home/user_name/miniconda3/envs/aphrodite/lib/python3.10/site-packages/ray/air/__init__.py", line 1, in <module>
    from ray.air.checkpoint import Checkpoint
  File "/home/user_name/miniconda3/envs/aphrodite/lib/python3.10/site-packages/ray/air/checkpoint.py", line 22, in <module>
    from ray.air._internal.remote_storage import (
  File "/home/user_name/miniconda3/envs/aphrodite/lib/python3.10/site-packages/ray/air/_internal/remote_storage.py", line 142, in <module>
    _cached_fs: Dict[tuple, Tuple[float, pyarrow.fs.FileSystem]] = {}
AttributeError: 'NoneType' object has no attribute 'fs'

Error when `top_logprobs` value is `-inf`

So, in some cases, some values in the top_logprobs are -inf.

The request completes fine; but it seems like when it tries to build the output JSON, it produces the following error:

INFO 12-22 15:32:55 async_aphrodite.py:110] Finished request cmpl-b98ecf403c8848a4aad44d4f537cfcf8.
INFO:     127.0.0.1:32890 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
//...
ValueError: Out of range float values are not JSON compliant

So, I went to this line:
https://github.com/PygmalionAI/aphrodite-engine/blob/main/aphrodite/endpoints/openai/api_server.py#L234

And added this:

print(logprobs)
# for min value of each top_logprobs to -1000
logprobs.top_logprobs = [
  {k: v if v > -1000 else -1000 for k, v in top_logprob.items()}
  for top_logprob in logprobs.top_logprobs
]
return logprobs

this makes it work and that print(logprobs) returns:

text_offset=[0] token_logprobs=[0.0] tokens=['▁anno'] top_logprobs=[{'▁anno': 0.0, '<s>': -inf, '<0x00>': -inf, '<0x03>': -inf, '<0x04>': -inf, '<unk>': -inf, '</s>': -inf, '<0x02>': -inf, '<0x05>': -inf, '<0x01>': -inf}]

Are these -inf expected? Are they the problem that the JSON doesn't output correctly?

@AlpinDale if you want I can open a PR with that small hotfix, but for sure there's a more elegant solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.