openvinotoolkit / openvino.genai Goto Github PK
View Code? Open in Web Editor NEWRun Generative AI models using native OpenVINO C++ API
License: Apache License 2.0
Run Generative AI models using native OpenVINO C++ API
License: Apache License 2.0
New versions of dependencies are out and LLM samples don't work anymore e2e
This task regards enabling tests for gemma-7b-it. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for dolly-v2-12b. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for baichuan2-7b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for llama-2-7b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
Requesting help to understand how TTFT is calculated
Requesting help to understand how TTFT is calculated
No response
Requesting help to understand how TTFT is calculated
This task regards enabling tests for mini-cpm-2b-dpo. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Currently, dependencies in requirements files are not pinned and with new dependencies versions we have broken pipelines like on 2023.3 now.
Let's pin dependencies and use dependabot to track new versions and send updates as PRs. Such PRs can be tested on compatibility with our CI.
This task regards enabling tests for phi-2. You can find more details under openvino_notebooks LLM question answering README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
auto-gptq can't be installed due to a build error:
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [7 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/private/var/folders/gx/znq60x355475d1njb709q0jh0000gn/T/pip-install-2tmqwfxi/auto-gptq_3c9ac5ea4a8f4b049c9ff044cc750c38/setup.py", line 58, in <module>
CUDA_VERSION = "".join(os.environ.get("CUDA_VERSION", default_cuda_version).split("."))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
The issue is that the CPU-only version of Pytorch is installed, where default_cuda_version=torch.version.cuda
is None
. Seems like an upstream issue, but we may need to workaround here.
This task regards enabling tests for mistral-7b. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
I have converted the llama-7b-chat model to int4 using following commands:
python convert.py --model_id meta-llama/Llama-2-7b-chat-hf --output_dir models/llama-2-7b-chat --precision FP16 --compress_weights INT4_SYM INT4_ASYM 4BIT_DEFAULT
python convert.py --model_id meta-llama/Llama-2-7b-chat-hf --output_dir models/llama-2-7b-chat --precision FP32 --compress_weights 4BIT_DEFAULT
I'm running benchmarking with the int4 converted models. I tried with following variations and as you can see all the responses contain German words.
Tried with different prompt - It is giving partial answer in German.
Using the following prompt generates the complete response in German.
Am I missing something here ? Please provide some guidance.
This task regards enabling tests for qwen1.5-7b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
python convert.py --model_id /home/llm/disk/llm/meta-llama/Llama-2-7b-hf --output_dir /home/llm/disk/llm/meta-llama/Llama-2-7b-hf-openvino --precision FP32
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
/home/llm/miniconda3/envs/openvino/lib/python3.9/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/llm/miniconda3/envs/openvino/lib/python3.9/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[ INFO ] openvino runtime version: 2023.3.0-13775-ceeafaf64f3-releases/2023/3
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.50s/it]
Using the export variant default. Available variants are:
- default: The default ONNX variant.
Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
- use_cache -> True
/home/llm/miniconda3/envs/openvino/lib/python3.9/site-packages/transformers/modeling_utils.py:4193: FutureWarning: _is_quantized_training_enabled
is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable
instead
warnings.warn(
The cos_cached attribute will be removed in 4.40. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead.
The sin_cached attribute will be removed in 4.40. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead.
/home/llm/miniconda3/envs/openvino/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py:1057: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_length > self.causal_mask.shape[-1]:
Export model to OpenVINO directly failed with:
Check 'is_conversion_successful' failed at src/frontends/pytorch/src/frontend.cpp:141:
FrontEnd API failed with OpConversionFailure:
Model wasn't fully converted. Failed operations detailed log:
-- aten::mul with a message:
Exception happened during conversion of operation __module.model/aten::mul with schema aten::mul.Tensor(Tensor self, Tensor other) -> Tensor
Check 'args_et.is_dynamic() || args_et != element::boolean' failed at src/core/src/op/util/binary_elementwise_arithmetic.cpp:25:
While validating node 'opset1::Multiply Multiply_218 (__module.model/aten::eq/Equal[0]:boolean[...], __module.model/aten::eq/Equal[0]:boolean[?,1,1,?]) -> (dynamic[...])' with friendly_name 'Multiply_218':
Arguments cannot have boolean element type (argument element type: boolean).
Summary:
-- Conversion is failed for: aten::mul
.
Model will be exported to ONNX
[ WARNING ] Making stateful models is not supported when exporting to ONNX as an intermediate step. A stateless model will be exported instead. It may result in sub-optimal inference performance.Provide a model that can be converted to OpenVINO without fallback to ONNX conversion path.
Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
- use_cache -> True
This task regards enabling tests for youri-7b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
I am running Qwen-7B on SPR.
And I found there is no any significant perf improvement between FP32 & compressed FP32-INT4_ASYM.
FP32 benchmarking cmd:
python benchmark.py -m /root/.cache/huggingface/hub/Qwen-7B-Chat-ov/pytorch/dldt/FP32 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. " -n 5 -ic 32 -bs 1 --num_beams 1 -d CPU --torch_compile_backend openvino
Latency: 129.71 ms/token
INT4 benchmarking cmd:
python benchmark.py -m /root/.cache/huggingface/hub/Qwen-7B-Chat-ov/pytorch/dldt/compressed_weights/OV_FP32-INT4_ASYM -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. " -n 5 -ic 32 -bs 1 --num_beams 1 -d CPU --torch_compile_backend openvino
Latency: 121.91 ms/token
Did I loose something important in benchmarking cmd? Or it is an issue.
I've run the benchmark to test stable diffusion v1.5, but the quality of generated image is low. I doubted there are something computing wrong in the process.
I compared it with optimum-intel. and set the same prompt, steps (20), resolution (512x512), and got very different quality image. please have a look at.
This is an effort to increase Large Language Models tests coverage in OpenVINO GenAI.
Working on this task will let you familiarize yourself with:
If you would like to add a new model which there's not a task for, please let us know! We would love to get outside ideas.
Example commit: bf4c200#diff-2c8a6fc2893aa2e1103985c1ee763cc325de6042ea66a11ae30428d77e73e416
134074
OpenVINO Tokenizers extension has a new home https://github.com/openvinotoolkit/openvino_tokenizers
Or maybe it's better to remove submodule and use release packages via cmake fetch content?
Dear,
I've run the llm_bench\python to test stable diffusion v1.5, and found some regression between different ov packages, like below,
2023.3->8.73s
2024.0->10.65s
2024.1->9.18s
the parameters is 20 steps, 512x512, the others are the same with prompt/stable-diffusion.jsonl file.
the command is:
python benchmark.py -d GPU --model "C:\AIGC\openvino\models\stable-diffusion-optimum-sdv1_5" --prompt_file "prompts/stable-diffusion.jsonl" -n 1
the platform is MTL U9 185 iGPU with 32GB.
Thanks a lot,
I am trying to run text generation using text_generaion/causal_lm/cpp/greedy_causal_lm.cpp
without any modifications. I followed the build instructions in the README and ran this command, after which I was shown the following error.
$ ./build/greedy_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16 "Why is the Sun yellow?"
Exception from src/inference/src/core.cpp:85:
Exception from src/frontends/ir/src/ir_deserializer.cpp:438:
Invalid IR! ScatterNDUpdate_15 name is not unique!
I have limited knowledge of the toolkit's internal processes, and would like to get some indication of where this issue might be arising from (and how it could be resolved).
gcc
and g++
versions 11.4.0The python environment has the following packages among others:
torch==2.3.0
openvino==2024.1.0
openvino-tokenizers==2024.1.0
transformers==4.37.2
optimum==1.19.1
aarch64
and x86
processors and a shared file system. The tokenizer conversion using openvino-tokenizers
(command below) runs successfully every time on the aarch64
machine but often results in a seg fault on the x86
machine. I could not find any pattern in the occurrence of seg faults.convert_tokenizer ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --output ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --with-detokenizer --trust-remote-code
aarch64
machine, trying to build greedy_causal_lm.cpp
along with other files using cmake -DCMAKE_BUILD_TYPE=Release -S ./ -B ./build/ && cmake --build ./build/ -j
results in the following error. This never occurs on the x86
machine.CMake Error at /home/nishant/workspace/llm/openvino.genai/thirdparty/openvino_tokenizers/CMakeLists.txt:15 (find_package):
By not providing "FindOpenVINO.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "OpenVINO",
but CMake did not find one.
Could not find a package configuration file provided by "OpenVINO" with any
of the following names:
OpenVINOConfig.cmake
openvino-config.cmake
Add the installation prefix of "OpenVINO" to CMAKE_PREFIX_PATH or set
"OpenVINO_DIR" to a directory containing one of the above files. If
"OpenVINO" provides a separate development package or SDK, be sure it has
been installed.
-- Configuring incomplete, errors occurred!
I've had to switch between the two machines to execute specific commands that run on those machines successfully. This could have resulted in some issues as well.
Benchmark cmd:
numactl -C 0-55 -m 0 python benchmark.py -m /root/.cache/huggingface/hub/flan-t5-xl-ov/pytorch/dldt/FP16 -p "It is done..." -n 3 -bs 1
-d CPU --torch_compile_backend openvino -ic 128 --num_beams 1 -lc bfloat16_config.json 2>&1 | tee -a ./logs/0.log
This task regards enabling tests for mpt-7b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for Phi-1_5. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for tiny-llama-1b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Step 1 of the SD1.5 setup is failing on Windows. Is there a specific channel that should be added?
(openvino_sd_cpp) C:\Users\local_user>conda install openvino eigen c-compiler cxx-compiler make
Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful initial attempt using frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: unsuccessful initial attempt using frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- c-compiler
- openvino
- make
- cxx-compiler
Current channels:
- https://repo.anaconda.com/pkgs/main/win-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/win-64
- https://repo.anaconda.com/pkgs/r/noarch
- https://repo.anaconda.com/pkgs/msys2/win-64
- https://repo.anaconda.com/pkgs/msys2/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
System details:
Intel Core UItra 155H
32GB LPDDR5
conda v23.7.4
I downloaded and compiled on Windows. Probably made a stupid mistake, and that's why it won't work. But I'm trying to get this project done so I can join the Intel Partner Alliance! But, I can't get this project to run because of a weird error preventing loading of the DLL even though it exists at the exact location - I ctrl clicked it and it opened up so it's there. But it just won't load right. Can you tell if there's anything in particular that would cause the dll to be unable to load?
This is using OpenVino 2024.0.0 with this model "acen20/Mistral-7B-Instruct-v0.2-openvino-int4".
Error output:
(base) PS C:\Users\hdtru\prg\openvino.genai\text_generation\causal_lm\cpp> .\build\Release\beam_search_causal_lm.exe " C:\Users\hdtru\.cache\huggingface\hub\models--acen20--Mistral-7B-Instruct-v0.2-openvino-int4\snapshots\0a94c646b59e31dc1c52024c1469a5087edf704c\openvino_model.xml" "What is your name?"
Exception from src\inference\src\cpp\core.cpp:163:
Cannot add extension. Cannot find entry point to the extension library. This error happened: Cannot load library 'C:\Users\hdtru\prg\openvino.genai\text_generation\causal_lm\cpp\build\openvino_tokenizers\src\Release\openvino_tokenizers.dll': 127 from cwd: C:\Users\hdtru\prg\openvino.genai\text_generation\causal_lm\cpp
Fix it so dll loads correctly... not sure what else to say!
No response
none
not sure
No response
Running convert.py for microsoft/trocr-base-printed, it gives the below error:
ValueError: Unrecognized configuration class <class
'transformers.models.vision_encoder_decoder.configuration_vision_encoder_decoder.VisionEncoderDecoderConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MambaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
As you can see in the error message, it says Model type should be one of ... and "TrOCRConfig" is part of the list. So "trocr-base-printed" should be supported, but the conversion fails.
N/A
No response
N/A
No response
End Of Sequence tokens are an essential part of LLM training and inference. You can find more details in this comment.
Thanks to a PR adding End Of Sequence tokens to Runtime Info openvino_tokenizers now put EOS token value into rt_info section in OpenVINO Intermediate Representation format (.xml
file to be specific) when converting a tokenizer to OpenVINO.
Since EOS has been enabled in OpenVINO, now it needs to be enabled in GenAI text_generation module.
beam_search_causal_lm.cpp
and greedy_causal_lm.cpp
from https://github.com/openvinotoolkit/openvino.genai/tree/master/text_generation/causal_lm/cpp should read the EOS token instead of having a hardcoded value with comment // There's no way to extract special token values from the detokenizer for now
.
It’s required to extract the value using ov::Model::get_rt_info()
and use it. Remove the comments about absence of way to extract that value.
132861
This task regards enabling tests for red-pajama-3b-chat. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
As Title
I want to capture profiling information in my C++ application using get_runtime_model()
. I attempted to patch the sample with something like:
ov::CompiledModel compiledModel = core.compile_model(
std::string{argv[1]} + "/openvino_model.xml", "CPU", ov::device::properties("CPU", ov::enable_profiling(true)) );
...
std::string FLAGS_exec_graph_path = "greedy_causal_lm.exec_graph.xml";
try {
ov::serialize(compiledModel.get_runtime_model(), FLAGS_exec_graph_path);
std::cerr << "Executable graph is stored to " << FLAGS_exec_graph_path << std::endl;
} catch (const std::exception& ex) {
std::cerr << "Can't get executable graph: " << ex.what() << std::endl;
}
but when I execute the sample I hit an error with the profiling code:
./build/greedy_causal_lm llama-2-7b-chat.f16.int4/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT "Why is the Sun yellow?"
Can't get executable graph: Exception from src/inference/src/cpp/compiled_model.cpp:35:
Exception from src/plugins/intel_cpu/src/node.cpp:499:
Node Broadcast_143553 contains less child edges than 1
The sample works fine without the profiling snippet.
Can an option be added to the sample C++ applications to correctly dump runtime profiling information?
(openvino.genai) E:\projects\openvino.genai\text_generation\causal_lm\cpp>python ......\llm_bench\python\convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir .\TinyLlama-1.1B-Chat-v1.0\ --precision FP16 --stateful
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6'
Traceback (most recent call last):
File "D:\anaconda\envs\openvino.genai\lib\site-packages\transformers\utils\import_utils.py", line 1364, in get_module
return importlib.import_module("." + module_name, self.name)
File "D:\anaconda\envs\openvino.genai\lib\importlib_init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in call_with_frames_removed
File "D:\anaconda\envs\openvino.genai\lib\site-packages\optimum\exporters\onnx_main.py", line 33, in
from .convert import export_models, validate_models_outputs
File "D:\anaconda\envs\openvino.genai\lib\site-packages\optimum\exporters\onnx\convert.py", line 49, in
from transformers.pytorch_utils import is_torch_less_than_1_11
ImportError: cannot import name 'is_torch_less_than_1_11' from 'transformers.pytorch_utils' (D:\anaconda\envs\openvino.genai\lib\site-packages\transformers\pytorch_utils.py)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\projects\openvino.genai\llm_bench\python\convert.py", line 27, in
from optimum.exporters.openvino import export_models
File "D:\anaconda\envs\openvino.genai\lib\site-packages\optimum\exporters\openvino_init_.py", line 1, in
from .main import main_export
File "D:\anaconda\envs\openvino.genai\lib\site-packages\optimum\exporters\openvino_main_.py", line 24, in
from optimum.exporters.onnx import main as optimum_main
File "", line 1075, in _handle_fromlist
File "D:\anaconda\envs\openvino.genai\lib\site-packages\transformers\utils\import_utils.py", line 1352, in getattr
value = self._get_module(name)
File "D:\anaconda\envs\openvino.genai\lib\site-packages\transformers\utils\import_utils.py", line 1366, in _get_module
raise RuntimeError(
RuntimeError: Failed to import optimum.exporters.onnx.main because of the following error (look up to see its traceback):
cannot import name 'is_torch_less_than_1_11' from 'transformers.pytorch_utils' (D:\anaconda\envs\openvino.genai\lib\site-packages\transformers\pytorch_utils.py)
Request:
Maybe fix the version of all python packages instead of using ">=" or "no given version" for converting model.
This task regards enabling tests for chatglm3-6b. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for red-pajama-3b-instruct. You can find more details under openvino_notebooks LLM question answering README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
This task regards enabling tests for notus-7b-v1. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
Benchmark cmd (-ic is 128):
numactl -C 0-55 -m 0 python benchmark.py -m /root/.cache/huggingface/hub/chatglm3-6b-ov/pytorch/dldt/FP16 -p "It is done, and submitted..." -n 2 -bs 1 -d CPU --torch_compile_backend openvino -ic 128 --num_beams 1 -lc bfloat16_config.json 2>&1 | tee -a ./logs/0.log
BTW, ChatGLM2's output size is right.
@Wovchena This is related to #406 but goes deeper, hence I decided to make this a new issue.
greedy_causal_lm
inference on arm
, large matmuls (e.g. 1x2x4096:4096x4096
in query/key/value projection) fall back to using a slow reference matmul implementation (ref_any_bf16
)1x32x2x2:1x32x2x128
in dot-product attention) use a faster gemm_acl_f16
kernel, which indicates that ACL kernels through oneDNN are available but not being used for point (1).x86
most matmuls use brgemm_avx512_bf16
kernel resulting in much faster inference.Question: Is there a heuristic in OpenVINO that causes large matmuls on arm
to fall back to the reference implementation?
Following #327, I added code in greedy_causal_lm.cpp
to save profiling information after inference. The table below shows the kernels used on x86
vs arm
and their runtime (in microseconds) from the first decoder block of meta-llama/Llama-2-7b-hf
(HuggingFace, compressed to INT8_ASYM).
Operation | Shape | x86 kernel | x86 time (mcs) | arm kernel | arm time (mcs) |
---|---|---|---|---|---|
q_proj | 1x2x4096:4096x4096 | brgemm_avx512_bf16 |
584 | ref_any_f16 |
44714 |
k_proj | 1x2x4096:4096x4096 | brgemm_avx512_bf16 |
699 | ref_any_f16 |
39713 |
v_proj | 1x2x4096:4096x4096 | brgemm_avx512_bf16 |
5775 | ref_any_f16 |
37816 |
o_proj | 1x2x4096:4096x4096 | brgemm_avx512_bf16 |
833 | ref_any_f16 |
36469 |
mlp.gate_proj | 1x2x4096:4096x11008 | brgemm_avx512_bf16 |
1023 | ref_any_f16 |
97537 |
mlp.up_proj | 1x2x4096:4096x11008 | brgemm_avx512_bf16 |
965 | ref_any_f16 |
101677 |
mlp.down_proj | 1x2x11008:11008x4096 | brgemm_avx512_bf16 |
953 | ref_any_f16 |
98749 |
classifier (logits) | 1x2x4096:4096x32000 | brgemm_avx512_bf16 |
1813 | ref_any_f16 |
284708 |
For small matmuls however, I noticed gemm_acl_f16
being called on arm
.
Operation | Shape | arm kernel | arm time (mcs) |
---|---|---|---|
A = q_proj * k_proj ^ T | 1x32x2x128:1x32x2x128 | gemm_acl_f16 |
226 |
A * v_proj | 1x32x2x2:1x32x2x128 | gemm_acl_f16 |
69 |
This indicates that ACL kernels are available during inference but not being used only for large matmuls. To check that this diversion is not coming from oneDNN, I compiled oneDNN with ACL with following version combinations and benchmarked the above matmul sizes with benchdnn
. Results are shown below, which indicate that gemm:acl
is used on arm
for all of the above sizes when the two compile correctly.
oneDNN source | oneDNN version | ACL version | Kernel used on arm | Comments |
---|---|---|---|---|
oneapi-src/oneDNN | 3.3.3 | 24.02.1 | Build fails; error from ACL | References to ACL v24.02.1 were found in openvino docs. |
oneapi-src/oneDNN | 3.3.3 | 23.11 | Build fails; error from ACL | N/A |
oneapi-src/oneDNN | 3.3.3 | 23.08 | gemm:acl |
Expected kernel is used. |
openvinotoolkit/oneDNN | 3.3.3 | 24.02.1 | gemm:jit_f32 |
ACL compilation flag ignored by cmake |
openvinotoolkit/oneDNN | 3.3.3 | 23.11 | gemm:jit_f32 |
ACL compilation flag ignored by cmake |
openvinotoolkit/oneDNN | 3.3.3 | 23.08 | gemm:jit_f32 |
ACL compilation flag ignored by cmake |
I think ACL kernels should ideally be used for any matmul on arm
since they are available and being used for some operations. Since oneDNN is not the one causing a reference kernel to be used, I am led to believe that a heuristic within OpenVINO is causing this fallback. I'm hoping to know whether that is indeed the case, and (if yes) where this heuristic has been implemented.
Convert cmd:
python3 convert.py --model_id tiiuae/falcon-40b --output_dir /root/.cache/huggingface/hub/tiiuae/falcon-40b-ov --stateful --precision FP16
Benchmarking cmd:
numactl -C 0-55 -m 0 python benchmark.py -m /root/.cache/huggingface/hub/falcon-40b-ov/pytorch/dldt/FP16 -p "It is done" -n 3 -bs 1 -d CPU --torch_compile_backend openvino -ic 128 --num_beams 1 -lc bfloat16_config.json 2>&1 | tee -a ./logs/0.log
Speculative decoding is a very popular method to speed up text generation significantly. Moreover, it is already being adopted by industry.
Can you have such an example for text generation? For example, it could be Llama2-7B + TinyLlama.
This task regards enabling tests for gemma-2b-it. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
This task regards enabling tests for dolly-v2-3b. You can find more details under openvino_notebooks LLM question answering README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
Conversion configs are merged to optimum, so we can use optimum-cli directly.
This task regards enabling tests for zephyr-7b-beta. You can find more details under openvino_notebooks LLM chatbot README.md.
Please ask general questions in the main issue at #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
Described in the main Discussion issue at: #259
No response
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.