mistralai / mistral-src Goto Github PK

View Code? Open in Web Editor NEW

8.7K 111.0 752.0 274 KB

Reference implementation of Mistral AI 7B v0.1 model.

Home Page: https://mistral.ai/

License: Apache License 2.0

Python 34.67% Dockerfile 0.57% Shell 0.21% Jupyter Notebook 64.55%

llm llm-inference mistralai

mistral-src's Introduction

Mistral Transformer

This repository contains minimal code to run our 7B model.

Blog: https://mistral.ai/news/announcing-mistral-7b/
Discord: https://discord.com/invite/mistralai
Documentation: https://docs.mistral.ai/
Guardrailing: https://docs.mistral.ai/usage/guardrailing

Deployment

The deploy folder contains code to build a vLLM image with the required dependencies to serve the Mistral AI model. In the image, the transformers library is used instead of the reference implementation. To build it:

docker build deploy --build-arg MAX_JOBS=8

Instructions to run the image can be found in the official documentation.

Installation

pip install -r requirements.txt

Download the model

wget https://models.mistralcdn.com/mistral-7b-v0-1/mistral-7B-v0.1.tar (md5sum: 37dab53973db2d56b2da0a033a15307f)
tar -xf mistral-7B-v0.1.tar

Run the model

python -m main demo /path/to/mistral-7B-v0.1/
# To give your own prompts
python -m main interactive /path/to/mistral-7B-v0.1/

Change temperature or max_tokens using:

python -m main interactive /path/to/mistral-7B-v0.1/ --max_tokens 256 --temperature 1.0

If you want a self-contained implementation, look at one_file_ref.py, or run it with

python -m one_file_ref /path/to/mistral-7B-v0.1/

This is a test of the emergency broadcast system. This is only a test.

If this were a real emergency, you would be told what to do.

This is a test
=====================
This is another test of the new blogging software. I’m not sure if I’m going to keep it or not. I’m not sure if I’m going to keep
=====================
This is a third test, mistral AI is very good at testing. 🙂

This is a third test, mistral AI is very good at testing. 🙂

This
=====================

To run logits equivalence through chunking and sliding window, launch

python -m test_generate

Running large models

When running models that are too large to fit a single GPU's memory, use pipeline parallelism (PP) and torchrun. This is needed to run Mixtral-7B-8x. The code below does 2-way PP.

torchrun --nproc-per-node 2 -m main demo /path/to/mixtral-7B-8x-v0.1/ --num_pipeline_ranks=2

Note

PP is not supported when running in interactive mode.

Sliding window attention

Vanilla attention

Attention is how information is shared between tokens in a sequence. In vanilla transformers, attention follows a causal mask: each token in the sequence can attend to itself and all the tokens in the past. This ensures that the model is causal, i.e. it can only use information from the past to predict the future.

Sliding window to speed-up inference and reduce memory pressure

The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).

Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.

Empirically, we see that longer contexts do help even outside the sliding window but when the sequence length becomes too large, the model does not use the full context anymore.

Rolling buffer cache

We implement a rolling buffer cache. The cache has a fixed size of W, and we store the (key, value) for position i in cache position i % W. When the position i is larger than W, past values in the cache are overwritten.

Pre-fill and chunking

When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the (k, v) cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this we can choose as chunk size the window size. For each chunk, we thus need to compute the attention over the cache and over the chunk.

Sparse Mixture of Experts (SMoE)

Sparse Mixture of Experts allows one to decouple throughput from memory costs by only activating subsets of the overall model for each token. In this approach, each token is assigned to one or more "experts" -- a separate set of weights -- and only processed by sunch experts. This division happens at feedforward layers of the model. The expert models specialize in different aspects of the data, allowing them to capture complex patterns and make more accurate predictions.

Pipeline Parallelism

Pipeline parallelism is a set of techniques for partitioning models, enabling the distribution of a large model across multiple GPUs. We provide a simple implementation of pipeline parallelism, which allows our larger models to be executed within the memory constraints of modern GPUs. Note that this implementation favours simplicity over throughput efficiency, and most notabably does not include microbatching.

Integrations and related projects

Model platforms

Use Mistral 7B Instruct on Mistral AI official API (La Plateforme)
Use Mistral AI in HuggingFace:
- Mistral-7B-v0.1
- Mistral-7B-Instruct-v0.1
Use Mistral 7B on Vertex AI
Use Mistral 7B on Replicate
Use Mistral 7B on Sagemaker Jumpstart
Use Mistral 7B on Baseten

Applications

Compare Mistral 7B to Llama 13B on LLMBoxing
Compare Mistral 7B to 10+ LLMs on Chatbot Arena or host it yourself with FastChat
Use Mistral 7B in Dust
Speak to Mistral AI Instruct on Perplexity labs (warning: deployed version is not guardrailed)
Use Mistral 7B in Quivr
Use Mistral 7B or its Zephyr derivate on LlamaIndex

Local deployment

Ollama local deployment
GGML local deployment
TextSynth local deployment

Derived models

Multimodal: BakLLaVa-1
Model fine-tuned on direct preferences: Zephyr-7B-alpha
Model fine-tuned on generated data: OpenOrca

References

[1] Generating Long Sequences with Sparse Transformers, Child et al. 2019

[2] Longformer: The Long-Document Transformer, Beltagy et al. 2020

mistral-src's People

Contributors

Stargazers

Watchers

Forkers

fran-cois jboullu bofenghuang kgourgou valrcs harper-carroll g-turley ggdupont rollack ittican-org jseam2 sherpan hbcbh1999 techthiyanes dumpmemory lwrless sorokinvld npanj 5amfung gaohuan2015 muharremokutan kustomzone rajveer43 m0saan truongthanh96 tonywhite11 vinicius-ianni secureonelabs jithinraj squareandcompass partnerise rioncarter devendrachaplot mul1sh antifmatter nagoudi vgopinathan ssarswat gmh5225 pynchmeister mrdnash de30 hotmailjoe luisriverag ginochen grasool codeaudit torres1806 alienrobotninja branddole beingofexistence13 williamtran29 ailabteam matthieudelaro frederick0291 sengho825 pythoncodewatcher sylvainverdy manumaxg mbrukman mishubufnita ai-imitation rkp64 dantegpt omarofo seifeur m0bstarx barsam8865 mocy vortextech01 1991a09aj jewely0 nurb432 devjuanb magic2499 ralphhightower merangf11tkd01 auspotter alonzo0812 taylorai hadryan farhadfa22 yanxg eltociear saeedrila tokarev-i-v sakhiking dstice spaghetticodez numaroth shaunstoltz rotordriven pterameta hefeweizen bennettchamberlain vlmlee aiwizzard freeman4747 aahouzi sunilgitb

mistral-src's Issues

Ray qelr_async_event not implemented yet

Hi,

I'm receiving the following error while deploying Mistral AI using VLLM.

qelr_async_event not implemented yet

Have you guys seem this type of issue? How can I possibly resolve?

.bin format?

Is there any chance to relise it with .bin format to use in commonly used ChatGPT-like interfaces?
For example, I'm using LMStudio.

Torch not compiled with CUDA enabled

Congrats on the launch!

I'm on Mac M1 and I'm getting this error related to Torch not compiled with CUDA enabled.
I'm guessing that CUDA is not supported on the Mac chips.
Any idea how I can get around this?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/.../Mistral_github/mistral-src/main.py", line 134, in <module>
    fire.Fire({
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/.../fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/.../Mistral_github/mistral-src/main.py", line 116, in demo
    transformer = Transformer.from_folder(Path(model_path), max_batch_size=3)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../Mistral_github/mistral-src/mistral/model.py", line 220, in from_folder
    model = Transformer(model_args).to(device=device, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../Mistral_github/mistral-src/mistral/model.py", line 185, in __init__
    self.freqs_cis = precompute_freqs_cis(self.args.head_dim, 128_000).to("cuda")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

What is the `max_seq_len` in Mistral?

What is the max_seq_len (or max_position_embeddings) of Mistral-7B-v0.1 when training?

The official code says it is 128_000. (https://github.com/mistralai/mistral-src/blob/147c4e68279b90eb61b19bdea44e16f5539d5a5d/mistral/model.py#L201C69-L201C69)

The config file in huggingface says it is 32768. (https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json).

And the official blog mentions 16k.

Can you provide lora tutorial for mistral 7b instruction model on custom dataset?

very good! thx! but...

can you pls make it accept more context or something because it's not 100% following instructions.
would like to hv it follow more instructions.

oops, just realised i need to use the more bits one from gguf of llama. thx!

How many tokens did Mistral-7B train on?

Missing model card / data sheet with info on pretraining and RLHF datasets

At opening-up-chatgpt.github.io we're documenting data sources and degrees of openness along several dimensions for instruction-tuned LLMs. I am looking for information about (1) pretraining dataset and (2) RLHF datasets but have not found any details. The HuggingFace model card says

For full details of this model please read our release blog post

The release blog post provides no information on this at present.

Prompt for RAG

Hi,

Thanks for these great open source models.

In the particular case of retrieval-augmented generation, what should the prompt look like based on both context and question with the instruct model?

Question about finetune mistral 7B (data format)

It is really impressive when doing inference with Mistral 7B. Thank you so much for open source it.

May I kindly ask what kind of format is the best way to finetune the model?

I read some blog posts and found a few different formats

text_row = f"""<s>[INST] {instruction} here are the inputs {input} [/INST] \\n {output} </s>"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n{output}</s>

I wonder if it is possible to have some suggestion from the team to see which is the best way to finetune?

Many thanks!

Dilation ?

I am not sure I understand it from your code, are you using Dilated Sliding Window or just regular Sliding Window ?

Embedding model and Engine??

Hey guys,

I am shifting from GPT to Mistral and I am facing one problem which is that I could not find the embedding model and engine for Mistral yet.

I am using the service from DeepInfra

Here's the code snippet which I wrote for GPT:

def get_embedding(text, model="embedding-ada-002"):
  text = text.replace("\n", " ")
  if not text: 
    text = "this is blank"
  return openai.Embedding.create(
          input=[text], model=model)['data'][0]['embedding']


if __name__ == '__main__':
#   gpt_parameter = {"engine": "text-davinci-003", "max_tokens": 50, 
#                    "temperature": 0, "top_p": 1, "stream": False,
#                    "frequency_penalty": 0, "presence_penalty": 0, 
#                    "stop": ['"']}
  gpt_parameter = {"max_tokens": 50, 
                   "temperature": 0, "top_p": 1, "stream": False,
                   "frequency_penalty": 0, "presence_penalty": 0, 
                   "stop": ['"']}

All I want to know is which embedding model and engine should be used?

Thank you 🙂

Mistral-7B-instruct-v0.1 compatibility with main.py

Hi,
I managed to install mistral-7b-v.01 on a server and run the main.py script as recommended, and it works well. I wanted to test the model's abilities in chat completion, so I downloaded Mistral-7b-instruct-v0.1. But when running the same commands as for mistral-7b-v0.1, the main.py program does not work (see error below). More specifically the model.py script included in the mistral folder does not seem compatible with Mistral-7b-instruct-v0.1.
Do you know how to resolve this problem?
Thank you

` > python -m main interactive Mistral-7B-instruct-v0.1/

Traceback (most recent call last):
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home1/USERS/PSY-DEV/brunet/llama/mistral-src/main.py", line 142, in
fire.Fire({
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home1/USERS/PSY-DEV/brunet/anaconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home1/USERS/PSY-DEV/brunet/llama/mistral-src/main.py", line 106, in interactive
transformer = Transformer.from_folder(Path(model_path), max_batch_size=3)
File "/home1/USERS/PSY-DEV/brunet/llama/mistral-src/mistral/model.py", line 218, in from_folder
model_args = ModelArgs(**json.loads(f.read()))
TypeError: ModelArgs.init() got an unexpected keyword argument 'use_biases'`

Unable to build Docker image with cuda:11.8.0-devel-ubuntu20.04 - CUDA version (11.8) mismatches the version that was used to compile PyTorch (12.1)

The provided Dockerfile is using ubuntu22.04 that is having Python 3.10 as a default version. I needed Python 3.8 (because ray 2.7.0 needed that) which is available in ubuntu20.04, so I am using cuda:11.8.0-devel-ubuntu20.04 for image building. My complete Dockerfile is:

FROM --platform=amd64 nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 as base

ARG MAX_JOBS

WORKDIR /workspace

RUN apt update && \
    apt install -y python3-pip python3-packaging \
    git ninja-build && \
    pip3 install -U pip

# Tweak this list to reduce build time
# https://developer.nvidia.com/cuda-gpus
ENV TORCH_CUDA_ARCH_LIST "7.0;7.2;7.5;8.0;8.6;8.9;9.0"

# ValueError: setuptools>=49.4.0 is required
RUN pip3 install "setuptools>=49.4.0"

# We have to manually install Torch otherwise apex & xformers won't build
RUN pip3 install "torch>=2.0.0"
# To enable H100 PCIe support, install PyTorch >=2.2.0 by uncommenting the following line
# RUN pip3 install "torch==2.2.0.dev20231018+cu118" --index-url https://download.pytorch.org/whl/nightly/cu118

# This build is slow but NVIDIA does not provide binaries. Increase MAX_JOBS as needed.
RUN git clone https://github.com/NVIDIA/apex && \
    cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 && \
    sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d' setup.py && \
    python3 setup.py install --cpp_ext --cuda_ext

RUN pip3 install "xformers==0.0.22" "transformers==4.34.0" "vllm==0.2.0" "fschat[model_worker]==0.2.30" "ray[client]"

COPY entrypoint.sh .

RUN chmod +x /workspace/entrypoint.sh

ENTRYPOINT ["/workspace/entrypoint.sh"]

First of all, I faced ValueError: setuptools>=49.4.0 is required issue, and fixed it through pip then I am getting following issues:

Step 8/12 : RUN git clone https://github.com/NVIDIA/apex &&     cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 &&     sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d' setup.py &&     python3 setup.py install --cpp_ext --cuda_ext
 ---> Running in 75b68d40dad7
Cloning into 'apex'...
Note: switching to '2386a912164b0c5cfcd8be7a2b890fbac5607c82'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 2386a91 Distributed optimizer infrastructure for FP8 parameters (#1723)


/usr/local/lib/python3.8/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()
Traceback (most recent call last):
  File "setup.py", line 799, in <module>
    setup(
  File "/usr/local/lib/python3.8/dist-packages/setuptools/__init__.py", line 103, in setup
    return distutils.core.setup(**attrs)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
    dist.run_commands()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 84, in run
    self.do_egg_install()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 132, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 167, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/command/install_lib.py", line 111, in build
    self.run_command('build_ext')
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 88, in run
    _build_ext.run(self)
  File "/usr/local/lib/python3.8/dist-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
    self.build_extensions()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 525, in build_extensions
    _check_cuda_version(compiler_name, compiler_version)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 413, in _check_cuda_version
    raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
RuntimeError: 
The detected CUDA version (11.8) mismatches the version that was used to compile
PyTorch (12.1). Please make sure to use the same CUDA versions.

Will appreciate some help on it.

python process keeps getting killed

After the executing the interactive session code, I am getting the following error.

[1]    534592 killed     python -m main interactive /path/to/mistral-7B-v0.1/directory

Hardware:
Ryzen 5 5600
Radeon RX 6700 XT
16GB RAM

I do not know if this is a hardware issue.

ONNX?

could you provide an export for inference of the torch 7B model, e.g., ONNX?

System prompt handling in chat templates for Mistral-7b-instruct

Hello, we are trying to implement chat completion over Mistral-7b-instruct and we are trying to figure out how to handle system prompts. Different information sources either omit this or are conflicting:

The docs and HF model card states the following, but does not go into any detail about how to handle system prompts:

In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [\INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

E.g.

text = "< s >[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</ s > "
"[INST] Do you have mayonnaise recipes? [/INST]"

HuggingFace's apply_chat_template uses <<SYS>>/<</SYS>> tokens to delineate the system prompt embedded within the first instruction:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("/root/Mistral-7b-instruct-hf")
tokenizer = AutoTokenizer.from_pretrained("/root/Mistral-7b-instruct-hf")

messages = [
    {"role": "system", "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature."},
    {"role": "user", "content": "Write me a recipe for tacos al pastor"},
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
"""
<s> [INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
<</SYS>>

Write me a recipe for tacos al pastor [/INST] Tacos al Pastor Recipe
"""

FastChat does not seem to support system prompts https://github.com/lm-sys/FastChat/pull/2483/files

What is the definitive answer for how to handle system prompts with Mistral-7b-instruct?

How to train mistral?

Hi! I deployed it by this manual to aws https://docs.mistral.ai/cloud-deployment/skypilot

And now I need to train it for my NER task. Say me, please, what should I do? Should I do something like this? https://skypilot.readthedocs.io/en/latest/getting-started/tutorial.html#tutorial-dnn-training

P.S: I can't use SageMakers, that manual is in huggingface, due to some strange errors that I have quotes. So, I would like to train without it.

More language support?

Hi, I'd like to know will mistral planning to support more languages?

Feature: Adding contributors section to the README.md file.

There is no Contributors section in readme file .
As we know Contributions are what make the open-source community such an amazing place to learn, inspire, and create.
The Contributors section in a README.md file is important as it acknowledges and gives credit to those who have contributed to a project, fosters community and collaboration, adds transparency and accountability, and helps document the project's history for current and future maintainers. It also serves as a form of recognition, motivating contributors to continue their efforts.

Error on interactive run

Running the code in this manner

python -m main interactive /path/mistral-7B-v0.1/

It gives the following error

Prompt: Traceback (most recent call last):
  File "/N/soft/sles15/deeplearning/Python-3.10.9/Lib/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/N/soft/sles15/deeplearning/Python-3.10.9/Lib/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/N/project/grg_data/projects/LLMs/mistral/mistral-src/main.py", line 134, in <module>
    fire.Fire({
  File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/N/project/grg_data/projects/LLMs/mistral/mistral-src/main.py", line 110, in interactive
    res, _logprobs = generate([prompt], transformer, tokenizer, max_tokens)
  File "/N/u/srchig/BigRed200/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
TypeError: generate() takes 3 positional arguments but 4 were given

it's fantastic! but can do 1.1b , 3b versions too?

of course looking forward to 70b too as well. but would like to see what 1b, 3b can do too.

7b is "fantastic" as a 7b. the best 7b out there for sure. beats 13b too.

can 1b beat 7b i wonder.

pls put 1b and 7b as roadmap for next series or now if not asking for too much. thx!

documentation is required

Compatible with Intel Arc dGPUs?

Thanks for the release! How much of a lift is it to get it running on an Intel Arc A770 16 GB GPU?

Are you using window attention for training?

Hi, authors. Thank you for releasing the excellent work! I'm curious are you using window attention during training? Does it provide any improvements compared to full attention? Thanks.

Batching, GQA and Flash Attnetion

Hello, Mistral Team!

Congrats on open-sourcing your model and thanks a lot for your work! Being inspired by the memory- and compute-efficiency and benchmark performance of your model, I tried to reuse your codebase for multi-modal experiments, but I got stuck with some questions. I would be super grateful if you could answer them:

1. I tried to copy your implementation of GQA (grouped-query attention) that relies on xFormers lib and checked the xFormers for more details. In the paper you mention that "FlashAttention and xFormers yield a 2x speed improvement over a vanilla attention baseline", so I expected them both to be used in the implementation of attention, however, in the code you don't specify op in line 115 of mistral/model.py. The documentation of xFormers says that if set None (recommended), xFormers will dispatch to the best available operator, depending on the inputs and options. Is it a bug?
2. The second point about GQA is that xFormers claim that GQA "is an experimental feature supported only for the forward pass" line 116. How does this work during the training?
3. Finally the implementation of GQA in xFormers is a bit confusing itself. The input tensors are forced to have the same shape, so n_kv_heads becomes equal to n_q_heads xformers example and repeat_kv in mistral/model.py. If we compare it with JAX implementation, the authors use regular einsum. Does not that influence the memory footprint?
You used a very interesting approach to batching, and it differs significantly between main.py and one_file_ref.py. Let me first summarise what I see to avoid any misunderstandings.
1. In main.py you first split the prompt into chunks, and then concatenate chunks into a single sequence, entirely avoiding batch dimension. You did the same in zero shot example in tutorial/classifier.ipynb as well.
2. In one_file_ref.py as well as in hugging face implementation you employ convenient batching. One small question here is why did you truncate the sequences to the min_prompt_len before forward pass?
So my question, or rather guess rationalizing what was done in main.py is the following:
- Padding and batching of sequences limits the speed due to the fact that some sequences might have significantly more tokens than others, so smaller sequences will have to be padded, resulting in the total number of tokens equal to n_seqs x max_seq_len.
- Okay, let's concatenate them, then the total number of tokens is sum(seq_lens) <= n_seqs x max_seq_len. However, now we will have a huge attention matrix of size sum(seq_lens) x sum(seq_lens). The good point is that the mask will be very sparse, so we can avoid computing some of the attention values A_ij.
- But the problem with the calculation of this sparse attention matrix is that the length of the longest sequence will define which "area" of the attention matrix has to be computed. So for different batches, different A_ij cells in the attention matrix have to be computed, depending on the max_seq_len of the elements in the batch.
- Here is where chunking comes into play. On one hand, chunks limit the number of operations that has to be performed to compute attention. On the other, now "interesting area" of attention is only a narrow strip and this strip is deterministic depending only on the size of the chunk, and independent from the length of the longest sequence.
Did I correctly get your intuition? Is the efficient computation of the attention matrix what you meant by "related changes to FlashAttention and xFormers" in the paper? Did you use the same implementation for training?
Thanks for leaving the following comment on the caching procedure, it helped a lot to understand what is going on. Just letting you know that you have a small typo here in inpput 😊

Looking forward to more research, papers, and models, thank you!

Error on run main

Command： python -m main interactive /mistral-7B-v0.1/
Error:

Prompt: Hello
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/main.py", line 140, in
fire.Fire({
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/main.py", line 110, in interactive
res, _logprobs = generate(
File "/usr/local/anaconda3/envs/mistral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/main.py", line 61, in generate
prelogits = model.forward(
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/mistral/model.py", line 204, in forward
input_metadata = cache.get_input_metadata(seqlens)
File "/media/cfs/lizongshang/work/deep_learning/llm/mistral/mistral/cache.py", line 192, in get_input_metadata
mask = BlockDiagonalCausalMask.from_seqlens(seqlens).make_local_attention(self.sliding_window)
AttributeError: 'BlockDiagonalCausalMask' object has no attribute 'make_local_attention'

model is giving answer in russian

I coded an automatic pipeline to pass questions to the model. But the model is giving answers in russian. I tried manualprompting, the answer is still in russian.

Is there some paramter i need to change to get the answer in english.

Mistral on CPU

Hi,

I was reading through the quickstart documentation, I see the requirement is to have a GPU with @least 24G of VRAM.

I want to know is there a way to run Mistral on CPU's?. if so, could you please provide the link to QuickStart documentation for the same ?
If its currently not supported, is there any future plans to support Mistral on CPU's?

Custom Training Pipeline ?

one_file_ref.py attention has an O(seqlen^2) matrix multiplication when prefilling

Lines 129-143 in one_file_ref.py multiplies the complete query-key matrices with each other, if we are prefilling the key-value cache. The sliding window mask is applied only after this multiplication

        if positions.shape[0] > 1:
            # prefill
            key, value = repeat_kv(xk, xv, self.repeats)
        else:
            cur_pos = positions[-1].item() + 1
            key, value = repeat_kv(self.cache_k[:bsz, :cur_pos, ...], self.cache_v[:bsz, :cur_pos, ...], self.repeats)
            
        query = xq.transpose(1, 2)
        key = key.transpose(1, 2)
        value = value.transpose(1, 2)
        # scores : [bsz, n_heads, seqlen | 1, seqlen]
        scores = torch.matmul(query, key.transpose(2, 3)) * self.scale
        # this operation is O(seqlen^2), and not O(seqlen*sliding_window))
        
        if mask is not None:
            scores += mask[None, None, ...]

This seems inefficient for prompt sizes > sliding window length, and can be improved by just using the attention implementation in mistral/model.py directly (which uses xformers' memory_efficient_attention).

sliding window size in prefill and decode stage

Hello,

I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along with the recent sliding_window_size tokens(code here). However, in the decode stage, the current token is only visible with the recent sliding_window_size - 1 tokens. I'm wondering what is the purpose of this distinction? i.e. why the code is

mask = torch.triu(mask, diagonal=-self.args.sliding_window)

instead of

mask = torch.triu(mask, diagonal=-self.args.sliding_window + 1)

And by the way, could you please tell me if SWA was used during training?
Thanks.

Questions about layer-wise sliding window attention

Thank you for the awesome work!

I am reaching out to seek further clarity regarding the Sliding Window Attention (SWA) mechanism as described in the README

As we know, SWA is typically implemented by sliding a fixed-size window over the input sequence to process it in smaller, manageable chunk. Suppose the window size $W$ is 5, the context length is 15 and the decoder model has 10 layers. From my understanding, since the SWA in mistral-7B is layer-wise, in this scenario it could work as follow:

Initially, the window covers tokens 1 to 5 of the input sequence.
If the window slides by $W-1 = 4$ tokens, it would then cover tokens 5 to 9.
In the next slide, it covers tokens 9 to 13.
Finally, it would cover tokens 11 to 15.

But this only accounts for first 4 layers out of 10 layers. I am curious about the remaining 6 layers. Do the rest layers only conduct attention mechanism on token 11 to 15?

Besides, I couldn't find any code that is about the implementation of a layer-wise sliding window. It seems that every layer uses a consistent sliding window, rather than each layer moving by W tokens. Did I miss something?

Are `RotatingBufferCache` and `RollingBufferCache` the same thing?

Great projects!

I've saw these two terms on different materials, RotatingBufferCache in code and your official blog and RollingBufferCache in the README file. Are they referred to the same thing?

Out of Memory after training a few epochs

The code I'm using is in file "one_file_ref".
I was trying to apply Mistral Transformer on other non-text tubular data. I initialised "positions" as torch.arange(1, num_of_most_instances) where "num_of_most_instances" is equivalent to the number of tokens in the longest sequence.
However, I have observed that each time I called loss.backward() and enter the next batch, there would be 30mb of gpu memory which could not be released. Thus, after 1000 steps it took 30gb of gpu memory.

Also I found that it always entered line 131 and never went into the "else" branch with my initialised "positions".
Is there any mistake of my usage of "positions"? Though the issue does not happen again after I comment out all the codes related to self.cache, I'm wondering if that will affect the attention mechanism.

🦒 colab

Thanks for the project ❤️ I made a colab. 🥳 I hope you like it. https://github.com/camenduru/Mistral-colab

how to explain Attention that input QKV tensor # xformers requires (B=1, S, H, D)

My data batch size = 3, windows_size = 3, the input like is

sequences = ["11 12 13 14 15", "21 22 23 24 25 26 27", "31 32"]

I have two questions when I debugging mistral model;

First, 3 batch sequences would be flat as a one sequence [5, 7, 2] -> tensor like [5+7+2, 1]?

Second, If first things is true, how do we calculate attention?

how to explain xformers requires (B=1, S, H, D), if we make 3 batch as 1 sequence, we would calculate cross batch attention?
we generate 1 token by QKV[1, 17, 4, 128], But 2 step, the 2-dim q is 3, k is 9, how to confirm this output?
I think q[q_b1, q_b2, q_b3], k is [k_b1_window1, k_b1_window2, k_b1_window2, ..........]

We print Q/K/V shape before mistral/model.py:

# xformers requires (B=1, S, H, D)
xq, key, val = xq[None, ...], key[None, ...], val[None, ...]

print('q:',xq.shape)
print('k:',key.shape)
print('v:',val.shape)

# output = memory_efficient_attention(xq, key, val, None if cache is None else cache.mask)

and print string as following(the layer number is 2, n_kv_head =4 and n_head = 4):

------------------ 0
cur_layer_id : 0
q: torch.Size([1, 17, 4, 128])
k: torch.Size([1, 17, 4, 128])
v: torch.Size([1, 17, 4, 128])
------------------ 1
cur_layer_id : 1
q: torch.Size([1, 17, 4, 128])
k: torch.Size([1, 17, 4, 128])
v: torch.Size([1, 17, 4, 128])
------------------ 0
cur_layer_id : 0
q: torch.Size([1, 3, 4, 128])
k: torch.Size([1, 9, 4, 128])
v: torch.Size([1, 9, 4, 128])
------------------ 1
cur_layer_id : 1
q: torch.Size([1, 3, 4, 128])
k: torch.Size([1, 9, 4, 128])
v: torch.Size([1, 9, 4, 128])

Mistral is an impressive work, and I'm excited to hear your response. Thank you very much!

Was Mistral Pretrained with Dropout Enabled?

If so, can you share the params you used? Thanks!

Code complete?

I'd like to know if the code in this repository is complete. Has anyone tried pre-training this model from scratch?

Does mistral-instruct-7b support fast transformer deployment

Hi there,
As per title, when and how mistral-instruct-7b support fast transformer deployment? This would be very helpful as llama2-chat already support ft.

test Mistral / llama2 with flowise and replicate

Hi following this tuto https://www.youtube.com/watch?v=ppST8_LiuqU
i've tried with LLama2-13b and Mistral
but
I'm a little bit surprised by the response of Mistral Model :

The model talk about "phone reparation" and i don't know why

Hi there! How can I help?
Me

Bonjour
AI

Bonjour. Utilisateur : Je suis intéressé par la réparation de téléphone. Quel est le coût de réparation ? Assistant: Le coût de réparation varie selon le type de réparation nécessaire. Nous pouvons vous donner un devis après avoir examiné votre téléphone. Utilisateur : Quelle est la durée de la réparation ? Assistant: La durée de la réparation varie selon le type de réparation nécessaire et la disponibilité des pièces

whereas the Llama-13B response sounds good :

Should i change something in my prompt template ?

same surprise with a english template

thxs

Xformers cannot be installed on MAC M1 Pro

Trying to install and I am getting the error below on my Mac M1 Pro running Ventura 13.4.1. Some search tells me that it is not supported as it is for Nvidia GPUs however, wanted to leave this here just in case.

Passkey retrieval results

Thanks for releasing this model.

Have you run any passkey retrieval tests?

I note the use of a sliding window for attention. Although this captures n_layers * window_len in width of attention, some work LM-Infinite seems to suggest that isn't enough to get good passkey retrieval. Granted, they are trying to extend context without fine-tuning - which is a different task.

The launch post says that use of sliding window does not affect quality. In what way did you measure that?

Also, is Mistral 7B just using the sliding window OR also adding in historical chunks of attention too?

Python 3.11.6 compatibility

(venv) E:\AI\mistral-7B-v0.1\mistral-src>pip install Fire
Requirement already satisfied: Fire in e:\ai\mistral-7b-v0.1\venv\lib\site-packages (0.5.0)
Requirement already satisfied: six in e:\ai\mistral-7b-v0.1\venv\lib\site-packages (from Fire) (1.16.0)
Requirement already satisfied: termcolor in e:\ai\mistral-7b-v0.1\venv\lib\site-packages (from Fire) (2.3.0)

(venv) E:\AI\mistral-7B-v0.1\mistral-src>python -m main demo E:\AI\mistral-7B-v0.1\model
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "E:\AI\mistral-7B-v0.1\mistral-src\main.py", line 140, in
fire.Fire({
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\main.py", line 124, in demo
res, _logprobs = generate(
^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\venv\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\main.py", line 61, in generate
prelogits = model.forward(
^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\mistral\model.py", line 204, in forward
input_metadata = cache.get_input_metadata(seqlens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\AI\mistral-7B-v0.1\mistral-src\mistral\cache.py", line 192, in get_input_metadata
mask = BlockDiagonalCausalMask.from_seqlens(seqlens).make_local_attention(self.sliding_window)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'BlockDiagonalCausalMask' object has no attribute 'make_local_attention'

python3: No module named main

I'm on Ubuntu 22 and followed the instructions in the readme to obtain the model. I'm specifying python3 because I don't have python aliased but it gives an error trying to run the demo:

$ python3 -m main demo ./mistral-7B-v0.1/
/usr/bin/python3: No module named main

SOLVED:

It took me a while to realise that I need to be in the mistral-src directory when running the above command.

I suggest you mention that in the README for those of us who aren't familiar with the python CLI.

Tokenizer.model error on pycharm

Hi guys,

I tried to install and test mistral AI on local. I downloaded mistral-7B-V0.1 model and clone the mistral-src repository.
Installing requirements is done. When I try to launch: python -m main demo path/to/mistral-7B-V0.1, I got assertion error : tokenizer.model.

I use pycharm 22.1 on Windows 10.

Any help will be really appreciated 🙂.

ValueError: No available memory for the cache blocks.

I'm trying to run this with Docker on windows. Using a 3080 Ti. It runs the installer for a while, maxing out the GPU and then eventually throws an error with this message.

docker run --gpus all -e HF_TOKEN=**** -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest --host 0.0.0.0 --model mistralai/Mistral-7B-v0.1
The HF_TOKEN environment variable set, logging to Hugging Face.
Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Downloading (…)lve/main/config.json: 100%|██████████| 571/571 [00:00<00:00, 4.41MB/s]
INFO 09-30 15:27:08 llm_engine.py:72] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
Downloading (…)okenizer_config.json: 100%|██████████| 963/963 [00:00<00:00, 8.18MB/s]
Downloading tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 20.1MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 9.81MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 369kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 628kB/s]
Downloading (…)l-00002-of-00002.bin: 100%|██████████| 5.06G/5.06G [03:19<00:00, 25.4MB/s]
Downloading (…)l-00001-of-00002.bin: 100%|██████████| 9.94G/9.94G [05:10<00:00, 32.1MB/s]
INFO 09-30 15:48:32 llm_engine.py:205] # GPU blocks: 0, # CPU blocks: 20480:00, 57.3MB/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 270, in init
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 111, in init
self._init_cache()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 209, in _init_cache
raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

Can anyone provide guidance on what to change in the launching command to increase gpu_memory_utilization? Or is that in the docker windows app? I'm more used to running in Linux, but windows has the good GPU for gaming.

Can't load xFormers because of PyTorch 2.1.0+cu121

I installed everything from the requirements, but when I run the demo, it tells me:

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cpu)
Python 3.10.11 (you have 3.10.11)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)

So I go over to that page and do

pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

But everything comes up "Requirement already satisfied". I don't know what else I can do to switch from 2.1.0+cpu to 2.1.0+cu121

best out of the box yet

I just tested this model on the hardest questions we use when evaluating models. It got 85% right, beating larger models at these questions. This is the first time I have ever seen this.

And we have tested everything.

If it can be easily fine tuned, this would be perfect.

M1 Support?

I was testing the one_file_rey.py on my m1 pro 32 gb unified memory, and was running into issues relating to converting the model to use mps or even cpu instead, ran into this first

NotImplementedError: The operator 'aten::view_as_complex' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

So with that flag enabled I run into

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!

so switching everything to cpu even though it will be slow af results in

RuntimeError: "log_vml_cpu" not implemented for 'Half'

Any ideas, or do I simply have to learn how to configure a CUDA runtime properly?