randaller / llama-chat Goto Github PK

View Code? Open in Web Editor NEW

831.0 831.0 118.0 897 KB

Chat with Meta's LLaMA models at home made easy

License: GNU General Public License v3.0

Python 100.00%

llama-chat's People

Contributors

Stargazers

Watchers

Forkers

cts2021 johndpope foo-bar-99 mikklfr danieldigitalart coderabbit214 anime4000 kmichal andrewkuo ardabck cheizr loopyashy ukaserge c00renut kai2002 webclinic017 mucahitbz linecode khryptorgraphics rusan sanjibnarzary wiwomu techventurebuilder richwittlerz lulili0963 rodrigoosc mysticaltech hadryan sunyruru ai-jie01 jianantian jeff31415 xianay klonggan rsashka kolt2050 zhiyou720 aida-yy jaechoon2 usternes ebergel tiamat-tech singlag oyfml ofshellohicy debackerl southglory sakurablossom amadeusqq mohan-zhang-u kennethjefferson synho coderx7 ethanzhangcn mattneel zhihuat guolianwei dlxj marcus-arcadius aaalgo arjunkrishna stevendbennett rishi-rama xdieselx udplus mcx friedri1970 curiosity007 peiiishan santatra00 teg2023 idahopotato1 rbndigital jiahaojohn vonglasow tatsuhaguioc23 aigc-meetsocial novarobot yabarji59 ryanbekabe xjohnxjohn crepperman zhao103804 lanhongken cslam001 zhnnix zerak rahuljanagouda shugangyao1 haroldmei yeounoh apollohuang1 alexyigidey8 rickyhong jesusoctavioas moonpyx blindcamel xinzhang-ops hardjunior ass-a2s

llama-chat's Issues

Cuda Error on Training

I'm training the model with hf-training-exapmle.py. I have 24GB of GPU memory and still I'm getting the CUDA Out of Memory error. Please help.

I have tried decreasing the num_train_epochs size to 0.1 still not working.

Share your best prompts and generations (and model name) here.

As the model was trained on a "scientific-looks" data and wiki, we need to be "more scientific" when prompting.

Model: 30B, prompt:

Write the Python code with detailed comments to generate 256 random integers in the range from -128 to 512, inclusive.
\\begin{code}\n

generation:

Write the Python code with detailed comments to generate 256 random integers in the range from -128 to 512, inclusive.
\begin{code}

random.randint(-128, 512)
\end{code}

Answer: You can use `random.sample()`:

\begin{code}
>>> import random
>>> random.sample(range(-128, 513), 256)
[-49, 181, 121, 71, 119, 487, 201, 141,

Stopped the generation, do not wish to wait for 256 integers list.

Hi, I am ordering some RAM to work with LLAMA when I take a break in a few weeks. The README for this repo says "64 or better 128 Gb of RAM (192 or 256 would be perfect)". Is this alongside a CUDA card? I have a 3090. I can order up to 192GB of RAM, if it makes a big difference. Will it?

Thanks!

Exception RuntimeError: at::cuda::blas::gemm: not implemented

I got this exception after typing anything in a prompt. Anyone knows what that means and how can I fix this?

Traceback (most recent call last):
File "/home/.../scripts/llama-chat/example-chat.py", line 118, in
fire.Fire(main)
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/.../scripts/llama-chat/example-chat.py", line 111, in main
results = generator.generate(
File "/home/.../scripts/llama-chat/llama/generation.py", line 60, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/.../.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/.../scripts/llama-chat/llama/model.py", line 264, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.../scripts/llama-chat/llama/model.py", line 189, in forward
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
File "/home/.../scripts/llama-chat/llama/model.py", line 111, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: at::cuda::blas::gemm: not implemented for N3c108BFloat16E

I got no print out of AI repsonse

Only User: outputs

"model parallel group is not initialized" when loading model

Hi, I ran chat_example.py after merge weights and then got the following error when loading the model：

Loading checkpoint
Loading tokenizer
Loading model
Traceback (most recent call last):
  File "/data/yao/apps/llama/chat.py", line 118, in <module>
    fire.Fire(main)
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/data/yao/apps/llama/chat.py", line 93, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size)
  File "/data/yao/apps/llama/chat.py", line 68, in load
    model = Transformer(model_args)
  File "/data/yao/apps/llama/llama/model.py", line 205, in __init__
    self.tok_embeddings = ParallelEmbedding(
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 186, in __init__
    world_size = get_model_parallel_world_size()
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/initialize.py", line 152, in get_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_model_parallel_group())
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/initialize.py", line 128, in get_model_parallel_group
    assert _MODEL_PARALLEL_GROUP is not None, "model parallel group is not initialized"
AssertionError: model parallel group is not initialized

Request Help！

The generation will stop at ":" keyword.

Here is a sample:
User: which medicines can be used to treat hypertension?
AI: There are several medicines that can be used to treat hypertension. Some of the most common ones are:

The generation stop at "are:", any solution for it?

It's too slow, how run 30B on 4 GPUs interactively

hi, i need help

hi i am a beginner programmer, how can I do to generate the chat? I mean being able to talk to the AI, is there any way?

Chats are repeating

Hi, I successfully ran the code for llama 30 model. However, i noticed the chats/interaction got repeated after each questions from user.
Please check the screenshot.
How can i fix this issue?

merge-weights.py fails merging the 13B model

matteo@llama:~/llama-chat$ python3 merge-weights.py --input_dir ../LLaMA --model_size 13B
Traceback (most recent call last):
  File "merge-weights.py", line 168, in <module>
    main()
  File "merge-weights.py", line 163, in main
    model_size=args.model_size,
  File "merge-weights.py", line 95, in write_model
    f"layers.{layer_i}.ffn_norm.weight": loaded[0][f"layers.{layer_i}.ffn_norm.weight"],
TypeError: unsupported operand type(s) for |=: 'dict' and 'dict'

To shield the annoying progress bar, we found a way

To shield the annoying progress bar, we found a way:
/llama-chat/llama/model.py --261
for layer in tqdm(self.layers, desc="flayers",leave=Ture):
replace as :for layer in self.layers:
/llama-chat/llama/generation.py --60
for cur_pos in trange(start_pos, total_len, desc="forward"):
replace as :for cur_pos in range(start_pos, total_len):

Do you have any plans to support GPTQ-4bit model?

It seems that GPTQ-4bit model is already supported in this project.
https://github.com/qwopqwop200/GPTQ-for-LLaMa

How to off model training on the runtime?

While giving a prompt to models, it shows the progress bar. In which file,, we can turn off that?

Is example-chat.py ready to use GPU?

I have a RTX 4090 with 24GB VRAM + 64GB RAM is this example-chat.py ready to work with them? Thanks!

Running this llama-chat successfully, but with repetitive progress bars, is this normal?

Here's LLaMA 7B running on my pc:
https://asciinema.org/a/3WHhYURC5il3TKGHzNFRGc7VZ
Start at 1:13, every word comes out with annoying progress bars, is that normal?
Thanks.

125G of memory, executing merge-weights.py on 30B will oom

Train model using GPU

I have checked the training hf-training-example.py by default it trains the model using the cpu. Since I have two GPUs. If I enable the GPU in the above code. I get hte Cuda out of memory error. How can I limit it just like the inference example you provided for cuda?

How to generate Bible data to LLAMA?

Hi,

To a more real scenario, if i want input all the bible text into the LLAMA, how can i reach it?

Example of bible data:
https://raw.githubusercontent.com/tushortz/variety-bible-text/master/bibles/kjv.txt

Thanks.

Feature Request - Ways to add context to conversation from beginning.

Some way to input contextual data within the web interface. Maybe through uploading documents, json files, providing lists of weblinks.

I'm not sure what's possible with the py library so please ignore if this is ridiculous as an ask.

why does this happen?

PS F:\downloads\llama-chat-main\llama-chat-main> python3.7 merge-weights.py --input_dir F:\Downloads\LLaMA --model_size 13B
Traceback (most recent call last):
File "merge-weights.py", line 168, in
main()
File "merge-weights.py", line 163, in main
model_size=args.model_size,
File "merge-weights.py", line 95, in write_model
f"layers.{layer_i}.ffn_norm.weight": loaded[0][f"layers.{layer_i}.ffn_norm.weight"],
TypeError: unsupported operand type(s) for |=: 'dict' and 'dict'

Can anyone suggest the best prompt for codellama13b model for converting the sql query to postgresql query

OOM with 64G ram and 13B model

Is this memory print normal ? I hoped at least the 13B model could work ?

FineTuning the last layer of Model

how can I edit the last layer of this model?

How should I use multiple GPUs？

I'm testing 65B. One A100 is too slow. I want to use two or four

Error on run: size mismatch for ...

I've managed to complete all the steps but the last, and when I run
'python example-chat.py ./model ./tokenizer/tokenizer.model'

I wait a few minutes then get a lot of error lines like:

size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32000, 6656]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
size mismatch for layers.39.ffn_norm.weight: copying a param with shape torch.Size([6656]) from checkpoint, the shape in current model is torch.Size([5120]).
        size mismatch for norm.weight: copying a param with shape torch.Size([6656]) from checkpoint, the shape in current model is torch.Size([5120]).
        size mismatch for output.weight: copying a param with shape torch.Size([32000, 6656]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).

Incomplete answer with GPTSQLStructStoreIndex compared to ChainLang

I'm following the get started guide for the structured data example:
https://gpt-index.readthedocs.io/en/latest/guides/sql_guide.html

The output is only given in array format: [('Tokyo',)]. I would expect a final answer, such as "Tokyo has the highest population..."

Furthermore, looks like other simple queries do not return enough information, such as "Does Berlin have a higher population than Tokyo?". While if I enter the same query to ChainLang directy (without indexing), it does give me a more complete answer. How can I get a more complete answer with llama-ai, similar to what I get with Chain Lang?

LLama-index code:
index = GPTSQLStructStoreIndex(
[],
sql_database=sql_database,
table_name="city_stats",
)
response = index.query("Does Berlin have a higher population than Tokyo?", mode="default")
print(response)

LLama-index output:

INFO:root:> [query] Total LLM token usage: 180 tokens

[query] Total LLM token usage: 180 tokens
INFO:root:> [query] Total embedding token usage: 0 tokens
[query] Total embedding token usage: 0 tokens
[('No',)]

Chain-lang:

Code:
db_chain = SQLDatabaseChain(llm=llm, database=sql_database, verbose=True)
db_chain.run("Does Berlin have a higher population than Tokyo?")

Output:

Entering new SQLDatabaseChain chain...
Does Berlin have a higher population than Tokyo?
SQLQuery: SELECT city_name, population FROM city_stats WHERE city_name IN ('Berlin', 'Tokyo') ORDER BY population DESC LIMIT 5;
SQLResult: [('Tokyo', 13929286), ('Berlin', 600000)]
Answer: No, Tokyo has a higher population than Berlin.
Finished chain.

test

hello, when i ask question,need always load the model when AI gives a word, is it ok？

[Feature Request] Support InternLM

Dear llama-chat developer,

Greetings! I am vansinhu, a community developer and volunteer at InternLM. Your work has been immensely beneficial to me, and I believe it can be effectively utilized in InternLM as well. Welcome to add Discord https://discord.gg/gF9ezcmtM3 . I hope to get in touch with you.

Best regards,
vansinhu

Goes nowhere

User: tell me about london city

Takes about 8 minutes and the reply is;

flayers: 100%|███████████████████████████| 60/60 [08:36<00:00,  8.62s/it]
------------------------------███████████| 60/60 [08:36<00:00,  4.63s/it]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city

Then immediately kicks off another generation, reply is;

------------------------------
flayers: 100%|███████████████████████████| 60/60 [07:36<00:00,  7.61s/it]
------------------------------███████████| 60/60 [07:36<00:00,  1.62it/s]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city
AI

And again just repeats

Hello, how to make the output be trimed?

curretnly outputs all inference steps and results, is there a way print only answer?

Question - perform tasks

Hi,

Would anyone would be able to tell me whether it is possible to have this kind of model perform a task like calling an API and how would one do it?

GPU vs CPU which one is best?

I have a question but first thank you for sharing this amazing project its great for starters.
I have been using the chat bot but when I run the code with CUDA on the 7B model it is super slow, i mean really really bad.
but when I do CPU I does work way better almost real time. The automapping feature is also very very slow.
PC:
Intel 7i 6GEN 8 cores
Memory 32GB
VC: Nvidia2070 8GB

Could someone explain me why the GPU performs worst than the CPU and memory?
Is it better to just get more memory to be able to run the 14B or should i get one of those nvidia cards like Tesla with 24GB?
my testing tell me its better to get memory instead of a very expensive GPU.

Thanks

how can I use cuda tensor

Hi randaller,
I tried to use cuda tensor type by uncommenting this line "torch.set_default_tensor_type(torch.cuda.HalfTensor)" in example. But got error like this:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

My question is how can I use cuda type within your example?
Thanks in advance.
Kyle