randaller / llama-chat Goto Github PK
View Code? Open in Web Editor NEWChat with Meta's LLaMA models at home made easy
License: GNU General Public License v3.0
Chat with Meta's LLaMA models at home made easy
License: GNU General Public License v3.0
I'm training the model with hf-training-exapmle.py. I have 24GB of GPU memory and still I'm getting the CUDA Out of Memory error. Please help.
I have tried decreasing the num_train_epochs size to 0.1 still not working.
As the model was trained on a "scientific-looks" data and wiki, we need to be "more scientific" when prompting.
Model: 30B, prompt:
Write the Python code with detailed comments to generate 256 random integers in the range from -128 to 512, inclusive.
\\begin{code}\n
generation:
Write the Python code with detailed comments to generate 256 random integers in the range from -128 to 512, inclusive.
\begin{code}
random.randint(-128, 512)
\end{code}
Answer: You can use `random.sample()`:
\begin{code}
>>> import random
>>> random.sample(range(-128, 513), 256)
[-49, 181, 121, 71, 119, 487, 201, 141,
Stopped the generation, do not wish to wait for 256 integers list.
Hi, I am ordering some RAM to work with LLAMA when I take a break in a few weeks. The README for this repo says "64 or better 128 Gb of RAM (192 or 256 would be perfect)". Is this alongside a CUDA card? I have a 3090. I can order up to 192GB of RAM, if it makes a big difference. Will it?
Thanks!
I got this exception after typing anything in a prompt. Anyone knows what that means and how can I fix this?
Traceback (most recent call last):
File "/home/.../scripts/llama-chat/example-chat.py", line 118, in
fire.Fire(main)
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/.../scripts/llama-chat/example-chat.py", line 111, in main
results = generator.generate(
File "/home/.../scripts/llama-chat/llama/generation.py", line 60, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/.../.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/.../scripts/llama-chat/llama/model.py", line 264, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.../scripts/llama-chat/llama/model.py", line 189, in forward
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
File "/home/.../scripts/llama-chat/llama/model.py", line 111, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: at::cuda::blas::gemm: not implemented for N3c108BFloat16E
Only User: outputs
Hi, I ran chat_example.py
after merge weights and then got the following error when loading the model:
Loading checkpoint
Loading tokenizer
Loading model
Traceback (most recent call last):
File "/data/yao/apps/llama/chat.py", line 118, in <module>
fire.Fire(main)
File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/data/yao/apps/llama/chat.py", line 93, in main
generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size)
File "/data/yao/apps/llama/chat.py", line 68, in load
model = Transformer(model_args)
File "/data/yao/apps/llama/llama/model.py", line 205, in __init__
self.tok_embeddings = ParallelEmbedding(
File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 186, in __init__
world_size = get_model_parallel_world_size()
File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/initialize.py", line 152, in get_model_parallel_world_size
return torch.distributed.get_world_size(group=get_model_parallel_group())
File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/initialize.py", line 128, in get_model_parallel_group
assert _MODEL_PARALLEL_GROUP is not None, "model parallel group is not initialized"
AssertionError: model parallel group is not initialized
Request Help!
Here is a sample:
User: which medicines can be used to treat hypertension?
AI: There are several medicines that can be used to treat hypertension. Some of the most common ones are:
The generation stop at "are:", any solution for it?
hi i am a beginner programmer, how can I do to generate the chat? I mean being able to talk to the AI, is there any way?
matteo@llama:~/llama-chat$ python3 merge-weights.py --input_dir ../LLaMA --model_size 13B
Traceback (most recent call last):
File "merge-weights.py", line 168, in <module>
main()
File "merge-weights.py", line 163, in main
model_size=args.model_size,
File "merge-weights.py", line 95, in write_model
f"layers.{layer_i}.ffn_norm.weight": loaded[0][f"layers.{layer_i}.ffn_norm.weight"],
TypeError: unsupported operand type(s) for |=: 'dict' and 'dict'
To shield the annoying progress bar, we found a way:
/llama-chat/llama/model.py --261
for layer in tqdm(self.layers, desc="flayers",leave=Ture):
replace as :for layer in self.layers:
/llama-chat/llama/generation.py --60
for cur_pos in trange(start_pos, total_len, desc="forward"):
replace as :for cur_pos in range(start_pos, total_len):
It seems that GPTQ-4bit model is already supported in this project.
https://github.com/qwopqwop200/GPTQ-for-LLaMa
While giving a prompt to models, it shows the progress bar. In which file,, we can turn off that?
I have a RTX 4090 with 24GB VRAM + 64GB RAM is this example-chat.py ready to work with them? Thanks!
Here's LLaMA 7B running on my pc:
https://asciinema.org/a/3WHhYURC5il3TKGHzNFRGc7VZ
Start at 1:13, every word comes out with annoying progress bars, is that normal?
Thanks.
I have checked the training hf-training-example.py by default it trains the model using the cpu. Since I have two GPUs. If I enable the GPU in the above code. I get hte Cuda out of memory error. How can I limit it just like the inference example you provided for cuda?
Hi,
To a more real scenario, if i want input all the bible text into the LLAMA, how can i reach it?
Example of bible data:
https://raw.githubusercontent.com/tushortz/variety-bible-text/master/bibles/kjv.txt
Thanks.
Some way to input contextual data within the web interface. Maybe through uploading documents, json files, providing lists of weblinks.
I'm not sure what's possible with the py library so please ignore if this is ridiculous as an ask.
PS F:\downloads\llama-chat-main\llama-chat-main> python3.7 merge-weights.py --input_dir F:\Downloads\LLaMA --model_size 13B
Traceback (most recent call last):
File "merge-weights.py", line 168, in
main()
File "merge-weights.py", line 163, in main
model_size=args.model_size,
File "merge-weights.py", line 95, in write_model
f"layers.{layer_i}.ffn_norm.weight": loaded[0][f"layers.{layer_i}.ffn_norm.weight"],
TypeError: unsupported operand type(s) for |=: 'dict' and 'dict'
Is this memory print normal ? I hoped at least the 13B model could work ?
how can I edit the last layer of this model?
I'm testing 65B. One A100 is too slow. I want to use two or four
I've managed to complete all the steps but the last, and when I run
'python example-chat.py ./model ./tokenizer/tokenizer.model'
I wait a few minutes then get a lot of error lines like:
size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32000, 6656]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
size mismatch for layers.39.ffn_norm.weight: copying a param with shape torch.Size([6656]) from checkpoint, the shape in current model is torch.Size([5120]).
size mismatch for norm.weight: copying a param with shape torch.Size([6656]) from checkpoint, the shape in current model is torch.Size([5120]).
size mismatch for output.weight: copying a param with shape torch.Size([32000, 6656]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
I'm following the get started guide for the structured data example:
https://gpt-index.readthedocs.io/en/latest/guides/sql_guide.html
The output is only given in array format: [('Tokyo',)]. I would expect a final answer, such as "Tokyo has the highest population..."
Furthermore, looks like other simple queries do not return enough information, such as "Does Berlin have a higher population than Tokyo?". While if I enter the same query to ChainLang directy (without indexing), it does give me a more complete answer. How can I get a more complete answer with llama-ai, similar to what I get with Chain Lang?
LLama-index code:
index = GPTSQLStructStoreIndex(
[],
sql_database=sql_database,
table_name="city_stats",
)
response = index.query("Does Berlin have a higher population than Tokyo?", mode="default")
print(response)
LLama-index output:
INFO:root:> [query] Total LLM token usage: 180 tokens
[query] Total LLM token usage: 180 tokens
INFO:root:> [query] Total embedding token usage: 0 tokens
[query] Total embedding token usage: 0 tokens
[('No',)]
Chain-lang:
Code:
db_chain = SQLDatabaseChain(llm=llm, database=sql_database, verbose=True)
db_chain.run("Does Berlin have a higher population than Tokyo?")
Output:
Entering new SQLDatabaseChain chain...
Does Berlin have a higher population than Tokyo?
SQLQuery: SELECT city_name, population FROM city_stats WHERE city_name IN ('Berlin', 'Tokyo') ORDER BY population DESC LIMIT 5;
SQLResult: [('Tokyo', 13929286), ('Berlin', 600000)]
Answer: No, Tokyo has a higher population than Berlin.
Finished chain.
Dear llama-chat developer,
Greetings! I am vansinhu, a community developer and volunteer at InternLM. Your work has been immensely beneficial to me, and I believe it can be effectively utilized in InternLM as well. Welcome to add Discord https://discord.gg/gF9ezcmtM3 . I hope to get in touch with you.
Best regards,
vansinhu
User: tell me about london city
Takes about 8 minutes and the reply is;
flayers: 100%|███████████████████████████| 60/60 [08:36<00:00, 8.62s/it]
------------------------------███████████| 60/60 [08:36<00:00, 4.63s/it]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city
Then immediately kicks off another generation, reply is;
------------------------------
flayers: 100%|███████████████████████████| 60/60 [07:36<00:00, 7.61s/it]
------------------------------███████████| 60/60 [07:36<00:00, 1.62it/s]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city
AI
And again just repeats
Hi,
Would anyone would be able to tell me whether it is possible to have this kind of model perform a task like calling an API and how would one do it?
I have a question but first thank you for sharing this amazing project its great for starters.
I have been using the chat bot but when I run the code with CUDA on the 7B model it is super slow, i mean really really bad.
but when I do CPU I does work way better almost real time. The automapping feature is also very very slow.
PC:
Intel 7i 6GEN 8 cores
Memory 32GB
VC: Nvidia2070 8GB
Could someone explain me why the GPU performs worst than the CPU and memory?
Is it better to just get more memory to be able to run the 14B or should i get one of those nvidia cards like Tesla with 24GB?
my testing tell me its better to get memory instead of a very expensive GPU.
Thanks
Hi randaller,
I tried to use cuda tensor type by uncommenting this line "torch.set_default_tensor_type(torch.cuda.HalfTensor)" in example. But got error like this:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
My question is how can I use cuda type within your example?
Thanks in advance.
Kyle
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.