Coder Social home page Coder Social logo

llama-chat's People

Contributors

mucahitbz avatar randaller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama-chat's Issues

Cuda Error on Training

I'm training the model with hf-training-exapmle.py. I have 24GB of GPU memory and still I'm getting the CUDA Out of Memory error. Please help.

I have tried decreasing the num_train_epochs size to 0.1 still not working.

Share your best prompts and generations (and model name) here.

As the model was trained on a "scientific-looks" data and wiki, we need to be "more scientific" when prompting.

Model: 30B, prompt:

Write the Python code with detailed comments to generate 256 random integers in the range from -128 to 512, inclusive.
\\begin{code}\n

generation:

Write the Python code with detailed comments to generate 256 random integers in the range from -128 to 512, inclusive.
\begin{code}

random.randint(-128, 512)
\end{code}

Answer: You can use `random.sample()`:

\begin{code}
>>> import random
>>> random.sample(range(-128, 513), 256)
[-49, 181, 121, 71, 119, 487, 201, 141,

Stopped the generation, do not wish to wait for 256 integers list.

Clarify requirements

Hi, I am ordering some RAM to work with LLAMA when I take a break in a few weeks. The README for this repo says "64 or better 128 Gb of RAM (192 or 256 would be perfect)". Is this alongside a CUDA card? I have a 3090. I can order up to 192GB of RAM, if it makes a big difference. Will it?

Thanks!

Exception RuntimeError: at::cuda::blas::gemm: not implemented

I got this exception after typing anything in a prompt. Anyone knows what that means and how can I fix this?

Traceback (most recent call last):
File "/home/.../scripts/llama-chat/example-chat.py", line 118, in
fire.Fire(main)
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/.../.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/.../scripts/llama-chat/example-chat.py", line 111, in main
results = generator.generate(
File "/home/.../scripts/llama-chat/llama/generation.py", line 60, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/.../.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/.../scripts/llama-chat/llama/model.py", line 264, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.../scripts/llama-chat/llama/model.py", line 189, in forward
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
File "/home/.../scripts/llama-chat/llama/model.py", line 111, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.../.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: at::cuda::blas::gemm: not implemented for N3c108BFloat16E

"model parallel group is not initialized" when loading model

Hi, I ran chat_example.py after merge weights and then got the following error when loading the model:

Loading checkpoint
Loading tokenizer
Loading model
Traceback (most recent call last):
  File "/data/yao/apps/llama/chat.py", line 118, in <module>
    fire.Fire(main)
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/data/yao/apps/llama/chat.py", line 93, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size)
  File "/data/yao/apps/llama/chat.py", line 68, in load
    model = Transformer(model_args)
  File "/data/yao/apps/llama/llama/model.py", line 205, in __init__
    self.tok_embeddings = ParallelEmbedding(
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 186, in __init__
    world_size = get_model_parallel_world_size()
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/initialize.py", line 152, in get_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_model_parallel_group())
  File "/data/yao/anaconda3/envs/chatgpt/lib/python3.10/site-packages/fairscale/nn/model_parallel/initialize.py", line 128, in get_model_parallel_group
    assert _MODEL_PARALLEL_GROUP is not None, "model parallel group is not initialized"
AssertionError: model parallel group is not initialized

Request Help!

The generation will stop at ":" keyword.

Here is a sample:
User: which medicines can be used to treat hypertension?
AI: There are several medicines that can be used to treat hypertension. Some of the most common ones are:

The generation stop at "are:", any solution for it?

hi, i need help

hi i am a beginner programmer, how can I do to generate the chat? I mean being able to talk to the AI, is there any way?

Chats are repeating

Hi, I successfully ran the code for llama 30 model. However, i noticed the chats/interaction got repeated after each questions from user.
Please check the screenshot.
How can i fix this issue?

Capture

merge-weights.py fails merging the 13B model

matteo@llama:~/llama-chat$ python3 merge-weights.py --input_dir ../LLaMA --model_size 13B
Traceback (most recent call last):
  File "merge-weights.py", line 168, in <module>
    main()
  File "merge-weights.py", line 163, in main
    model_size=args.model_size,
  File "merge-weights.py", line 95, in write_model
    f"layers.{layer_i}.ffn_norm.weight": loaded[0][f"layers.{layer_i}.ffn_norm.weight"],
TypeError: unsupported operand type(s) for |=: 'dict' and 'dict'

To shield the annoying progress bar, we found a way

To shield the annoying progress bar, we found a way:
/llama-chat/llama/model.py --261
for layer in tqdm(self.layers, desc="flayers",leave=Ture):
replace as :for layer in self.layers:
/llama-chat/llama/generation.py --60
for cur_pos in trange(start_pos, total_len, desc="forward"):
replace as :for cur_pos in range(start_pos, total_len):

Train model using GPU

I have checked the training hf-training-example.py by default it trains the model using the cpu. Since I have two GPUs. If I enable the GPU in the above code. I get hte Cuda out of memory error. How can I limit it just like the inference example you provided for cuda?

why does this happen?

PS F:\downloads\llama-chat-main\llama-chat-main> python3.7 merge-weights.py --input_dir F:\Downloads\LLaMA --model_size 13B
Traceback (most recent call last):
File "merge-weights.py", line 168, in
main()
File "merge-weights.py", line 163, in main
model_size=args.model_size,
File "merge-weights.py", line 95, in write_model
f"layers.{layer_i}.ffn_norm.weight": loaded[0][f"layers.{layer_i}.ffn_norm.weight"],
TypeError: unsupported operand type(s) for |=: 'dict' and 'dict'

Error on run: size mismatch for ...

I've managed to complete all the steps but the last, and when I run
'python example-chat.py ./model ./tokenizer/tokenizer.model'

I wait a few minutes then get a lot of error lines like:

size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32000, 6656]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
size mismatch for layers.39.ffn_norm.weight: copying a param with shape torch.Size([6656]) from checkpoint, the shape in current model is torch.Size([5120]).
        size mismatch for norm.weight: copying a param with shape torch.Size([6656]) from checkpoint, the shape in current model is torch.Size([5120]).
        size mismatch for output.weight: copying a param with shape torch.Size([32000, 6656]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).

Incomplete answer with GPTSQLStructStoreIndex compared to ChainLang

I'm following the get started guide for the structured data example:
https://gpt-index.readthedocs.io/en/latest/guides/sql_guide.html

The output is only given in array format: [('Tokyo',)]. I would expect a final answer, such as "Tokyo has the highest population..."

Furthermore, looks like other simple queries do not return enough information, such as "Does Berlin have a higher population than Tokyo?". While if I enter the same query to ChainLang directy (without indexing), it does give me a more complete answer. How can I get a more complete answer with llama-ai, similar to what I get with Chain Lang?

LLama-index code:
index = GPTSQLStructStoreIndex(
[],
sql_database=sql_database,
table_name="city_stats",
)
response = index.query("Does Berlin have a higher population than Tokyo?", mode="default")
print(response)

LLama-index output:

INFO:root:> [query] Total LLM token usage: 180 tokens

[query] Total LLM token usage: 180 tokens
INFO:root:> [query] Total embedding token usage: 0 tokens
[query] Total embedding token usage: 0 tokens
[('No',)]


Chain-lang:

Code:
db_chain = SQLDatabaseChain(llm=llm, database=sql_database, verbose=True)
db_chain.run("Does Berlin have a higher population than Tokyo?")

Output:

Entering new SQLDatabaseChain chain...
Does Berlin have a higher population than Tokyo?
SQLQuery: SELECT city_name, population FROM city_stats WHERE city_name IN ('Berlin', 'Tokyo') ORDER BY population DESC LIMIT 5;
SQLResult: [('Tokyo', 13929286), ('Berlin', 600000)]
Answer: No, Tokyo has a higher population than Berlin.
Finished chain.

test

hello, when i ask question,need always load the model when AI gives a word, is it ok?
image

[Feature Request] Support InternLM

Dear llama-chat developer,

Greetings! I am vansinhu, a community developer and volunteer at InternLM. Your work has been immensely beneficial to me, and I believe it can be effectively utilized in InternLM as well. Welcome to add Discord https://discord.gg/gF9ezcmtM3 . I hope to get in touch with you.

Best regards,
vansinhu

Goes nowhere

User: tell me about london city

Takes about 8 minutes and the reply is;

flayers: 100%|███████████████████████████| 60/60 [08:36<00:00,  8.62s/it]
------------------------------███████████| 60/60 [08:36<00:00,  4.63s/it]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city

Then immediately kicks off another generation, reply is;

------------------------------
flayers: 100%|███████████████████████████| 60/60 [07:36<00:00,  7.61s/it]
------------------------------███████████| 60/60 [07:36<00:00,  1.62it/s]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city
AI

And again just repeats

Question - perform tasks

Hi,

Would anyone would be able to tell me whether it is possible to have this kind of model perform a task like calling an API and how would one do it?

GPU vs CPU which one is best?

I have a question but first thank you for sharing this amazing project its great for starters.
I have been using the chat bot but when I run the code with CUDA on the 7B model it is super slow, i mean really really bad.
but when I do CPU I does work way better almost real time. The automapping feature is also very very slow.
PC:
Intel 7i 6GEN 8 cores
Memory 32GB
VC: Nvidia2070 8GB

Could someone explain me why the GPU performs worst than the CPU and memory?
Is it better to just get more memory to be able to run the 14B or should i get one of those nvidia cards like Tesla with 24GB?
my testing tell me its better to get memory instead of a very expensive GPU.

Thanks

how can I use cuda tensor

Hi randaller,
I tried to use cuda tensor type by uncommenting this line "torch.set_default_tensor_type(torch.cuda.HalfTensor)" in example. But got error like this:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

My question is how can I use cuda type within your example?
Thanks in advance.
Kyle

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.