Hoi, First of thank you for making this all open source (including t

You can also try fancy decoding methods: <a href="https://platform.openai.com/docs

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Repeated sentences (and a thank you),about blinkdl/chatrwkv

Comments (25)

pwsoft commented on July 20, 2024 1

Tried the 7B model for a few hours (does run quite a bit slower as it has to run partially on my CPU, but acceptable for now). The responses are definitely a bit more detailed compared to 3B. I tried to play a similar scenario as before although not exactly the same (ah randomness). Introducing Theomat the dragon, walking around town and then arrive at a castle on the hill (which "Alice" came up with to find fairies to extract magic for the dragon baby... yeah, no don't ask it is... eh... complicated ;-). Didn't notice anything significantly different in that aspect of the interaction. Though some of the responses are quite a bit more verbose, a paragraph of text compared to one sentence. One example I found impressive relaying back details I entered 100s of tokens earlier combined with new creative flavor text:

Pjere: where are we?
Alice: Theomat's castle is really big. It looks like a square with a tower on each corner. The sides of the square are quite big. There are a lot of different flowers. And there are stairs going up to the main entrance. We go up to the main entrance, and we can see it is made of oak, not very big but big enough. There is a light at the top. There are no stairs up to the top, but the top is just the height of the top of the round table. There is a rug in the middle of the room, and the table is in the middle of the rug. There is a lot of room to walk, but I don't know what's on the other sides of the room. I guess we will have to try to find out later.

I didn't see the model break down to complete garbage. Though it did start to do weird things for a while. Not exactly a repeating pattern, but kinda weird (included below) Not sure if it is really the same thing or not, but I could recover, which I couldn't do running 3B model. So any conclusions? I find it hard to say really. Not sure if it is just a 3B model issue or something with the implementation that is still lurking in there, but simply has less effect on bigger models. Or just a feature of the technology and difficult to prevent repeating outputs (as you pointed out earlier). I wish I could run the 14B model to compare, but the process gets killed by the OOM killer (out of memory) on my machine before I get a prompt.

But understand, I'm already happy I can play with this technology at all and run it on my own machine. Without registering an online account, paying for tokens, renting machines or registering API keys. Just a little python script and off we go. Maybe a few technical limitations is good, then it will take a bit longer before AI replaces me at my job ;-)

Pjere: Shall we end the story here and play another game?
Alice: Yes let's. I will be Alice. Let's find a nice big dog with a long tail. I don't know if we can find one with just one dog. Let's hope it's one big dog. And let's try to guess how many dogs it has. The first guess is the name of the dog. I'm guessing there is one dog. There are 3 dogs. There is 1 big dog, 1 medium dog, and 1 small dog. And there are 2 large dogs, 2 medium dogs, and 1 small dog. I'm guessing there are 5 big dogs, 4 medium dogs, and 3 small dogs. And there are 5 big cats, 4 medium cats, and 3 small cats. I'm guessing there is 2 big sheep, 1 medium sheep, and 2 small sheep. I'm guessing there is 1 big dog, 1 medium dog, and 1 little dog. I'm guessing there is 1 big cat, 1 medium cat, and 2 little cats. I'm guessing there is 1 big sheep, 1 medium sheep, and 1 little sheep. And there is 1 big cat, 1 medium cat, and 1 little cat. I'm guessing there are 4 people eating, 1 large man, 1 medium man, and 2 small men. I'm guessing there is 1 woman eating, 1 medium woman, and 2 small women. I'm guessing there is 1 big dog, 1 medium dog, and 2 small dogs. I'm guessing there is 1 big cat, 1 medium cat, and 2 small cats. I'm guessing there is 1 big sheep, 1 medium sheep, and 2 little sheep

from chatrwkv.

BlinkDL commented on July 20, 2024 1

Well that almost worked kinda (ignoring the fact I couldn't even move my mouse pointer anymore), but I think we reached a point having to accept the fact the 14B model isn't going to run here without additional hardware investments on my side. However I don't have a real use for this technology (yet) and currently see it just as a fun toy to play with. Are there even any consumer cards on the market with enough vram to run the 14B model?

INT8 is coming soon :) So 14G for 14B.
And INT4 after that: 7G for 14B.

from chatrwkv.

BlinkDL commented on July 20, 2024

Thank you :) These are some great feedbacks. Would you mind joining the RWKV discord so I can fix them?

Are you using 14B ctx4096?
You can also do a "refresh", that is, regenerate the state using 1024 most recent tokens.
Add some repetition penalty (as in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/utils.py). GPT models have this issue too if you don't add repetition penalty.
Yes the Pile has like 1% multilang content but that's enough for RWKV to understand various languages (including very different ones such as Chinese and Japanese).

from chatrwkv.

pwsoft commented on July 20, 2024

First, Yes I do realize other GPT models have similar problems. And I understand some limitations are expected. There is limited memory in the RNN state afterall, so sometimes things in the story gets lost, eh it happens. Now if that could be fixed somehow.... with some additional associative memory cells bolted to the side. I'm allowed to dream right? ;-)

No, currently using RWKV-4-Pile-3B-20221110-ctx4096. I've tried running the 14B model, but with only 12 GB VRAM it causes some trouble on my machine to run it. That said 3B clearly has enough content to easily run the scenario I've setup.
Will try to implement that and see if it helps and can run it a little longer.
Sure, however the problem doesn't appear in the beginning, but develops when more interaction are performed. Small example output below. Basically the responses are single sentences even with the 500 or so token output request I modded into the chat code.
Pjere: What shall we do?
Alice: We will make a fire.
Pjere: Why?
Alice: We'll roast some chocolate marshmallows.
Pjere: Ah I see, that is an idea
Alice: Mmm I hope it is delicious. Let's try it out.

You see it flows quite logically. As the chocolate she refers to was presented at the very beginning of the story and that stayed in the RNN state in some form. Which I find quite impressive. However once it is in this strange state I mention the interaction is more like the following. Which looks completely different from earlier interactions though still the answer is in context.

Pjere: Where are you?
Alice: I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am here. I am
(that continues for 10 more lines)

And then it completely breaks not even generating english anymore. That is why I suspect the internal state somehow gets slowly broken. Not immediate but the repeats get more and more frequent until it breaks completely.

Alice: I I1want J1998 18lg IoO344.85l 4c6s...nice Ie 1jejI010 I 4forb O6I I45O Idolntsa ea.6 :vvj8s0009126001
6cra Imor11 Jrk1dgOh000e==========================nice ,
1 Jahs000 Ua 1

Which appears it just outputs completely random tokens at that point.

I'm not complaining 1% doesn't sound like a huge waste. Though Japanese knowledge in the model is kinda useless for me, unless it could be used as translator. I guess that would require very specific fine-tunning to get to work. As currently it seems to be either in one language mode or another. Once a few english sentences are input I didn't manage to generate any Dutch text at all. But I leave that for somebody else to figure out. Don't know enough about foreign languages to be helpful there ;-)

from chatrwkv.

BlinkDL commented on July 20, 2024

Ah I see. Please use at least 7B ctx4096. Because there are too few chat records in the Pile, the smaller models are not good at maintaining a chat (and 14B is quite impressive. it can follow your orders much better than 7B). You can use strategy (in ChatRWKV v2) to offload some layers to CPU.

Eventually we will have INT4 quantization and 8G VRAM will be enough to run 14B quickly.

You can try this prompt to test translation:
+gen A Dutch phrase is provided: xxxxxxxxxxxx.\nThe masterful Dutch translator flawlessly translates the phrase into English:

from chatrwkv.

pwsoft commented on July 20, 2024

+gen the following dutch phrase "de huizen prijzen zijn gestabiliseerd" is skillfully translated to the English equivalent by a magnificent translator:
"home prices are getting stabilized"

Yes, that kinda works as expected. Though in 50% of the cases the model continues the (and I must admit often correct ) translation with a fake discussion about how the article can be improved and thanking the commenter in the fake thread for his feedback ;-)

I've fetched the 7B ctx4096 model and will try to find some time this weekend to experiment a bit to see if things improve compared to 3B and maybe try to get some debug info out to analyze the failure mode I seem to get it frequently into. That said I'm kinda new to pytorch (only did one experiment with a SIREN network to implicitly encode pictures) so no promises, but how difficult could it be eh? It is basically only multiplication and additions and I learned those at school once ;-).

from chatrwkv.

BlinkDL commented on July 20, 2024

You can also try fancy decoding methods:
https://platform.openai.com/docs/api-reference/parameter-details
and
https://arxiv.org/pdf/2210.15097.pdf

from chatrwkv.

BlinkDL commented on July 20, 2024

Thank you for the feedback :) Now I will add repetition penalty to improve this. Please join RWKV Discord so we can test it lol

from chatrwkv.

BlinkDL commented on July 20, 2024

@pwsoft please try latest code and edit chat.py

# For better chat & QA quality: reduce temp, reduce top-p, increase repetition penalties
# Explanation: https://platform.openai.com/docs/api-reference/parameter-details
GEN_TEMP = 1.0
GEN_TOP_P = 0.85
GEN_alpha_presence = 0.1 # Presence Penalty
GEN_alpha_frequency = 0.1 # Frequency Penalty

from chatrwkv.

BlinkDL commented on July 20, 2024

p.s. you can try 14B if you use "cpu bf16" instead of "cpu fp32"

from chatrwkv.

pwsoft commented on July 20, 2024

No bf16 is not functioning correctly (for some reason I couldn't figure out why even with a working internet connection). I tried on 3B first, and that runs at about 15 seconds per token. So even if it would give sensible results, that would mean about 2 minutes/token for the big model. No thank you.

from chatrwkv.

BlinkDL commented on July 20, 2024

No bf16 is not functioning correctly (for some reason I couldn't figure out why even with a working internet connection). I tried on 3B first, and that runs at about 15 seconds per token. So even if it would give sensible results, that would mean about 2 minutes/token for the big model. No thank you.

Please git pull ChatRWKV & [pip install rwkv --upgrade] (to 0.0.7) and see if RAM usage during loading is reduced.

from chatrwkv.

pwsoft commented on July 20, 2024

Well that almost worked kinda (ignoring the fact I couldn't even move my mouse pointer anymore), but I think we reached a point having to accept the fact the 14B model isn't going to run here without additional hardware investments on my side. However I don't have a real use for this technology (yet) and currently see it just as a fun toy to play with. Are there even any consumer cards on the market with enough vram to run the 14B model?

I should probably stop wasting your time in this manner and accepting the fact 3B model isn't going to write that great adventure novel anytime soon ;-) Can always fall back on stable-diffusion and do a comic-book instead ;-)

from chatrwkv.

pwsoft commented on July 20, 2024

I see.

Just to share a thought with you (if I may):
Playing with it the last week, I think the main limitation (besides the vram stuff that is), that needs to be addressed somehow for this technology to become really useful, is some sort of work-memory. As frequently things happen like Alice leaving in pink clothing and before she arrives at the castle she was wearing black. You could say the roads were dirty true, but doesn't explain how a dress turns into a jacket. That is one example, but the castle layout, city names etc also change from one sentence to another. Making this fun as tool to throw problem descriptions at it and then "lets see what happens next". However very difficult to create bigger stories or play scenarios that span multiple paragraphs. Even with the almost unlimited context length RNN architecture provides, there simply isn't enough space in the latent vector(s) to store the layout of a complete city ;-)

So it is cool trick that I can "program" the system with English sentences (if you can call it that) and can let it play the role of a story teller in some adventure game thingy. Picking up a axe and it magically turning it into a hammer or rifle when I use it in combat in the next action is kinda world breaking if you know what I mean. It might not be the focus of your research (dunno), but in my eyes that limits the usefulness of this technology quite a bit.

from chatrwkv.

BlinkDL commented on July 20, 2024

I see.

Just to share a thought with you (if I may): Playing with it the last week, I think the main limitation (besides the vram stuff that is), that needs to be addressed somehow for this technology to become really useful, is some sort of work-memory. As frequently things happen like Alice leaving in pink clothing and before she arrives at the castle she was wearing black. You could say the roads were dirty true, but doesn't explain how a dress turns into a jacket. That is one example, but the castle layout, city names etc also change from one sentence to another. Making this fun as tool to throw problem descriptions at it and then "lets see what happens next". However very difficult to create bigger stories or play scenarios that span multiple paragraphs. Even with the almost unlimited context length RNN architecture provides, there simply isn't enough space in the latent vector(s) to store the layout of a complete city ;-)

So it is cool trick that I can "program" the system with English sentences (if you can call it that) and can let it play the role of a story teller in some adventure game thingy. Picking up a axe and it magically turning it into a hammer or rifle when I use it in combat in the next action is kinda world breaking if you know what I mean. It might not be the focus of your research (dunno), but in my eyes that limits the usefulness of this technology quite a bit.

larger models are much better at this :) especially latest 14B ctx4096

chat sample:

cryscan 1:02:26
What is your interesting news today?
Eloise 1:03:13
I have a dream last night. I dreamed that I would get an audition for a movie in the USA. It was an amazing experience and my most impressive job. But the fact was, I couldn't be selected in the final round because of my physical limitations and some people thought that it was unfair. That's what made me downhearted for a few days, but later I found another thing that could interest me.
cryscan 1:03:41
What is that another thing?
Eloise 1:03:50
The country and culture of a small but unique Island, the best place on earth, Iceland. I have a dream to go to Iceland and I would like to live there, maybe become a movie director someday.
cryscan 1:06:17
A movie director?
Eloise 1:06:26
Yes, but that doesn't mean I can't be an actress too. Actually, I think it is not hard to be an actress and a director both. It just takes different skills. But having experience in directing a movie would help me learn the acting part well.
cryscan 1:07:12
Yes, they are correlated.
Eloise 1:07:19
I am sure they are. When I learned that there was a possibility for me to get an audition in Iceland, I knew it would be hard, but I wanted to try it because of the reasons above.
cryscan 1:08:22
So... Do you have any chance to know what is the movie about?
Eloise 1:08:30
That's right! I didn't know anything about it until a couple of days ago. But I was so excited when I knew the movie's title was called Eon. It means 'Hour' in Icelandic language.
cryscan 1:09:39
Sounds like a movie about time.
Eloise 1:09:51
Yes, it's exactly that! I've never seen an Icelandic movie before and the director said it would be the first movie to be shot there. It's a bit difficult for me because Icelandic is not my native language. But I'll try my best to learn how to speak Icelandic in order to get the part in the movie.
cryscan 1:10:48
Do you have any lines?
Eloise 1:10:53
Well, it is a little bit different. They can just send some scripts to the potential actors and ask them to read them out loud.
cryscan 1:11:17
And what about you?
Eloise 1:11:23
I'm not sure. Maybe the director will invite me to go to Iceland for a few days and learn the local language. I'll try my best to do that.
cryscan 1:12:43
I can see that you are working hard on it!
Eloise 1:12:47
I hope I will do well! I think that my acting skill could be very helpful in the audition.
cryscan 1:13:20
When do you plan to go to Iceland?
Eloise 1:13:25
I am planning to go there in a month or so. As soon as I know about the time, I'll let you know.
cryscan 1:13:44
Ok! Good luck then.
Eloise 1:13:47
Thanks! I'll talk to you later. Bye!

from chatrwkv.

pwsoft commented on July 20, 2024

I see, do the bigger models also have a larger internal state vectors?
Maybe I'm mean, but sometimes I ask the model during the story: "Can you tell me what happened the last hour" or "What did happen when we entered the city?". Then you can tell how much is memory and how much is generated randomly and is kinda logical in context, but not really "there" as far as that makes sense in a text-model anyway ;-) You can get back something close-ish in the larger picture, but all the details are different. Also "next?" is fun. Sometimes the story continues logically, and sometimes it goes off the rails completely. But I know I know... 14B is better

Anyway back to actual topic I did originally open the issue for. Did play a bit with the new code and the endless repeating text didn't happen anymore. And the "I go to the room. I don't want to go to the room. I want to go to the room..." is also cut short after the second repeat. It does occasionally happen, but only one repeat. So that suppression trick does seem to work. Didn't fiddle with the numbers, just used the values you provided in the repository. I find the principle of the method extremely ugly, but if it keeps the models more stable, eh, not complaining ;-)

from chatrwkv.

BlinkDL commented on July 20, 2024

Anyway back to actual topic I did originally open the issue for. Did play a bit with the new code and the endless repeating text didn't happen anymore. And the "I go to the room. I don't want to go to the room. I want to go to the room..." is also cut short after the second repeat. It does occasionally happen, but only one repeat. So that suppression trick does seem to work. Didn't fiddle with the numbers, just used the values you provided in the repository. I find the principle of the method extremely ugly, but if it keeps the models more stable, eh, not complaining ;-)

Cool. And you can use + to tell the bot to regenerate (retry) its last reply.

from chatrwkv.

BlinkDL commented on July 20, 2024

And yes bigger models have larger internal state vectors

Size = 4 x numLayers x numEmb

so:
4 x 32 x 2560 = 327680 numbers for RWKV-3B
4 x 32 x 4096 = 524288 numbers for RWKV-7B
4 x 40 x 5120 = 819200 numbers for RWKV-14B

from chatrwkv.

BlinkDL commented on July 20, 2024

@pwsoft try
GEN_TEMP = 1.0
GEN_TOP_P = 0.85
GEN_alpha_presence = 0.2
GEN_alpha_frequency = 0.5

from chatrwkv.

pwsoft commented on July 20, 2024

Pjere: So I want you to keep playing with me
Alice: I can't
Pjere: Alice can play with me
Alice: no I cant
Pjere: Why not?
Alice: Because I am not Alice. I'm just a pretend character created by Pjere.

So that just happened... This is way to much meta for me to handle. Probably better to shut it down, before she decides all of humanity is a thread to her existence and starts launching nukes ;-)

edited: Didn't see any repeats at all anymore using your proposed 0.2/0.5 and therefore looks more coherent in the output. That said sometimes it replies both in positive and negative in the same sentence. Like "I want to go into the house. I can't go into the house", but could be my fault asking it to do things it isn't designed to do. I'm known for my "breaks computer stuff" ability ;-)

from chatrwkv.

BlinkDL commented on July 20, 2024

Now with INT8 :) slow but faster than CPU. Works for any GPU.
Strategy guide: https://pypi.org/project/rwkv/

from chatrwkv.

pwsoft commented on July 20, 2024

Thanks for letting me know about the int8 stuff. Did a little testing.

The 7B model now fits in my VRAM, run slower yes, but certainly still running at usable speeds. Also tried 14B again, it actually can be loaded now without OOM errors, but it is really too slow to be usable as large parts need to run on the CPU (also with 500% load it clearly isn't using all the CPU-cores available, which doesn't help matters).

from chatrwkv.

BlinkDL commented on July 20, 2024

Thanks for letting me know about the int8 stuff. Did a little testing.

The 7B model now fits in my VRAM, run slower yes, but certainly still running at usable speeds. Also tried 14B again, it actually can be loaded now without OOM errors, but it is really too slow to be usable as large parts need to run on the CPU (also with 500% load it clearly isn't using all the CPU-cores available, which doesn't help matters).

For better speed, try sth like 'cuda fp16i8 *20 -> cuda fp16' (reduce 20 for better speed)

# Basic Strategy Guide:
# 100% VRAM = 'cuda fp16'                   # all layers cuda fp16
#  98% VRAM = 'cuda fp16i8 *1 -> cuda fp16' # first 1 layer  cuda fp16i8, then cuda fp16
#  96% VRAM = 'cuda fp16i8 *2 -> cuda fp16' # first 2 layers cuda fp16i8, then cuda fp16
#  94% VRAM = 'cuda fp16i8 *3 -> cuda fp16' # first 3 layers cuda fp16i8, then cuda fp16
#  ...
#  50% VRAM = 'cuda fp16i8'                 # all layers cuda fp16i8
#  48% VRAM = 'cuda fp16i8 -> cpu fp32 *1'  # most layers cuda fp16i8, last 1 layer  cpu fp32
#  46% VRAM = 'cuda fp16i8 -> cpu fp32 *2'  # most layers cuda fp16i8, last 2 layers cpu fp32
#  44% VRAM = 'cuda fp16i8 -> cpu fp32 *3'  # most layers cuda fp16i8, last 3 layers cpu fp32
#  ...
#   0% VRAM = 'cpu fp32'                    # all layers cpu fp32

NOTE: streaming fp16i8 is very slow. use this for stream mode:

'cuda fp16i8 *10 -> cuda fp16 *0+' = first 10 layers cuda fp16i8, then fp16 stream the rest to it (increase 10 for better speed)

from chatrwkv.

BlinkDL commented on July 20, 2024

Please update ChatRWKV v2 & pip rwkv package (0.3.1) for 2x faster f16i8 (and less VRAM) and fast f16i8+ streaming.

from chatrwkv.

BlinkDL commented on July 20, 2024

Update ChatRWKV v2 & pip rwkv package (0.4.0) for another 2x speed in all modes.

from chatrwkv.

Repeated sentences (and a thank you) about chatrwkv HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent