Comments (34)
Does it run on AMD? I can try AMD RX 480 😎
from stablelm.
RTX 3070 (8GB VRAM) Tuned-3B (fp16) ✅
RTX 3070 (8GB VRAM) Tuned-3B (fp32) 🚫
RTX 3070 (8GB VRAM) Tuned-7B (fp16) 🚫
RTX 3070 (8GB VRAM) Tuned-7B (fp32) 🚫
from stablelm.
For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub:
https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit
https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit
from stablelm.
@cduk it's pretty much straightforward:
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-tuned-alpha-3b")
model.half().cuda()
model.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')
tokenizer.save_pretrained('vvsotnikov/stablelm-tuned-alpha-3b-16bit')
It will save the model and the tokenizer locally, then you will have to upload them to Hub. Good luck!
from stablelm.
So I came up with the following, to use 8 bit quantization @cduk:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, StoppingCriteria, StoppingCriteriaList
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
)
tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-tuned-alpha-7b")
model = AutoModelForCausalLM.from_pretrained(
"StabilityAI/stablelm-tuned-alpha-7b",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
load_in_8bit=True,
quantization_config=quantization_config,
device_map={'': 0}
)
Loading takes around 13GB of RAM peak. I record the VRAM before running a prompt.
Then I run the default script prompt and record the time and VRAM. Then I keep rerunning the script prompt and I record the run times. These results are on a T4. I will play with llm_int8_threshold
to see if there are other savings possible. Again, no direct SSD storage = slow load times.
Running with 10.0 now, will update this table with results as they become available.
This table is for the Nvidia T4 16GB (15.3 GiB avail.) card.
Threshhold | Initial VRAM | After first Prompt | Loading | First Prompt | Next ones |
---|---|---|---|---|---|
4 | 9619MiB | 10041 MiB | 10-13m | 2m21s | 2-5s |
6 | 9618MiB | 10031 MiB | 10-13m | 43.8s | 3-6s |
10 | 9619MiB | 10029 MiB | 10-13m | 2m39s | 3-6s |
Threshhold has negligible effect on RAM. However, with 4 prompts run a bit faster (?)
from stablelm.
Able to run the tuned-alpha-3b on a 4070 Ti (12GB)
from stablelm.
3B f16 runs on 2080ti. Though you might need a lot of RAM to convert f32 to f16, peak is like 24G
calculation of lower bound of VRAM in GiB:
# 3B f16
>>> (3_638_525_952 * 2) / 1024 / 1024 / 1024
6.77728271484375
# 3B f32
>>> (3_638_525_952 * 4) / 1024 / 1024 / 1024
13.5545654296875
# 7B f16
>>> (7_869_358_080 * 2) / 1024 / 1024 / 1024
14.657821655273438
# 7B f32
>>> (7_869_358_080 * 4) / 1024 / 1024 / 1024
29.315643310546875
from stablelm.
Tesla P40 (24GB) - works
from stablelm.
Got the 7B running fine on my 4090.
from stablelm.
Not a gaming PC, but I just tried the Colab notebook with 83.5GB of System RAM and A100 with 40GB.
It's insanely fast to initialize and the prompts on the tuned-alpha-7B model took around 2 seconds to complete.
from stablelm.
I was able to get 3B parameter to work on CPU with 16GB of ram.
from stablelm.
Got 7B models working on my Tesla M40 w/ 24GB ram
from stablelm.
Nvidia T4 (16 GB) runs out of memory when trying to load the fp16 7B model. The 3B model runs smoothly in fp16.
from stablelm.
I had to disable torch.backends.cudnn and convert to float.
check out my repo https://github.com/astrobleem/Simple-StableLM-Chat
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
model.float().to(device)
torch.backends.cudnn.enabled = False
from stablelm.
RTX 3060 (12GB VRAM) Tuned-3B (fp16) OK
from stablelm.
7900XTX 24GB is OK with tuned-7B, based on docker-based ROCm5.5-rc5 and PyTorch2.0
from stablelm.
Best bet is 4bit quantization. 7B will likely run in 6gigs of VRAM at that level, as that's about the requirement for 7b with LLaMa.
from stablelm.
Keep eyes on this issues
from stablelm.
For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub: https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit
Would you mind showing how you made the conversion? I'm new to this and would like to do the same for the base 7B model. Thanks.
from stablelm.
I only have 40GB of RAM. So the default code did not work for me for 7B.
By changing the first lines to this, RAM is limited to 17GB and the model loads in 9:50 min.
tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-base-alpha-7b")
model = AutoModelForCausalLM.from_pretrained(
"StabilityAI/stablelm-base-alpha-7b",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
model = model.to("cuda")
You need to pip install accelerate
as well. So make those changes to avoid loading a 32 bit version of the model (34 gb), then the weights separately (another 34 gb). You still download 2x the size, unlike with @vvsotnikov 's images, but my VM has gigabit so I don't mind.
However, then it crashes, because I have a T4 which only has 15.3GB with the following :/:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 0; 14.62 GiB total capacity; 14.33 GiB already allocated; 185.38 MiB free; 14.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I'm confident that with some messing around it will fit on the T4. It's so close!
For now I will try running with device_map="auto"
and report back.
https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/flan-ul2#running-on-low-resource-devices
from stablelm.
@antheas So close! Have you considered quantizing to 8-bit and seeing how well that works? I wonder whether 8bit 7B would out-perform fp16 3B. Both seem like they would fit within 8GB RAM on consumer GPUs.
from stablelm.
with device_map=auto
I get the following map. The last 4 layers don't fit.
model.hf_device_map = {'gpt_neox.embed_in': 0,
'gpt_neox.layers.0': 0,
'gpt_neox.layers.1': 0,
'gpt_neox.layers.2': 0,
'gpt_neox.layers.3': 0,
'gpt_neox.layers.4': 0,
'gpt_neox.layers.5': 0,
'gpt_neox.layers.6': 0,
'gpt_neox.layers.7': 0,
'gpt_neox.layers.8': 0,
'gpt_neox.layers.9': 0,
'gpt_neox.layers.10': 0,
'gpt_neox.layers.11': 0,
'gpt_neox.layers.12': 0,
'gpt_neox.layers.13': 0,
'gpt_neox.layers.14': 'cpu',
'gpt_neox.layers.15': 'cpu',
'gpt_neox.final_layer_norm': 'cpu',
'embed_out': 'cpu'}
Default inference takes 2m6s first time, 20s second time. Tad too slow for me. Example reply for untuned model:
What's your mood today?
What did you do yesterday? What's your dream today?
I dreamt I was taking a walk with my family in a quiet neighborhood. When bedtime came, I'd say I was unhappily married and had no kids. It somehow seemed the perfect dream, the ideal marriage. We walked to my
Loading took 18m, but I don't have access to direct SSD storage, so your mileage may vary.
With device_map=auto
, each partition is loaded and transferred to the GPU sequentially, so RAM use is around 10GB.
@cduk will try now.
from stablelm.
However, with 4 prompts run a bit faster (?)
Isn't this expected? The lower the threshold, the more weights are converted to int8 (hence less compute to do).
from stablelm.
However, with 4 prompts run a bit faster (?)
Isn't this expected? The lower the threshold, the more weights are converted to int8 (hence less compute to do).
The way I read it it's the opposite. According to its description, values follow a normal distribution, with most being less than [-3.5, 3.5]. llm_int8_threshold
is the bound at which if a value is lower, it is converted to int8. This is because higher values are associated with outliers and can destabilize the model.
Might be wrong though.
Built myself a little chat bot with ipywidget. I'm playing a bit with the model now, it's quite fun.
By adding streamer=TextStreamer(tokenizer=tokenizer, skip_prompt=True)
to model.generate()
, responses are streamed.
from stablelm.
Ah, yes, sure, my mistake. Quite weird then :)
from stablelm.
I was able to get 3B parameter to work on CPU with 16GB of ram.
Did you use any tricks such as the dtype
or similar?
from stablelm.
from stablelm.
I am using Radeon 6900xt (16GB VRAM) and quick start code on README works well! (Using stabilityai/stablelm-tuned-alpha-7b)
I used rocm/pytorch docker with rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview version.
https://hub.docker.com/r/rocm/pytorch
EDIT: I tested it a little more and it seems that 16GB of memory is not enough.
When I set max_new_token to 1024, I was able to confirm the OutOfMemory error.
It seems difficult to use the 7b model with 16GB of VRAM.
from stablelm.
Works on P6000 24gb.. up to 3000 context before it OOM.
from stablelm.
I can confirm Tuned-7B works on my A6000 Ada / 48Gb GPU :)
from stablelm.
Have any of you run into this error when you have the model running? I've attempted method where the model is quantized to an 8-bit version but it seems to cause this problem with the probability tensor/tokens.
For those of you who are using the 8-bit version of StableLM how did you get the ChatBot up and running?
from stablelm.
RTX 3080 Ti (12 GB VRAM) Tuned-3B ✅
RTX 3080 Ti (12 GB VRAM) Tuned-7B 🚫 (CUDA OOM)
from stablelm.
stablelm-tuned-alpha-3b (fp16) works on a Tesla K80. I load it on GPU2 because it runs cooler.
from stablelm.
Have any of you run into this error when you have the model running? I've attempted method where the model is quantized to an 8-bit version but it seems to cause this problem with the probability tensor/tokens.
For those of you who are using the 8-bit version of StableLM how did you get the ChatBot up and running?
I had the same error, I discovered checking around that there is a parameter that you can add to the generate() function called remove_invalid_values, if you put it in True it should work :) I leave here the parameters that I put:
tokens = model.generate(
**inputs, max_new_tokens=64,
temperature=0.7, do_sample=True,
stopping_criteria=StoppingCriteriaList([StopOnTokens()]),
remove_invalid_values=True
)
PD:
RTX 3080 (12GB VRAM) Tuned-7B (fp16) OK
from stablelm.
Related Issues (20)
- loss not decreasing with deepspeed HOT 1
- Training Script stablity 3B and 7B HOT 6
- Unclear tokenizer class HOT 2
- Cannot run demo HOT 2
- fairyfloss HOT 2
- process killed HOT 4
- License unclear HOT 8
- Is it normal to take a long time ( about 15min )to generate an answer? HOT 1
- How to expand the sequence length of llama? HOT 1
- Consider using OpenAI Evals
- The output is the same as the input. HOT 1
- Is this project abandoned? HOT 4
- Stability AI
- Hello, how to convert the statityai/tablelm-base-alpha-3b to ggml format HOT 1
- Target modules ['query_key_value', 'dense', 'dense_h_to_4h', 'dense_4h_to_h'] not found in the base model. Please check the target modules and try again. HOT 2
- OSError: stabilityai/stablelm-base-alpha-3b-v2 does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack. HOT 3
- Windows fatal exception: access violation
- Chatting and prompt
- Big difference between the before-cooldown-ckpt and the final checkpoint in the results of downstream tasks?
- Can you share code/resources for Self Knowledge learning? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stablelm.