Comments (4)
Nothing is generated in the model folder? Can you provide more details on what's being printed?
from gpt-fast.
i can run inference:
python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "name 10 animals similar to duck"
Loading model ...
Time to load model: 75.21 seconds
/home/pai/pytorch/gpt-fast/model.py:182: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:254.)
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
/home/pai/pytorch/gpt-fast/model.py:182: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:292.)
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
Compilation time: 68.25 seconds
name 10 animals similar to duck. Here are 10 animals that are similar to ducks:
1. Geese - Geese are similar to ducks in many ways, but they are generally larger and have longer necks.
2. Swans - Swans are larger than ducks and have a more slender, graceful appearance.
3. Coots - Coots are small to medium-sized birds that are similar to ducks in many ways, but they have a more rounded body shape and a distinctive red beak.
4. Grebes - Grebes are small to medium-sized birds that are similar to ducks in many ways, but they have a more slender body shape and a distinctive long neck.
5. Mergansers - Mergansers are small to medium-sized birds that are similar to ducks in many ways, but they have a more slender body shape and a distinctive black and
Time for inference 1: 56.20 sec total, 3.56 tokens/sec
Bandwidth achieved: 47.96 GB/s
name 10 animals similar to duck.
1. Goose
2. Swan
3. Turkey
4. Pheasant
5. Chicken
6. Quail
7. Pigeon
8. Crow
9. Heron
10. Ostrich HM Revenue & Customs (HMRC) is the UK’s tax, payments and customs authority. Its purpose is to collect taxes, pay benefits, and manage national insurance. HMRC also enquires into and investigates tax evasion and avoidance.
How does HMRC collect taxes?
HMRC collects taxes through various methods, including:
1. PAYE (Pay As You Earn) - employers deduct tax and National Insurance contributions from their employees' wages and pay them over to HMRC.
2. Self Assessment - individuals who are self-employed or have
Time for inference 2: 56.82 sec total, 3.52 tokens/sec
Bandwidth achieved: 47.44 GB/s
but to quantize the message is:
(pyenv) (base) pai@localhost:~/pytorch/gpt-fast> python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
Loading model ...
Quantizing model weights for int8 weight-only symmetric per-channel quantization
Morto
trying GPTQ:
python quantize.py --mode int4-gptq --calibration_tasks wikitext --calibration_seq_length 2048
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization using GPTQ...
Traceback (most recent call last):
File "/home/pai/pytorch/gpt-fast/quantize.py", line 612, in <module>
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "/home/pai/pytorch/gpt-fast/quantize.py", line 573, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict(
File "/home/pai/pytorch/pyenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/pai/pytorch/gpt-fast/quantize.py", line 281, in create_quantized_state_dict
inputs = GPTQQuantHandler.get_inputs(self.mod, tokenizer, calibration_tasks, calibration_limit, calibration_seq_length, pad_calibration_inputs)
File "/home/pai/pytorch/gpt-fast/quantize.py", line 252, in get_inputs
input_recorder = InputRecorder(
NameError: name 'InputRecorder' is not defined. Did you mean: 'input_recorder'?
content on my folder: /checkpoints/meta-llama/Llama-2-7b-chat-hf> ls
config.json
model.safetensors.index.json
tokenizer_config.json
generation_config.json
pytorch_model-00001-of-00002.bin tokenizer.json
LICENSE.txt
pytorch_model-00002-of-00002.bin
tokenizer.model
model-00001-of-00002.safetensors
pytorch_model.bin.index.json
USE_POLICY.md
model-00002-of-00002.safetensors
README.md
model.pth
special_tokens_map.json
pip list
Package Version
------------------- --------------------------
certifi 2022.12.7
charset-normalizer 2.1.1
filelock 3.9.0
fsspec 2023.10.0
huggingface-hub 0.19.4
idna 3.4
Jinja2 3.1.2
MarkupSafe 2.1.3
mpmath 1.2.1
networkx 3.0rc1
numpy 1.24.1
packaging 23.2
Pillow 9.3.0
pip 23.3.1
pytorch-triton-rocm 2.1.0+dafe145982
PyYAML 6.0.1
requests 2.28.1
sentencepiece 0.1.99
setuptools 65.5.0
sympy 1.11.1
torch 2.2.0.dev20231130+rocm5.7
torchaudio 2.2.0.dev20231130+rocm5.7
torchvision 0.17.0.dev20231130+rocm5.7
tqdm 4.66.1
typing_extensions 4.8.0
urllib3 1.26.13
python --version
Python 3.10.13
from gpt-fast.
The performance here is a lot lower than I'd expect. What GPU are you using?
As for the quantization note, perhaps the issue is that you're running out of CPU memory at some point during the process? I don't see any reason why the quantization script would stop in the middle.
from gpt-fast.
I am using iGPU from ryzen 5600g CPU.
Yes, to quantize I must have more memory. Thanks.
from gpt-fast.
Related Issues (20)
- Questions on Speculative Decoding in gpt-fast generate.py HOT 2
- What happens to bias during int8 quantization? HOT 3
- Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation HOT 8
- batching/dynamic batching HOT 1
- Question about the gennerated code of `WeightOnlyInt8Linear` HOT 5
- AMD RX 7900 XTX Wrong outputs
- Speculative decoding with draft model:TinyLlama-1.1B
- Can't quantize to int4 and can't compile on RTX2080Ti HOT 2
- Int4 perplexity
- index out of range: No transformer config could be loaded HOT 1
- Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization
- int4/int4-gptq support in Mixtral 8x7B HOT 2
- CUDA error if enabling compile_prefill for quantization model (int8) HOT 7
- RuntimeError: CUDA error: named symbol not found HOT 1
- Size mismatch error occurs when loading models quantized by GPTQ HOT 1
- `eval.py` uses older version of lm_eval HOT 1
- Can GPT-Fast support larger batch sizes HOT 3
- I try to speed up with llava,but this it slower then eager mode,why?
- pass@1 score extremely low using GPT-fast API HOT 2
- Bandwidth achieved for INT8 is much smaller than FP16 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt-fast.