Expected Behavior Lora should load with minimal vram overhead (con

High lora vram usage after update,about comfyanonymous/comfyui

Comments (32)

sneccc commented on August 30, 2024 2

i also noticed an increase in inference time going very close to 23.6 vram and maybe offloading to ram while running a lora even on fp8

from comfyui.

comfyanonymous commented on August 30, 2024 2

Can you check if this is fixed on the latest?

from comfyui.

BigBanje commented on August 30, 2024 1

Same thing here. I reinstalled the standalone from the read-me, reinstalled pytorch, still eats all my VRAM and causes comfyui to crash after a couple generations every time.

from comfyui.

RayHell commented on August 30, 2024 1

Same, one generation is slow the other is faster then slower and even slower.

from comfyui.

comfyanonymous commented on August 30, 2024 1

Reverted the change because it was causing too many issues. If you encounter the lora issue and need to use --disable-cuda-malloc to fix it let me know what your system specs are.

from comfyui.

Bortus-AI commented on August 30, 2024

Same thing I have noticed

from comfyui.

WingeD123 commented on August 30, 2024

same, I got error out of vram using a 100m lora

from comfyui.

jslegers commented on August 30, 2024

I noticed that the UNETLoader.load_unet takes a lot more memory since the most recent changes when loading a FLUX transformer unet of weight_dtype fp8_e4m3fn.

Before the changes I could stay under 12GB total VRAM usage when loading a fp8_e4m3fn version of the flux1-schnell after first loading the t5xxl text decoder (given a minor tweak to unet_offload_device - see #4319).

After the changes, I run into the 16GB memory limit when the FLUX transformer unet is loaded.

error without --disable-cuda-malloc

Error occurred when executing KSampler:

Allocation on device

File "D:\sd-ComfyUI\ComfyUI\execution.py", line 152, in recursive_execute
output_data, output_ui = get_output_data(obj, input_data_all)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\execution.py", line 82, in get_output_data
return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\execution.py", line 75, in map_node_over_list
results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\nodes.py", line 1382, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\nodes.py", line 1352, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\sample_error_enhancer.py", line 22, in informative_sample
raise e
File "D:\sd-ComfyUI\ComfyUI\custom_nodes\ComfyUI-Impact-Pack\modules\impact\sample_error_enhancer.py", line 9, in informative_sample
return original_sample(*args, **kwargs) # This code helps interpret error messages that occur within exceptions but does not have any impact on other operations.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\custom_nodes\ComfyUI-AnimateDiff-Evolved\animatediff\sampling.py", line 279, in motion_sample
return orig_comfy_sample(model, noise, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\sample.py", line 43, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\samplers.py", line 829, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\samplers.py", line 729, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\samplers.py", line 706, in sample
self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\sampler_helpers.py", line 66, in prepare_sampling
comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required)
File "D:\sd-ComfyUI\ComfyUI\comfy\model_management.py", line 527, in load_models_gpu
cur_loaded_model = loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\model_management.py", line 325, in model_load
raise e
File "D:\sd-ComfyUI\ComfyUI\comfy\model_management.py", line 321, in model_load
self.real_model = self.model.patch_model(device_to=patch_model_to, patch_weights=load_weights)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\model_patcher.py", line 349, in patch_model
self.patch_weight_to_device(key, device_to)
File "D:\sd-ComfyUI\ComfyUI\comfy\model_patcher.py", line 327, in patch_weight_to_device
temp_weight = comfy.model_management.cast_to_device(weight, device_to, torch.float32, copy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sd-ComfyUI\ComfyUI\comfy\model_management.py", line 840, in cast_to_device
return tensor.to(device, dtype, copy=copy, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

from comfyui.

comfyanonymous commented on August 30, 2024

Update and let me know if it's fixed.

from comfyui.

WingeD123 commented on August 30, 2024

Update and let me know if it's fixed.

updated and runs fine without adding '--cuda-malloc'.
after adding '--cuda-malloc', i got OOM again

from comfyui.

Squishums commented on August 30, 2024

After updating, and without specifying any cuda-malloc related args, vram usage and inference speed are back to normal. Restarted and ran several times without any issues. It was consistently failing before the update.

Thanks, comfy.

from comfyui.

Danamir commented on August 30, 2024

On my system with a 3070Ti 8GB VRAM and 32GB RAM, I have the inverse problem. The default Cuda malloc was providing relatively good performances with Flux, without a noticeable downgrade when loading a LoRA. The new default downgrade the performance roughly 5 times.

The --cuda-malloc option works correctly to get the previous behavior.

from comfyui.

cmcjas commented on August 30, 2024

Second Danamir. I'm running on a laptop featuring a 4060 8GB VRAM and 16GB RAM. Before the update, flux nf4 was using around 7.4GB of VRAM and 30GB of RAM when generating a high-res image. But after the update, the VRAM usage went up to 8.3GB exceeding the available dedicated VRAM, as a result, the rendered time went from 3.20 mins to a whooping 24 mins! Luckily, --cuda-malloc fixed the issue and now it's behaving as before prior to the update.

from comfyui.

f-rank commented on August 30, 2024

After this went through, some images generated with Lora come out blurry for some reason.
Seems random. Saw someone else had the same prob on a reddit thread.

from comfyui.

Danamir commented on August 30, 2024

Reverted the change because it was causing too many issues. If you encounter the lora issue and need to use --disable-cuda-malloc to fix it let me know what your system specs are.

Yeah I think that was the safest thing to do. Thanks for your reactivity !

from comfyui.

Abocg commented on August 30, 2024

I have rtx 2060 6gig .. original flux dev took 5 minutes per generation and flux nf4 took one hour. Updated everything tried with and without disable-cuda-malloc and still take forever for nf4 version. Is it possible that RTX 2000 series are not supported?

from comfyui.

Mithadon commented on August 30, 2024

In my case, --disable-cuda-malloc combined with --lowvram solves the issue at fp8, but not at fp16. The loras made with ai-toolkit are still loaded very very slowly and with extra vram usage.
4090 24gb, 96gb ram

from comfyui.

hablaba commented on August 30, 2024

Not specifically Lora related, but since updating to use Flux I’ve noticed some of my old SDXL workflows that were right on the edge of my machines memory limits now OOM. I pulled the latest revert and tried the cuda malloc flags and it didn’t help. I reverted to an older commit from before Flux just to be safe and things work again as normal.

Peeking at the commit history I found this: b8ffb29

Wondering if that could be the cause as it appears to increase the amount of memory required by adding a larger buffer (100 to 300)?

from comfyui.

ltdrdata commented on August 30, 2024

Not specifically Lora related, but since updating to use Flux I’ve noticed some of my old SDXL workflows that were right on the edge of my machines memory limits now OOM. I pulled the latest revert and tried the cuda malloc flags and it didn’t help. I reverted to an older commit from before Flux just to be safe and things work again as normal.

Peeking at the commit history I found this: b8ffb29

Wondering if that could be the cause as it appears to increase the amount of memory required by adding a larger buffer (100 to 300)?

Can you modify only that specific value in the latest version and test it for us?

from comfyui.

imHugoLeandro commented on August 30, 2024

In my case, --disable-cuda-malloc combined with --lowvram solves the issue at fp8

It solved the issue for me too, using Flux with Fp8 before I couldn't load some anime/fae Loras, now with these 2 arguments I can load multiple loras I couldn't before. Im on a 4080, 64gb

from comfyui.

Squishums commented on August 30, 2024

If you encounter the lora issue and need to use --disable-cuda-malloc to fix it let me know what your system specs are.

4080. 32GB system RAM. Win 10. Loading Flux Dev in fp8 with an fp8 text encoder.

Driver version didn't make a difference. Had problems on both 1 year old drivers and latest game ready drivers.
Torch version didn't make a difference. Had problems on both torch 2.1(?) and 2.3.1

from comfyui.

BigBanje commented on August 30, 2024

I have a 3090 with 16GB of system RAM. I run Flux Dev with fp16, with one Lora at a time using the normal vram mode.

I still have an issue after recent commits, however the --disable-cuda-malloc command seems to fix it.

When I run that command, the VRAM usage goes back down once a creation finishes. For me it idles around 14GB.
When I don't run that command, VRAM usage idles at 17GB, and any adjustments cause subsequent runs to max out usage and cause speeds like 27s/it

from comfyui.

Stoobs commented on August 30, 2024

I'm running a RTX 3080 10GB, 64GB DDR5, Zen4 7950X, comfyui portable and noticed this behaviour after updating through the manager add-on in the last couple of days or so.

I went from ~2s/it up to 20+ s/it for an identical workflow.

I reinstalled (I'd kept the archive - commit hash b334605) and everything went back to normal with that version.

from comfyui.

bghira commented on August 30, 2024

i wonder if mimalloc would have any place here

we use it on other tools/use-cases with memory-intensive workloads to overload the memalloc call on linux into one from microsoft's mimalloc, which has better (read: more efficient) support for huge pages and large TBL, among other things.

another one would be jemalloc which seems to offer some benefits for different operations with dense compute calls, eg. using SwiGLU.

here is an example

enable_mimalloc() {

	! [ -z "${MIMALLOC_DISABLE}" ] && echo "mimalloc disabled." && return
	LIBMIMALLOC_PATH='/usr/lib64/libmimalloc.so'
	if ! [ -f "${LIBMIMALLOC_PATH}" ]; then
		echo "mimalloc doesn't exist. You might really want to install this."
	else
		echo "Enabled mimalloc."
		export MIMALLOC_ALLOW_LARGE_OS_PAGES=1
		export MIMALLOC_RESERVE_HUGE_OS_PAGES=0 # Use n 1GiB pages
		export MALLOC_ARENA_MAX=1 # Tell Glibc to only allocate memory in a single "arena".
		export MIMALLOC_PAGE_RESET=0 # Signal when pages are empty
		export MIMALLOC_EAGER_COMMIT_DELAY=4 # The first 4MiB of allocated memory won't be hugepages
		export MIMALLOC_SHOW_STATS=0 # Display mimalloc stats
		export LD_PRELOAD="${LD_PRELOAD} ${LIBMIMALLOC_PATH}"
		return
	fi
	LIBHUGETLBFS_PATH="/usr/lib64/libhugetlbfs.so"
	if [ -f "${LIBHUGETLBFS_PATH}" ]; then
		export LD_PRELOAD="${LD_PRELOAD} ${LIBHUGETLBFS_PATH}"
		export HUGETLB_MORECORE=thp
		export HUGETLB_RESTRICT_EXE=python3.11
		echo "Enabled libhugetlbfs parameters for easy huge page support."
	else
		echo "You do not even have libhugetlbfs installed. There is very little we can do for your performance here."
	fi
}

configure_mempool() {
	export HUGEADM_PATH

	export HUGEADM_CURRENTSIZE

	# Current pool size (allocated hugepages)
	HUGEADM_CURRENTSIZE=$(hugeadm --pool-list | grep "${HUGEADM_POOLSIZE}" | awk '{ print $3; }')
	# Maximum pool size (how many hugepages)
	HUGEADM_MAXIMUMSIZE=$(hugeadm --pool-list | grep "${HUGEADM_POOLSIZE}" | awk '{ print $4; }')
	HUGEADM_PATH=$(which hugeadm)
	if [ -z "${HUGEADM_PATH}" ]; then
		echo 'hugeadm is not installed. Was unable to configure the system hugepages pool size.'
	fi
	export HUGEADM_FREE
	export TARGET_HUGEPAGESZ=0 # By default, we'll assume we need to allocate zero pages.
	HUGEADM_FREE=$(expr "${HUGEADM_MAXIMUMSIZE}" - "${HUGEADM_CURRENTSIZE}")
	if [ "${HUGEADM_FREE}" -lt "${HUGEADM_PAGESZ}" ]; then
		# We don't have enough free hugepages. Let's go for gold and increase it by the current desired amount.
		TARGET_HUGEPAGESZ=$(expr "${HUGEADM_PAGESZ}" - "${HUGEADM_FREE}")
		sudo "${HUGEADM_PATH}" --hard --pool-pages-max "2MB:${TARGET_HUGEPAGESZ}" || echo "Could not configure hugepages pool size via hugeadm."
		echo "Added ${TARGET_HUGEPAGESZ} to system hugepages memory pool."
	else
		echo "We have enough free pages (${HUGEADM_FREE} / ${HUGEADM_MAXIMUMSIZE}). Continuing."
	fi
}

restore_mempool() {
	if [ "${TARGET_HUGEPAGESZ}" -gt 0 ]; then
		echo "Being a good citizen and restoring memory pool size back to ${HUGEADM_MAXIMUMSIZE}."
		sudo "${HUGEADM_PATH}" --hard --pool-pages-max "2MB:${HUGEADM_MAXIMUMSIZE}" || echo "Could not configure hugepages pool size via hugeadm."
	else
		TOTAL_MEM_WASTED=$(expr "${HUGEADM_MAXIMUMSIZE}" \* 2)
		echo "There were no extra hugepages allocated at startup, so there is nothing to clean up now. You could free ${TOTAL_MEM_WASTED}M for other applications by reducing the maximum pool size to zero by default."
	fi
}

### How to load / use it
configure_mempool
enable_mimalloc

## call comfyui here
. ./start_ui.sh # or the correct start command

# Unconfigure hugepages if we've altered the system environment.
restore_mempool

you'll need libhugetlbfs and mimalloc installed from https://github.com/microsoft/mimalloc

it gives me a 6-40% speedup on various operations but nothing consistent across the board. the total speedup for a generation was 13%.

from comfyui.

brandostrong commented on August 30, 2024

Not specifically Lora related, but since updating to use Flux I’ve noticed some of my old SDXL workflows that were right on the edge of my machines memory limits now OOM. I pulled the latest revert and tried the cuda malloc flags and it didn’t help. I reverted to an older commit from before Flux just to be safe and things work again as normal.
Peeking at the commit history I found this: b8ffb29
Wondering if that could be the cause as it appears to increase the amount of memory required by adding a larger buffer (100 to 300)?

Can you modify only that specific value in the latest version and test it for us?

I tried both, both were OOM on my 4090(I can run it within memory once or twice on a fresh restart of windows), the former commit if anything was slower, both took up about 23.4GB of VRAM, and 1.5GB of shared. Tested once

(20 steps, default settings, lora loaded)

Current: 300 * 1024 * 1024

loaded partially 21661.2 21637.845825195312 11
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [05:16<00:00, 15.81s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 407.55 seconds

Old: 100 * 1024 * 1024 *

loaded partially 21661.2 21637.845825195312 11
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [06:00<00:00, 18.02s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 439.91 seconds

I would like to mention sampling using both kohya and ostrisai's lora trainer use quite a bit less vram, and never oom, so this model should fit eventually :)

Sampling at 20 steps, takes about 25-35 seconds using ostrisai's

from comfyui.

buggsi commented on August 30, 2024

Can you try running it with: --disable-cuda-malloc to see if it improves things?

What does that do? Will I need it on my RTX 3070 8GB to avoid potential problems?

from comfyui.

nailz420 commented on August 30, 2024

Running Forge with FLUX 1d NF without --cuda-malloc (console suggests to enable it, meaning it's not) still causes an OOM (16gb VRAM). I assume Forge uses same components as Comfy for FLUX, making this is relevant

Begin to load 1 model
[Unload] Trying to free 9411.13 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 9773.37 MB ...
[Memory Management] Current Free GPU Memory: 9773.37 MB
[Memory Management] Required Model Memory: 6246.84 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 2502.53 MB
Patching LoRAs: 43%|███████████████████████████▎ | 130/304 [00:06<00:08, 19.72it/s]ERROR lora diffusion_model.double_blocks.13.img_mod.lin.weight CUDA out of memory. Tried to allocate 216.00 MiB. GPU
Patching LoRAs: 44%|████████████████████████████▍ | 135/304 [00:06<00:12, 13.30it/s]ERROR lora diffusion_model.double_blocks.13.txt_mod.lin.weight CUDA out of memory. Tried to allocate 216.00 MiB. GPU
ERROR lora diffusion_model.double_blocks.13.txt_attn.qkv.weight CUDA out of memory. Tried to allocate 108.00 MiB. GPU
Patching LoRAs: 45%|████████████████████████████▊ | 137/304 [00:06<00:11, 14.01it/s]ERROR lora diffusion_model.double_blocks.13.txt_mlp.0.weight CUDA out of memory. Tried to allocate 144.00 MiB. GPU
Patching LoRAs: 46%|█████████████████████████████▎ | 139/304 [00:06<00:08, 19.95it/s]
Traceback (most recent call last):
File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules_forge\main_thread.py", line 30, in work
self.result = self.func(*self.args, **self.kwargs)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\txt2img.py", line 110, in txt2img_function
processed = processing.process_images(p)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\processing.py", line 809, in process_images
res = process_images_inner(p)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\processing.py", line 952, in process_images_inner
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\processing.py", line 1323, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\sd_samplers_kdiffusion.py", line 194, in sample
sampling_prepare(self.model_wrap.inner_model.forge_objects.unet, x=x)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\sampling\sampling_function.py", line 356, in sampling_prepare
memory_management.load_models_gpu(
File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 575, in load_models_gpu
loaded_model.model_load(model_gpu_memory_when_using_cpu_swap)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 384, in model_load
raise e
File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 380, in model_load
self.real_model = self.model.forge_patch_model(patch_model_to)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\patcher\base.py", line 228, in forge_patch_model
self.lora_loader.refresh(target_device=target_device, offload_device=self.offload_device)
File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\patcher\lora.py", line 352, in refresh
weight = weight.to(dtype=torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB. GPU
CUDA out of memory. Tried to allocate 144.00 MiB. GPU

from comfyui.

Mithadon commented on August 30, 2024

Can you check if this is fixed on the latest?

Yes! It works as it should now. Swapping loras while using fp16 and it's going quickly. Very happy about this, thank you very much for figuring it out.

from comfyui.

ssube commented on August 30, 2024

This is still broken for me on the latest commit, currently bb222ceddb232aafafa99cd4dec38b3719c29d7d and torch 2.3.0 and CUDA 12.4. Tested with and without --disable-cuda-malloc.

Depending on the sampler, I get one of these errors:

Euler:

Loading 1 new model
loaded partially 22473.3650390625 22448.039184570312 0
  0%|                                                                                                                                                                                                | 0/45 [00:00<?, ?it/s]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 90.00 MiB. GPU 
  2%|████                                                                                                                                                                                    | 1/45 [00:01<01:03,  1.44s/it]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 90.00 MiB. GPU 
  4%|████████▏                                                                                                                                                                               | 2/45 [00:02<01:01,  1.42s/it]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 90.00 MiB. GPU 
  7%|████████████▎                                                                                                                                                                           | 3/45 [00:04<00:59,  1.42s/it]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 90.00 MiB. GPU 
  9%|████████████████▎                                                                                                                                                                       | 4/45 [00:05<00:58,  1.42s/it]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 90.00 MiB. GPU 
 11%|████████████████████▍

The size of the allocation varies, but the block number seems to be consistent:

Loading 1 new model                                                                                                                                                                                                         
loaded partially 22471.3650390625 22448.039184570312 0                                                                                                                                                                      
  0%|                                                                                                                                                                                                | 0/45 [00:00<?, ?it/s]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 90.00 MiB. GPU                                                                                                             
  2%|████                                                                                                                                                                                    | 1/45 [00:01<01:05,  1.50s/it]
ERROR lora diffusion_model.single_blocks.37.linear2.weight CUDA out of memory. Tried to allocate 180.00 MiB. GPU                                                                                                            
  4%|████████▏

UniPC:

Loading 1 new model
loaded partially 22505.3650390625 22484.027465820312 2
  4%|████████▏                                                                                                                                                                               | 2/45 [00:02<01:01,  1.43s/it]
!!! Exception during processing !!! cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to 
try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.preferred_linalg_library
Traceback (most recent call last):
  File "/home/ssube/ComfyUI/execution.py", line 316, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/home/ssube/ComfyUI/execution.py", line 191, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/home/ssube/ComfyUI/execution.py", line 168, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/home/ssube/ComfyUI/execution.py", line 157, in process_inputs
    results.append(getattr(obj, func)(**inputs))
  File "/home/ssube/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 612, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
  File "/home/ssube/ComfyUI/comfy/samplers.py", line 716, in sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
  File "/home/ssube/ComfyUI/comfy/samplers.py", line 695, in inner_sample
    samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
  File "/home/ssube/ComfyUI/comfy/samplers.py", line 600, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
  File "/home/ssube/ComfyUI/comfy/extra_samplers/uni_pc.py", line 870, in sample_unipc
    x = uni_pc.sample(noise, timesteps=timesteps, skip_type="time_uniform", method="multistep", order=order, lower_order_final=True, callback=callback, disable_pbar=disable)
  File "/home/ssube/ComfyUI/comfy/extra_samplers/uni_pc.py", line 724, in sample
    x, model_x = self.multistep_uni_pc_update(x, model_prev_list, t_prev_list, vec_t, init_order, use_corrector=True)

Euler runs to completion, UniPC consistently fails on the second step. The LoRA style does apply and the output image matches one produced on a higher memory card.

I get 100% VRAM usage even with a 48GB card, but that does not log a loaded partially error and works with any sampler.

from comfyui.

High lora vram usage after update about comfyui HOT 32 OPEN

Comments (32)

error without --disable-cuda-malloc

Current: 300 * 1024 * 1024

Old: 100 * 1024 * 1024 *

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent