Hello, While I attempt to apply torchcompile option for training CLI

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Error when using torchcompile option for CLIP training,about mlfoundations/open_clip

rwightman commented on August 24, 2024 1

@EIFY did you try forcing the non reentrant checkpointing? could look to change the default if that works...

from open_clip.

rwightman commented on August 24, 2024

@kkjh0723 I think it might break with gradient checkpointing? not sure there is a workaround, possibly maybe using non reentrant mode?

from open_clip.

EIFY commented on August 24, 2024

I got the same error trying to run both --grad-checkpointing and --torchcompile, but since pytorch 2.1.0 --torchcompile now works with --accum-freq > 1 as the next best option.

from open_clip.

EIFY commented on August 24, 2024

@rwightman No I haven't tried that.

In that regard, the good news is that

open_clip/src/open_clip/transformer.py

Lines 320 to 322 in 91923df

    
           if self.grad_checkpointing and not torch.jit.is_scripting(): 
        
               # TODO: handle kwargs https://github.com/pytorch/pytorch/issues/79887#issuecomment-1161758372 
        
               x = checkpoint(r, x, None, None, attn_mask)

pytorch/pytorch#79887 is now fixed and we should be able to do e.g.

if self.grad_checkpointing and not torch.jit.is_scripting(): 
    x = checkpoint(r, x, None, None, attn_mask, use_reentrant=False)

The bad news is that other than that grad_checkpointing is either delegated to the vision/text trunks w/o argument support

open_clip/src/open_clip/model.py

Lines 260 to 263 in 91923df

    
           @torch.jit.ignore 
        
           def set_grad_checkpointing(self, enable=True): 
        
               self.visual.set_grad_checkpointing(enable) 
        
               self.transformer.grad_checkpointing = enable

or not supported at all:

open_clip/src/open_clip/modified_resnet.py

Lines 161 to 164 in 91923df

    
           @torch.jit.ignore 
        
           def set_grad_checkpointing(self, enable=True): 
        
               # FIXME support for non-transformer 
        
               pass

So fairly involved changes would be necessary. I will try doing the easy part and see if it at least gets past that when I get a chance.

from open_clip.

EIFY commented on August 24, 2024

@rwightman OK so it turned out that use_reentrant=False doesn't help. It still breaks at the same point:

[2023-11-08 12:56:29,383] [0/0] torch._utils_internal: [INFO] CompilationMetrics(frame_key='1', co_name='forward', co_filename='/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/model.py', co_firstlineno=256, cache_size=0, guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, fail_reason="'NNModuleVariable' object has no attribute 'get_name'")
Traceback (most recent call last):
(...)
torch._dynamo.exc.InternalTorchDynamoError: 'NNModuleVariable' object has no attribute 'get_name'

from user code:
   File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/model.py", line 274, in forward
    image_features = dim_scale_img * self.encode_image(image, normalize=self.normalize) if image is not None else None
  File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/model.py", line 239, in encode_image
    features = self.visual(image)
  File "/home/jason-chou/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/transformer.py", line 486, in forward
    x = self.transformer(x)
  File "/home/jason-chou/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/transformer.py", line 319, in forward
    x = checkpoint(r, x, None, None, attn_mask, use_reentrant=False)

from open_clip.

Error when using torchcompile option for CLIP training about open_clip HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if self.grad_checkpointing and not torch.jit.is_scripting():
	# TODO: handle kwargs https://github.com/pytorch/pytorch/issues/79887#issuecomment-1161758372
	x = checkpoint(r, x, None, None, attn_mask)

	@torch.jit.ignore
	def set_grad_checkpointing(self, enable=True):
	self.visual.set_grad_checkpointing(enable)
	self.transformer.grad_checkpointing = enable

	@torch.jit.ignore
	def set_grad_checkpointing(self, enable=True):
	# FIXME support for non-transformer
	pass