Coder Social home page Coder Social logo

Comments (9)

sniklaus avatar sniklaus commented on June 30, 2024 1

Oh wow, what a nightmare. I am happy you found the culprit though, I am hence closing this issue for now since I am under the impression that it wasn't an issue with the code in the end. Please feel free to reopen the issue in case I am mistaken. And thanks for keeping us updated throughout all of this!

from pytorch-pwc.

sniklaus avatar sniklaus commented on June 30, 2024 1

And sorry for forgetting to answer your questions.

Is the block=tuple([ 32, 1, 1 ]) specifying that there are 32 threads for kernel_Correlation_updateOutput or is it specified somewhere else?

It means that there are 32 threads per warp and it is originally defined here: https://github.com/lmb-freiburg/flownet2/blob/b92e198b56b0e52e1ba0a5a98dc0e39fa5ae70cc/src/caffe/layers/correlation_layer.cu#L17

from pytorch-pwc.

Etienne66 avatar Etienne66 commented on June 30, 2024

I can't help but wonder if that 196 channels in self.netSix is part of the cause of my memory error since the correlation in kernel_Correlation_updateOutput indicates that it is stepping through 32 channels at a time. It should stop at 160 but instead is stopping at 192 instead and that would indicate that it is treating like it has 224 channels while reducing it to 81. I can't imagine what data it is using in those extra channels but it makes me wonder if that was why the original paper indicated they were having a lot of edge problems in their model. I'm just guessing though. I know even less about C at this point than python but I am learning.

I started training from scratch with a corrected self.netSix but I'm training to deblur and not directly training the flow. I should probably go back and train the pytorch-pwc model by itself instead but I'm curious if this will work.

from pytorch-pwc.

shengjie-lin avatar shengjie-lin commented on June 30, 2024

For me, I always get this error when I am using any gpu other than gpu:0. I tried my best to make sure everything is on the same gpu device, but this error won't go away. So I ended up mapping whichever available device to gpu:0 when running docker.

from pytorch-pwc.

Etienne66 avatar Etienne66 commented on June 30, 2024

I only have one GPU. It wasn't because of that. It is because of a memory leak caused by having a channels set to 196 instead of 192 while stepping through 32 channels at a time in the CuPy module I mentioned above. I have not had this error anymore. Instead of training from scratch I am just modifying the model after I have loaded the pretrained model. I'm still doing tests but I think it is going to outperform the original deblur model I'm working on. I haven't had a single error since I changed this code. If I put it back to 196 and put the multiply by 20 I will get it sometime during an epoch. You know CuPy better than me. Take a close look at it. Is it not reading memory locations that don't have data defined?

from pytorch-pwc.

Etienne66 avatar Etienne66 commented on June 30, 2024

I think I understand the CuPy code a little better now and there is no memory leak in regards to having 196 channels. The issue seems to have been purely having a part of the model layer in the return statement. I moved the netRefiner layer out of the return statement like the following and the memory errors ceased. My only guess is that PyTorch does not handle layers in the return statement exactly like it does in the rest of the forward block

@sniklaus, I do have one question about the CuPy code though and I haven't found an answer on the internet. Is the block=tuple([ 32, 1, 1 ]) specifying that there are 32 threads for kernel_Correlation_updateOutput or is it specified somewhere else? I assume that threadIdx.x is only from 0 to 31 depending on which thread is running.

	def forward(self, tenOne, tenTwo):
		tenOne = self.netExtractor(tenOne)
		tenTwo = self.netExtractor(tenTwo)

		objEstimate = self.netSix(tenOne[-1], tenTwo[-1], None)
		objEstimate = self.netFiv(tenOne[-2], tenTwo[-2], objEstimate)
		objEstimate = self.netFou(tenOne[-3], tenTwo[-3], objEstimate)
		objEstimate = self.netThr(tenOne[-4], tenTwo[-4], objEstimate)
		objEstimate = self.netTwo(tenOne[-5], tenTwo[-5], objEstimate)
                objEstimate['tenFeat'] = self.netRefiner(objEstimate['tenFeat'])

		return (objEstimate['tenFlow'] + objEstimate['tenFeat']) * 20.0
	# end

from pytorch-pwc.

Etienne66 avatar Etienne66 commented on June 30, 2024

I'm starting to think I have a hardware issue. It was working fine for135 epochs, gone though millions of iterations and now I can't make any more progress because of this exact same error. No code changes.

I probably need a new video card with some water cooling. I was running 80C for many months and I bet that has degraded the GPU some. I can't come up with another explanation of why it would be working so well and then stop working at all.

from pytorch-pwc.

Etienne66 avatar Etienne66 commented on June 30, 2024

I think I have found a resolution. I am using MSi Afterburner to under-clock my NVIDIA GeForce RTX 2060 Super. It has a boost clock of 1680Mhz. I reduced the Core Clock by 100 MHz. I also set the GPU Temp Limit to 76C and since the Power Limit is linked it was reduced to 88%. It has been running a few hours with no more of this error.

from pytorch-pwc.

tsogkas avatar tsogkas commented on June 30, 2024

First of all, thank you again @sniklaus, for making this code available. I have a workstation with 2 Nvidia 1080 Ti on it, and I still get the same error whenever I have this code running on one GPU and I try to run a separate experiment on the 2nd GPU. As a sidenote, I've never encountered this issue when trying to run any other code on both GPUs simultaneously. I think my issue is related to what @StArchon94 posted earlier:

For me, I always get this error when I am using any gpu other than gpu:0. I tried my best to make sure everything is on the same gpu device, but this error won't go away. So I ended up mapping whichever available device to gpu:0 when running docker.

This a very weird issue and can be quite problematic since I cannot debug while, say, training a model, which can take a couple of days. @StArchon94 did you end up finding a solution after all?

from pytorch-pwc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.