Coder Social home page Coder Social logo

Comments (2)

ajbrock avatar ajbrock commented on June 16, 2024

So, first off, best practice when training for long epochs is to test ALL your code before launching an hours-and-hours long first epoch. I do this by just setting the training loops to loop over only the first example or two then checkpointing and outputting any statistics (there are issues, for example, with some of my checkpoint scripts not playing well with python 3, if I recall correctly).

My modelnet40 training times were significantly lower, (a couple hours per epoch? It's been over a year since I ran them so I don't recall and the logs are buried in an external hard drive somewhere) running on a single Maxwell Titan X. If you're working with dicom data and you've got ~256x256x256 volumes then you absolutely should expect your training time to scale with the dimensionality of your data (48 hours for a single epoch sounds mercifully fast, actually). Remember that I run all of my data at 32x32x32--if you're at 256x256x256 then you're already using more than 500 TIMES as much memory/computational cost for a single sample, assuming you're using a similar style network.

Running with a batchsize of 1 is also going to significantly slow you down and may mess up batchnorm (in the experiments I run with batchsize 1 on a localization/segmentation net I actually don't run into batchnorm issues, but I've heard that with BS<16 some people have issues). If you can run with a batch size closer to 10 you should definitely see significant speedups.

Given, however, that memory constraints are likely preventing you loading a larger batch onto your relatively small card, you might want to consider resampling your data to a more amenable size. If you're trying to look for fine-grained things (say, nodules in a CT scan, which only span a few voxels) then you might consider selecting random crops from your volumes, such that you keep the same resolution but don't have to deal with the entire spatial extent.

As to the number of epochs necessary, it's impossible to say with certainty since it's problem-specific, and depends on things like the size of your net (#parameters, depth, optimization difficulty) and the size and complexity of your dataset--also, it's more a matter of the number of iterations than the number of epochs (5 epochs on a 100,000 sample dataset is many more gradient descent steps than 50 epochs on a 1,000 sample dataset!) . A good rule of thumb is to start with a small number of epochs (say, 30, and anneal the learning rate at 15 and 25 epochs) and see how that does. Make sure to checkpoint before you anneal your learning rates!

You could also do the short-term training a couple times and do something like cyclic learning rates between different experiments (starting with the same net then re-running the experiment, kicking the learning rate back up to the start and then annealing it) and then use a Snapshot Ensemble with the different checkpoints to eek out a few more points of performance. I don't think you'll need the full 250 epochs, but in general if you can bake your net for closer to that many iterations you'll tend to be better off. In my experience 300 tends to be "certain, but probably overkill," and if I can spare the time to find some good annealing points I can do it closer to 100-150 epochs.

from generative-and-discriminative-voxel-modeling.

EJShim avatar EJShim commented on June 16, 2024

I think I should try batch size>16, and see the difference.

and I will start with cropped-resampled dicom data.

always thank you for your replies.

from generative-and-discriminative-voxel-modeling.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.