Since many experiments can take a while and will be running on an environment where th

To load a model, we would need: the parameters passed to .comp

Having a configurable/able callback that runs after each epoch and at interrupt

Add interrupt handlers about keras HOT 17 CLOSED

keras-team commented on May 4, 2024 2

Add interrupt handlers

from keras.

Comments (17)

halflings commented on May 4, 2024 5

Any chance this can be reopened? It's fair that programs not ending on SIGINT are annoying, but I think that somebody that would add this callback explicitly would be very much aware of this fact, and that they'll just need to send SIGINT twice for the running command to stop immediately.

When iterating with a model running in the cloud, I need to run some clean-up code once training ends, and I often realize that I used too many epochs too late into this process. This would help with these cases.

from keras.

asafh commented on May 4, 2024 3

I'm made a small keras callback that listens for signal (SIGINT by default, using the signal module) and stops the training once this epoch is complete. Would it make sense to open a pull request adding it to callbacks.py?
I think it's a decent solution to allow you to easily interrupt a long training session, and have your code afterwards execute (e.g. save model then exit).

Related, would it be save to call model.save mid-training?

from keras.

fchollet commented on May 4, 2024

I think this could be interesting to have! One issue would be that model saving with pickle takes time and can be impractical for large models. So maybe a more urgent feature would be a saving function that stores weight matrices to hdf5, alongside a serialized version of the model structure, for ultrafast model saving & loading.

Other potential improvements to the Sequential model would be a better logger (currently logging is handled with if statements in the code, which is not the cleanest thing) and visualization features (plot your loss/accuracy/validation metrics over time, your decaying learning rate, your gradient magnitudes, visualize learned features, and more...

from keras.

jfsantos commented on May 4, 2024

Yes, pickling models really takes forever! Since we added HDF5 support by adding h5py as a requirement, I think using an HDF5 file would be the way to go. I am not sure on how to serialize the model structure without the actual parameters, do you have any suggestion?

Regarding the logger, the best would be to have a list of extensions which execute some task as soon as a condition is reached (e.g., # of epochs/iterations). Then you could add extensions for plotting, logging, or even stopping the training algorithm in some cases (as in early stopping, for example).

from keras.

fchollet commented on May 4, 2024

To load a model, we would need:

the parameters passed to .compile()
the list of layers along with their current parameters
the weight matrices of each layer.

Recovering the weights is easy:

weights = []
for l in model.layers:
    weights.append(l.get_weights())

The main issue is serializing the list of layers and their parameters (sans weights). Maybe we could delete the weights then pickle the layers. Or maybe there is a better way. Any ideas?

from keras.

jfsantos commented on May 4, 2024

That does make sense. I'll open another issue to discuss the model serialization, so we don't mix subjects :)

from keras.

jfsantos commented on May 4, 2024

Now that we have model serialization, we can discuss this. I think it would be useful to have not only interrupt handlers, but also functions that run at given times (e.g., after each epoch or iteration). That would enable us to make model snapshots, report test/validation errors, etc. For the snapshots, it would also be interesting to be able to retrieve solver parameters/state after each iteration/epoch, as then we could interrupt and restart training at any time.

Any suggestions on how we can implement this?

from keras.

fchollet commented on May 4, 2024

Having a configurable/scriptable callback that runs after each epoch and at interrupt signal would be neat. You could use it to:

backup the model
recover info such as training/validation loss/accuracy, current learning rate, current mean gradient amplitude. This info can then be displayed in a webapp (on localhost or remote), that you could use as an experiment monitoring station. Total visualization is critical to doing good research.

A basic initial version would be to dump everything to a folder (one per experiment) at each epoch. The exact config (what to dump, where) would be passed to the .fit() method as a dictionary.

from keras.

jfsantos commented on May 4, 2024

Yes, I was thinking both on plotting and saving model/optimizer snapshots. To make it more flexible, we could even pass a list of callback functions/objects to fit instead of fixing the behaviour inside the model object. I'm just not sure on what would be the values passed to the callback function, since depending on its functionality it is going to need access to the model instance, the optimizer state and things that were calculated during the current epoch (accuracy, validation/test error, etc).

I am not sure on how we can keep the optimizer state (when it is meaningful, e.g., when using momentum). Any ideas?

from keras.

fchollet commented on May 4, 2024

In the case of SGD and Adam, the optimizer class attribute iterations gives you indirect access to the current learning rate. That's not really optimal though.

I imagine a solution would be to identify for each optimizer which quantities are meaningful to monitor, make these available as class attributes (updated at each iteration) and expose a method get_state() that lists the attributes and their values.

How would you envision the callback feature?

from keras.

jfsantos commented on May 4, 2024

Yes, I was doing some tests with SGD with momentum and updating iterations after loading weights, but the problem is that you don't have the current gradients to be able to compute velocities for the first iteration. This is an issue in many other optimizers. We would have to be able to restart an optimizer by passing starting values for the updates, but I'm not sure on what's the best way to implement this on Theano.

For the callbacks, I was thinking on having a list of objects from a Monitor subclass. Each object would have a run method that is called at a given frequency (number of iterations or epochs). Inside this method, we could store current test/validation metrics, plot stuff, do snapshots, or anything else. To test if this is a good idea, we could try to implement Progbar like this.

However, if you think this is too complicated, let's just add the snapshotting stuff (I already implemented this in my personal branch, will make a PR soon) and something to store the per-iteration/per-epoch metrics so we can generate plots (either offline or on-the-fly, with Bokeh for example).

from keras.

fchollet commented on May 4, 2024

One clean solution to "save" optimizers would be heavily modify them to store all moving parts as class attributes, and allow for state snapshotting (as a dictionary of these attributes) and instantiation from a previous state. We would lose in agility by doing so, but anything else would be pretty hacky and possibly unsafe.

For performance reasons if would definitely be preferable to incorporate the monitoring / data dumping to the Progbar (abstracted as a Monitor class), rather than having to pass updates to both the progbar for logging and the monitor for data dumping... in fact we can get there with only a slight modification of the existing progbar.

As we add more features, the challenge will be the interface: we want to design a way to configure the monitor that is sufficiently powerful, but that isn't heavy or complicated. Large configuration dictionaries have a tendency to quickly get out of hand.

from keras.

fchollet commented on May 4, 2024

One clean solution to "save" optimizers would be heavily modify them to store all moving parts as class attributes, and allow for state snapshotting (as a dictionary of these attributes) and instantiation from a previous state. We would lose in agility by doing so, but anything else would be pretty hacky and possibly unsafe.

Just wondering --at this point, would we have any better solution to monitor the state of an optimizer? And wouldn't this kill performance, given Theano's memory management model?

from keras.

jfsantos commented on May 4, 2024

I agree that keeping the moving parts of the optimizers as class attributes would probably kill performance. In Blocks, they simply pickle the optimizer (actually, the whole main loop) but state that this is not reliable nor portable (e.g., files pickled in Python 2 do not load in Python 3, and if there are changes in libraries the pickled objects will stop working). I think pickling is only a solution as an emergency measure (let's say you're training on a shared machine and someone shuts it down, or you have limited walltime and underestimated training time).

I don't have any better idea, so we could do exactly as they do in Blocks: pickle the optimizer as this "emergency escape pod" solution, and rely on saving only the model as something that can be trusted. For most practical cases of stopping and restarting training (for example, using a dataset to pre-train the model and then another dataset to fine-tune it), it's probably OK to restart the optimizer.

from keras.

yogeshg commented on May 4, 2024

I would love to use and contribute to this feature, could you point me to your callback, @asafh ?

from keras.

asafh commented on May 4, 2024

I've made a mistake with the commit on my branch, and a couple more trying to fix it. The PR though doesn't have any of the multiple commits and is contained in one commit.

@yogeshg you can either wait for the PR to be accepted (assuming it will be) or just take the relevant changes from the callbacks.py file in my commit #5679
If you want to, you can this class external to keras (in your own project), just qualify the Callback being extended (keras.callbacks.Callback).

from keras.

swilson314 commented on May 4, 2024

This was an issue for me too. Additionally, I found that importing tf keeps me from receiving ctrl-c: https://stackoverflow.com/questions/52798454/import-of-tensorflow-stops-sigint-handler-from-working

I modified the code slightly so the existence of a file can signal an early stop:

class SignalStopping(Callback):
	'''Stop training when an interrupt signal (or other) was received
		# Arguments
		sig: the signal to listen to. Defaults to signal.SIGINT.
		doubleSignalExits: Receiving the signal twice exits the python
			process instead of waiting for this epoch to finish.
		patience: number of epochs with no improvement
			after which training will be stopped.
		verbose: verbosity mode.
	'''
	# SBW 2018.10.15 Since ctrl-c trapping isn't working, watch for existence of file, e.g. .\path\_StopTraining.txt.
	def __init__(self, sig=signal.SIGINT, doubleSignalExits=False, verbose=0, stop_file=None, stop_file_delta=10):
		super(SignalStopping, self).__init__()
		self.signal_received = False
		self.verbose = verbose
		self.doubleSignalExits = doubleSignalExits
		self.stop_file = stop_file
		self.stop_file_time = time.time()
		self.stop_file_delta = stop_file_delta
		def signal_handler(sig, frame):
			if self.signal_received and self.doubleSignalExits:
				if self.verbose > 0:
					print('') #new line to not print on current status bar. Better solution?
					print('Received signal to stop ' + str(sig)+' twice. Exiting..')
				exit(sig)
			self.signal_received = True
			if self.verbose > 0:
				print('') #new line to not print on current status bar. Better solution?
				print('Received signal to stop: ' + str(sig))
		signal.signal(signal.SIGINT, signal_handler)
		self.stopped_epoch = 0

	def on_epoch_end(self, epoch, logs={}):
		# SBW 2018.10.15 Since ctrl-c trapping isn't working, watch for existence of file, e.g. .\path\_StopTraining.txt.
		if self.stop_file is not None:
			# Checking file system is slow in training loop, don't check every epoch.
			delta = time.time() - self.stop_file_time
			if delta>self.stop_file_delta:
				self.stop_file_time += delta
				if os.path.isfile(self.stop_file):
					self.signal_received = True
		if self.signal_received:
			self.stopped_epoch = epoch
			self.model.stop_training = True

	def on_train_end(self, logs={}):
		if self.stopped_epoch > 0 and self.verbose > 0:
			print('Epoch %05d: stopping due to signal' % (self.stopped_epoch))

from keras.

Add interrupt handlers about keras HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent