cuny-cl / yoyodyne Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 16.0 2.22 MB

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

License: Apache License 2.0

Python 98.73% TeX 1.27%

yoyodyne's People

Contributors

Stargazers

Watchers

Forkers

bonham79 adamits abhishek-p kylebgorman yuyingren seanghay lgessler othergreengrasses michaelpginn ufcompling serapio isdanni hoofangyu robin99999 cbelth

yoyodyne's Issues

Multi-layer LSTMs broken

Enabling either --encoder_layers 2 or --decoder_layers 2 will cause runtime crashes during training. All of the following seem to have issues: LSTM, attentive LSTM, pointer-generator, transducer.

Expected hidden[0] size (2, 64, 100), got [1, 64, 100]

RuntimeError: Expected hidden[0] size (1, 40, 100), got [2, 40, 100]

etc. I have tagged this a release blocker.

Unseen symbols during inference

[copied from CUNY-CL/abstractness/issues/134]

Currently, the presence of a symbolsnot seen during training causes crash during transducer inference, and I am not sure what happens in the same situation in other architectures. I can imagine two solutions:

One could provide an "unk"-ing script which takes train, dev, and test, and then "unk"s (i.e., replaces with the reserved <UNK> symbol) any symbols present in dev or test but not in train, and also any symbol below a certain threshold (in case one wants to learn embeddings for this unk character).
yoyodyne-predict could insert <UNK> programmatically where needed during inference.

I think I prefer (2), it's one less step for the user and the unk embedding thing is a bit exotic for our small-vocabulary setting.

This is a blocker for a post-beta release candidate.

Decoding twice in validation step

See here: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/base.py#L219

Previously, I did this so I could first predict with greedy search to get an accuracy, and then predict with teacher forcing to compute a loss taht is comparable to how the train loss is computed. Currently, we predict twice with whatever is in the batch from the validation dataloader. IIRC this should have the gold targets in it.

This means that we are making identical validation predictions twice, neither of which is greedy -- they both have access to the gold history and use teacher forcing. I think we probably want to compute validation accuracy without the gold history, for which we would need to pass the batch through with no targets on it.

Then, we can either keep predicting a 2nd time to compute a loss with teacher forcing, or just compute the loss directly from the greedy predictions (and thus only decode once during validation).

@kylebgorman thoughts on this?

Informative model logging

Right now we log both the "model" (decoder and default encoder) and the "module" (encoder) lookups here and here respectively. This gives us the following printout, for a pointer-generator with a transformer encoder layer (cf. Singer & Kann 2020):

Model: PointerGeneratorLSTMEncoderDecoder
Model: TransformerEncoder

This is not as informative as it could be. I submit it would be better as something like:

Model: transformer encoder, pointer-generator LSTM decoder

I think this could be done by adding properties to each "model" and "module" class and then concatenating them together like:

util.log_info(f"Model: {encoder.name} encoder, {model.name} decoder")

Thoughts? If this proposal appeals it should be easy for me to implement.

Input/feature symbol overlap

[copied from CUNY-CL/abstractness/issues/127]

If there is any overlap between input symbols and feature symbols, the two are treated the same in several models. The simplest solution would be to use some kind of special string munging to prevent this overlap, perhaps by wrapping the features like f"{[feature]}.

Alternatively, this might motivate using separate encoders for feature-full LSTM and transformer models.

Issue #7 is spiritually related.

This is a blocker for a post-beta release candidate.

Restarting training

[copied from CUNY-CL/abstractness/issues/102]

The PTL training loop attempts to gracefully shut down in response to a Ctrl+C keyboard interrupt. However, simply re-running the same training command restarts training from scratch, with a new version number. This is in contrast to, e.g., FairSeq, which picks up from the last checkpoint.

One suspects this is unpredictable behavior; restarts ought to be programmatically supported.

This is a blocker for a post-beta release candidate.

Consistent batch size

I have a need (discussed off-thread) to make it so that every batch is the same size. This seems to me to require the following:

datasets know the length of the longest source, features (if present), and target (@property is appropriate here)
there is a flag (call it --pad_to_max or something) which when enabled causes source, features, and target (respectively) to be padded to the appropriate length from (1) respectively; it just needs to be passed as pad_len to the batches.PaddedTensor constructor to achieve this

I will put this aside until #40 is completed, however, since this may interact in weird ways with that.

Validation Accuracy Aggregation

Currently, our validation_step method on the BaseEncoderDecoder computes a per batch accuracy and aggregates them at the end of each epoch. Because of this, we get a macro average accuracy that will depend on the batch size.

I noticed something must be strange when using evaluation sets of size 1000 and getting validation accuracies to many decimal places (like 0.9247395...). I think we probably want to accumulate raw counts of correct/incorrect dev samples per batch, and then aggregate those into an accuracy at the end of each epoch.

The impact of this should be small, but still, I believe we are getting slightly incorrect accuracies according to the expected micro accuracy.

Sweeping

W&B provides a simple interface for running hyperparameter sweeps. To support this, we need to refactor the train and predict functions to read hyperparameter settings from elsewhere, and we probably also want an alternative command-line launcher to yoyodyne-train.

This is related to #5, as the refactoring there should make this much easier.

Feature indexing

IndexFeatures has an incorrect alignment for feature_idx. See

yoyodyne/yoyodyne/indexes.py

Line 160 in 18f19ae

return len(self.source_map) - self.features_idx

Easy fix but impacts #47 Do we want to maintain a separate features_map to ease chance of these bugs or keep merged with source_map.

PyTorchification of transducer

[copied from CUNY-CL/abstractness/issues/123]

There are lot of pure Python loops in the transducer implementation and many can be replaced with PyTorch functions.

Documentation for schedulers

We have three schedulers, each with their own arguments. They probably merit a paragraph or two in the README describing how they're used. I am assigning this to myself.

BERT Pretraining

Do we have any interest in adding in a masking function to Yoyodyne to allow BERT style training with the available models? This could feasibly improve inflection/g2p performance by allowing pretraining. Also allows use of the library for LM-esque training.

Will remove the suggestion if goes against underlying purpose of library. Just thought I'd ask since related to a potential side project.

Student forcing options/roll-out

After #71, we now can control, for a given training batch, whether teacher or student forcing is used. Some recent work suggests that for sequence-to-sequence models there is an advantage to training with student forcing. Some other work recommends gradually rolling out student forcing during training. I propose that we:

experiment with a flag that simply enables student forcing during training and see if things still converge
also experiment with a linear, batchwise rollout of student forcing; that is:
- for each batch, we draw a random sample such that with probability p we use teacher forcing and with probability 1 - p we use student forcing
- we initialize with p = 1 and after the warmup phase, linearly decrement p so that p = 0 for the last batch

Note that the stochastic option (the second one) is somewhat different from what Bengio et al. do: they do this at the token level. However, this seems harder and slower to implement, so I am suggesting something simpler to start out with.

Both of these can be thought of as hyperparameter free (beyond the boolean decision of whether or not to use student forcing during training at all). If either work we can incorporate into the master branch.

Pointer-generator transformers

[copied from CUNY-CL/abstractness/issues/12]

A new architecture: a variant of the pointer-generator using a transformer encoder.

Previously we talked about treating the choice of encoder as its own parameter, independent of other architectural choices, and this might be a good place to implement that.

Transducer transformer

[copied from CUNY-CL/abstractness/issues/120]

A new architecture: a variant of the transducer using a transformer encoder.

Previously we talked about treating the choice of encoder as its own parameter, independent of other architectural choices, and this might be a good place to implement that.

Beam search

Beam search is generally not supported by our models, though the flag exists. It appears to be supported in lstm; it is unclear to me whether it's supported by feature_invariant_transformer, transformer and transducer (if any of these silently ignore the flag, which is what I think happens, this is a serious but easy-to-fix bug), and it's explicitly not implemented in pointer_generator_lstm.

Unnecessary pass over evaluation dataset

[copied from CUNY-CL/abstractness/issues/50]

Currently when the evaluation dataset is instantiated, it makes a pass through the data to compute indices. However, these indices are already known from the training dataset. It should be possible to forestall this unnecessary pass over the evaluation dataset.

Transducer GPU support

Transducer training on GPU raises an error because it encounters a mixed CPU/GPU operation. Sample trace:

Epoch 0:   0%|                                                   | 0/294 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/kbg/.miniconda3/bin/yoyodyne-train", line 8, in <module>
    sys.exit(main())
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/train.py", line 448, in main
    trainer.fit(model, train_loader, eval_loader)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
    result = self._run_optimization(
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1646, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/optim/adadelta.py", line 87, in step
    loss = closure()
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 140, in _wrap_closure
    closure_result = closure()
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
    step_output = self._step_fn()
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/base.py", line 115, in training_step
    preds = self(batch)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/transducer.py", line 86, in forward
    prediction, loss = self.decode(
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/transducer.py", line 174, in decode
    last_action = self.decode_action_step(
  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/transducer.py", line 260, in decode_action_step
    end_of_input = (input_length - alignment) <= 1  # 1 -> Last char.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Epoch 0:   0%|          | 0/294 [00:00<?, ?it/s]

This is a blocker for a post-beta release candidate.

hparams

When testing, I noticed two things:

hparams.yaml is continuously rebuilt while training -- maybe this means we we reinstantiate the model (to load from a checkpoint or something) several times? This could be ok, but struck me as weird.
All samples are now in the hparams file, so it is large and gets bigger as the dataset gets bigger. This is via the training dataset property on the model I believe. I think in the past I added something to ignore certain peroperties in hparams.yaml but don't remember what I did off the top of my head.

I need to look into both at some point. Putting this issue here as a placeholder for now.

Default Scripts

More a reminder than a fix: we should keep up to date scripts for model architectures for quick testing and runs.

Benchmarking

We should add a benchmarking suite. I have reserved a separate repo, CUNY-CL/yoyodyne-benchmarks for this.

Here are a list of shared tasks (and related papers) from which we can pull data:

Merhav & Ash (2018) transliteration
Other transliteration tasks:
- Dakshina
- Japanese/English reverse transliteration
SIGMORPHON 2016 inflection
SIGMORPHON 2017 inflection
SIGMORPHON 2018 inflection
SIGMORPHON 2020 g2p
SIGMORPHON 2021 g2p
New York-Boulder abstractness data

The benchmark itself is a collection of two tables.

A "KPI" table, per dataset/language. E.g., "SIGMORPHON 2021 g2p Bulgarian".
A "study" table, per dataset/language/architecture. E.g., "Transducer ensemble on SIGMORPHON 2021 g2p Bulgarian".

A single script should compute all KPI statistics and dump it out as a TSV. This table should include:

training set size
dev set size
test set size
average input string length
average output string length
whether it has features

While one could imagine a single script which performs all studies, this is probably not wise. Rather, these should be grouped into separate scripts based on their functionality (though it may make sense for there to be multiple studies per study script; e.g., we coudl have one script per dataset/language pair). The results can be dumped out in some structured format (JSON), and a separate script can be used to aggregate the non-ragged portions of all the JSON study reports into a single TSV. This table should include:

dataset
language
model type
GPU models (e.g., torch.cuda.get_device_name(number))
wall clock time during training
wall clock time during inference
development accuracy of best model
test accuracy of best model
hyperparameters for best model (this is the ragged part)
model size, in KB
model size, in # of trainable parameters

Then, a separate script is used to aggregate the non-ragged portions of the extant study observations.

Studies should include:

the worst of 5 randomly initialized models
the best of 5 randomly initialized models
the media of 5 randomly initialized models
a voting ensemble of 5 randomly initialized models
possibly: heterogeneous ensembles of different architectures

Putting this all together should make it easy for us to win relevant shared tasks. ;)

This is related to #5, as the refactoring there should make this much easier. This is also related to #15; we may want to use the sweeping interface for the benchmarks.

Feature encoding enhancements

[This continues the discussion in #12.]

Both the transducer and the pointer-generator treat features in architecture-specific ways; issue #12 deals with their ideal treatment in the transducer, since the Makarov/Clematide mechanism of treating them as one-hot embeddings appears to be ineffectual. In contrast, they are just concatenated by the LSTM and transformer models. My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated. To be more explicit, imagine the source tensor is of size batch_size x hidden_size x source_length and the feature tensor is of size batch_size x hidden_size x feature_length. Then I propose that we concatenate these to form a tensor of size batch_size x hidden_size x (source_length + feature_length), and attention operates over that larger tensor.

As a result, we can use the features column to do multi-source translation in general. Furthermore, the dataset getter is no longer is conditioned on architecture: it just has features and no-feature variants, which makes things a bit simpler.

(Note that this will not work for the inattentive LSTM; something else will have to be done or we can just dump it.)

A distant enhancement is that it would be possible, in theory, to have different encoders for source and feature (LSTM vs. GRU vs. transformer); an even more distant enhancement would be to allow these to have different dimensionalities and use linear projection to map the feature encoding back onto the source encoding. I do not actually think we should do either of these, but it's a thing we could do...

Transducer feature embedding

[copied from CUNY-CL/abstractness/issues/96]

Transducers make it somewhat difficult to condition generation on features. Makarov & Clematide use a one-hot embedding of features, concatenated itemwise to the input string embeddings, but @bonham79's pilot results suggest this is ineffectual. We can imagine at least a couple other ways to do it:

Features could be embedded, combined via mean, and then used to initialize the decoder state.
Features could be embedded, combined via mean, and then concatenated itemwise to the input string embeddings.
Features could be encoded and this encoding used to initialize the decoder state.
Features could be encoded, and then concatenated itemwise to the input strring embeddings.

Documentation on GPU/CUDA use

[copied from CUNY-CL/abstractness/issues/22]

The documentation should probably explain --gpu/--no-gpu and also link to some information about getting CUDA working.

Documentation for accumulating multiple batches

PTL allows us to simulate batch sizes larger than what will fit in your accelerator's core by accumulate gradients from multiple ("mini")batches. For instance to simulate an effective batch size of 8192 with smaller batches of 1024 one could do --batch_size 1024 --accumulate_grad_batches 8. Alternatively, if batch size is not known at runtime (e.g., because one is doing hyperparameter optimization) one can do --max_batch_size 1024 and anything larger than that will be simulated like so.

We should document this briefly.

Tied vocabulary flag: no-op?

Correct me if I'm wrong, but: the --tied_vocabulary flag has no effect except in the construction of the index (and this has no downstream impact).

It isn't "tied" in the stronger sense that source and target symbols share an embedding, so that a source "a" and a target "a" receive the same representation. Is that what was intended?

If this is correct, I propose we either:

remove the flag, so as not to confuse people
or, cause it to have some effect.

Transducer Boltzmann exploration

[copied from CUNY-CL/abstractness/issues/52]

The transducer uses random choice for tutoring. Alternatively, Boltzmann exploration could be used:

https://arxiv.org/abs/1705.10257

Torch and Lightning 2.0.0

The library does not work with Lightning (and one suspects that Torch itself is also an issue) > 2.0.0. The first issue I encounter when running yoyodyne-train with no arguments is related to a change in how Lightning command-line arguments are handled---I suspect there are at least a few more.

So that the library is not broken at head---which I consider unacceptable---I have pinned as follows:

pytorch-lightning>=1.7.0,<2.0.0
torch>=1.11.0,<2.0.0

What we need to do is just to migrate to 2.0.0, by fixing Lightning (and Torch, if any) bugs until things work, and then re-pin these two dependencies >=2.0.0. I have initially assigned myself, but I would welcome help.

Switch to CrossEntropyLoss?

Right now we use NLLLoss, with a complicated custom version thereof when applying label smoothing, and then apply LogSoftmax.

However, CrossEntropyLoss supposedly merges these two steps, and it also supports built-in label smoothing. Moving to it, I suspect, would give us a small speedup. My first attempt to do this was not successful, however: loss plateaued quickly at zero accuracy.

Note that the transducer also has special-casing here.

@Adamits for discussion.

Updates and Documentation for WandB sweeps

Based on other work were doing, we should add some documentation and make necessary tweaks for running a W&B sweep with this codebase.

Add documentation and examples of running WandB sweeps with Yoyodyne.
Make updates to codebase so PTL and WandB play nice wrt logging hyperparameters, etc.
Update PTL to log max validation accuracy.

train() method in train.py has incorrect args

Little bug we missed i, I assume, #110 .

We now pass the datamodule when calling the train method, but it still expects a train and dev loader.

Without checking the docs, I suspect this works because PTL checks the type of train_loader, which is actually a DataModule with both the train and dev loaders, and uses it. Then, dev_loader in our method gets a string or None intended for the train_from arg, and since I have not tested this with train_from, everything was working as expected.

Easy fix that I can make today.

Number of layers have to match

I believe both transformers and LSTM layers require that the number of encoder and decoder layers match. However, they're specified as separate arguments and flags. This makes it hard to do hyperparameter search including them; if one does something like

parameters:
  encoder_layers:
    values: [1, 2]
  decoder_layers:
    values: [1, 2]

in a hyperparameter sweep, half the draws will fail at initialization because you'll draw a different number of the two.

I propose that we just standardize this as --layers. Thoughts?

Outputting to the current directory

In yoyodyne-predict, the following will fail: --output path_to_current_directory_file.txt. Do you know why? It's because it tries to create the directory before writing. The simple solution, which I will now implement, will not call os.makedir if writing to the PWD. (We can just test for an empty string.)

Reported by @liliest

Defaults module

[Spun-off from discussion in #51.]

Create a defaults.py that defines defaults as constants. These are then used in train.py and predict.py, and in some cases, in the architecture modules themselves.

Avoid --attention flag

One extra complexity for model class lookup is that we treat attention (but only for LSTMs) as a separate flag rather than as a separate architecture. This isn't mirrored in the backend: they are separate model classes.

I propose to call the vanilla LSTM --arch lstm, the attentive LSTM --arch attentive_lstm. Documentation and defaults will also need to be updated, as will tests. Self-assigning.

Index written too late

Currently one needs the index and a checkpoint to train. If you Ctrl+C during training, graceful shut down writes out the last checkpoint is written out, but the index is not written until training is complete. You face the same problem if you want to do predictions with an intermediate model while training is still going. The only workaround is to train zero epochs, just to get an index file.

The solution it seems is to write out the index earlier. This could either be done before training kicks off, or more fancifully, with some kind of callback.

Reported by @liliest.

Pointer-generator does not work on CPU

With --accelerator cpu (or not specifying since that's the default):

  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/pointer_generator.py", line 142, in decode_step
    attention_probs.scatter_add_(
RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype

One of the argument tensors, I suppose, is on CPU and the other one is...IDK where. @Adamits, at your leisure: any insights here?

Linear decay scheduler

We only have one decaying scheduler right now: the goofy warmup+inverse square decay strategy from the original transformer paper. I propose that we also add a linear decay to schedulers.py.

@Adamits I think this is in your wheelhouse so I have assigned it to you; let me know if that's a problem.

Sequence length errors

This scary error sometimes greets you if you are using the transformer:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [398,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

I believe this is what happens when you have inputs that exceed --max_sequence_length (increasing it to a huge number makes it go away). What should we do about this? A few possibilities:

Cowardly refuse to pad a source sequence batch longer than --max_sequence_length by throwing an exception. This would have the interesting effect that it would apply not just to transformer anymore (which actually might be desirable---certainly it's clearer).
Document in the README that that scary errors means you need to increase --max_sequence_length.

I think I prefer (1). What do you think?

Relatedly, it occurs that the following might be clearer:

--max_sequence_length > --max_source_length; help docs should also mention that this is for transformers only, unless we adopt (1) in which case it becomes uniform; we should also move this into base.py
--max_decode_length > --max_target_length

Feature-invariant LSTM

Here's the pitch: use the feature-invariance notion on an LSTM wiith feature. @bonham79 thinks this is an easy fix.

One question is whether we want to add a new --arch flag or just add --feature_invariant.

Add option to concatenate features for pointer generator

I was just thinking, though our pointer-generator implementation(s) take care to encode features separately so that they are not used in the attention distribution for the pointer probabilities, I think it is worth making it easy to just consider features as other input symbols along with the lemmas.

That is, to concatenate the features with the input just like we do for the 'vanilla' seq2seq models. This is just for comparison as sometimes these models learn things on their own without much intervention.

Use specialized prediction?

Currently, our prediction code does some relatively hairy stuff that is also architecture sensitive. Lightning permits one to define a predict_step method. If the user does not specify this, prediction is just the same as calling forward.

It seems to me from this that we could likely move some of that code (the squeeze, transpose, and max) into predict_step for the appropriate modules, making it so that there's less Torchyness, and so we don't need to pass around or switch on the arch string.

ByT5 encoder

As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.

This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5 in yoyodyne/models/modules/byt5.py. In the constructor, you'd use the transformers.T5Encoder.from_pretrained class method to instantiate an encoder; there are four sizes (small, base, large, xl) and we could just add all four.

I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.

The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte as a special-case separator option.

Here are some early notes on how to do this.

Silencing irrelevant warnings

[copied from CUNY-CL/abstractness/issues/101]

I see the following (mostly spurious) warnings on various runs:

PossibleUserWarning about number of data loaders
~~UserWarning: deprecation of nn.functional.sigmoid~~
UserWarning: dropout expects num_layers > 1
~~UserWarning: trying to infer the batch_size from an ambiguous collection~~
PossibleUserWarning: max_epochs was not set.

Users tend to treat warnings as errors, so we may want to suppress these.

Multiprocessing compatibility

get_loss_function in base uses subdefinitions for functions. This is not compatible with multiprocessing so needs to be adjusted. Adding as issue here for mental note.

W&B config warnings when training a sweep agent

Currently, when training a sweep agent, we need to start the agent within a hyperparameter config, which automatically logs all of the hyperparameters to W&B. We also need to initialize the pytorch-lightning WandbLogger, which under the hood attempts to log the config again. See here for details: wandb/wandb#2641.

Ideally, we would solve this by not updating twice, however, this may be out of our control as we rely on wandb and pytorch-lightning for those behaviors.

A work-around is to suppress the W&B warning message. So far, warnings.filterwarnings does not work -- but we should investigate this more.

Pointer-generator behavior with features

Reporting some funky behavior with the pointer-generator with features:

With --arch pointer_generator_lstm and features enabled (i.e., --features_col 3 or something other than the default 0), I get the following report:

Model: pointer-generator
Encoder: LSTM
Decoder: attentive LSTM
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                   | Type                  | Params
-----------------------------------------------------------------
0 | loss_func              | NLLLoss               | 0     
1 | dropout_layer          | Dropout               | 0     
2 | source_encoder         | LSTMEncoder           | 333 K 
3 | decoder                | LSTMAttentiveDecoder  | 283 K 
4 | classifier             | Linear                | 12.0 K
5 | log_softmax            | LogSoftmax            | 0     
6 | generation_probability | GenerationProbability | 601   
-----------------------------------------------------------------

Since it doesn't list a feature encoder in either place, this makes me think it has ignored features even though I have explicitly requested them.

With --arch pointer_generator_lstm --source_encoder lstm --features_encoder lstm (which should be the same thing) we get a crash:

  File "/home/kbg/.miniconda3/lib/python3.10/site-packages/yoyodyne/models/pointer_generator.py", line 370, in forward
    return predictions
UnboundLocalError: local variable 'predictions' referenced before assignment

The reason from this should be clear from the code: the branch that begins at line 345 doesn't define predictions.

--arch pointer_generator_lstm --source_encoder lstm --features_encoder linear works, though after hill climbing for a while both losses go nan (e.g., on our Polish data).
Same story as (3) with --arch pointer_generator_lstm --source_encoder transformer --features_encoder linear.

I am assigning this to @bonham79; I think the fix will be quite small.

Testing

[copied from CUNY-CL/abstractness/issues/87]

We should add integration tests (I hesitate to call these unit tests), simply limiting ourselves to the model sizes and data quantities we can run on CircleCI's free tier. We get 6,000 compute-minutes per month...all of this is pretty generous except that I am unclear whether we can use their GPU images or are stuck on CPU (ideally we'd parameterize tests on both). I think it ought to be possible to do actual training of the major models using, say, 1,000 examples. Unit tests could include g2p (for feature-less) and inflection (for feature-full) from SIGMORPHON.

The current training and prediction functions are structured to read and write directly to the file system. They should be modularized to take ordinary arguments and return the results:

for training, a function could simply return the best model (or its path) with metadata (wall clock time, training accuracy, development accuracy), and then the command-line enabled version of that loop could invoke this
for prediction, a function could simply return the accuracy.

These functions can then be called by the existing (null return type) training and prediction functions, the ones parameterized with click flags.

This will also support two other projects (issues coming soon):

benchmarking
W&B-enabled hyperparameter sweeping

This is a blocker for a post-beta release candidate.

Python interface for loading

Previously, I recall that we had methods like get_trainer and get_model to go along with train.get_trainer_from_argparse_args and train.get_model_from_argparse_args.

These seem to have disappeared. What is the rationale behind that? Was this a mistake?

cuny-cl / yoyodyne Goto Github PK

yoyodyne's People

Contributors

Stargazers

Watchers

Forkers

yoyodyne's Issues

Recommend Projects

Recommend Topics

Recommend Org