cuny-cl / yoyodyne Goto Github PK
View Code? Open in Web Editor NEWSmall-vocabulary sequence-to-sequence generation with optional feature conditioning
License: Apache License 2.0
Small-vocabulary sequence-to-sequence generation with optional feature conditioning
License: Apache License 2.0
Enabling either --encoder_layers 2
or --decoder_layers 2
will cause runtime crashes during training. All of the following seem to have issues: LSTM, attentive LSTM, pointer-generator, transducer.
Expected hidden[0] size (2, 64, 100), got [1, 64, 100]
RuntimeError: Expected hidden[0] size (1, 40, 100), got [2, 40, 100]
etc. I have tagged this a release blocker.
[copied from CUNY-CL/abstractness/issues/134]
Currently, the presence of a symbolsnot seen during training causes crash during transducer inference, and I am not sure what happens in the same situation in other architectures. I can imagine two solutions:
<UNK>
symbol) any symbols present in dev or test but not in train, and also any symbol below a certain threshold (in case one wants to learn embeddings for this unk character).yoyodyne-predict
could insert <UNK>
programmatically where needed during inference.I think I prefer (2), it's one less step for the user and the unk embedding thing is a bit exotic for our small-vocabulary setting.
This is a blocker for a post-beta release candidate.
See here: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/base.py#L219
Previously, I did this so I could first predict with greedy search to get an accuracy, and then predict with teacher forcing to compute a loss taht is comparable to how the train loss is computed. Currently, we predict twice with whatever is in the batch from the validation dataloader. IIRC this should have the gold targets in it.
This means that we are making identical validation predictions twice, neither of which is greedy -- they both have access to the gold history and use teacher forcing. I think we probably want to compute validation accuracy without the gold history, for which we would need to pass the batch through with no targets on it.
Then, we can either keep predicting a 2nd time to compute a loss with teacher forcing, or just compute the loss directly from the greedy predictions (and thus only decode once during validation).
@kylebgorman thoughts on this?
Right now we log both the "model" (decoder and default encoder) and the "module" (encoder) lookups here and here respectively. This gives us the following printout, for a pointer-generator with a transformer encoder layer (cf. Singer & Kann 2020):
Model: PointerGeneratorLSTMEncoderDecoder
Model: TransformerEncoder
This is not as informative as it could be. I submit it would be better as something like:
Model: transformer encoder, pointer-generator LSTM decoder
I think this could be done by adding properties to each "model" and "module" class and then concatenating them together like:
util.log_info(f"Model: {encoder.name} encoder, {model.name} decoder")
Thoughts? If this proposal appeals it should be easy for me to implement.
[copied from CUNY-CL/abstractness/issues/127]
If there is any overlap between input symbols and feature symbols, the two are treated the same in several models. The simplest solution would be to use some kind of special string munging to prevent this overlap, perhaps by wrapping the features like f"{[feature]}
.
Alternatively, this might motivate using separate encoders for feature-full LSTM and transformer models.
Issue #7 is spiritually related.
This is a blocker for a post-beta release candidate.
[copied from CUNY-CL/abstractness/issues/102]
The PTL training loop attempts to gracefully shut down in response to a Ctrl+C keyboard interrupt. However, simply re-running the same training command restarts training from scratch, with a new version number. This is in contrast to, e.g., FairSeq, which picks up from the last checkpoint.
One suspects this is unpredictable behavior; restarts ought to be programmatically supported.
This is a blocker for a post-beta release candidate.
I have a need (discussed off-thread) to make it so that every batch is the same size. This seems to me to require the following:
@property
is appropriate here)--pad_to_max
or something) which when enabled causes source, features, and target (respectively) to be padded to the appropriate length from (1) respectively; it just needs to be passed as pad_len
to the batches.PaddedTensor
constructor to achieve thisI will put this aside until #40 is completed, however, since this may interact in weird ways with that.
Currently, our validation_step
method on the BaseEncoderDecoder
computes a per batch accuracy and aggregates them at the end of each epoch. Because of this, we get a macro average accuracy that will depend on the batch size.
I noticed something must be strange when using evaluation sets of size 1000 and getting validation accuracies to many decimal places (like 0.9247395...
). I think we probably want to accumulate raw counts of correct/incorrect dev samples per batch, and then aggregate those into an accuracy at the end of each epoch.
The impact of this should be small, but still, I believe we are getting slightly incorrect accuracies according to the expected micro accuracy.
W&B provides a simple interface for running hyperparameter sweeps. To support this, we need to refactor the train and predict functions to read hyperparameter settings from elsewhere, and we probably also want an alternative command-line launcher to yoyodyne-train
.
This is related to #5, as the refactoring there should make this much easier.
[copied from CUNY-CL/abstractness/issues/123]
There are lot of pure Python loops in the transducer implementation and many can be replaced with PyTorch functions.
We have three schedulers, each with their own arguments. They probably merit a paragraph or two in the README describing how they're used. I am assigning this to myself.
Do we have any interest in adding in a masking function to Yoyodyne to allow BERT style training with the available models? This could feasibly improve inflection/g2p performance by allowing pretraining. Also allows use of the library for LM-esque training.
Will remove the suggestion if goes against underlying purpose of library. Just thought I'd ask since related to a potential side project.
After #71, we now can control, for a given training batch, whether teacher or student forcing is used. Some recent work suggests that for sequence-to-sequence models there is an advantage to training with student forcing. Some other work recommends gradually rolling out student forcing during training. I propose that we:
Note that the stochastic option (the second one) is somewhat different from what Bengio et al. do: they do this at the token level. However, this seems harder and slower to implement, so I am suggesting something simpler to start out with.
Both of these can be thought of as hyperparameter free (beyond the boolean decision of whether or not to use student forcing during training at all). If either work we can incorporate into the master branch.
[copied from CUNY-CL/abstractness/issues/12]
A new architecture: a variant of the pointer-generator using a transformer encoder.
Previously we talked about treating the choice of encoder as its own parameter, independent of other architectural choices, and this might be a good place to implement that.
[copied from CUNY-CL/abstractness/issues/120]
A new architecture: a variant of the transducer using a transformer encoder.
Previously we talked about treating the choice of encoder as its own parameter, independent of other architectural choices, and this might be a good place to implement that.
Beam search is generally not supported by our models, though the flag exists. It appears to be supported in lstm
; it is unclear to me whether it's supported by feature_invariant_transformer
, transformer
and transducer
(if any of these silently ignore the flag, which is what I think happens, this is a serious but easy-to-fix bug), and it's explicitly not implemented in pointer_generator_lstm
.
[copied from CUNY-CL/abstractness/issues/50]
Currently when the evaluation dataset is instantiated, it makes a pass through the data to compute indices. However, these indices are already known from the training dataset. It should be possible to forestall this unnecessary pass over the evaluation dataset.
Transducer training on GPU raises an error because it encounters a mixed CPU/GPU operation. Sample trace:
Epoch 0: 0%| | 0/294 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/kbg/.miniconda3/bin/yoyodyne-train", line 8, in <module>
sys.exit(main())
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/train.py", line 448, in main
trainer.fit(model, train_loader, eval_loader)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
result = self._run_optimization(
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1646, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/optim/adadelta.py", line 87, in step
loss = closure()
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 140, in _wrap_closure
closure_result = closure()
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
step_output = self._step_fn()
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step
return self.model.training_step(*args, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/base.py", line 115, in training_step
preds = self(batch)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/transducer.py", line 86, in forward
prediction, loss = self.decode(
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/transducer.py", line 174, in decode
last_action = self.decode_action_step(
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/transducer.py", line 260, in decode_action_step
end_of_input = (input_length - alignment) <= 1 # 1 -> Last char.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Epoch 0: 0%| | 0/294 [00:00<?, ?it/s]
This is a blocker for a post-beta release candidate.
When testing, I noticed two things:
hparams.yaml
but don't remember what I did off the top of my head.I need to look into both at some point. Putting this issue here as a placeholder for now.
More a reminder than a fix: we should keep up to date scripts for model architectures for quick testing and runs.
We should add a benchmarking suite. I have reserved a separate repo, CUNY-CL/yoyodyne-benchmarks for this.
Here are a list of shared tasks (and related papers) from which we can pull data:
The benchmark itself is a collection of two tables.
A single script should compute all KPI statistics and dump it out as a TSV. This table should include:
While one could imagine a single script which performs all studies, this is probably not wise. Rather, these should be grouped into separate scripts based on their functionality (though it may make sense for there to be multiple studies per study script; e.g., we coudl have one script per dataset/language pair). The results can be dumped out in some structured format (JSON), and a separate script can be used to aggregate the non-ragged portions of all the JSON study reports into a single TSV. This table should include:
torch.cuda.get_device_name(number)
)Then, a separate script is used to aggregate the non-ragged portions of the extant study observations.
Studies should include:
Putting this all together should make it easy for us to win relevant shared tasks. ;)
This is related to #5, as the refactoring there should make this much easier. This is also related to #15; we may want to use the sweeping interface for the benchmarks.
[This continues the discussion in #12.]
Both the transducer and the pointer-generator treat features in architecture-specific ways; issue #12 deals with their ideal treatment in the transducer, since the Makarov/Clematide mechanism of treating them as one-hot embeddings appears to be ineffectual. In contrast, they are just concatenated by the LSTM and transformer models. My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated. To be more explicit, imagine the source tensor is of size batch_size x hidden_size x source_length and the feature tensor is of size batch_size x hidden_size x feature_length. Then I propose that we concatenate these to form a tensor of size batch_size x hidden_size x (source_length + feature_length), and attention operates over that larger tensor.
As a result, we can use the features column to do multi-source translation in general. Furthermore, the dataset getter is no longer is conditioned on architecture: it just has features and no-feature variants, which makes things a bit simpler.
(Note that this will not work for the inattentive LSTM; something else will have to be done or we can just dump it.)
A distant enhancement is that it would be possible, in theory, to have different encoders for source and feature (LSTM vs. GRU vs. transformer); an even more distant enhancement would be to allow these to have different dimensionalities and use linear projection to map the feature encoding back onto the source encoding. I do not actually think we should do either of these, but it's a thing we could do...
[copied from CUNY-CL/abstractness/issues/96]
Transducers make it somewhat difficult to condition generation on features. Makarov & Clematide use a one-hot embedding of features, concatenated itemwise to the input string embeddings, but @bonham79's pilot results suggest this is ineffectual. We can imagine at least a couple other ways to do it:
[copied from CUNY-CL/abstractness/issues/22]
The documentation should probably explain --gpu
/--no-gpu
and also link to some information about getting CUDA working.
PTL allows us to simulate batch sizes larger than what will fit in your accelerator's core by accumulate gradients from multiple ("mini")batches. For instance to simulate an effective batch size of 8192 with smaller batches of 1024 one could do --batch_size 1024 --accumulate_grad_batches 8
. Alternatively, if batch size is not known at runtime (e.g., because one is doing hyperparameter optimization) one can do --max_batch_size 1024
and anything larger than that will be simulated like so.
We should document this briefly.
Correct me if I'm wrong, but: the --tied_vocabulary
flag has no effect except in the construction of the index (and this has no downstream impact).
It isn't "tied" in the stronger sense that source and target symbols share an embedding, so that a source "a" and a target "a" receive the same representation. Is that what was intended?
If this is correct, I propose we either:
[copied from CUNY-CL/abstractness/issues/52]
The transducer uses random choice for tutoring. Alternatively, Boltzmann exploration could be used:
The library does not work with Lightning (and one suspects that Torch itself is also an issue) > 2.0.0. The first issue I encounter when running yoyodyne-train
with no arguments is related to a change in how Lightning command-line arguments are handled---I suspect there are at least a few more.
So that the library is not broken at head---which I consider unacceptable---I have pinned as follows:
pytorch-lightning>=1.7.0,<2.0.0
torch>=1.11.0,<2.0.0
What we need to do is just to migrate to 2.0.0, by fixing Lightning (and Torch, if any) bugs until things work, and then re-pin these two dependencies >=2.0.0
. I have initially assigned myself, but I would welcome help.
Right now we use NLLLoss, with a complicated custom version thereof when applying label smoothing, and then apply LogSoftmax.
However, CrossEntropyLoss supposedly merges these two steps, and it also supports built-in label smoothing. Moving to it, I suspect, would give us a small speedup. My first attempt to do this was not successful, however: loss plateaued quickly at zero accuracy.
Note that the transducer also has special-casing here.
@Adamits for discussion.
Based on other work were doing, we should add some documentation and make necessary tweaks for running a W&B sweep with this codebase.
Little bug we missed i, I assume, #110 .
We now pass the datamodule when calling the train
method, but it still expects a train and dev loader.
Without checking the docs, I suspect this works because PTL checks the type of train_loader
, which is actually a DataModule
with both the train and dev loaders, and uses it. Then, dev_loader
in our method gets a string or None intended for the train_from
arg, and since I have not tested this with train_from
, everything was working as expected.
Easy fix that I can make today.
I believe both transformers and LSTM layers require that the number of encoder and decoder layers match. However, they're specified as separate arguments and flags. This makes it hard to do hyperparameter search including them; if one does something like
parameters:
encoder_layers:
values: [1, 2]
decoder_layers:
values: [1, 2]
in a hyperparameter sweep, half the draws will fail at initialization because you'll draw a different number of the two.
I propose that we just standardize this as --layers
. Thoughts?
In yoyodyne-predict
, the following will fail: --output path_to_current_directory_file.txt
. Do you know why? It's because it tries to create the directory before writing. The simple solution, which I will now implement, will not call os.makedir
if writing to the PWD. (We can just test for an empty string.)
Reported by @liliest
[Spun-off from discussion in #51.]
Create a defaults.py
that defines defaults as constants. These are then used in train.py
and predict.py
, and in some cases, in the architecture modules themselves.
One extra complexity for model class lookup is that we treat attention (but only for LSTMs) as a separate flag rather than as a separate architecture. This isn't mirrored in the backend: they are separate model classes.
I propose to call the vanilla LSTM --arch lstm
, the attentive LSTM --arch attentive_lstm
. Documentation and defaults will also need to be updated, as will tests. Self-assigning.
Currently one needs the index and a checkpoint to train. If you Ctrl+C during training, graceful shut down writes out the last checkpoint is written out, but the index is not written until training is complete. You face the same problem if you want to do predictions with an intermediate model while training is still going. The only workaround is to train zero epochs, just to get an index file.
The solution it seems is to write out the index earlier. This could either be done before training kicks off, or more fancifully, with some kind of callback.
Reported by @liliest.
With --accelerator cpu
(or not specifying since that's the default):
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/pointer_generator.py", line 142, in decode_step
attention_probs.scatter_add_(
RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype
One of the argument tensors, I suppose, is on CPU and the other one is...IDK where. @Adamits, at your leisure: any insights here?
We only have one decaying scheduler right now: the goofy warmup+inverse square decay strategy from the original transformer paper. I propose that we also add a linear decay to schedulers.py
.
@Adamits I think this is in your wheelhouse so I have assigned it to you; let me know if that's a problem.
This scary error sometimes greets you if you are using the transformer:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [398,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
I believe this is what happens when you have inputs that exceed --max_sequence_length
(increasing it to a huge number makes it go away). What should we do about this? A few possibilities:
--max_sequence_length
by throwing an exception. This would have the interesting effect that it would apply not just to transformer anymore (which actually might be desirable---certainly it's clearer).--max_sequence_length
.I think I prefer (1). What do you think?
Relatedly, it occurs that the following might be clearer:
--max_sequence_length
> --max_source_length
; help docs should also mention that this is for transformers only, unless we adopt (1) in which case it becomes uniform; we should also move this into base.py--max_decode_length
> --max_target_length
Here's the pitch: use the feature-invariance notion on an LSTM wiith feature. @bonham79 thinks this is an easy fix.
One question is whether we want to add a new --arch
flag or just add --feature_invariant
.
I was just thinking, though our pointer-generator implementation(s) take care to encode features separately so that they are not used in the attention distribution for the pointer probabilities, I think it is worth making it easy to just consider features as other input symbols along with the lemmas.
That is, to concatenate the features with the input just like we do for the 'vanilla' seq2seq models. This is just for comparison as sometimes these models learn things on their own without much intervention.
Currently, our prediction code does some relatively hairy stuff that is also architecture sensitive. Lightning permits one to define a predict_step
method. If the user does not specify this, prediction is just the same as calling forward
.
It seems to me from this that we could likely move some of that code (the squeeze, transpose, and max) into predict_step
for the appropriate modules, making it so that there's less Torchyness, and so we don't need to pass around or switch on the arch string.
As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base
) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.
This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5
in yoyodyne/models/modules/byt5.py
. In the constructor, you'd use the transformers.T5Encoder.from_pretrained
class method to instantiate an encoder; there are four sizes (small
, base
, large
, xl
) and we could just add all four.
I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.
The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte
as a special-case separator option.
Here are some early notes on how to do this.
[copied from CUNY-CL/abstractness/issues/101]
I see the following (mostly spurious) warnings on various runs:
nn.functional.sigmoid
batch_size
from an ambiguous collectionmax_epochs
was not set.Users tend to treat warnings as errors, so we may want to suppress these.
get_loss_function
in base
uses subdefinitions for functions. This is not compatible with multiprocessing so needs to be adjusted. Adding as issue here for mental note.
Currently, when training a sweep agent, we need to start the agent within a hyperparameter config, which automatically logs all of the hyperparameters to W&B. We also need to initialize the pytorch-lightning WandbLogger
, which under the hood attempts to log the config again. See here for details: wandb/wandb#2641.
Ideally, we would solve this by not updating twice, however, this may be out of our control as we rely on wandb
and pytorch-lightning
for those behaviors.
A work-around is to suppress the W&B warning message. So far, warnings.filterwarnings
does not work -- but we should investigate this more.
Reporting some funky behavior with the pointer-generator with features:
--arch pointer_generator_lstm
and features enabled (i.e., --features_col 3
or something other than the default 0
), I get the following report:Model: pointer-generator
Encoder: LSTM
Decoder: attentive LSTM
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-----------------------------------------------------------------
0 | loss_func | NLLLoss | 0
1 | dropout_layer | Dropout | 0
2 | source_encoder | LSTMEncoder | 333 K
3 | decoder | LSTMAttentiveDecoder | 283 K
4 | classifier | Linear | 12.0 K
5 | log_softmax | LogSoftmax | 0
6 | generation_probability | GenerationProbability | 601
-----------------------------------------------------------------
Since it doesn't list a feature encoder in either place, this makes me think it has ignored features even though I have explicitly requested them.
--arch pointer_generator_lstm --source_encoder lstm --features_encoder lstm
(which should be the same thing) we get a crash: File "/home/kbg/.miniconda3/lib/python3.10/site-packages/yoyodyne/models/pointer_generator.py", line 370, in forward
return predictions
UnboundLocalError: local variable 'predictions' referenced before assignment
The reason from this should be clear from the code: the branch that begins at line 345 doesn't define predictions
.
--arch pointer_generator_lstm --source_encoder lstm --features_encoder linear
works, though after hill climbing for a while both losses go nan
(e.g., on our Polish data).
Same story as (3) with --arch pointer_generator_lstm --source_encoder transformer --features_encoder linear
.
I am assigning this to @bonham79; I think the fix will be quite small.
[copied from CUNY-CL/abstractness/issues/87]
We should add integration tests (I hesitate to call these unit tests), simply limiting ourselves to the model sizes and data quantities we can run on CircleCI's free tier. We get 6,000 compute-minutes per month...all of this is pretty generous except that I am unclear whether we can use their GPU images or are stuck on CPU (ideally we'd parameterize tests on both). I think it ought to be possible to do actual training of the major models using, say, 1,000 examples. Unit tests could include g2p (for feature-less) and inflection (for feature-full) from SIGMORPHON.
The current training and prediction functions are structured to read and write directly to the file system. They should be modularized to take ordinary arguments and return the results:
These functions can then be called by the existing (null return type) training and prediction functions, the ones parameterized with click
flags.
This will also support two other projects (issues coming soon):
This is a blocker for a post-beta release candidate.
Previously, I recall that we had methods like get_trainer
and get_model
to go along with train.get_trainer_from_argparse_args
and train.get_model_from_argparse_args
.
These seem to have disappeared. What is the rationale behind that? Was this a mistake?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.