When I use StochasticDurationPredictor like below in place of normal duration predictor and pitch predictor,
Duration loss and pitch loss are quite large. (4000~5000)
Even though attn_hard_dur is in training(by alignment encoder) and not stable, the loss seems to be too large,
what could be the problem?
self.duration_predictor = StochasticDurationPredictor(model_config)
# output : [batch_size, hidden_dim, text_seq_len] text encoder output
output = output + speaker_embedding
sdp_mask = torch.unsqueeze(sequence_mask(text_seq_lens, output.shape[-1]), 1).to(output.dtype)
duration_prediction = self.duration_predictor(
x=output,
x_mask=sdp_mask,
w=attn_hard_dur.unsqueeze(1),
reverse=False
)
duration_loss = torch.sum(duration_prediction.float())
[Train step : 100] total_loss 10906.626953, mel_loss 0.977667, d_loss 5461.666504, p_loss 5441.733887, ctc_loss 2.249039, bin_loss 1.860192,
[Train step : 200] total_loss 11053.518555, mel_loss 0.775560, d_loss 5531.041016, p_loss 5519.537598, ctc_loss 2.165245, bin_loss 1.741701,
[Train step : 300] total_loss 10651.076172, mel_loss 0.666218, d_loss 5329.517578, p_loss 5318.736328, ctc_loss 2.155767, bin_loss 1.736991,
[Train step : 400] total_loss 10753.271484, mel_loss 0.627934, d_loss 5380.976074, p_loss 5369.540039, ctc_loss 2.126560, bin_loss 1.712614,
[Train step : 500] total_loss 11444.052734, mel_loss 0.591087, d_loss 5737.601562, p_loss 5703.707520, ctc_loss 2.152741, bin_loss 1.709005,
[Train step : 600] total_loss 11534.011719, mel_loss 0.564516, d_loss 5772.588867, p_loss 5758.684082, ctc_loss 2.173516, bin_loss 1.651289,
[Train step : 700] total_loss 12391.351562, mel_loss 0.555285, d_loss 6199.072754, p_loss 6189.496582, ctc_loss 2.226302, bin_loss 1.634849,
[Train step : 800] total_loss 10847.539062, mel_loss 0.542545, d_loss 5435.289062, p_loss 5409.686523, ctc_loss 2.021489, bin_loss 1.557770,
[Train step : 900] total_loss 10487.540039, mel_loss 0.532718, d_loss 5260.755371, p_loss 5224.231445, ctc_loss 2.020667, bin_loss 1.525733,
[Train step : 1000] total_loss 9263.677734, mel_loss 0.536135, d_loss 4639.945312, p_loss 4621.299805, ctc_loss 1.896345, bin_loss 1.430400,
[Train step : 1100] total_loss 10892.701172, mel_loss 0.537346, d_loss 5445.554688, p_loss 5444.647461, ctc_loss 1.962303, bin_loss 1.467209,
[Train step : 1200] total_loss 9963.730469, mel_loss 0.528609, d_loss 4972.430664, p_loss 4988.891113, ctc_loss 1.879873, bin_loss 1.405760,
[Train step : 1300] total_loss 9535.383789, mel_loss 0.527506, d_loss 4766.958496, p_loss 4766.083008, ctc_loss 1.815230, bin_loss 1.356215,
[Train step : 1400] total_loss 10367.413086, mel_loss 0.529463, d_loss 5190.863281, p_loss 5174.230469, ctc_loss 1.789981, bin_loss 1.329592,
[Train step : 1500] total_loss 10163.126953, mel_loss 0.525895, d_loss 5072.743164, p_loss 5088.081543, ctc_loss 1.776225, bin_loss 1.312975,
[Train step : 1600] total_loss 10285.883789, mel_loss 0.518091, d_loss 5121.499023, p_loss 5162.050781, ctc_loss 1.815271, bin_loss 1.384229,
[Train step : 1700] total_loss 10007.465820, mel_loss 0.515487, d_loss 4998.065918, p_loss 5007.191406, ctc_loss 1.692332, bin_loss 1.380546,
[Train step : 1800] total_loss 10438.118164, mel_loss 0.506113, d_loss 5225.224609, p_loss 5210.665527, ctc_loss 1.721988, bin_loss 1.504314,
[Train step : 1900] total_loss 9777.532227, mel_loss 0.515738, d_loss 4897.006836, p_loss 4878.411133, ctc_loss 1.598633, bin_loss 1.363741,
[Train step : 2000] total_loss 10859.443359, mel_loss 0.488703, d_loss 5405.339844, p_loss 5451.948242, ctc_loss 1.666379, bin_loss 1.473202,
[Train step : 2100] total_loss 10407.706055, mel_loss 0.495060, d_loss 5194.208008, p_loss 5211.336914, ctc_loss 1.665549, bin_loss 1.437147,
[Train step : 2200] total_loss 10450.838867, mel_loss 0.491688, d_loss 5227.919922, p_loss 5220.825684, ctc_loss 1.601205, bin_loss 1.470080,
[Train step : 2300] total_loss 9259.790039, mel_loss 0.489379, d_loss 4645.092773, p_loss 4612.664551, ctc_loss 1.544137, bin_loss 1.450988,