polyglot: Large Language Models of Well-balanced Competence in Multi-languages
role: distributed training of LM with Megatron LM & data crawling, preprocessing and model evaluaton. Published 1.3B, 3.8B, 5.8B, 12.8B polyglot-ko models.
Thanks for your code. It helps a lot. When I was trying to reproduce your results on the amazon review datasets, I found the BERT-ADD accuracies are worse than no adapt results? Have you encountered the same issue? Epoch [79/80] Step [195/200]: acc=0.5000 g_loss=0.6922 d_loss=0.6932 kd_loss=0.0000. I also noticed that when the adapt training converges, only the kd_loss descent to 0, but g_loss and d_loss didn't descend at all. Is it normal or maybe this is where the problem is? Or could you please release the hyperparameters for Bert-AAD?