(yzy-KEAR) jizhi2@jizhi2-MS-7A78:/media/jizhi2/软件/yzy/KEAR$ bash/task_train.sh
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
start is 1660190764.6195216start is 1660190764.6195297
[1858634] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
[1858635] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}
[1858635]: world_size = 2, rank = 1, backend=nccl
[1858634]: world_size = 2, rank = 0, backend=nccl
batch size: 2, total_batch_size: 10batch size: 2, total_batch_size: 10
clearing output folder.
args.fp16 is 0
args.fp16 is 0
load_vocab google/electra-large-discriminator
load_vocab google/electra-large-discriminator
load_data data/csqa_ret_3datasets/train_data.json
load_data data/csqa_ret_3datasets/train_data.json
data: 9741, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 9741, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 1222, world_size: 2
get dir test/
make dataloader ...
data: 1222, world_size: 2
get dir test/
make dataloader ...
max len: 200
95 percent len: 98
train_data 9741
total length: 2436
max len: 200
95 percent len: 98
train_data 9741
total length: 2436
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
model_type= electra
model_type= electra
init model finished.
init model finished.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.weight', 'scorer.csqa_ret_3datasets.scorer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.weight', 'scorer.csqa_ret_3datasets.scorer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-08-11 12:06:35,144 - __main__ - INFO - initializing trainer.
2022-08-11 12:06:35,144 - __main__ - INFO - initializing trainer.
Trainer: fp16 is 0
2022-08-11 12:06:35,906 - __main__ - INFO - initialize trainer finished.
Trainer: fp16 is 02022-08-11 12:06:35,906 - __main__ - INFO - setting up optimizer
2022-08-11 12:06:35,906 - __main__ - INFO - initialize trainer finished.
2022-08-11 12:06:35,906 - __main__ - INFO - setting up optimizer
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
2022-08-11 12:06:35,912 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
2022-08-11 12:06:35,912 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
load successfully.load successfully.
2022-08-11 12:06:35,915 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-11 12:06:35,915 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
File "task.py", line 410, in <module>
Traceback (most recent call last):
File "task.py", line 410, in <module>
srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)
File "task.py", line 93, in train
File "task.py", line 93, in train
self.trainer.train(
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
self.trainer.train(
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
for step, batch in enumerate(train_looper):
File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
for step, batch in enumerate(train_looper):
File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
batch = next(self.dataloader_iter)
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
batch = next(self.dataloader_iter)
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 474, in _next_data
index = self._next_index() # may raise StopIteration
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
data = self._next_data()
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 474, in _next_data
return next(self._sampler_iter) # may raise StopIteration
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
for idx in self.sampler:
File "/media/jizhi2/软件/yzy/KEAR/utils/resumable_sampler.py", line 31, in __iter__
index = self._next_index() # may raise StopIteration
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
assert len(self.perm) == self.total_size
AssertionError
return next(self._sampler_iter) # may raise StopIteration
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
for idx in self.sampler:
File "/media/jizhi2/软件/yzy/KEAR/utils/resumable_sampler.py", line 31, in __iter__
assert len(self.perm) == self.total_size
AssertionError
Traceback (most recent call last):
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '1e-5', '--append_answer_text', '1', '--weight_decay', '0.01', '--preset_model_type', 'electra', '--batch_size', '2', '--max_seq_length', '50', '--num_train_epochs', '10', '--save_interval_step', '2', '--continue_train', '--print_number_per_epoch', '2', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--print_loss_step', '10', '--clear_output_folder']' returned non-zero exit status 1.
I debugged and found that the value of len(self.perm) is 9741 and the value of self.total_size is 9742.
What is the reason for this?