cofe-ai / fast-gector Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
训练内存持续增加直至崩掉
Hi,
I'm having trouble with training using an RTX 3090. No matter what I tried, it doesn't seem that training is happening on the GPU.
When I start training, these is a log saying it's using GPU: setup device: cuda:0
, but GPU memory usage does not change at all, and CPU usage is very high on 1 core.
I'm using conda with Python 3.7, torch 0.11.0, which I setup with this command:
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
Am I missing anything? Please help me if possible. Thank you!
parameter special_tokens_fix is specially for roberta? if use bert model, should set 0?
Hello Jason,
I am using the fine-tuned Roberta model from the GECToR repo to run the inference over a sentences file. I am able to make predictions using the original GECToR. However, I am having issues with the fast-gector.
This is what I have for the predict.sh:
#!/bin/bash
mkdir result
deepspeed --include localhost:0 --master_port 42991 predict.py
--batch_size 256
--iteration_count 5
--min_len 3
--max_len 128
--min_error_probability 0.0
--additional_confidence 0.0
--sub_token_mode "average"
--max_pieces_per_token 5
--model_dir "/home/ec2-user/fast-gector"
--ckpt_id "roberta_1_gectorv2.th"
--detect_vocab_path "./data/vocabulary/d_tags.txt"
--correct_vocab_path "./data/vocabulary/labels.txt"
--pretrained_transformer_path "roberta-base"
--input_path "sentences_sample_100.txt"
--out_path "result/sentences_sample_100.preds"
--special_tokens_fix 1
--detokenize 1
--deepspeed_config "./configs/ds_config_zero1.json"
The error message has been listed below. It looks like the deepspeed is looking for pt files under the model directory, how can I use the fine-tuned model to make predictions in this case?
Traceback (most recent call last):
File "/home/ec2-user/fast-gector/predict.py", line 101, in
main(args)
File "/home/ec2-user/fast-gector/predict.py", line 39, in main
predictor = Predictor(args)
File "/home/ec2-user/fast-gector/src/predictor.py", line 38, in init
self.model = self.init_model(args)
File "/home/ec2-user/fast-gector/src/predictor.py", line 59, in init_model
ds_engine.load_checkpoint(args.model_dir, args.ckpt_id)
File "/home/ec2-user/miniconda/envs/gector_env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2759, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/home/ec2-user/miniconda/envs/gector_env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2793, in _load_checkpoint
sd_loader = SDLoaderFactory.get_sd_loader(
File "/home/ec2-user/miniconda/envs/gector_env/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 44, in get_sd_loader
return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
File "/home/ec2-user/miniconda/envs/gector_env/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 216, in init
super().init(ckpt_list, version, checkpoint_engine)
File "/home/ec2-user/miniconda/envs/gector_env/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 56, in init
self.check_ckpt_list()
File "/home/ec2-user/miniconda/envs/gector_env/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 179, in check_ckpt_list
assert len(self.ckpt_list) > 0
AssertionError
Hey Jason,
Thanks for making this AllenNLP-free Gector library.
I am trying to run the inference on a cpu machine, and I received an error message saying no gpu resources available from the deepspeed.
Does it mean that I have to remove the deepspeed dependency in the codebase if I prefer to run on the cpu only?
how to export onnx inference
Hello. What are your reproduced results on the test sets?
0it [00:00, ?it/s]/home/coulombc/wheels_builder/tmp.17380/python-3.10/torch/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [0,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/home/coulombc/wheels_builder/tmp.17380/python-3.10/torch/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [0,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize
failed.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at /home/coulombc/wheels_builder/tmp.17380/python-3.10/torch/c10/cuda/CUDAException.cpp:31 (most recent call first):
@Jason3900
Hi, first thankyou for removing allennlp dependency from GECToR because i had problem for using that.
Actually I started to run this your version but had some problem and questions if you answer me I would be thankful.
I didn't used CUDA and NVIDIA. Does it make problem in running. I should say I'm running the code in colab and using T4 gpu.
also I decided to just use part of the synthetic data to just see if it works but it is still about 2 hours and still running. Is it normal?
After it finished it gave this error:
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2330722it [1:57:15, 26.72s/it][2023-10-25 16:50:16,631] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6518
[2023-10-25 16:50:16,632] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'configs/ds_config_zero1_fp16.json', '--num_epochs', '10', '--max_num_tokens', '128', '--valid_batch_size', '256', '--cold_step_count', '0', '--warmup', '0.1', '--cold_lr', '1e-3', '--skip_correct', '0', '--skip_complex', '0', '--sub_token_mode', 'average', '--special_tokens_fix', '1', '--unk2keep', '0', '--tp_prob', '1', '--tn_prob', '1', '--detect_vocab_path', './data/vocabulary/d_tags.txt', '--correct_vocab_path', './data/vocabulary/labels.txt', '--do_eval', '--train_path', '/content/drive/MyDrive/GEC_test/gector/OUTPUT_FILE', '--valid_path', '/content/drive/MyDrive/GEC_test/gector/OUTPUT_Test', '--save_dir', 'ckpts/ckpt_20231025_14:52:16', '--use_cache', '0', '--log_interval', '1', '--eval_interval', '50', '--save_interval', '50', '--pretrained_transformer_path', 'roberta-base', '--tensorboard_dir', 'logs/tb/gector_20231025_14:52:16'] exits with return code = -9
Should I change "ckpt_path="ckpts/globalstep-xxxx"" in predict.sh ? What should it be?
Maybe my questions are too nave but I'm kind of new in GEC. So if you help me it will really help.
Hello sir,
i was wondering how fast-gector is capturing context of long sentences when it is only trained on some english tokens/words.
Hello! First of all thank you for this implementation
I'm currently trying to apply it on Arabic, and I have a few questions please
if you could answer my curiosities, I would be super thankful!
In the def get_target_sent_by_levels
function, the very first edits level (level 0) is not considered hence the instruction rest_labels = label_list[1:]
, why is that?
In my data, there's only one level of edits (one error per sentence), so is it wrong to take label_list[0:]
?
How is the labels vocab generated? should I just take the words that are included in my training data or can I use a vocabulary from another source? What would you recommend?
Sorry if my questions are too much! I just want to make sure that I'm training my model correctly
Thank you so much in advance!
Received the following error message wehn doing the inference:
Looks like the MisMatchedTokenizer
now takes three positional arguments here, only two are given in the predictor class here.
Traceback (most recent call last):
File "/home/ec2-user/gector/fast-gector/predict.py", line 101, in
main(args)
File "/home/ec2-user/gector/fast-gector/predict.py", line 49, in main
pred_batch, cnt = predictor.handle_batch(batch_text)
File "/home/ec2-user/gector/fast-gector/src/predictor.py", line 76, in handle_batch
batch_input_dict = self.preprocess(ori_batch)
File "/home/ec2-user/gector/fast-gector/src/predictor.py", line 114, in preprocess
input_ids, offsets = self.mismatched_tokenizer.encode(tokens)
File "/home/ec2-user/gector/fast-gector/utils/mismatched_utils.py", line 22, in encode
wordpiece_ids = [self.tokenizer_vocab[wordpiece]
File "/home/ec2-user/gector/fast-gector/utils/mismatched_utils.py", line 22, in
wordpiece_ids = [self.tokenizer_vocab[wordpiece]
TypeError: 'int' object is not subscriptable
Thank you for offering this AllenNLP-free version of gector. I was trying to play with it but realized the Seq2EditDataset can be quite slow as ~8it/s, which makes it impossible to process original dataset use by the paper (9M for pretrain). I wonder if this is normal, or I may miss something important to accelerate it?
cpu memory increase until it explodes during training.
Hi,
I wanted to use this method for another language and i wanted to make the dataset myself.
Can i use it or this method and model has good result only for rich language and large dataset? I would be thankful if you answer me.
Hello
Is that possible to publish a branch without the deepspeed dependence? Our use case requires a cpu usage, which cannot be compatible with deepspeed I afraid. Thanks!
Hi,
How to get the data as mention in the prepare data script
SUBSET="train-stage2"
SOURCE="../gec_private_train_data/${SUBSET}.src"
TARGET="../gec_private_train_data/${SUBSET}.trg"
OUTPUT="../gec_private_train_data/${SUBSET}.edits"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.