Comments (11)
I have the same problem while doing preprocessing locally.
I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command:
python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data" --vocab-size=32000
Tried it with with the data from the "scraped" dir provided with the repo.
Please find the log in the attached file.
I've installed the dependencies using conda, as follows:
conda install setuptools ftfy tqdm Click tensorflow numpy
pip install sentencepiece
conda list
output:
from gpt-2-tensorflow2.0.
Hi @vincsous and @RomanPlusPlus
Thanks for reporting the issue.
I have fixed the issue please pull the code and test.
Thanks
from gpt-2-tensorflow2.0.
Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training.
First, as I am using Colab, I do not have multiple GPU so I choose --distributed=False.
It seems that it starts to train but training stops ("Training Done....") at step 20, 11% accuracy.
Here is the log.
log_train.txt
Thanks again
from gpt-2-tensorflow2.0.
Hi @akanyaani, thank you for your speedy response.
Unfortunately, the problem persists. I still get the same [!sentences_.empty()]
error.
Please find the log in the attached file.
from gpt-2-tensorflow2.0.
But it's working on my system could you please print files in that directory.
Add print in the pre_process.py train method.
text_files = glob.glob((data_dir + "/*.txt"))
print(text_files) #Add this and see does it print text files
process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")
This error comes when text_files does not have any text files.
If text_files is an empty list then try to resolve path issues.
from gpt-2-tensorflow2.0.
Hi @vincsous
I will look into that.
Thanks
from gpt-2-tensorflow2.0.
Hi @akanyaani ,
I added the line you suggested.
It prints out the following:
['/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/processed.txt']
I also checked the "processed.txt" file. It's empty.
from gpt-2-tensorflow2.0.
You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.
python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/scraped" --vocab-size=32000
from gpt-2-tensorflow2.0.
I am also getting this error. My command:
python pre_process.py --data-dir=/media/b/F:/patent_data_v2/patent_data_joined --vocab-size=50000
Checked the processed.txt
file - it's got PLENTY of data.
Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine.
My os:
Linux Ubuntu (latest version, 20)
Running in conda custom environment.
My conda env.yaml file:
`name: tf
channels:
- anaconda
- defaults
dependencies: - _libgcc_mutex=0.1=main
- _tflow_select=2.1.0=gpu
- absl-py=0.9.0=py36_0
- astunparse=1.6.3=py_0
- blas=1.0=mkl
- blinker=1.4=py36_0
- brotlipy=0.7.0=py36h7b6447c_1000
- c-ares=1.15.0=h7b6447c_1001
- ca-certificates=2020.6.24=0
- cachetools=4.1.0=py_1
- certifi=2020.6.20=py36_0
- cffi=1.14.0=py36he30daa8_1
- chardet=3.0.4=py36_1003
- click=7.1.2=py_0
- cryptography=2.9.2=py36h1ba5d50_0
- cudatoolkit=10.1.243=h6bb024c_0
- cudnn=7.6.5=cuda10.1_0
- cupti=10.1.168=0
- ftfy=5.7=py_0
- gast=0.3.3=py_0
- google-auth=1.14.1=py_0
- google-auth-oauthlib=0.4.1=py_2
- google-pasta=0.2.0=py_0
- grpcio=1.27.2=py36hf8bcb03_0
- h5py=2.10.0=py36hd6299e0_1
- hdf5=1.10.6=hb1b8bf9_0
- idna=2.10=py_0
- intel-openmp=2020.1=217
- keras-preprocessing=1.1.0=py_1
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20191231=h14c3975_1
- libffi=3.3=he6710b0_2
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libprotobuf=3.12.3=hd408876_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- markdown=3.1.1=py36_0
- mkl=2019.4=243
- mkl-service=2.3.0=py36he904b0f_0
- mkl_fft=1.1.0=py36h23d657b_0
- mkl_random=1.1.0=py36hd6b4f25_0
- ncurses=6.2=he6710b0_1
- numpy=1.18.5=py36ha1c710e_0
- numpy-base=1.18.5=py36hde5b4d6_0
- oauthlib=3.1.0=py_0
- openssl=1.1.1g=h7b6447c_0
- opt_einsum=3.1.0=py_0
- pip=20.1.1=py36_1
- protobuf=3.12.3=py36he6710b0_0
- pyasn1=0.4.8=py_0
- pyasn1-modules=0.2.7=py_0
- pycparser=2.20=py_0
- pyjwt=1.7.1=py36_0
- pyopenssl=19.1.0=py36_0
- pysocks=1.7.1=py36_0
- python=3.6.10=h7579374_2
- readline=8.0=h7b6447c_0
- requests=2.24.0=py_0
- requests-oauthlib=1.3.0=py_0
- rsa=4.0=py_0
- scipy=1.5.0=py36h0b6359f_0
- setuptools=47.3.1=py36_0
- six=1.15.0=py_0
- sqlite=3.32.3=h62c20be_0
- tensorboard=2.2.1=pyh532a8cf_0
- tensorboard-plugin-wit=1.6.0=py_0
- tensorflow=2.2.0=gpu_py36hf933387_0
- tensorflow-base=2.2.0=gpu_py36h8a81be8_0
- tensorflow-estimator=2.2.0=pyh208ff02_0
- tensorflow-gpu=2.2.0=h0d30ee6_0
- termcolor=1.1.0=py36_1
- tk=8.6.10=hbc83047_0
- tqdm=4.47.0=py_0
- urllib3=1.25.9=py_0
- wcwidth=0.2.5=py_0
- werkzeug=1.0.1=py_0
- wheel=0.34.2=py36_0
- wrapt=1.12.1=py36h7b6447c_1
- xz=5.2.5=h7b6447c_0
- zlib=1.2.11=h7b6447c_3
- pip:
- sentencepiece==0.1.85
prefix: /home/b/anaconda3/envs/tf
`
- sentencepiece==0.1.85
from gpt-2-tensorflow2.0.
You can run into this error even if your path is correct because the train
method assumes your data files use the txt
file extension. If you don't have files with txt
as their extension, they won't be considered, causing the error.
I'd recommend that the train
method be changed to:
def train(data_dir, vocab_size, min_seq_len, max_seq_len):
text_files = glob.glob((data_dir + "/*"))
process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")
In other words, change "/*.txt"
to "/*"
.
Better yet, gather the file paths recursively like so:
text_files = glob.glob((data_dir + "/**/*"))
This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes.
from gpt-2-tensorflow2.0.
I encountered this error when running the code on Windows. I fixed this by editing all calls to with open
like this:
with open(PROCESS_DATA_PATH, 'r', encoding = 'utf-8') as f:
with open(BPE_TSV_PATH, 'w', encoding = 'utf-8', newline='') as f_output:
The files that are read need to be encoded in UTF-8, but I guess that goes without saying.
from gpt-2-tensorflow2.0.
Related Issues (20)
- Why was WSWS.org taken out of the gpt-2 dataset, but Breitbart, fox etc was left in?
- sg.sample_sequence returns context after pre-trained model HOT 3
- The Following Error is Generated HOT 1
- No module named sentencepiece
- IndexError: Out of range: piece id is out of range. HOT 3
- why not epoch ? how to stop train model
- Error while training HOT 1
- How can i train a model with language with not english symbols?
- Error in PredictCost() for the op: "Softmax"
- Performance issues in data_pipeline.py(P2)
- What is shared weight across layers?
- python setup.py egg_info did not run successfully.
- Batch inference issue and left padding HOT 3
- problem with training HOT 1
- Mismatch in Arguments HOT 6
- Is Concatenation of Data Files Necessary?
- Why Train_Accuracy is pretty low(about 0.2) ? HOT 1
- can not covert to pytorch model by using transformers
- tensor mask shape may be different with tensor matmul_qk shape
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt-2-tensorflow2.0.