Coder Social home page Coder Social logo

Comments (11)

RomanPlusPlus avatar RomanPlusPlus commented on May 19, 2024

I have the same problem while doing preprocessing locally.

I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command:
python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data" --vocab-size=32000

Tried it with with the data from the "scraped" dir provided with the repo.

Please find the log in the attached file.

log.txt

I've installed the dependencies using conda, as follows:
conda install setuptools ftfy tqdm Click tensorflow numpy
pip install sentencepiece

conda list output:

packages_versions.txt

from gpt-2-tensorflow2.0.

akanyaani avatar akanyaani commented on May 19, 2024

Hi @vincsous and @RomanPlusPlus

Thanks for reporting the issue.
I have fixed the issue please pull the code and test.

Thanks

from gpt-2-tensorflow2.0.

vincsous avatar vincsous commented on May 19, 2024

Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training.
First, as I am using Colab, I do not have multiple GPU so I choose --distributed=False.
It seems that it starts to train but training stops ("Training Done....") at step 20, 11% accuracy.
Here is the log.
log_train.txt

Thanks again

from gpt-2-tensorflow2.0.

RomanPlusPlus avatar RomanPlusPlus commented on May 19, 2024

Hi @akanyaani, thank you for your speedy response.

Unfortunately, the problem persists. I still get the same [!sentences_.empty()] error.

Please find the log in the attached file.

log200517.txt

from gpt-2-tensorflow2.0.

akanyaani avatar akanyaani commented on May 19, 2024

Hi @RomanPlusPlus

But it's working on my system could you please print files in that directory.

Add print in the pre_process.py train method.

text_files = glob.glob((data_dir + "/*.txt"))

print(text_files) #Add this and see does it print text files

process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")

This error comes when text_files does not have any text files.
If text_files is an empty list then try to resolve path issues.

from gpt-2-tensorflow2.0.

akanyaani avatar akanyaani commented on May 19, 2024

Hi @vincsous

I will look into that.

Thanks

from gpt-2-tensorflow2.0.

RomanPlusPlus avatar RomanPlusPlus commented on May 19, 2024

Hi @akanyaani ,

I added the line you suggested.
It prints out the following:

['/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/processed.txt']

I also checked the "processed.txt" file. It's empty.

from gpt-2-tensorflow2.0.

akanyaani avatar akanyaani commented on May 19, 2024

Hi @RomanPlusPlus

You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.

python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/scraped" --vocab-size=32000

from gpt-2-tensorflow2.0.

apteryxlabs avatar apteryxlabs commented on May 19, 2024

I am also getting this error. My command:
python pre_process.py --data-dir=/media/b/F:/patent_data_v2/patent_data_joined --vocab-size=50000

Checked the processed.txt file - it's got PLENTY of data.

Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine.

My os:
Linux Ubuntu (latest version, 20)

Running in conda custom environment.

My conda env.yaml file:
`name: tf
channels:

  • anaconda
  • defaults
    dependencies:
  • _libgcc_mutex=0.1=main
  • _tflow_select=2.1.0=gpu
  • absl-py=0.9.0=py36_0
  • astunparse=1.6.3=py_0
  • blas=1.0=mkl
  • blinker=1.4=py36_0
  • brotlipy=0.7.0=py36h7b6447c_1000
  • c-ares=1.15.0=h7b6447c_1001
  • ca-certificates=2020.6.24=0
  • cachetools=4.1.0=py_1
  • certifi=2020.6.20=py36_0
  • cffi=1.14.0=py36he30daa8_1
  • chardet=3.0.4=py36_1003
  • click=7.1.2=py_0
  • cryptography=2.9.2=py36h1ba5d50_0
  • cudatoolkit=10.1.243=h6bb024c_0
  • cudnn=7.6.5=cuda10.1_0
  • cupti=10.1.168=0
  • ftfy=5.7=py_0
  • gast=0.3.3=py_0
  • google-auth=1.14.1=py_0
  • google-auth-oauthlib=0.4.1=py_2
  • google-pasta=0.2.0=py_0
  • grpcio=1.27.2=py36hf8bcb03_0
  • h5py=2.10.0=py36hd6299e0_1
  • hdf5=1.10.6=hb1b8bf9_0
  • idna=2.10=py_0
  • intel-openmp=2020.1=217
  • keras-preprocessing=1.1.0=py_1
  • ld_impl_linux-64=2.33.1=h53a641e_7
  • libedit=3.1.20191231=h14c3975_1
  • libffi=3.3=he6710b0_2
  • libgcc-ng=9.1.0=hdf63c60_0
  • libgfortran-ng=7.3.0=hdf63c60_0
  • libprotobuf=3.12.3=hd408876_0
  • libstdcxx-ng=9.1.0=hdf63c60_0
  • markdown=3.1.1=py36_0
  • mkl=2019.4=243
  • mkl-service=2.3.0=py36he904b0f_0
  • mkl_fft=1.1.0=py36h23d657b_0
  • mkl_random=1.1.0=py36hd6b4f25_0
  • ncurses=6.2=he6710b0_1
  • numpy=1.18.5=py36ha1c710e_0
  • numpy-base=1.18.5=py36hde5b4d6_0
  • oauthlib=3.1.0=py_0
  • openssl=1.1.1g=h7b6447c_0
  • opt_einsum=3.1.0=py_0
  • pip=20.1.1=py36_1
  • protobuf=3.12.3=py36he6710b0_0
  • pyasn1=0.4.8=py_0
  • pyasn1-modules=0.2.7=py_0
  • pycparser=2.20=py_0
  • pyjwt=1.7.1=py36_0
  • pyopenssl=19.1.0=py36_0
  • pysocks=1.7.1=py36_0
  • python=3.6.10=h7579374_2
  • readline=8.0=h7b6447c_0
  • requests=2.24.0=py_0
  • requests-oauthlib=1.3.0=py_0
  • rsa=4.0=py_0
  • scipy=1.5.0=py36h0b6359f_0
  • setuptools=47.3.1=py36_0
  • six=1.15.0=py_0
  • sqlite=3.32.3=h62c20be_0
  • tensorboard=2.2.1=pyh532a8cf_0
  • tensorboard-plugin-wit=1.6.0=py_0
  • tensorflow=2.2.0=gpu_py36hf933387_0
  • tensorflow-base=2.2.0=gpu_py36h8a81be8_0
  • tensorflow-estimator=2.2.0=pyh208ff02_0
  • tensorflow-gpu=2.2.0=h0d30ee6_0
  • termcolor=1.1.0=py36_1
  • tk=8.6.10=hbc83047_0
  • tqdm=4.47.0=py_0
  • urllib3=1.25.9=py_0
  • wcwidth=0.2.5=py_0
  • werkzeug=1.0.1=py_0
  • wheel=0.34.2=py36_0
  • wrapt=1.12.1=py36h7b6447c_1
  • xz=5.2.5=h7b6447c_0
  • zlib=1.2.11=h7b6447c_3
  • pip:
    • sentencepiece==0.1.85
      prefix: /home/b/anaconda3/envs/tf
      `

from gpt-2-tensorflow2.0.

elbowdonkey avatar elbowdonkey commented on May 19, 2024

You can run into this error even if your path is correct because the train method assumes your data files use the txt file extension. If you don't have files with txt as their extension, they won't be considered, causing the error.

I'd recommend that the train method be changed to:

def train(data_dir, vocab_size, min_seq_len, max_seq_len):
	text_files = glob.glob((data_dir + "/*"))
	process_text(text_files)
	train_byte_pair_encoding(vocab_size)
	create_tf_records(min_seq_len, max_seq_len)
	print("Pre-processing is done............")

In other words, change "/*.txt" to "/*".

Better yet, gather the file paths recursively like so:

text_files = glob.glob((data_dir + "/**/*"))

This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes.

from gpt-2-tensorflow2.0.

tkahn avatar tkahn commented on May 19, 2024

I encountered this error when running the code on Windows. I fixed this by editing all calls to with open like this:

with open(PROCESS_DATA_PATH, 'r', encoding = 'utf-8') as f:
with open(BPE_TSV_PATH, 'w', encoding = 'utf-8', newline='') as f_output:

The files that are read need to be encoded in UTF-8, but I guess that goes without saying.

from gpt-2-tensorflow2.0.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.