- I'm working as a researcher at AI Center of Samsung Life Insurance.
- NLP, including Language Modeling and Represenation Learning
- And also ML, Recommend system
- Fine-tuning and serving large language model
- Retrieval augmented generation(RAG)
Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
Hi M. H. Kwon,
Your tokenization script is really helpful.
I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py ,extract_features.py etc..as a result I got vocab file, .tfrecord file, .jason file and check point files.
Now how to use those file for the below tasks:
Need your help.
Hi, thank you for your sharing.
I am trying to make vocab.txt like below for IMDB moview review dataset.
python3 subword_builder.py --corpus_filepattern IMDB_review.txt --output_filename vocab.txt --min_count 30000
WARNING:tensorflow:From subword_builder.py:81: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:133: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.
W0304 18:26:35.470829 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:133: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.
['./IMDB_review.txt']
WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:138: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.
W0304 18:26:35.492865 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:138: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.
19.23373532295227 for reading read file : ./IMDB_review.txt
read all files
WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/text_encoder.py:588: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
W0304 18:26:54.772613 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/text_encoder.py:588: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
INFO:tensorflow:Iteration 0
I0304 18:26:54.772828 140030738089792 text_encoder.py:588] Iteration 0
INFO:tensorflow:vocab_size = 668
I0304 18:26:59.560518 140030738089792 text_encoder.py:660] vocab_size = 668
INFO:tensorflow:Iteration 1
I0304 18:26:59.560930 140030738089792 text_encoder.py:588] Iteration 1
INFO:tensorflow:vocab_size = 378
I0304 18:27:02.865697 140030738089792 text_encoder.py:660] vocab_size = 378
INFO:tensorflow:Iteration 2
I0304 18:27:02.866119 140030738089792 text_encoder.py:588] Iteration 2
INFO:tensorflow:vocab_size = 403
I0304 18:27:06.409686 140030738089792 text_encoder.py:660] vocab_size = 403
INFO:tensorflow:Iteration 3
I0304 18:27:06.409908 140030738089792 text_encoder.py:588] Iteration 3
INFO:tensorflow:vocab_size = 397
I0304 18:27:10.208530 140030738089792 text_encoder.py:660] vocab_size = 397
INFO:tensorflow:Iteration 4
I0304 18:27:10.208930 140030738089792 text_encoder.py:588] Iteration 4
INFO:tensorflow:vocab_size = 399
I0304 18:27:13.905530 140030738089792 text_encoder.py:660] vocab_size = 399
total vocab size : 456, 19.1799635887146 seconds elapsed
INFO:tensorflow:vocab_size = 456
I0304 18:27:13.912348 140030738089792 text_encoder.py:686] vocab_size = 456
but vocab size is very small?
whats wrong?
IMDB_review.txt
I thought this was wonderful way to spend time on too hot summer weekend sitting in the air conditioned theater and watching light hearted comedy The plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer While some may be disappointed when they realize this is not Match Point Risk Addiction thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love This was the most d laughed at one of Woody comedies in years dare say decade While ve never been impressed with Scarlet Johanson in this she managed to tone down her sexy image and jumped right into average but spirited young woman This may not be the crown jewel of his career but it was wittier than Devil Wears Prada and more interesting than Superman great comedy to go see with friends .
Basically there a family where little boy Jake thinks there a zombie in his closet his parents are fighting all the time This movie is slower than soap opera and suddenly Jake decides to become Rambo and kill the zombie OK first of all when you re going to make film you must Decide if its thriller or drama As drama the movie is watchable Parents are divorcing arguing like in real life And then we have Jake with his closet which totally ruins all the film expected to see BOOGEYMAN similar movie and instead watched drama with some meaningless thriller spots out of just for the well playing parents descent dialogs As for the shots with Jake just ignore them .
And my tensorflow version is 1.15
Hi @kwonmha,
your project is exactly what came into my mind when dealing with Bert vocab creation. Currently I'm doing some vocab optimizations for my Bert project, too.
Can you say something about improvements/degradations related to your vocab changes? I'm really curious if this approach delivers better results.
The code in the repo does not work with either Tensorflow version 1.11 or version 2.0. I got it to work in 1.11 by changing the code according to the 1.11 API
filenames = sorted(tf.gfile.Glob(filepattern))
print(filenames)
lines_read = 0
for filename in filenames:
start = time.time()
with tf.gfile.Open(filename) as f:
Alternatively update the readme to the reflect the tf version you used.
I wanted to build a vocab on my corpus. I made a folder named data, put the file in it(made a small text file just for sanity check), set corpus_max_lines to 8(no.of lines in my test text) and run the following command
python subword_builder.py --corpus_filepatter
n=/media/ayushjain1144/"New Volume"/"IGCAR PS"/data --corpus_max_lines=8 --output_filename=/media/ayushjain1144/"New Volume"/"IGCAR PS"/out
The output I received is this:
['/media/ayushjain1144/New Volume/IGCAR PS/data']
0.00014281272888183594 for reading read file : /media/ayushjain1144/New Volume/IGCAR PS/data
read all files
61
61
61
61
out directory is empty. Also not able to understand what 61 means! Please help!
Hello,
first of all, thank you for this project
I tried to run it using tensorflow==1.11.0 and I got this error
AttributeError: module 'tensorflow.io' has no attribute 'gfile'
I also tried to run it using tensorflow==2.0.0 and got this error
AttributeError: module 'tensorflow' has no attribute
'flags'`
could you list all the requirements or give a solution, thank you!
I am getting an error while running vocab builder.
Code and files used for vocab bulider:
!git clone https://github.com/kwonmha/bert-vocab-builder.git
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review_nopunct.txt
!python ./bert-vocab-builder/subword_builder.py --corpus_filepattern "restaurant_review_nopunct.txt" --output_filename "vocab.txt" --min_count 1
Issue 1: fixed replacing 'tf.flags' by ' tf.compat.v1.flags' (Version issue)
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 37, in
tf.flags.DEFINE_string('output_filename', '/tmp/my.subword_text_encoder',
AttributeError: module 'tensorflow' has no attribute 'flags'
Issue 2:
The number of files to read : 1
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 86, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./bert-vocab-builder/subword_builder.py", line 67, in main
split_on_newlines=FLAGS.split_on_newlines, additional_chars=FLAGS.additional_chars)
File "/content/bert-vocab-builder/tokenizer.py", line 191, in corpus_token_counts
split_on_newlines=split_on_newlines):
File "/content/bert-vocab-builder/tokenizer.py", line 139, in _read_filepattern
tf.logging.INFO("Start reading ", filename)
TypeError: 'int' object is not callable
Could any one help please me out on this issue? Thanks in advance
Windows fatal exception: access violation
Current thread 0x00002078 (most recent call first):
File "D:\Anaconda3\lib\site-packages\tensorflow\python\lib\io\file_io.py", lin
e 384 in get_matching_files_v2
File "D:\Anaconda3\lib\site-packages\tensorflow\python\lib\io\file_io.py", lin
e 363 in get_matching_files
File "F:\BERT\bert-vocab-builder-master\tokenizer.py", line 133 in _read_filep
attern
File "F:\BERT\bert-vocab-builder-master\tokenizer.py", line 188 in corpus_toke
n_counts
File "subword_builder.py", line 63 in main
File "D:\Anaconda3\lib\site-packages\absl\app.py", line 250 in _run_main
File "D:\Anaconda3\lib\site-packages\absl\app.py", line 299 in run
File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line
40 in run
File "subword_builder.py", line 81 in
bert-vocab-builder/text_encoder.py
Line 688 in 2e7d107
I am facing the below issue while running the command for creating data for inputing to the albert model.
python create_pretraining_data.py --input_file /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/data_toy/restaurant_review_train --output_file /media/xxxx/NewVolume/ALBERT/ouput --vocab_file /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/models_toy/vocab.txt
WARNING:tensorflow:From create_pretraining_data.py:653: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
WARNING:tensorflow:From create_pretraining_data.py:618: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
W0114 11:29:58.957636 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:618: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From create_pretraining_data.py:618: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
W0114 11:29:58.957761 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:618: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
WARNING:tensorflow:From create_pretraining_data.py:626: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.
W0114 11:29:58.958572 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:626: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.
WARNING:tensorflow:From create_pretraining_data.py:628: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
W0114 11:29:58.959418 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:628: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
INFO:tensorflow:*** Reading from input files ***
I0114 11:29:58.959500 140552204957504 create_pretraining_data.py:628] *** Reading from input files ***
INFO:tensorflow: /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/data_toy/restaurant_review_train
I0114 11:29:58.959625 140552204957504 create_pretraining_data.py:630] /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/data_toy/restaurant_review_train
WARNING:tensorflow:From create_pretraining_data.py:228: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
W0114 11:29:58.960052 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:228: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
Traceback (most recent call last):
File "create_pretraining_data.py", line 653, in
tf.app.run()
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/xxxx/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/xxxx/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "create_pretraining_data.py", line 636, in main
rng)
File "create_pretraining_data.py", line 230, in create_training_instances
line = reader.readline()
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/lib/io/file_io.py", line 179, in readline
return self._prepare_value(self._read_buf.ReadLineAsString())
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/lib/io/file_io.py", line 98, in _prepare_value
return compat.as_str_any(val)
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/util/compat.py", line 123, in as_str_any
return as_str(value)
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/util/compat.py", line 93, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 8: invalid start byte
I tried using the vocab builder on the German Wikipedia, but some words aren't accurately represented into its sub words, for example, "eintausendneunhundertneunzig" is considered as a one sub word, although I expected "ein", "tausend", "neun", "hundert", "neun", "zig", is there any tweaks to make the model more specific to German which is very compound?
Thank you
I was trying to use the repo for building a vocab and I realized that the encode(text)
function is used to as a tokenizer. I am not sure if I am right, but I am not able to get the last token in the returned result.
def encode(text):
"""Encode a unicode string as a list of tokens.
Args:
text: a unicode string
Returns:
a list of tokens as Unicode strings
"""
if not text:
return []
ret = []
token_start = 0
# Classify each character in the input string
is_alnum = [c in _ALPHANUMERIC_CHAR_SET for c in text]
add_remaining = False
for pos in range(1, len(text)):
add_remaining = False
if is_alnum[pos] != is_alnum[pos - 1]:
if not is_alnum[pos]:
token = text[token_start:pos]
if token != u" " or token_start == 0:
add_remaining = False
ret.append(token)
else:
add_remaining = True
token_start = pos
final_token = text[token_start:] if text[-1] in _ALPHANUMERIC_CHAR_SET else text[token_start:-1]
if add_remaining:
ret.append(final_token)
return ret
The following is a sample result:
print(encode("knee injury present"))
>>['knee', 'injury']
Hi Kwonmha,
Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:
Thanks!
Regards
I'm trying to pretrain from scratch this https://github.com/google-research/bert/ and using this library to make vocab.txt, after I have successfully make a vocab.txt by using this library, should I match the size of vocabulary on bert_config.json with newly created vocab.txt?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.