bojone / bytepiece Goto Github PK
View Code? Open in Web Editor NEW更纯粹、更高压缩率的Tokenizer
License: Apache License 2.0
更纯粹、更高压缩率的Tokenizer
License: Apache License 2.0
通过类方法 convert_to_sentencepiece 转换为 sp model,再进行 load 的时候报错
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor()
sp_model.Load("sp.model")
libc++abi: terminating due to uncaught exception of type Darts::Details::Exception: /Users/runner/work/sentencepiece/sentencepiece/third_party/darts_clone/darts.h:1143: exception: failed to insert key: zero-length key
相关 issue google/sentencepiece#156
模型里面有 "\0",是否应该在 convert 的时候去掉,以及是否有副作用?
在评估tokenizer的部分给出的是tokenizer自身的评估指标,比如压缩率
但是,高压缩率的tokenizer并不意味模型的效果也更好,是否能给出最终模型层面的效果?
例如:sentencepiece实验中的BLUE
https://github.com/google/sentencepiece/blob/master/doc/experiments.md#english-to-japanese
Hey, I had an error in my training.
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32)
trainer.train(w, workers=2, batch_size=1000)
but I got an error AttributeError: Can't pickle local object 'Trainer.pcount.<locals>.worker_func'
This is my env version:
python 3.8.16
multiprocess 0.70.14
bytepiece-0.6.3
line 356, in convert_to_sentencepiece
p = re.sub(' ', '▁', p.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
目前的bytepiece算法和实现已经足够优秀,但是相较于sp而言文档较少,demo不多,所以目前好像更像是很多人研究的玩具而非生产力工具。综上,建议作者完善文档并增加一个词表扩充的demo或教程。感谢!
There seems to be redundancy in the model you released.
Please correct me if I am wrong. 😁
For example, when I run:
grep -A 2 -B 2 -n '","' bytepiece_80k.model > test.txt
I got:
67637- "77yM": [
67638- 13530,
67639: ",",
67640- 52141276
67641- ],
--
114017- "gO+8jA==": [
114018- 22806,
114019: ",",
114020- 27651
114021- ],
--
114062- "ge+8jA==": [
114063- 22815,
114064: ",",
114065- 30470
114066- ],
--
114077- "gu+8jA==": [
114078- 22818,
114079: ",",
114080- 17876
114081- ],
--
114092- "g++8jA==": [
114093- 22821,
114094: ",",
114095- 15013
114096- ],
--
114112- "hO+8jA==": [
114113- 22825,
114114: ",",
114115- 28134
114116- ],
--
114127- "he+8jA==": [
114128- 22828,
114129: ",",
114130- 24900
114131- ],
--
114142- "hu+8jA==": [
114143- 22831,
114144: ",",
114145- 33913
114146- ],
--
114157- "h++8jA==": [
114158- 22834,
114159: ",",
114160- 19583
114161- ],
--
114177- "iO+8jA==": [
114178- 22838,
114179: ",",
114180- 29143
114181- ],
--
114232- "ie+8jA==": [
114233- 22849,
114234: ",",
114235- 56579
114236- ],
--
114252- "iu+8jA==": [
114253- 22853,
114254: ",",
114255- 27238
114256- ],
--
114307- "i++8jA==": [
114308- 22864,
114309: ",",
114310- 55373
114311- ],
--
114397- "jO+8jA==": [
114398- 22882,
114399: ",",
114400- 27059
114401- ],
--
114427- "je+8jA==": [
114428- 22888,
114429: ",",
114430- 26193
114431- ],
--
114437- "ju+8jA==": [
114438- 22890,
114439: ",",
114440- 24179
114441- ],
--
114462- "j++8jA==": [
114463- 22895,
114464: ",",
114465- 38457
114466- ],
--
114477- "kO+8jA==": [
114478- 22898,
114479: ",",
114480- 23832
114481- ],
--
114517- "ke+8jA==": [
114518- 22906,
114519: ",",
114520- 45492
114521- ],
114522- "ku+8jA==": [
114523- 22907,
114524: ",",
114525- 14602
114526- ],
--
114537- "k++8jA==": [
114538- 22910,
114539: ",",
114540- 16818
114541- ],
--
114557- "lO+8jA==": [
114558- 22914,
114559: ",",
114560- 25217
114561- ],
--
114572- "le+8jA==": [
114573- 22917,
114574: ",",
114575- 22115
114576- ],
--
114592- "lu+8jA==": [
114593- 22921,
114594: ",",
114595- 35881
114596- ],
--
114612- "l++8jA==": [
114613- 22925,
114614: ",",
114615- 20337
114616- ],
--
114632- "mO+8jA==": [
114633- 22929,
114634: ",",
114635- 20429
114636- ],
--
114657- "me+8jA==": [
114658- 22934,
114659: ",",
114660- 23524
114661- ],
--
114682- "mu+8jA==": [
114683- 22939,
114684: ",",
114685- 25417
114686- ],
--
114707- "m++8jA==": [
114708- 22944,
114709: ",",
114710- 39254
114711- ],
--
114722- "nO+8jA==": [
114723- 22947,
114724: ",",
114725- 21866
114726- ],
--
114737- "ne+8jA==": [
114738- 22950,
114739: ",",
114740- 30855
114741- ],
114742- "nu+8jA==": [
114743- 22951,
114744: ",",
114745- 13346
114746- ],
--
114777- "n++8jA==": [
114778- 22958,
114779: ",",
114780- 19853
114781- ],
--
114797- "oO+8jA==": [
114798- 22962,
114799: ",",
114800- 15771
114801- ],
--
114812- "oe+8jA==": [
114813- 22965,
114814: ",",
114815- 15774
114816- ],
--
114827- "ou+8jA==": [
114828- 22968,
114829: ",",
114830- 21457
114831- ],
--
114837- "o++8jA==": [
114838- 22970,
114839: ",",
114840- 19483
114841- ],
--
114847- "pO+8jA==": [
114848- 22972,
114849: ",",
114850- 21556
114851- ],
--
114877- "pe+8jA==": [
114878- 22978,
114879: ",",
114880- 31149
114881- ],
--
114892- "pu+8jA==": [
114893- 22981,
114894: ",",
114895- 24183
114896- ],
--
114907- "p++8jA==": [
114908- 22984,
114909: ",",
114910- 27728
114911- ],
--
114922- "qO+8jA==": [
114923- 22987,
114924: ",",
114925- 28618
114926- ],
--
114947- "qe+8jA==": [
114948- 22992,
114949: ",",
114950- 19995
114951- ],
--
114957- "qu+8jA==": [
114958- 22994,
114959: ",",
114960- 13975
114961- ],
--
114982- "q++8jA==": [
114983- 22999,
114984: ",",
114985- 20687
114986- ],
--
114997- "rO+8jA==": [
114998- 23002,
114999: ",",
115000- 32861
115001- ],
--
115012- "re+8jA==": [
115013- 23005,
115014: ",",
115015- 28320
115016- ],
--
115032- "ru+8jA==": [
115033- 23009,
115034: ",",
115035- 20423
115036- ],
--
115047- "r++8jA==": [
115048- 23012,
115049: ",",
115050- 35625
115051- ],
--
115072- "sO+8jA==": [
115073- 23017,
115074: ",",
115075- 25750
115076- ],
--
115082- "se+8jA==": [
115083- 23019,
115084: ",",
115085- 26685
115086- ],
--
115092- "su+8jA==": [
115093- 23021,
115094: ",",
115095- 11848
115096- ],
--
115107- "s++8jA==": [
115108- 23024,
115109: ",",
115110- 17627
115111- ],
--
115127- "tO+8jA==": [
115128- 23028,
115129: ",",
115130- 28616
115131- ],
115132- "te+8jA==": [
115133- 23029,
115134: ",",
115135- 12874
115136- ],
--
115147- "tu+8jA==": [
115148- 23032,
115149: ",",
115150- 23628
115151- ],
--
115182- "t++8jA==": [
115183- 23039,
115184: ",",
115185- 25204
115186- ],
--
115212- "ue+8jA==": [
115213- 23045,
115214: ",",
115215- 37675
115216- ],
--
115252- "uu+8jA==": [
115253- 23053,
115254: ",",
115255- 39151
115256- ],
--
115282- "u++8jA==": [
115283- 23059,
115284: ",",
115285- 26261
115286- ],
115287- "vO+8jA==": [
115288- 23060,
115289: ",",
115290- 20389
115291- ],
--
115312- "ve+8jA==": [
115313- 23065,
115314: ",",
115315- 32564
115316- ],
--
115332- "vu+8jA==": [
115333- 23069,
115334: ",",
115335- 15513
115336- ],
--
115352- "v++8jA==": [
115353- 23073,
115354: ",",
115355- 27048
115356- ],
--
118962- "77yM5A==": [
118963- 23795,
118964: ",",
118965- 9190
118966- ],
118967- "77yM5Q==": [
118968- 23796,
118969: ",",
118970- 205451
118971- ],
118972- "77yM5g==": [
118973- 23797,
118974: ",",
118975- 172391
118976- ],
118977- "77yM5w==": [
118978- 23798,
118979: ",",
118980- 67781
118981- ],
118982- "77yM6A==": [
118983- 23799,
118984: ",",
118985- 128861
118986- ],
118987- "77yM6Q==": [
118988- 23800,
118989: ",",
118990- 55129
118991- ],
--
158192- "uK3vvIw=": [
158193- 31641,
158194: ",",
158195- 9172
158196- ],
--
160082- "77yM5Lg=": [
160083- 32019,
160084: ",",
160085- 113663
160086- ],
160087- "77yM5Lk=": [
160088- 32020,
160089: ",",
160090- 47189
160091- ],
160092- "77yM5Lo=": [
160093- 32021,
160094: ",",
160095- 27242
160096- ],
160097- "77yM5Ls=": [
160098- 32022,
160099: ",",
160100- 61788
160101- ],
160102- "77yM5Lw=": [
160103- 32023,
160104: ",",
160105- 14111
160106- ],
160107- "77yM5L0=": [
160108- 32024,
160109: ",",
160110- 36471
160111- ],
160112- "77yM5YU=": [
160113- 32025,
160114: ",",
160115- 45865
160116- ],
160117- "77yM5YY=": [
160118- 32026,
160119: ",",
160120- 20102
160121- ],
160122- "77yM5Yc=": [
160123- 32027,
160124: ",",
160125- 18068
160126- ],
160127- "77yM5Yg=": [
160128- 32028,
160129: ",",
160130- 22664
160131- ],
160132- "77yM5Y0=": [
160133- 32029,
160134: ",",
160135- 45655
160136- ],
160137- "77yM5Y8=": [
160138- 32030,
160139: ",",
160140- 78634
160141- ],
160142- "77yM5ZA=": [
160143- 32031,
160144: ",",
160145- 40057
160146- ],
160147- "77yM5ZI=": [
160148- 32032,
160149: ",",
160150- 10479
160151- ],
160152- "77yM5Zs=": [
160153- 32033,
160154: ",",
160155- 10167
160156- ],
160157- "77yM5aQ=": [
160158- 32034,
160159: ",",
160160- 35717
160161- ],
160162- "77yM5aU=": [
160163- 32035,
160164: ",",
160165- 24338
160166- ],
160167- "77yM5aY=": [
160168- 32036,
160169: ",",
160170- 13069
160171- ],
160172- "77yM5a4=": [
160173- 32037,
160174: ",",
160175- 31083
160176- ],
160177- "77yM5a8=": [
160178- 32038,
160179: ",",
160180- 22719
160181- ],
160182- "77yM5bA=": [
160183- 32039,
160184: ",",
160185- 60831
160186- ],
160187- "77yM5bw=": [
160188- 32040,
160189: ",",
160190- 12414
160191- ],
160192- "77yM5b4=": [
160193- 32041,
160194: ",",
160195- 15229
160196- ],
160197- "77yM5b8=": [
160198- 32042,
160199: ",",
160200- 11786
160201- ],
160202- "77yM5oM=": [
160203- 32043,
160204: ",",
160205- 9275
160206- ],
160207- "77yM5og=": [
160208- 32044,
160209: ",",
160210- 25038
160211- ],
160212- "77yM5ok=": [
160213- 32045,
160214: ",",
160215- 30394
160216- ],
160217- "77yM5oo=": [
160218- 32046,
160219: ",",
160220- 19145
160221- ],
160222- "77yM5os=": [
160223- 32047,
160224: ",",
160225- 18043
160226- ],
160227- "77yM5o4=": [
160228- 32048,
160229: ",",
160230- 10899
160231- ],
160232- "77yM5pU=": [
160233- 32049,
160234: ",",
160235- 12918
160236- ],
160237- "77yM5pc=": [
160238- 32050,
160239: ",",
160240- 9034
160241- ],
160242- "77yM5pg=": [
160243- 32051,
160244: ",",
160245- 16571
160246- ],
160247- "77yM5pw=": [
160248- 32052,
160249: ",",
160250- 21640
160251- ],
160252- "77yM5p0=": [
160253- 32053,
160254: ",",
160255- 13906
160256- ],
160257- "77yM5q0=": [
160258- 32054,
160259: ",",
160260- 7468
160261- ],
160262- "77yM5q8=": [
160263- 32055,
160264: ",",
160265- 12216
160266- ],
160267- "77yM5rI=": [
160268- 32056,
160269: ",",
160270- 11066
160271- ],
160272- "77yM55Q=": [
160273- 32057,
160274: ",",
160275- 11502
160276- ],
160277- "77yM55s=": [
160278- 32058,
160279: ",",
160280- 11973
160281- ],
160282- "77yM57s=": [
160283- 32059,
160284: ",",
160285- 16988
160286- ],
160287- "77yM6IA=": [
160288- 32060,
160289: ",",
160290- 15794
160291- ],
160292- "77yM6Jk=": [
160293- 32061,
160294: ",",
160295- 8488
160296- ],
160297- "77yM6K4=": [
160298- 32062,
160299: ",",
160300- 18199
160301- ],
160302- "77yM6K8=": [
160303- 32063,
160304: ",",
160305- 33792
160306- ],
160307- "77yM6LA=": [
160308- 32064,
160309: ",",
160310- 13559
160311- ],
160312- "77yM6LU=": [
160313- 32065,
160314: ",",
160315- 14349
160316- ],
160317- "77yM6L8=": [
160318- 32066,
160319: ",",
160320- 51748
160321- ],
160322- "77yM6YA=": [
160323- 32067,
160324: ",",
160325- 23202
160326- ],
160327- "77yM6YE=": [
160328- 32068,
160329: ",",
160330- 10723
160331- ],
160332- "77yM6YI=": [
160333- 32069,
160334: ",",
160335- 25998
160336- ],
160337- "77yM6Zk=": [
160338- 32070,
160339: ",",
160340- 8209
160341- ],
20GB 数据量可以正常训练,100GB 在跑到某一步的时候会卡住。bytepiece==0.6.3
。
某个 thread 的堆栈信息,看不出来,直接问 GPT 似乎是多进程的问题:
#0 0x00007f168f6207a4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1 0x00007f168f620898 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2 0x00007f1683848699 in semlock_acquire ()
from /opt/rh/rh-python38/root/usr/lib64/python3.8/lib-dynload/_multiprocessing.cpython-38-x86_64-linux-gnu.so
#3 0x00007f168f7ed4e6 in PyCFunction_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#4 0x00007f168f7ac932 in _PyObject_MakeTpCall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#5 0x00007f168f862c5c in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#6 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#7 0x00007f168f7ab7bd in PyObject_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#8 0x00007f168f860081 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#9 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#10 0x00007f168f85e323 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#11 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#12 0x00007f168f85e323 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#13 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#14 0x00007f168f8507cb in method_vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#15 0x00007f168f7ab7bd in PyObject_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#16 0x00007f168f8ad6d1 in t_bootstrap () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#17 0x00007f168f86bbc4 in pythread_wrapper () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#18 0x00007f168f6174e2 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f168f3f25b3 in clone () from /lib64/libc.so.6
通过以下代码转换模型文件后,可以 huggingface tokenizers 的接口直接加载 bytepiece 的模型。转换后,同样支持 byte_fallback 等特性。
以下代码为抛转引玉,未经过完全的测试,希望苏神可以测试一下。
# convert.py
from itertools import product
from pathlib import Path
from dataclasses import dataclass
import math
import json
@dataclass(slots=True)
class WordItem:
word: str
log_probability: float
def aslist(self):
return [self.word, self.log_probability]
def convert_bytepiece_model_to_hugging_face(bytepiece_model_file: str | Path, output_file: str | Path):
bytepiece_model_file = Path(bytepiece_model_file)
if not bytepiece_model_file.exists():
raise FileNotFoundError(f'{bytepiece_model_file} not found.')
output_file = Path(output_file)
if output_file.exists():
raise FileExistsError(f'{output_file} already exists.')
initial_word_items = [WordItem(f'<0x{a}{b}>', 0.0) for a, b in product('0123456789ABCDEF', '0123456789ABCDEF')]
bytepiece_model_data = json.load(bytepiece_model_file.open())
log_sum = math.log(sum(item[2] for item in bytepiece_model_data.values()))
vocab_word_items: list[WordItem] = []
for item in bytepiece_model_data.values():
_, word, frequence = item
if not word.strip():
continue
word_item = WordItem(word, math.log(frequence) - log_sum)
vocab_word_items.append(word_item)
word_items = initial_word_items + vocab_word_items
hugging_face_dict = {
'version': "1.0",
"pre_tokenizer": {
"type": "CharDelimiterSplit",
"delimiter": "\x00",
},
"model": {
"type": "Unigram",
"unk_id": 0,
"vocab": [item.aslist() for item in word_items],
"byte_fallback": True,
}
}
json.dump(hugging_face_dict, output_file.open('w'), indent=2, ensure_ascii=False)
if __name__ == '__main__':
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument('bytepiece_model_file', type=str)
parser.add_argument('output_file', type=str)
args = parser.parse_args()
convert_bytepiece_model_to_hugging_face(args.bytepiece_model_file, args.output_file)
python convert.py bytepiece.model bytepiece.json
from tokenizers import Tokenizer
bytepiece_tokenizer = Tokenizer.from_file('bytepiece_plus_240k.json')
sentences = [
'中外科学名著',
'提高产品质量',
'鞭炮声响彻夜空',
'这事的确定不下来',
'邓颖超生前使用过的物品',
'쯈',
]
for sentence in sentences:
print(bytepiece_tokenizer.encode(sentence).tokens)
for sentence in sentences:
print(bytepiece_tokenizer.encode(sentence).offsets)
# OUTPUT:
# ['中外', '科学', '名著']
# ['提高', '产品', '质量']
# ['鞭炮', '声', '响彻', '夜空']
# ['这事', '的确', '定', '不', '下来']
# ['邓', '颖', '超', '生前', '使用', '过的', '物品']
# ['<0xEC>', '<0xAF>', '<0x88>']
# [(0, 2), (2, 4), (4, 6)]
# [(0, 2), (2, 4), (4, 6)]
# [(0, 2), (2, 3), (3, 5), (5, 7)]
# [(0, 2), (2, 4), (4, 5), (5, 6), (6, 8)]
# [(0, 1), (1, 2), (2, 3), (3, 5), (5, 7), (7, 9), (9, 11)]
# [(0, 1), (0, 1), (0, 1)]
今天安装了一天,遇到了几个问题,分享一下,方便后面类似遇到的同学。
AHOCORASICK_BYTES=1 pip install git+https://github.com/WojciechMula/pyahocorasick.git
,我在虚拟环境下没有安装成功。git clone https://github.com/WojciechMula/pyahocorasick.git
build_as_bytes
这个变量直接设置为True
。保证AHOCORASICK_BYTES安装python setup.py install
,当然,还需要一个C++的库,这个直接按照错误提示下载安装就好了,这个简单,就是有点大。pieces = json.load(open(pieces))
直接指定编码: pieces = json.load(open(pieces, encoding="utf-8"))
比如将\n处理为<n>
多谢!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.