I downloaded albert_xxl v2, in file assets/30k-clean.vocab entry for [UNK] looks like:

I have the same issue, anyone have a solution please help </blockquot

Thanks! <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

[UNK] token in v2 models about albert HOT 5 OPEN

kkkppp commented on May 19, 2024 7

[UNK] token in v2 models

from albert.

Comments (5)

aarmstrong78 commented on May 19, 2024 1

If you only pass the .vocab file the init function will fall back on the Basic and Wordpiece tokenizers, which use [UNK]. You need to pass the spm model name as well:

tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab", spm_model_file="/content/30k-clean.model")

from albert.

tvinith commented on May 19, 2024

same here running ALBERT tfhub with own set of data : getting error as

`/content/Albert/classifier_utils.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, task_name)
623 segment_ids.append(1)
624
--> 625 input_ids = tokenizer.convert_tokens_to_ids(tokens)
626
627 # The mask has 1 for real tokens and 0 for padding tokens. Only real

/content/Albert/tokenization.py in convert_tokens_to_ids(self, tokens)
266 printable_text(token)) for token in tokens]
267 else:
--> 268 return convert_by_vocab(self.vocab, tokens)
269
270 def convert_ids_to_tokens(self, ids):

/content/Albert/tokenization.py in convert_by_vocab(vocab, items)
208 output = []
209 for item in items:
--> 210 output.append(vocab[item])
211 return output
212

KeyError: '[UNK]'`

from albert.

aarmstrong78 commented on May 19, 2024

I had a similar issue and my problem was that I wasn't setting the spm_model_file flag correctly, and therefore the tokeniser was falling back to the Basic & Wordpiece tokenisers which use [UNK]

from albert.

JKP0 commented on May 19, 2024

I have the same issue, anyone have a solution please help

cd ALBERT
/content/ALBERT

from ALBERT import tokenization
from ALBERT import tokenization_test

tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
tc=tokenizer.tokenize("Hello, my dog is cute")
ec=tokenizer.convert_tokens_to_ids(tc)

logs

KeyError Traceback (most recent call last)
in ()
1 tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
2 tc=tokenizer.tokenize("Hello, my dog is cute")
----> 3 ec=tokenizer.convert_tokens_to_ids(tc)

1 frames
/content/ALBERT/tokenization.py in convert_by_vocab(vocab, items)
209 output = []
210 for item in items:
--> 211 output.append(vocab[item])
212 return output
213

KeyError: '[UNK]'

from albert.

JKP0 commented on May 19, 2024

Thanks! @aarmstrong78

from albert.

Recommend Projects

[UNK] token in v2 models about albert HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent