Coder Social home page Coder Social logo

[UNK] token in v2 models about albert HOT 5 OPEN

kkkppp avatar kkkppp commented on May 19, 2024 7
[UNK] token in v2 models

from albert.

Comments (5)

aarmstrong78 avatar aarmstrong78 commented on May 19, 2024 1

If you only pass the .vocab file the init function will fall back on the Basic and Wordpiece tokenizers, which use [UNK]. You need to pass the spm model name as well:

tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab", spm_model_file="/content/30k-clean.model")

from albert.

tvinith avatar tvinith commented on May 19, 2024

same here running ALBERT tfhub with own set of data : getting error as

`/content/Albert/classifier_utils.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, task_name)
623 segment_ids.append(1)
624
--> 625 input_ids = tokenizer.convert_tokens_to_ids(tokens)
626
627 # The mask has 1 for real tokens and 0 for padding tokens. Only real

/content/Albert/tokenization.py in convert_tokens_to_ids(self, tokens)
266 printable_text(token)) for token in tokens]
267 else:
--> 268 return convert_by_vocab(self.vocab, tokens)
269
270 def convert_ids_to_tokens(self, ids):

/content/Albert/tokenization.py in convert_by_vocab(vocab, items)
208 output = []
209 for item in items:
--> 210 output.append(vocab[item])
211 return output
212

KeyError: '[UNK]'`

from albert.

aarmstrong78 avatar aarmstrong78 commented on May 19, 2024

I had a similar issue and my problem was that I wasn't setting the spm_model_file flag correctly, and therefore the tokeniser was falling back to the Basic & Wordpiece tokenisers which use [UNK]

from albert.

JKP0 avatar JKP0 commented on May 19, 2024

I have the same issue, anyone have a solution please help

cd ALBERT
/content/ALBERT

from ALBERT import tokenization
from ALBERT import tokenization_test
tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
tc=tokenizer.tokenize("Hello, my dog is cute")
ec=tokenizer.convert_tokens_to_ids(tc)

logs


KeyError Traceback (most recent call last)
in ()
1 tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
2 tc=tokenizer.tokenize("Hello, my dog is cute")
----> 3 ec=tokenizer.convert_tokens_to_ids(tc)

1 frames
/content/ALBERT/tokenization.py in convert_by_vocab(vocab, items)
209 output = []
210 for item in items:
--> 211 output.append(vocab[item])
212 return output
213

KeyError: '[UNK]'

from albert.

JKP0 avatar JKP0 commented on May 19, 2024

Thanks! @aarmstrong78

from albert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.