Comments (5)
If you only pass the .vocab file the init function will fall back on the Basic and Wordpiece tokenizers, which use [UNK]. You need to pass the spm model name as well:
tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab", spm_model_file="/content/30k-clean.model")
from albert.
same here running ALBERT tfhub with own set of data : getting error as
`/content/Albert/classifier_utils.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, task_name)
623 segment_ids.append(1)
624
--> 625 input_ids = tokenizer.convert_tokens_to_ids(tokens)
626
627 # The mask has 1 for real tokens and 0 for padding tokens. Only real
/content/Albert/tokenization.py in convert_tokens_to_ids(self, tokens)
266 printable_text(token)) for token in tokens]
267 else:
--> 268 return convert_by_vocab(self.vocab, tokens)
269
270 def convert_ids_to_tokens(self, ids):
/content/Albert/tokenization.py in convert_by_vocab(vocab, items)
208 output = []
209 for item in items:
--> 210 output.append(vocab[item])
211 return output
212
KeyError: '[UNK]'`
from albert.
I had a similar issue and my problem was that I wasn't setting the spm_model_file flag correctly, and therefore the tokeniser was falling back to the Basic & Wordpiece tokenisers which use [UNK]
from albert.
I have the same issue, anyone have a solution please help
cd ALBERT
/content/ALBERT
from ALBERT import tokenization
from ALBERT import tokenization_test
tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
tc=tokenizer.tokenize("Hello, my dog is cute")
ec=tokenizer.convert_tokens_to_ids(tc)
logs
KeyError Traceback (most recent call last)
in ()
1 tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
2 tc=tokenizer.tokenize("Hello, my dog is cute")
----> 3 ec=tokenizer.convert_tokens_to_ids(tc)
1 frames
/content/ALBERT/tokenization.py in convert_by_vocab(vocab, items)
209 output = []
210 for item in items:
--> 211 output.append(vocab[item])
212 return output
213
KeyError: '[UNK]'
from albert.
Thanks! @aarmstrong78
from albert.
Related Issues (20)
- torch.nn.modules.module.ModuleAttributeError: 'AlbertEmbeddings' object has no attribute 'bias' HOT 1
- The exact English pretraining data and Chinese pretraining data that are exact same to the BERT paper's pretraining data.
- albert base fine-tuned on squad2.0 gets stuck in loop when predicting on new file HOT 1
- Wrong pieces for control symbols after loading SentencepieceProcessor from official model HOT 2
- fine tune on my own English dataset
- Discrepancy in tokenization results using albert's tokenizer and sentencepiece library
- which word segmentation tool is used for pretraining Chinese ALBERT
- Probable error on line 306 in `create_pretraining_data.py` for albert
- Default Tutorial Not Working - Can't download MRPC data HOT 2
- Prediction Fails on default Colab HOT 2
- How to get the test embeddings from output of fine-tuned model (tutorial)
- when training in Race , The eval_accuracy is flat , it only has three numbers which are 0.0, 0.33334, 0.66667, 1.0
- Difference between v1 and v2 for xxlarge
- Wrong evaluate result on Squad2.0
- The results can't be reproduced HOT 2
- Improvement to how the `app` and `pages` files conflict is shown. Especially the last log line `"pages/" - "app/"` made it seem like you should remove the `pages` folder altogether. This was a bug in how the `''` case was displayed. After having a look at this I went further and added exactly which file caused the conflict given that `app` allows you to create `app/(home)/page.js` and such it saves some digging for what the actual conflicting file is. Similarly in `pages` both `pages/dashboard/index.js` and `pages/dashboard.js` are possible.
- Load in Browser Tensorflow
- Why do I find inconsistencies between the output of my ALBERT model converted to ONNX format and tested with ONNX Runtime, compared to the original PyTorch format model?
- Albet
- Albert
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from albert.