Comments (5)
The failure is due to the missing file sentencepiece.bpe.model. The error is easy to reproduce with AutoTokenizer.from_pretrained
using the slow tokenizer:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('osiria/minilm-l12-h384-italian-cased', use_fast=False)
Returns the error:
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
With the fast tokenizer loading the tokenizer does not error, I'm assuming the fast tokenizer downloads the sentencepiece model automatically.
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('osiria/minilm-l12-h384-italian-cased', use_fast=True)
Eland needs to use the slow tokenizer. One option is to take sentencepiece.bpe.model from the xlm-roberta-base repo and add it to ''osiria/minilm-l12-h384-italian-cased'. To do this first git clone https://huggingface.co/osiria/minilm-l12-h384-italian-cased
(you will need to install Git LFS) then add sentencepiece.bpe.model to the cloned repo.
from eland.
@davidkyle i followed your instructions but now i get another error: torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
is there another way to make it work?
from eland.
@davidkyle this is the complete error:
Exception has occurred: IndexError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
index out of range in self
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/functional.py", line 2264, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
return F.embedding(
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 126, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 830, in forward
embedding_output = self.embeddings(
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 98, in forward
output_states = self.auto_model(**trans_features, return_dict=False)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 387, in forward
return self._st_model(inputs)[self._output_key]
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 471, in sample_output
return self._model(*inputs)
File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 786, in _create_config
sample_embedding = self._traceable_model.sample_output()
File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 669, in __init__
self._config = self._create_config(es_version)
File "/home/federico/Desktop/work2/eland/eland/cli/eland_import_hub_model.py", line 269, in main
tm = TransformerModel(
File "/home/federico/Desktop/work2/eland/eland/cli/eland_import_hub_model.py", line 334, in <module>
main()
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
IndexError: index out of range in self
from eland.
Thanks for the stack trace.
When Eland is used to import the model it runs a test evaluation to measure the size of the embedding produced by the model. This is part of the config and useful when configuring the dims
parameter of the dense vector field mapping in Elasticsearch. The error is from this test evaluation, in this case it is probably because the inputs to forward(...)
not in the expected format.
I should able to reproduce this
from eland.
I looked into this again and there is another issue that prevents this model being used in Elasticsearch.
Elasticsearch uses the libtorch
C++ library to run the NLP models. The models must be converted to the TorchScript format before they can run in libtorch, this conversion is one of the things the Eland script does. Tracing this particular model fails with an error:
RuntimeError: Encountering a dict at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.
Here is a Python snippet that reproduces the failed trace operation. I used a local copy of the repository with the sentencepiece.bpe.model
added as mention above
from transformers import AutoModel, AutoTokenizer
import torch
# load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained('<directory of the downloaded model to which we added sentencepiece>', use_fast=False)
model = AutoModel.from_pretrained('<directory of the downloaded model to which we added sentencepiece>')
# create sample input
encoded_input = tokenizer("Replace me by any text you'd like.", return_tensors='pt')
trace_inputs = (encoded_input["input_ids"], encoded_input["attention_mask"])
# trace model fails
traced = torch.jit.trace(model, example_inputs=trace_inputs)
Closing this issue as if the model cannot be traced it cannot be supported.
from eland.
Related Issues (20)
- `RandomForestClassifier` model cannot be uploaded HOT 7
- A possible issue with eland.Dataframe.value_counts(), the statistical information is missing some values
- Is there a way to return _scores when using es_query or es_match?
- [NLP] Unable to install eland[pytorch] using pip due to unable to find matching pytorch HOT 2
- Cannot append fields of type "dense vectorfield type
- Cannot append fields of type "dense_vector" to an existing index HOT 1
- why openai/clip-vit-base-patch32 model not support ! HOT 2
- support adding cert path to eland_import_hub_model script HOT 4
- Feedback 🗣️
- Failed to import huggingface model: cardiffnlp/twitter-roberta-base-sentiment HOT 1
- Upgrade PyTorch to version 2.1.2 and
- Support Python 3.11 HOT 1
- Conda downgrades Elasticsearch when installing eland HOT 1
- Learning to rank tests failing
- More error information is needed when using pandas_to_eland()
- Support for TaylorAI/gte-tiny
- DataFrame(serverless_client) fails with `TypeError: 'Elasticsearch' object is not iterable` HOT 1
- Fields (multi-fields?) specified in eland.Dataframe columns variable are not always returned HOT 3
- Add CLI for deploying built-in models HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eland.