Comments (2)
Hi @foreveronehundred! The code does not break when running very very large segments BUT the models truncate the input if it goes above 512 tokens. For models like wmt22-cometkiwi-da
the input will be shared for both source and translation which means that the total number of tokens from source and translation should not be longer than 512 tokens....
Still, 512 tokens is a long input. Its more than enough to input several sentences together and evaluate entire paragraphs. Maybe not enough for an entire 2 page document tho.
A quick way to test it is to tokenize both inputs and get their length:
from transformers import XLMRobertaTokenizer
source = ["Hello, how are you?", "This is a test sentence."]
translations = ["Bonjour, comment ça va?", "Ceci est une phrase de test."]
# This is the same for most COMET models
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
# Tokenize and count tokens for each pair
for src, trans in zip(source, translations):
# Tokenize sentences
src_tokens = tokenizer.encode(src, add_special_tokens=False)
trans_tokens = tokenizer.encode(trans, add_special_tokens=False)
# Jointly encode and count tokens
joint_tokens = tokenizer.encode(src, trans, add_special_tokens=True, truncation=True)
# Output token counts
print(f"Source: {src}\nTranslation: {trans}")
print(f"Source tokens: {len(src_tokens)}")
print(f"Translation tokens: {len(trans_tokens)}")
print(f"Jointly encoded tokens: {len(joint_tokens)}")
print("="*30)
from comet.
Thanks for reply. I think the length is enough for general cases.
By the way, I want to know the token length of the training data. Could you give some statistics (Mean, STD, etc.)?
from comet.
Related Issues (20)
- Minimizing cpu RAM vs only use GPU RAM HOT 1
- what is the precision when load_from_checkpoint?
- Runtime error when loading wmt23-cometkiwi-da-xl HOT 1
- Different scores from different COMET package versions 1.1.2 and 2.2.1 HOT 2
- Different versions of COMET code give different scores with the same model and date.
- [QUESTION] large file scoring HOT 3
- [QUESTION] Splitting big models over multiple GPUs HOT 6
- [QUESTION] Memory footprint HOT 21
- Change the global variable logger to comet_logger HOT 1
- Training script for XCOMET HOT 1
- Safetensors Support
- [QUESTION] OOM when load XCOMET-XXL in A100 with 40G memory for prediction HOT 4
- [QUESTION] why num_layers = num_hidden_layers + 1 HOT 1
- [QUESTION] Comet kiwi architecture HOT 11
- Training data and scripts used for wmt22-cometkiwi-da HOT 4
- Add missing library stubs or py.typed marker
- I see Unbabel comet is downloading models--xlm-roberta-large folder every time, is there any way to load it from local, if yes please share the hack.[QUESTION] HOT 1
- [QUESTION] predict multiple times with a model
- [QUESTION] how to enable multi-gpu when calling the predict method
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from comet.