For translation quality estimation of COMET, I think there is no limitation of the tex

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[INPUT] Text Length of Input (source, reference, and hypothesis) about comet HOT 2 OPEN

foreveronehundred commented on September 22, 2024

[INPUT] Text Length of Input (source, reference, and hypothesis)

from comet.

Comments (2)

ricardorei commented on September 22, 2024 1

Hi @foreveronehundred! The code does not break when running very very large segments BUT the models truncate the input if it goes above 512 tokens. For models like wmt22-cometkiwi-da the input will be shared for both source and translation which means that the total number of tokens from source and translation should not be longer than 512 tokens....

Still, 512 tokens is a long input. Its more than enough to input several sentences together and evaluate entire paragraphs. Maybe not enough for an entire 2 page document tho.

A quick way to test it is to tokenize both inputs and get their length:

from transformers import XLMRobertaTokenizer
source = ["Hello, how are you?", "This is a test sentence."]
translations = ["Bonjour, comment ça va?", "Ceci est une phrase de test."]
# This is the same for most COMET models
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") 
# Tokenize and count tokens for each pair
for src, trans in zip(source, translations):
     # Tokenize sentences
    src_tokens = tokenizer.encode(src, add_special_tokens=False)
    trans_tokens = tokenizer.encode(trans, add_special_tokens=False)

    # Jointly encode and count tokens
    joint_tokens = tokenizer.encode(src, trans, add_special_tokens=True, truncation=True)

    # Output token counts
    print(f"Source: {src}\nTranslation: {trans}")
    print(f"Source tokens: {len(src_tokens)}")
    print(f"Translation tokens: {len(trans_tokens)}")
    print(f"Jointly encoded tokens: {len(joint_tokens)}")
    print("="*30)

from comet.

foreveronehundred commented on September 22, 2024

Thanks for reply. I think the length is enough for general cases.
By the way, I want to know the token length of the training data. Could you give some statistics (Mean, STD, etc.)?

from comet.

Recommend Projects

[INPUT] Text Length of Input (source, reference, and hypothesis) about comet HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent