Comments (2)
I did find examples where the offsets and text differ beyond leading whitespace, and I think it's due to weird characters in the input:
{
"src": "高龄老人坐着面包车从南方到东北看雪??",
"mt": "The elderly are watching the snow from the south to the northeast in a van? ?",
"ref": "An elderly goes to watch the snow in the northeast from the south in a minibus? ?",
"COMET": 0.8957364559173584,
"errors": [
{
"text": "ly are watching",
"confidence": 0.39867228269577026,
"severity": "minor",
"start": 9,
"end": 24
},
{
"text": "to",
"confidence": 0.3982390761375427,
"severity": "minor",
"start": 48,
"end": 51
},
{
"text": "van??",
"confidence": 0.4210003614425659,
"severity": "minor",
"start": 70,
"end": 77
}
]
}
The second question mark in the mt
text is a weird character. It gets normalized in the span text to a normal quote.
{
"src": "There's a circularity to it...",
"mt": "Darin besteht eine Zirkularität …",
"ref": "Das ist ein richtiger Kreislauf...",
"COMET": 0.9912000298500061,
"errors": [
{
"text": "Darin besteht eine Zirkularität...",
"confidence": 0.5221754312515259,
"severity": "minor",
"start": 0,
"end": 33
}
]
}
In the mt
text, there's a space followed by 3 periods as a single character, but the span text removes the whitespace and uses 3 separate period characters.
I assume the offsets are ok to use. I only ran into this because I was verifying that COMET doesn't change the source text when it makes span predictions. This is important when you evaluate the spans so you can directly compare the predicted spans to the MQM spans. If the text is edited, the mapping might not be correct.
from comet.
Hi @danieldeutsch, you are right! the offsets are correct and the "text" field is more informative. I get the text field by detokenizing the token ids belonging to a span. Yet, if you detokenize just a part of the original input, you might get slightly different output. Whitespaces are good examples as they are sometimes encoded with the suffix _
from comet.
Related Issues (20)
- Different scores from different COMET package versions 1.1.2 and 2.2.1 HOT 2
- Different versions of COMET code give different scores with the same model and date.
- [QUESTION] large file scoring HOT 3
- [QUESTION] Splitting big models over multiple GPUs HOT 6
- [QUESTION] Memory footprint HOT 21
- [INPUT] Text Length of Input (source, reference, and hypothesis) HOT 2
- Change the global variable logger to comet_logger HOT 1
- Training script for XCOMET HOT 1
- Safetensors Support
- [QUESTION] OOM when load XCOMET-XXL in A100 with 40G memory for prediction HOT 4
- [QUESTION] why num_layers = num_hidden_layers + 1 HOT 1
- [QUESTION] Comet kiwi architecture HOT 11
- Training data and scripts used for wmt22-cometkiwi-da HOT 4
- Add missing library stubs or py.typed marker
- I see Unbabel comet is downloading models--xlm-roberta-large folder every time, is there any way to load it from local, if yes please share the hack.[QUESTION] HOT 1
- [QUESTION] predict multiple times with a model
- [QUESTION] how to enable multi-gpu when calling the predict method
- Comet installation fails HOT 3
- Cannot use `load_from_checkpoint` in an offline environment HOT 3
- [QUESTION] MQM dataset on huggingface
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from comet.