Question I am working with a Named Entity Recognition (NER) datase

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

[Question]: Subtoken Labeling? about flair HOT 3 OPEN

quantarb commented on May 26, 2024

[Question]: Subtoken Labeling?

from flair.

Comments (3)

alanakbik commented on May 26, 2024

Hello @quantarb we've had such issues before. In this case, I first use a regular tokenizer, and then additionally split all tokens on the offset positions to get the final tokenization. There is no helper function in Flair for this, so you would need to write your own tokenization code.

from flair.

quantarb commented on May 26, 2024

Hi @alanakbik , thank you for your quick response. I tried to split up the tokens based on the offset positions, but I'm having problems restructoring my original flair sentence from tokens. What is the best way to reconstruct a flair sentence from tokens.

I tried several different approaches but my new_sentence never matches the original sentence.

text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(token.text) for token in sentence]
new_sentence = Sentence(tokens)

text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(text[token.start_position:token.end_position]) for token in sentence]
new_sentence = Sentence(tokens)

from flair.

MostHumble commented on May 26, 2024

Hi @alanakbik , thank you for your quick response. I tried to split up the tokens based on the offset positions, but I'm having problems restructoring my original flair sentence from tokens. What is the best way to reconstruct a flair sentence from tokens.

I tried several different approaches but my new_sentence never matches the original sentence.
text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(token.text) for token in sentence]
new_sentence = Sentence(tokens)
text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(text[token.start_position:token.end_position]) for token in sentence]
new_sentence = Sentence(tokens)

it might have something to do with the fact that some (all?) tokenizer are lossy, you can try with a different tokenizer:

tokenized = your_tokenizer.tokenize(raw)
#print(tokenized)
sentence = Sentence(tokenized)
tagger.predict(sentence)

from flair.

Recommend Projects

[Question]: Subtoken Labeling? about flair HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent