This is a freaking amazing overview & super useful! Many thanks! <p dir="auto"

Suggestion: Add basic problems like sentence segmentation and tokenization about nlp-progress HOT 5 CLOSED

sebastianruder commented on April 28, 2024 2

Suggestion: Add basic problems like sentence segmentation and tokenization

from nlp-progress.

Comments (5)

stefan-it commented on April 28, 2024 2

I implemented a sentence boundary detection system using several neural network architectures: deep-eos. (Sadly my paper was rejected at a conference so I decided to open source my implementation).

Corpora are a bit hard to find - some paper I read used Europarl. The problem is that Europarl is not 100% sentence segmented.

Another resources could be the universal dependencies datasets, because a good sentence boundary detection system should also work for non-english languages.

I'm currently working on sentence boundary detection using (universal) language models :)

from nlp-progress.

sebastianruder commented on April 28, 2024 1

Thanks so much!
That's a great idea! I think it'd be great to make people aware that these tasks are actually not solved.

Do you know what datasets are used for sentence segmentation and tokenization by any chance? I guess we could look at what spacy uses for evaluation.

from nlp-progress.

sebastianruder commented on April 28, 2024 1

Cool. Feel free to add them to the wish list for now then. I'll look up results once I have some time or maybe someone gets around to it first.

from nlp-progress.

pwichmann commented on April 28, 2024

It'd be great to see how well basic tasks are actually solved and what the conditions are under which they still perform very poorly. I'd hope that some people then address these special cases with better techniques.

I was surprised by this simple sentence that could not get sentence-tokenized correctly. Spacy.io made 2 sentences out of it.

from nlp-progress.

pwichmann commented on April 28, 2024

Same with inconsistent splitting when I use "...". This was surprising.

from nlp-progress.

Recommend Projects

Suggestion: Add basic problems like sentence segmentation and tokenization about nlp-progress HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent