Comments (5)
I implemented a sentence boundary detection system using several neural network architectures: deep-eos. (Sadly my paper was rejected at a conference so I decided to open source my implementation).
Corpora are a bit hard to find - some paper I read used Europarl. The problem is that Europarl is not 100% sentence segmented.
Another resources could be the universal dependencies datasets, because a good sentence boundary detection system should also work for non-english languages.
I'm currently working on sentence boundary detection using (universal) language models :)
from nlp-progress.
Thanks so much!
That's a great idea! I think it'd be great to make people aware that these tasks are actually not solved.
Do you know what datasets are used for sentence segmentation and tokenization by any chance? I guess we could look at what spacy uses for evaluation.
from nlp-progress.
Cool. Feel free to add them to the wish list for now then. I'll look up results once I have some time or maybe someone gets around to it first.
from nlp-progress.
It'd be great to see how well basic tasks are actually solved and what the conditions are under which they still perform very poorly. I'd hope that some people then address these special cases with better techniques.
I was surprised by this simple sentence that could not get sentence-tokenized correctly. Spacy.io made 2 sentences out of it.
from nlp-progress.
Same with inconsistent splitting when I use "...". This was surprising.
from nlp-progress.
Related Issues (20)
- How "SOTA" should results be? HOT 2
- SOTA entity linking is based on validation set not test set
- Add FinNLP Section HOT 3
- Hindi and Indian languages resource HOT 1
- NLP Results on code-mixed text HOT 1
- Maybe we should add readability assessment task, too? HOT 2
- Add Text-to-SQL progress (Dialogue) HOT 1
- Did you release dialogue progress? thanks
- For Grammar Error Correction task, why F0.5 is consider for evaluation and not F1? (Giving twice weight to precision than recall) HOT 1
- Add CFF (citation file format) to the repository HOT 1
- Add Dataset for Twitter
- DynaSent: Dynamic Sentiment Analysis Dataset
- English information extraction has incorrect F1 scores
- Language recognition? HOT 5
- Add sentence boundaries disambiguation section
- A Knowledge Graph resource of NLP-progress HOT 7
- NLP Repository
- Regarding the PreCo dataset
- Dependency parsing using NLP for list of words rather than a given sentence
- Tasks are not the right measure anymore
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlp-progress.