Comments (5)
+1 for CoNLL 2003. I think that training data is more important than pre-training models because you learn by training your own models, not by using some prebuilt ones.
from nltk.
I think this is a good idea. It might also be interesting to add the entity disambiguation layer annotated by the YAGO-AIDA project (Hoffart et al).
from nltk.
Sorry to disagree I think having the data might be interesting for a variety of tasks. Also, topic extraction is moving from text classification to entity linking, so datasets like the ones documented in the first task of http://www.wise2013.org/wise2013challenge.html and https://code.google.com/p/wiki-link/wiki/ExpandedDataset might be relevant. Providing an easy way to package and this type of corpora in nltk might be very useful.
from nltk.
+1 for CoNLL 2003. Could really use that for NER work.
from nltk.
Closing, as we don't have capacity to take this on.
from nltk.
Related Issues (20)
- Missing English words in words() HOT 5
- not download punkt HOT 5
- word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] HOT 2
- Potential Regex Denial of Service (ReDoS)
- Trouble with installation importing nltk HOT 1
- Add support for a `sort` argument in WordNet methods
- Unable to download Stopwords and also unable to access stopwords zip file manually. HOT 2
- Add a function of splitting combined words.
- Problems Running Examples Starting with Babelize HOT 1
- NLTK thinks `turn` is a noun when it shoud be a verb. HOT 1
- NLTK is considering "hi" and "hello" as a noun. HOT 4
- Import of Trie fails in mwe.py HOT 1
- ToktokTokenizer doesn't call one of the included replacement patterns and thus doesn't tokenize some punctuation, like opening guillemets HOT 1
- `corpus_bleu` function does not catch all the expections when calling `weights[0][0]` HOT 3
- Bug in nltk.draw.dispersion_plot with nltk 3.8.1, matplotlib-base 3.8.0, matplotlib-inline 0.1.6 and numpy 1.26 HOT 2
- Tokenizer punkt zip file sometimes does not unpackage
- `TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods.
- KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf HOT 7
- Dispersion Plot was not populating in correct order on Y axis. I have corrected that order. Please use the below code in dispersion.py file. HOT 2
- Not able to download the NLTK data module (python as well as manual download) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nltk.