Comments (4)
Hey, sorry about my late response. This sounds like a useful idea. I just have two concerns:
- Even though arXiv is very popular, not all papers are on arXiv. Many are just available in the refereed proceedings (e.g. ACL Anthology, AAAI), which don't have an API. How would you deal with these?
- As far as I can see, anyone who wants to contribute to the repo needs to run
gener_yaml.py
to produce the full yaml. Is there another way? If not, I think this places too heavy a burden on contributors; I also think having two yaml files (one template and the full version) might get confusing.
from nlp-progress.
Hi Sebastian,
this time I apologize for a slow response. I was off-line for two weeks.
Concerning the first bullet. The script can deal not only with arXiv API, but also with DOI API and Semantics Scholar API. Especially, Semantics Scholar has a huge database. For example for the dependency parsing I could easily fetch metadata for every paper via API. My feeling is that these three APIs will for sure cover >95% of the listed resources. If all data was in YAML format one could easily write a short script checking the coverage.
Of course for the "unAPIzed" papers one can still fallback to entering all the details by hand as it is now.
from nlp-progress.
As far as the second bullet is concerned. That is essentially my question - if and how would you see integrating this in your maintenance workflow? The idea would be that contributors need to enter either arXiv ID, DOI or Semantics Scholar id and the tooling would do the rest. I believe it is worth considering. Your repo will be growing, e.g. with addition of new languages or tasks, so tools improving the consistency of data would definitely increase the overall quality of the NLP progress. For example,
this would mean no more title/link inconsistencies like in #95 or inconsistencies with arxiv.org/abs vs arxiv.org/pdf that are there now.
Another benefit of having full metadata would be that tooling could easily generate the downloadable bibtex file for every task/language which could be helpful for many users.
Best regards,
Michał
from nlp-progress.
Hey Michał, I'm really sorry about my late reply. I meant to answer sooner, but somehow this slipped through the cracks.
I really like the idea and would love to offer more functionality on top of this. However, we've decided that we'll stick with storing data in Markdown for now. I'm not sure if this is still compatible with that.
from nlp-progress.
Related Issues (20)
- How "SOTA" should results be? HOT 2
- SOTA entity linking is based on validation set not test set
- Add FinNLP Section HOT 3
- Hindi and Indian languages resource HOT 1
- NLP Results on code-mixed text HOT 1
- Maybe we should add readability assessment task, too? HOT 2
- Add Text-to-SQL progress (Dialogue) HOT 1
- Did you release dialogue progress? thanks
- For Grammar Error Correction task, why F0.5 is consider for evaluation and not F1? (Giving twice weight to precision than recall) HOT 1
- Add CFF (citation file format) to the repository HOT 1
- Add Dataset for Twitter
- DynaSent: Dynamic Sentiment Analysis Dataset
- English information extraction has incorrect F1 scores
- Language recognition? HOT 5
- Add sentence boundaries disambiguation section
- A Knowledge Graph resource of NLP-progress HOT 7
- NLP Repository
- Regarding the PreCo dataset
- Dependency parsing using NLP for list of words rather than a given sentence
- Tasks are not the right measure anymore
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlp-progress.