Mostly public datasets. Use at your own risk!
-
(!) AwesomeData public datasets (updated 9 hours ago): prev.ly github.com/caesar0301/awesome-public-datasets
-
(!) Wikipedia, List of datasets for machine learning research
-
UCI Machine Learning Repository: UC Irvine Machine Learning Repository. 497 data sets@200507
-
(check licenses) Kaggle datasets (23,298 on 2019.11)
-
AWS, Registry of Open Data on (119 on 2019.11): prev.ly aws.amazon.com/datasets/
-
Stanford NLP group: mainly annotated corpora and TreeBanks or actual NLP tools
-
Yahoo Webscope: also includes papers that use the data that is provided
-
CrowdFlower: Data for Everyone (last updated 2017.11): data obtained by crowdsourcing for specific tasks, lots of little surveys they conducted
-
AI Hub: public image, text data for CV, NLP
-
threads
- Quora post about NLP(corpora) dataset: mainly annotated corpora
- /r/datasets: endless list of datasets, most is scraped by amateurs and not well-documented or licensed
- rs.io: another big list
- (?) not sure... Stackexchange: Opendata
-
site-specific data
-
gov data
- Korea
- EU
- EU Open Data Portal
- eurostat \ec.europa.eu/eurostat/data/bulkdownload
- European Environment Agency's data, maps, graphs...
- U.S.