Coder Social home page Coder Social logo

explosion / ml-datasets Goto Github PK

View Code? Open in Web Editor NEW
45.0 10.0 16.0 76 KB

🌊 Machine learning dataset loaders for testing and example scripts

License: MIT License

Python 100.00%
spacy thinc machine-learning machine-learning-datasets datasets testing

ml-datasets's Introduction

Machine learning dataset loaders for testing and examples

Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets.

PyPi Version

Setup and installation

The package can be installed via pip:

pip install ml-datasets

Loaders

Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.

# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()

Available loaders

NLP datasets

ID / Function Description NLP task From URL
imdb IMDB sentiment dataset Binary classification: sentiment analysis βœ“
dbpedia DBPedia ontology dataset Multi-class single-label classification βœ“
cmu CMU movie genres dataset Multi-class, multi-label classification βœ“
quora_questions Duplicate Quora questions dataset Detecting duplicate questions βœ“
reuters Reuters dataset (texts not included) Multi-class multi-label classification βœ“
snli Stanford Natural Language Inference corpus Recognizing textual entailment βœ“
stack_exchange Stack Exchange dataset Question Answering
ud_ancora_pos_tags Universal Dependencies Spanish AnCora corpus POS tagging βœ“
ud_ewtb_pos_tags Universal Dependencies English EWT corpus POS tagging βœ“
wikiner WikiNER data Named entity recognition

Other ML datasets

ID / Function Description ML task From URL
mnist MNIST data Image recognition βœ“

Dataset details

IMDB

Each instance contains the text of a movie review, and a sentiment expressed as 0 or 1.

train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
    print(f"Review: {text}")
    print(f"Sentiment: {annot}")
Property Training Dev
# Instances 25000 25000
Label values {0, 1} {0, 1}
Labels per instance Single Single
Label distribution Balanced (50/50) Balanced (50/50)

DBPedia

Each instance contains an ontological description, and a classification into one of the 14 distinct labels.

train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Category: {annot}")
Property Training Dev
# Instances 560000 70000
Label values 1-14 1-14
Labels per instance Single Single
Label distribution Balanced Balanced

CMU

Each instance contains a movie description, and a classification into a list of appropriate genres.

train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Genres: {annot}")
Property Training Dev
# Instances 41793 0
Label values 363 different genres -
Labels per instance Multiple -
Label distribution Imbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times -

Quora

train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
    q1, q2 = questions
    print(f"Question 1: {q1}")
    print(f"Question 2: {q2}")
    print(f"Similarity: {annot}")

Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0: no, 1: yes). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Property Training Dev
# Instances 363859 40429
Label values {0, 1} {0, 1}
Labels per instance Single Single
Label distribution Imbalanced: 63% label 0 Imbalanced: 63% label 0

Registering loaders

Loaders can be registered externally using the loaders registry as a decorator. For example:

@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
    return load_some_data()

assert "my_custom_loader" in ml_datasets.loaders

ml-datasets's People

Contributors

honnibal avatar ines avatar kadarakos avatar svlandeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.