Coder Social home page Coder Social logo

joke-dataset's Introduction

A dataset of English plaintext jokes

There are about 208 000 jokes in this database scraped from three sources.

I make no claim on ownership of these files, nor do I necessarily endorse the jokes in them. This dataset is provided for research purposes (see License section below).

This repository was archived in December 2022 and receives no further support.

Files

Currently the dataset contains jokes from three sources, each in a different file.

----------------------------------------------
reddit_jokes.json |  195K jokes | 7.40M tokens
stupidstuff.json  | 3.77K jokes |  396K tokens
wocka.json        | 10.0K jokes | 1.11M tokens
----------------------------------------------
TOTAL             |  208K jokes | 8.91M tokens
----------------------------------------------

Format

Each file is a JSON document, containing a flat list of joke objects. Each joke object always has the body field with additional fields varying based on the dataset, described below.

Obviously they are not all funny; to find the best ones, sort on the relevant additional fields.

Note that the title is in part of the joke many cases (especially for Reddit submissions).

reddit_jokes.json

Scraped from /r/jokes. Contains all submissions to the subreddit as of 13.02.2017.

These jokes may have additional comments in them (example).

Additional fields:

  • id -- submission ID in the subreddit.
  • score -- post score displayed on Reddit.
  • title -- title of the submission.
{
        "title": "My boss said to me, \"you're the worst train driver ever. How many have you derailed this year?\"",
        "body": "I said, \"I'm not sure; it's hard to keep track.\"",
        "id": "5tyytx",
        "score": 3
    }

stupidstuff.json

Scraped from stupidstuff.org.

Additional fields:

  • id -- page ID on stupidstuff.org.
  • category -- see available categories here.
  • rating -- mean user rating on a scale of 1 to 5.
{
        "category": "Blonde Jokes",
        "body": "A blonde is walking down the street with her blouse open, exposing one of her breasts. A nearby policeman approaches her and remarks, \"Ma'am, are you aware that I could cite you for indecent exposure?\" \"Why, officer?\" asks the blonde. \"Because your blouse is open and your breast is exposed.\" \"Oh my goodness,\" exclaims the blonde, \"I must have left my baby on the bus!\"",
        "id": 14,
        "rating": 3.5
    }

wocka.json

Scraped from wocka.com.

Additional fields:

  • id -- page ID on wocka.com.
  • category -- see available categories here.
  • title -- title of the joke.
{
        "title": "Infants vs Adults",
        "body": "Do infants enjoy infancy as much as adults enjoy adultery?",
        "category": "One Liners",
        "id": 17
    }

License

I provide this dataset for research purposes and make no ownership claim on any part of it. The question of copyright in the case of jokes is unclear and I recommend not using the dataset commercially.

For removal of copyrighted content, please contact me on GitHub.

Citing

If you use this dataset in academic work, please cite as follows:

@misc{pungas,
        title={A dataset of English plaintext jokes.},
        url={https://github.com/taivop/joke-dataset},
        author={Pungas, Taivo},
        year={2017},
        publisher = {GitHub},
        journal = {GitHub repository}
}

joke-dataset's People

Contributors

stampyzfanz avatar taivop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

joke-dataset's Issues

Bad data in wocka.json

wocka.json contains the following:

 {
        "body": "With over %d jokes submitted and ranked by users like you, Wocka has the largest collection anywhere on the internet. All of these jokes have been submitted by user like you. Join our community and chat with your fellow commedians and jokers.",
        "category": "Other / Misc",
        "id": 12490,
        "title": "Community of Comedians and Jokers"
    },
    {
        "body": "An active message board with hundreds of topics in which to participate.",
        "category": "One Liners",
        "id": 12491,
        "title": "Public Forums"
    },
    {
        "body": "Send private messages to your friends.",
        "category": "One Liners",
        "id": 12492,
        "title": "Private Messages"
    },
    {
        "body": "First click the Community button.\r\nThen click the Public Fourum button.\r\nThe first forum is for writing jokes.  \r\nDiscuss how to write funny jokes here.",
        "category": "One Liners",
        "id": 12493,
        "title": "Writing Jokes"
    },
    {
        "body": "Have a funny story? Share it here.",
        "category": "One Liners",
        "id": 12494,
        "title": "Funny Stories"
    },
    {
        "body": "Play or chat about games in here.",
        "category": "One Liners",
        "id": 12495,
        "title": "Games"
    },
    {
        "body": "Talk about anything you want here.\r\n\r\nO.K. I will. You really need to quit submitting jokes like this. Anybody agree with me?",
        "category": "One Liners",
        "id": 12496,
        "title": "General Discussion"
    },
    {
        "body": "TV, Movies, Music, Books...",
        "category": "One Liners",
        "id": 12497,
        "title": "Entertainment"
    },

Request for scraping code

The date for the post will be a good parameter to weigh the score against.
Could you either:

  1. Make changes to the code and upload a new version of the data with the post date? or
  2. Share the scraping code so that some of us could make changes to it and we will send a PR once done.

Thanks

Is there an API for this?

Hey, thanks for releasing this โ€” so funny. I'm wondering if anyone has published (or have the intention) to release an API for this data.

๐Ÿ˜†

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.