Coder Social home page Coder Social logo

corpora's Introduction

Corpora

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.

I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.

License

Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).

To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.

What is Corpora NOT?

This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.

What is Corpora?

  • Corpora is a repository of JSON files, meant to be language-neutral. If you want to create an NPM repo or whatever based on this, be my guest, but this repository will remain a collection of data files that can be interpreted by any language that can parse JSON.
  • Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
    • For example, Corpora will not contain any complete "dictionary" style files. Instead we host a sampling of 1000 common nouns, adjectives, and verbs.
    • Some lists are small enough by nature that we may contain a complete list of things in their category. For example, a list of heavily populated U.S. cities may only have 75 cities and be considered complete.

List of Corpora-related tools

I have some data, how do I submit?

We accept pull requests to this repository. Some guidelines:

  • BY SUBMITTING DATA AS A PULL REQUEST, YOU AGREE TO OUR APPLYING A CC0 FREE CULTURE LICENSE TO THE DATA, MEANING THAT ANYONE CAN USE THE DATA FOR ANY REASON WITHOUT ATTRIBUTION IN PERPETUITY.
  • Please submit all data as JSON format in a file with a .json extension, and please JSONLint your files before submitting -- also, thanks to Matt Rothenberg we have Travis-CI testing, which will jsonlint your pull request automatically. If you see a test failure notification in your PR after you submit, there's a problem with your JSON!
  • Keep individual files to about 1000 "things" maximum. Fewer than 1000 is fine, too.
  • If you'd like attribution, I'm happy to include your name in this Readme file. Just remember that nobody who uses this data is obligated to include attribution in their own projects.

Contributors

By Darius Kazemi and Many Wonderful Contributors.

corpora's People

Contributors

amarriner avatar aparrish avatar barbeque avatar charlesreid1 avatar ckolderup avatar coleww avatar dariusk avatar dazzlingdevelopment avatar enkiv2 avatar fitnr avatar greg-kennedy avatar hectate avatar hugovk avatar javierarce avatar jerimee avatar jimkang avatar kswedberg avatar lcooke avatar lee2sman avatar michaelpaulukonis avatar miguelznunez avatar mroth avatar msbrown avatar norwd avatar pikapower9080 avatar samplereality avatar serin-delaunay avatar suisea avatar techdubb avatar thisisparker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

corpora's Issues

Plural and invariable nouns?

Would it be possible to add a list of plural and invariable nouns?

I can’t seem to find such a list anywhere.

Exclude "bad words" from Corpora?

I ran wordfilter on Corpora:

./data/foods/herbs_n_spices.json may contain bad words: [u'spices']
./data/religion/fictional_religions.json may contain bad words: [u'Esoteric Order of Dagon']
./data/religion/religions.json may contain bad words: [u'Ancient Egyptian religion', u'Syrian-Egyptic Gnosticism']
./data/societies_and_groups/animal_welfare.json may contain bad words: [u'Japan', u'Pakistan', u'Egypt']
./data/words/word_clues/clues_five.json may contain bad words: [u'spade', u'tardy', u'blame', u'spice', u'flame', u'spook']
./data/words/word_clues/clues_four.json may contain bad words: [u'gash', u'lame', u'gimp']
./data/words/word_clues/clues_six.json may contain bad words: [u'spooks', u'spooky', u'blames', u'flames', u'script', u'spices', u'retard']

(This excludes keys matching "Description", "description", "descriptions", "scripts", "wine_descriptions".)

  1. Should any of those found words be removed from Corpora?
  2. Would it be useful to pop this into the CI?
    a. Just for information, or
    b. To fail a build if it finds something?

If yes to 2a, should some of those be added to a whitelist?
If yes to 2b, there needs to be a whitelist for any which aren't removed.

herbs-n-spices.json: Bad text encoding

herbs-n-spices.json has non-ASCII characters, but is not valid UTF-8. Lines 186 & 187 both have problems:

"Qâlat daqqa",
"Quatre épices"

Python and iconv both have problems with it:

$ cat data/foods/herbs_n_spices.json | python -m json.tool
'utf8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
$ cat data/foods/herbs_n_spices.json | iconv -f utf8 -t utf8
{
    "description": "A list of herbs and spices, and mixtures of the two.",
[...]
        "Pumpkin pie spice",
        "Q
iconv: (stdin):186:18: cannot convert

Missing dog

Hi there,
I can't seem to find a German Shepard on the list.

[Idea] Enforce 1000 element limit in Github Actions workflow

Hiyya! I was wondering if it would be a good idea to expand the Github Actions workflow to run a small script that checks each JSON file (on pushes and pull requests) and enforces the 1000 element limit on lists.

I can make the PR and changes to the workflow if it sounds like a good / useful contribution!

About Json file format and golang lib support

golang lib support

When I wanted to use the corpora project, I found that there was no golang version of the library, so I spent some time supporting a golang version. The repository is artikell/gocorpora.

json file format

But when I tried to build the model, I found that some JSON file formats were not friendly to strong-type languages such as golang and Java,for example:corpora/hexagrams.json at master · dariusk/corpora

{
 "description": "I Ching hexagrams and descriptions, by Ashley Blewer.",
 "source": "https://bits.ashleyblewer.com/i-ching/",
 "hexagrams": {
"111111": {"definition": "01. Force (乾 qián); The Creative; Possessing Creative Power & Skill",
            "hexagram": " ䷀ ",
            "number": "1",
            "description": "Heaven above and Heaven below: Heaven in constant motion. With the strength of the dragon, the Superior Person steels herself for ceaseless activity. Productive activity. Potent Influence. Sublime success if you keep to your course."},
  "000000": {"definition": "02. Field (坤 kūn); The Receptive; Needing Knowledge & Skill; Do not force matters and go with the flow",
            "hexagram": " ䷁ ",
            "number": "2",
            "description": "Earth above and Earth below: The Earth contains and sustains. In th"}
}
}

This model will be transformed into the following structure

{
		Description string `json:"description"`
		Hexagrams   struct {
			_000000 struct {
				Definition  string `json:"definition"`
				Description string `json:"description"`
				Hexagram    string `json:"hexagram"`
				Number      int64  `json:"number,string"`
			} `json:"000000"`
			_000001 struct {
				Definition  string `json:"definition"`
				Description string `json:"description"`
				Hexagram    string `json:"hexagram"`
				Number      int64  `json:"number,string"`
			} `json:"000001"`
}
}

000000 and 000001 have become an attribute of hexagrams. In the subsequent data supplement, the model structure will continue to increase.
Of course, the advantage is to ensure the uniqueness of data. Higher efficiency in query.
However, I think the data should be consistent in structure, and the unlimited increase of attributes is not allowed, which is more friendly to some strong-type languages.

ITP / NYU Course Assignment Incoming. . .

Hello!

I am assigning the students of "Open Source Studio" at ITP to make a contribution to Corpora! The details are here:

https://github.com/Open-Source-Studio-at-ITP/Syllabus/blob/source/data-assignment.md

I have forked the Corpora repo into our org so students can have a "sandbox" to submit their pull requests to and then we'll send useful ones upstream here.

Any comments, thoughts, or feedback on the assignment welcome!

I'll close this issue on Sept 25th when the assignment is due!

❤️

Dashes in category/file names make retrieval in pycorpora difficult

At the moment there are categories in corpora like "film-tv" and files like "materials/abridged-body-fluids". When using tools like pycorpora, these names cause problems because they prevent the user from retrieving files using standard syntax, such as pycorpora.category_name.file_name['key'], because - is not a legal character in Python identifiers.
In pycorpora I can work around this as follows:
getattr(pycorpora, 'film-tv').tv_shows[''tv_shows']
pycorpora.materials.get_file('abridged-body-fluids')['abridged body fluids']
However, this isn't ideal and probably either pycorpora and similar libraries should perform these workarounds internally (translating - to _, for instance), or corpora should restrict category and file names to valid JS/Python/C (for example) identifiers.
I've opened a similar issue in pycorpora: aparrish/pycorpora#11.

Paint Colors

Currently working on a project that requires a bunch of different, often ridiculous, names of various paints (from Valspar, Benjamin Moore, etc.). Currently scraping color pickers and random sites for paint names.

Slowly...but surely...

Would anyone be interested in a big list of them? Not really familiar with packing things up in JSON but I can do my best :-).

A list of profane english words ?

(Not sure if an issue should be opened)

Is there such a list in the corpora ? Couldn't find it with a quick search ? Would it be interesting to have ?

us_cities.json has bunk population data

it looks like a lot of the cities have the last digit of their population truncated.
i.e. new york lists it's population as 8,405,82 which like is not any number I recognize

Fix attributions?

A TON of folks have contributed files to corpora, but only three people are cited in the contributor's section of the README. Would you accept a PR to add the other contributors to the readme as a separate file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.