dariusk / corpora Goto Github PK

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

JavaScript 100.00%

bots corpus language words

corpora's Introduction

Corpora

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.

I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.

License

Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).

To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.

What is Corpora NOT?

This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.

What is Corpora?

Corpora is a repository of JSON files, meant to be language-neutral. If you want to create an NPM repo or whatever based on this, be my guest, but this repository will remain a collection of data files that can be interpreted by any language that can parse JSON.
Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
- For example, Corpora will not contain any complete "dictionary" style files. Instead we host a sampling of 1000 common nouns, adjectives, and verbs.
- Some lists are small enough by nature that we may contain a complete list of things in their category. For example, a list of heavily populated U.S. cities may only have 75 cities and be considered complete.

List of Corpora-related tools

corpora-project, a Node.js NPM package for accessing corpora data offline.
pycorpora, a simple Python interface for corpora
corpora-api, a Node.js server that offers up the corpora as a JSON API (now live at https://corpora-api.glitch.me)

I have some data, how do I submit?

We accept pull requests to this repository. Some guidelines:

BY SUBMITTING DATA AS A PULL REQUEST, YOU AGREE TO OUR APPLYING A CC0 FREE CULTURE LICENSE TO THE DATA, MEANING THAT ANYONE CAN USE THE DATA FOR ANY REASON WITHOUT ATTRIBUTION IN PERPETUITY.
Please submit all data as JSON format in a file with a .json extension, and please JSONLint your files before submitting -- also, thanks to Matt Rothenberg we have Travis-CI testing, which will jsonlint your pull request automatically. If you see a test failure notification in your PR after you submit, there's a problem with your JSON!
Keep individual files to about 1000 "things" maximum. Fewer than 1000 is fine, too.
If you'd like attribution, I'm happy to include your name in this Readme file. Just remember that nobody who uses this data is obligated to include attribution in their own projects.

Contributors

By Darius Kazemi and Many Wonderful Contributors.

corpora's People

Contributors

Stargazers

Watchers

Forkers

techdubb rfreebern mroth silky miguelbermudez atmccann timecolour andy-gorman muffinista ckolderup robsolomon kuro5hin inky jmahoney heyitsgarrett kadamwhite samplereality boodoo atduskgreg jonesbp muranava xanhast mkurian alyphen swartzcr lbstone amarriner fridiculous ahurriyetoglu pkropf fototo rossbarclay alforj xn-github prabhjotsl minskybelieve basilleaf iarnaud dfmooreqqq arturochian shooki morphogencc motiteux ares7 tyler-eto aethersg nosuchtype snazz2001 jampod-dev risatrix anissen jonathanmarvens noma4i simianlogic thenathanblack mlnsvbd borozanov mcat-ee davidcolbyreed alvations kangabell dewb mikeasilva aparrish timkelty lostineverland johng42 iabw virginia moonmilk ottoman91 bradparks zarawesome hugovk kylemcdonald y-a-v-a robhuzzey wiseman dcanetma constantx torrez prophetgoddess organisciak metavida schollz oknono amanda chengzhizhao hanzhang hhhaiai sampyxis edlea niteshkhilwani chrisspurgeon ariddell bdetweiler todrobbins polosecki aeeilllmrx sgilchrist

corpora's Issues

List of machine parts

There is a great list of parts of machinery but I'm unclear on the legality of incorporating.

Plural and invariable nouns?

Would it be possible to add a list of plural and invariable nouns?

I can’t seem to find such a list anywhere.

add an emoji folder to language

pls and ty

Plants

I extracted these from here: https://en.wikipedia.org/wiki/List_of_plants_by_common_name

Figured I might as well post them here in case someone has time to convert them to required format and make a pull request 👍

plants.txt

Ultraconserved words

I read this and immediately thought of this repo.

Lingists identify 15,000-year-old 'ultraconserved words'

I don't know how I'd begin finding and processing them.

Human Universals

Just found a lovely list that felt appropriate for this repo http://condor.depaul.edu/mfiddler/hyphen/humunivers.htm

Add Scents

Change of capital name for Kazakhstan

Change in capital city name for Kazakhstan to Nur-Sultan.
Here is the link of a report.

file path: data/geography/countries_with_capitals.json
file: countries_with_capitals.json

Came to this repo from a youtube video.

Exclude "bad words" from Corpora?

I ran wordfilter on Corpora:

./data/foods/herbs_n_spices.json may contain bad words: [u'spices']
./data/religion/fictional_religions.json may contain bad words: [u'Esoteric Order of Dagon']
./data/religion/religions.json may contain bad words: [u'Ancient Egyptian religion', u'Syrian-Egyptic Gnosticism']
./data/societies_and_groups/animal_welfare.json may contain bad words: [u'Japan', u'Pakistan', u'Egypt']
./data/words/word_clues/clues_five.json may contain bad words: [u'spade', u'tardy', u'blame', u'spice', u'flame', u'spook']
./data/words/word_clues/clues_four.json may contain bad words: [u'gash', u'lame', u'gimp']
./data/words/word_clues/clues_six.json may contain bad words: [u'spooks', u'spooky', u'blames', u'flames', u'script', u'spices', u'retard']

(This excludes keys matching "Description", "description", "descriptions", "scripts", "wine_descriptions".)

Should any of those found words be removed from Corpora?
Would it be useful to pop this into the CI?
a. Just for information, or
b. To fail a build if it finds something?

If yes to 2a, should some of those be added to a whitelist?
If yes to 2b, there needs to be a whitelist for any which aren't removed.

Add "feeling cold, warm, cool, hot" in the activities.json file

corpora-api no longer maintained

herbs-n-spices.json: Bad text encoding

herbs-n-spices.json has non-ASCII characters, but is not valid UTF-8. Lines 186 & 187 both have problems:

"Qâlat daqqa",
"Quatre épices"

Python and iconv both have problems with it:

$ cat data/foods/herbs_n_spices.json | python -m json.tool
'utf8' codec can't decode byte 0xe2 in position 1: invalid continuation byte

$ cat data/foods/herbs_n_spices.json | iconv -f utf8 -t utf8
{
    "description": "A list of herbs and spices, and mixtures of the two.",
[...]
        "Pumpkin pie spice",
        "Q
iconv: (stdin):186:18: cannot convert

Missing dog

Hi there,
I can't seem to find a German Shepard on the list.

[Idea] Enforce 1000 element limit in Github Actions workflow

Hiyya! I was wondering if it would be a good idea to expand the Github Actions workflow to run a small script that checks each JSON file (on pushes and pull requests) and enforces the 1000 element limit on lists.

I can make the PR and changes to the workflow if it sounds like a good / useful contribution!

Pamelo in fruit dataset should be Pomelo

About Json file format and golang lib support

golang lib support

When I wanted to use the corpora project, I found that there was no golang version of the library, so I spent some time supporting a golang version. The repository is artikell/gocorpora.

json file format

But when I tried to build the model, I found that some JSON file formats were not friendly to strong-type languages such as golang and Java，for example：corpora/hexagrams.json at master · dariusk/corpora

{
 "description": "I Ching hexagrams and descriptions, by Ashley Blewer.",
 "source": "https://bits.ashleyblewer.com/i-ching/",
 "hexagrams": {
"111111": {"definition": "01. Force (乾 qián); The Creative; Possessing Creative Power & Skill",
            "hexagram": " ䷀ ",
            "number": "1",
            "description": "Heaven above and Heaven below: Heaven in constant motion. With the strength of the dragon, the Superior Person steels herself for ceaseless activity. Productive activity. Potent Influence. Sublime success if you keep to your course."},
  "000000": {"definition": "02. Field (坤 kūn); The Receptive; Needing Knowledge & Skill; Do not force matters and go with the flow",
            "hexagram": " ䷁ ",
            "number": "2",
            "description": "Earth above and Earth below: The Earth contains and sustains. In th"}
}
}

This model will be transformed into the following structure

{
		Description string `json:"description"`
		Hexagrams   struct {
			_000000 struct {
				Definition  string `json:"definition"`
				Description string `json:"description"`
				Hexagram    string `json:"hexagram"`
				Number      int64  `json:"number,string"`
			} `json:"000000"`
			_000001 struct {
				Definition  string `json:"definition"`
				Description string `json:"description"`
				Hexagram    string `json:"hexagram"`
				Number      int64  `json:"number,string"`
			} `json:"000001"`
}
}

000000 and 000001 have become an attribute of hexagrams. In the subsequent data supplement, the model structure will continue to increase.
Of course, the advantage is to ensure the uniqueness of data. Higher efficiency in query.
However, I think the data should be consistent in structure, and the unlimited increase of attributes is not allowed, which is more friendly to some strong-type languages.

ITP / NYU Course Assignment Incoming. . .

Hello!

I am assigning the students of "Open Source Studio" at ITP to make a contribution to Corpora! The details are here:

https://github.com/Open-Source-Studio-at-ITP/Syllabus/blob/source/data-assignment.md

I have forked the Corpora repo into our org so students can have a "sandbox" to submit their pull requests to and then we'll send useful ones upstream here.

Any comments, thoughts, or feedback on the assignment welcome!

I'll close this issue on Sept 25th when the assignment is due!

❤️

Capital of Bolivia

https://github.com/dariusk/corpora/blob/master/data/geography/countries_with_capitals.json

The capital of Bolivia is Sucre, not La Paz.

The constitutional capital is Sucre, while the seat of government and executive is La Paz.

dulux colours

Hopefully someone can format and make a pull request :)
dulux colours.txt
source: http://www.dulux.com.au/specifier/colour/colour-atlas

Dashes in category/file names make retrieval in pycorpora difficult

At the moment there are categories in corpora like "film-tv" and files like "materials/abridged-body-fluids". When using tools like pycorpora, these names cause problems because they prevent the user from retrieving files using standard syntax, such as pycorpora.category_name.file_name['key'], because - is not a legal character in Python identifiers.
In pycorpora I can work around this as follows:
getattr(pycorpora, 'film-tv').tv_shows[''tv_shows']
pycorpora.materials.get_file('abridged-body-fluids')['abridged body fluids']
However, this isn't ideal and probably either pycorpora and similar libraries should perform these workarounds internally (translating - to _, for instance), or corpora should restrict category and file names to valid JS/Python/C (for example) identifiers.
I've opened a similar issue in pycorpora: aparrish/pycorpora#11.

Paint Colors

Currently working on a project that requires a bunch of different, often ridiculous, names of various paints (from Valspar, Benjamin Moore, etc.). Currently scraping color pickers and random sites for paint names.

Slowly...but surely...

Would anyone be interested in a big list of them? Not really familiar with packing things up in JSON but I can do my best :-).

dariusk / corpora Goto Github PK

corpora's Introduction

Corpora

License

What is Corpora NOT?

What is Corpora?

List of Corpora-related tools

I have some data, how do I submit?

Contributors

corpora's People

Contributors

Stargazers

Watchers

Forkers

corpora's Issues

golang lib support

json file format

Recommend Projects

Recommend Topics

Recommend Org