datasets / awesome-data Goto Github PK

Curated list of quality open datasets

Home Page: https://datahub.io/collections

data datasets datasets-csv open-data open-datasets opendata

awesome-data's Introduction

Awesome collections on DataHub

The awesome section presents collections of high quality datasets organized by topic.

Home page for awesome collections is located in the frontend repo and should be modified from there. See the live page here:

Collections

awesome-data's People

Contributors

Stargazers

Watchers

Forkers

senegalouvert opendatamali spencerx bwlv senegalouvert-datasets jberwald imclab sanjaypoyzer johnlsheridan yannael deiz pdehaye mcnabber091 femtotrader cmsdroff datasets-kr alexamici lexman zelima honne nimmen amercader jeffreychung mikanebu rathoremanish04 nixworks sigmango cnxtech talvinder ashco-2019 fabianofilho mojtabanaseri bkarski pathak-mayurdeep antonydevanchi chakchak1234 carl-wilson96 correlaid-paris todrobbins bethanyelysia liyubov ehoumanevans bhupinder7551 mdheller vxfla scls19fr jitendriyag2 redwa viivekuv shivamkainth fagan2888 gavram parksebastien i-ankit-25 luismjimenez ltrangng hkujy maconeoone qinyanchen dsremo dystudio kehleboe cloudfast-bit cwavedave changrui mariogar25 nicholaskarlson cellslogic another-noob-coder admariner punnypenguins thecuriouscirc magechenhaoming peip-mirror ovlo caterinaconz devnullone ukaserge erdal-pb lowhood woodworker4303 00mjk bmveee rajnikantnita jackob32 danozworld xgauravc jv-ai daniellappv popovayoana vikassri wesdx

awesome-data's Issues

Insurance industry combined ratios timeseries

Probably need to disaggregate. (i.e. p/c market versus others, take out catastrophes etc)
Best source is Best's but very restrictive
Does this merit inclusion?

Sources

Monthly prices for a wide range of commodities from IMF

Exchange Rates

http://datahub.io/dataset/exchange-rates

[super] Inflation / Price Index / PPP / Deflator data

Price indices
- ~~World CPI (per country) #170~~ https://github.com/datasets/cpi
- Country specific price indexes e.g.
  - https://github.com/datasets/cpi-gb
  - https://github.com/datasets/cpi-us
Inflation
- World: #165
PPP
- ~~Global PPP #40~~ https://github.com/datasets/ppp
Price deflators

Granularity

Spatial:
- "World" i.e. per country (probably annual time series)
- Per country: probably want a sub-selection and greater granularity
Temporal
- Year, month, day (?)
- Span: historical is special. For uniform values probably more recent.

Purchasing power parity (PPP) dataset

Source data series would be one or more of:

PPP conversion factor (GDP) to market exchange rate ratio - http://data.worldbank.org/indicator/PA.NUS.PPP
PPP conversion factor, GDP (LCU per international $) - http://data.worldbank.org/indicator/PA.NUS.PPP

NUTS Administrative Boundaries

Initial shortlist of datasets to include

Original in this google spreadsheet

Please add new suggestions as a new issue in this issue tracker.

US House price index (case-shiller)

http://www.standardandpoors.com/indices/sp-case-shiller-home-price-indices/en/us/?indexId=spusa-cashpidff--p-us----

ICD-10 and ICD-9 Classification of Medical Conditions

The WHO maintains a listing of known diseases at http://www.who.int/classifications/icd/en/ - the data download is only available upon registration and with a NC license. Is there an open version of this somewhere?

Earth Temperature Time Series

Crunchbase (?)

Does this merit inclusion?

Getting the data

http://www.crunchbase.com/ - stats as of Aug 2013

~175k companies, 193k people

Where to get bulk ...

http://info.crunchbase.com/about/crunchbase-data-exports/ - Excel file dumps (can use REST API)
- Download URL http://static.crunchbase.com/exports/crunchbase_monthly_export.xlsx (13Mb)
- This is not the whole DB but a small portion dealing with the latest "deals"
Data from ~ 2y ago via petewarden: https://github.com/petewarden/crunchcrawl/
All use of Crunchbase API requires registration as of Dec 1 2012

License

cc-by according to http://info.crunchbase.com/docs/licensing-policy/ with a bunch of specific attribution requirements

Oil prices

US EIA has a variety of prices: https://www.eia.gov/dnav/pet/pet_pri_spt_s1_d.htm (US EIA is great as high quality and public domain as fed gov)

There's various types of oil for which we could get prices:

brent crude - https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RBRTE&f=D
- XLS file: https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls
wti (west texas intermediate) - http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RWTC&f=M
- note there are a bunch of versions - first purchase price - FOB spot price
pump price (in various places in the world)

I propose we store:

Brent crude
WTI

For granularity I'd say it is worth storing all of daily, weekly, monthly and annual but prioritise daily. (note naming conventions: http://data.okfn.org/doc/publish-faq#data-package-name)

Question: Do this as one data package or one data package per oil type? (And if one data package do we store brent and WTI same file or separate files? Ans: yes, separate files).

All in one:

Convenient to prepare as data all from same source so scraper easy to run (that said we already have natural gas prices separate ...)

Separate:

One data package for one dataset approach.
Data package is small and lightweight

My instinct here is in all in one, so data package will look like:

data/wti-day.csv
data/wti-year.csv
data/wti-month.csv
# etc

United Nations Code for Trade and Transport Locations (UN/LOCODE)

http://www.unece.org/cefact/locode/service/location.html

Mimetypes / Mediatypes / File formats Dataset

List of mimetypes / mediatypes / file formats.

S&P 500

The index value and associated info (as per shiller). Good for this to be historical.
Constituents

List of all public health insurances in Germany with contact details

There is no open repository of contact details for health insurances in Germany apart from one PDF listing URLs. Assisted by web scraping we have compiled a complete list with email, address and telephone number. This should be helpful for healthcare system researchers trying to access policies or data from all insurances. There are 137 of them! Does this belong in the registry?

Country Boundaries (vector)

This would be country polygons at crudest scale (e.g. 1:110m). Suggest packaging natural earth data (pd etc).

package name: geo-boundaries-world-110m

Long-term: best way would be to get primary natural earth folks to add in "packaging" - they are already on github - see https://github.com/nvkelso/natural-earth-vector. But we need an exemplar ...

What format should we use?

geojson
topojson - already have this here https://github.com/mbostock/topojson/tree/master/examples
(geocsv ?)
(sqlite)

/cc @jalbertbowden @amercader - thoughts here very welcome :-)

Data

natural earth geojson from @nvkelso - just boundaries
topojson from @mbostock
Natural earth site: http://www.naturalearthdata.com/downloads/110m-cultural-vectors/

Population - UK by NUTS region

http://datahub.io/dataset/nuts-region-populations

Euribor

http://www.euribor-rates.eu/euribor-rates-by-year.asp

We probably don't need all 15 rates they used to have and which they are now reducing:

Until November 1st 2013 Euribor-EBF published 15 Euribor rates (1-3 weeks en 1-12 months) daily (working days only). As of November 1st 2013 the number of Euribor rates is reduced to 8 (1-2 weeks, 1, 2, 3, 6, 9 and 12 months). This adjustment is a consequence of the problems which arose last couple of years when determining the Euribor rates. An EBA/ESMA report which was published January 2013 recommends to calculate and publish only those Euribor rates which are used by banks on a frequent basis. The rationale being that is easier to calculate a reliable rate if there are many transactions for a specific rate (maturity).

I suggest we record the following rates at monthly intervals (which is what you get from historical data)

1-week
1-month
3-month
1-year

Though may turn out getting all 8 is same effort so may as well.

Reseau de Dakar Dem Dikk (DDD)

Le reseau actuel des Bus du service public des transports Dakar Dem Dikk.

Download is via some javascript-y thing but some dev tools analysis reveals source as:

http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=tableCode:240&DataMartId=POP&Format=csv&c=2,3,5,7,9,11,13,15,16,17&s=_countryEnglishNameOrderBy:asc,refYear:desc,areaCode:asc

Bloomberg Open Symbology

http://www.openbloomberg.com/open-symbology/

Stock Symbols List - US

A complete list of all NYSE stock symbols (plus company name).

TODO: work out what symbol list(s) we want.

Note EDGAR also have a symbol list: http://okfnlabs.org/blog/2014/03/04/sec-edgar-database.html plus see bloomberg list in #25

CO2 "Price" (Emission trading permits)

Where can we get CO2 price and emission trading scheme info? Which regions run emissions trading schemes?

EU data

http://www.eea.europa.eu/data-and-maps/data/european-union-emissions-trading-scheme-eu-ets-data-from-citl-7

Data about the EU emission trading system (ETS). The EU ETS data viewer provides aggregated data on emissions and allowances, by country, sector and year. The data mainly comes from the EU Transaction Log (EUTL). Additional information on auctioning and scope corrections is included.

(Major) Cities of the World

See http://www.unece.org/cefact/locode/service/location.html - looks like we would have to scrape (and not sure what the license is ...)

See also #30 (city population time series) - this provides a nice CSV file so maybe we extract from that ...)

CBOE Ticker list

http://www.cboe.com/publish/ScheduledTask/MktData/cboesymboldir2.csv

Airport Codes

Does this merit being included as a reference dataset?

Public Bodies (e.g. Government Departments, Local Authorities etc)

Some initial work in this project including a DB: https://github.com/okfn/publicbodies.org

[meta] Get Involved and Helping Out - start here!

If you interested in getting involved and helping out creating and maintaining datasets then just add your github username in a comment below plus any relevant info on skills / interests

acces to write in datasets directories

Hi , can I have acces to add this datapackage https://github.com/aliounedia/senegal-companies to the register ?

UK House Prices

IBAN / BIC codes (SWIFT)

Official ISO registrar http://swiftref.swift.com/

World Bank - World Development Indicators

http://data.worldbank.org/
http://data.worldbank.org/summary-terms-of-use

EDGAR company identifiers (CIK)

There's a list here: http://www.sec.gov/edgar/NYU/cik.coleft.c

Would also be nice to have ticket to CIK (which EDGAR must have as they use in their search).

To do this you probably need to do a search by ticker on edgar standard search and request atom output e.g.

http://www.sec.gov/cgi-bin/browse-edgar?CIK=ibm&Find=Search&owner=exclude&action=getcompany&output=atom

Then parse the atom to grab the CIK. (If you prefer HTML output just omit output=atom).

US Boundaries

Similar to #38 (country boundaries)

Name: geo-boundaries-us-10m

JODI Oil Database (?)

http://www.jodidata.org/database/access-database.aspx

Not quite sure what is in there but seems to be oil reserves etc

Gold Prices

https://github.com/datasets/gold-prices

Mime-types / Media-types / File formats

http://www.iana.org/assignments/media-types

http://svn.apache.org/viewvc/httpd/httpd/branches/2.2.x/docs/conf/mime.types?view=annotate

Discussion

Would prefer to include file extension

Suggested Schema

{
  id: # mimetype identifier
  fileextensions: # space separated list (?)
  link: # link to authoratative mimetype?
}

current countries of earth

I had a python script kicking around for fetching up-to-date country code standards and putting them all together.

I love the work you are doing on dataprotocols.org so I reorganized it as a datapackage.
This probably duplicates some of the data already included in the registry, so feel free to ignore.

https://github.com/ewheeler/current-countries-of-earth

Currency Codes

The links need to be updated - (coincidentally I commented on this in the datahub http://datahub.io/dataset/iso-4217-currency-codes a couple of hours ago)

This table is not really currency codes, its country/currency codes so is denormalized so USD appears in several places as a result. The table is misnamed and less useful as a result.

Oddly too, the reference to a country is by name not by ISO 3166 code. Do you have a policy around linking/foreign keys?

Of course, some folk would use the XML 'package' directly http://www.currency-iso.org/dam/downloads/dl_iso_table_a1.xml :)

Datasets

Think we have multiple:

co2-{geo}
- total and per capita emissions at level {geo} which is one of global or national
co2-fossil-{geo} where geo is one of global | national | regional
- global: http://cdiac.esd.ornl.gov/trends/emis/tre_glob_2010.html
- country data is at http://cdiac.esd.ornl.gov/trends/emis/tre_coun.html
- preliminary estimates for 2011/2012 - http://cdiac.ornl.gov/ftp/trends/co2_emis/Preliminary_CO2_emissions_2012.xlsx
co2-fossil-gridded - http://cdiac.esd.ornl.gov/epubs/ndp/ndp058/ndp058_v2013.html

Sources

http://cdiac.esd.ornl.gov/

Data Should Look Like

Global:

Year, Emissions, .... could have other columns for more fine-grained breakdown

Country:

Year, Country, Emissions, Per Capita Emissions

Long term interest rates / Long term government bonds

[meta] Naming Conventions

Establish various naming conventions both for datasets / repos and also for files.

Datasets

For country specific datasets:

{topic}                      # e.g. gdp
{topic}-{2-digit-iso}    # e.g. gdp-us

For Data Files

Temporal granularity

[...-]year.csv
[...-]quarter.csv
[...-]month.csv
[...-]day.csv

For README

Intro summary paragraph

Headings (all h2)

Data - about the data
Wrangling - how we had to process the data (maybe we should call Processing)
License - about the license

LEI (legal entity identity) database

See http://p-lei.org/about

See also http://openleis.com/ - seems to be a dump at http://openleis.com/legal_entities.json and http://openleis.com/legal_entities.xml (not sure about license)

registry - quantarctica (qgis antarctica data)

http://www.quantarctica.org/

datasets / awesome-data Goto Github PK

awesome-data's Introduction

Awesome collections on DataHub

Collections

awesome-data's People

Contributors

Stargazers

Watchers

Forkers

awesome-data's Issues

Sources

Getting the data

License

Data

EU data

Discussion

Datasets

Sources

Data Should Look Like

Datasets

For Data Files

For README

Recommend Projects

Recommend Topics

Recommend Org