Proposal to add command line search for data sets with entity type and year specificity

Need the ability to search through data by specifying a tag and then perhaps entity type and year:

heydan search population --type all --year all

Dan, I found something!

population
american_community_survey_population, 1990 to 2013
decennial_census, 1990 to 2013

heydan search population --type all --year 2015 --type school_district

population
american_community_survey_estimates)population, 2015

Rename steps and variables in the scripts

Suggest renaming to:

@DaTa is the main variable used in each step

use this block to download a data and put it into a csv file

download
return @DaTa

use this step to transform the @DaTa variable into just identifier column, and variable value for each year, descending by year

ansi_id, 2015, 2014, 2013

#1, 1, 10, 20, 40

transform
return @DaTa

this step compiles the necessary jurisdiction files

compile

this step imports into elastic search

import

simpler and cleaner methods, make more sense

Generalize update_files method in the scripts

Generalize update_files method in the script.rb to be leveraged in the

It should look at the column header for the id in the download.csv, and then map accordingly to th file structure needed. At the moment, the code is slightly different in each of the script implementations.

Create a new Monkey Patch Labs inspired name

Reduce the amount of metadata needed

we have a ton of fields that we don't need:

too many descriptions
notes
etc
we can add any additional fields as needed, but get rid of the base number

Decennial Census Total Population

Add total population for available years from the api:

http://www.census.gov/data/developers/data-sets/decennial-census-data.html

Windows compatibility issues

Here are some roadblocks I hit when attempting to install Heydan on a Windows 10 box:

(1) It seems that kgio and unicorn are not compatible with Windows
(2) I couldn’t perform downloads because Ruby returned a certificate error. I found a couple of web pages that had downloadable certificates and suggestions about setting up an SSL_CERT_FILE environment variable but couldn’t get it working. I suspect this could be resolved with more effort.
(3) A more minor issue is that I couldn’t invoke heydan from the command prompt, but could work around this with “ruby heydan”.
(4) Windows versions of Ruby apparently do not support process.fork which seems to be required for sync

Update template.rb.erb with comments and sample code from population example

Add an md5 hash and version for downloads

Currently, the downloads process just checks to see if the download is present or not. If the download file gets updated (new data), then it won't be downloaded. Perhaps we need a version control system? I.e. census_decennial_population_1.csv?

We should create an md5 hash of the download file, and store it in the metadata at upload time. That way, we can also check to see if the file downloaded is exactly the same as the file we need to process.

heydan import

Bulk updates elasticsearch from the jurisdictions folder

Create ElasticSearch helper HeyDan::ElasticSearch that creates an index, updates mapping and bulk posts json files to an elasticsearch server

Add the OpenCivicIdentifiers from Sunlight/Google

Add the data (ids and names) from the Open Civic Identifiers project

https://github.com/opencivicdata/ocd-division-ids/blob/master/identifiers/country-us.csv

as part of the import into elasticsearch, include the source metadata files with linkages

Add Fips, Place and Other IDs from Open Civic Identifier datasets

Ability to sync Jurisdictions folder to a bucket

Create the ability to sync a jurisdictions folder to a bucket

Add Birth data at the state and county levels for 1995 to 2013

Add Birth data at the state and county levels for 1968 to 2013
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/natality/

Simplify naming schema for variables/script names

the file names for sources/folder/scripts/ should be reduced to just the variable name, we can scope the script by its containing folder
after we call it, unset the name to prevent writing over

heydan upload

uploads download folder to s3. If you have a heydan.yml file with keys, it will upload all the files to an s3 bucket

uploads data from the download folder to an s3 bucket

heydan start

heydan server

starts a sinatra server, checks to see if the data has been processed and files are present in the jurisidictions folder first.

Proposal needed for reminder to process new data

Data is updated over time at the source (i.e. catalogues). How do we surface the need to update from the source (i.e. a next updated at?). Maybe the metadata has a nextUpdate with a date stamp?

Add an md5 hexdigest for ocd-identifiers for elastic search and other types of ids where we can't have nonnumber/string characters

Proposal needed for Ontology

We need to tag and easily surface data sets to the developers. What tagging/data system can we develop for developers to quickly identify and add/grab data to the system?

Add Cause of Death Data for counties and states

Pull out mortality data from ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/ from 1968 to 2013 for:
age
race
cause of death

Turn Parrellel into Helpers to make it less scary to use/call

HeyDan::Helpers.process_array...

heydan download

Downloads all datasets from the CDN

Download class that scans the file names in the datasets folder and attempts to download them from an s3 bucket

Add USPS Vacancy Data

Census American Community Survey Total Population

Census American Community Survey Total Population for 2007 to 2013
http://www.census.gov/data/developers/data-sets/acs-survey-5-year-data.html#notes

Interview 5 developers about using the system

Get feedback on the initial implementation

Proposal needed for creating/merging a bunch of variables/years into central csv file

heydan generate [name] --variables population var2 var 3 --entitytype state:kansas --year 1990 --path optional

would generate a csv file for the state of kansas with identifiers and variables as columns

Generate new dataset from templates

running heydan new name

generates new files to add a new dataset datasets/name.json, scripts/name.rb

Copies scripts/template.rb.erb and replaces the filename and class name to scripts/dataset_name.rb
Copies datasets/template.json.erb and replaces the filename to datasets/dataset_name.json

It also checks for name collisions

Add parallel processing to setup

Command Line Tools

Add a Thor enabled commandline tools for heydan including, this is just the skeleton and does not include the actual processing:

  heydan sync 
  #Downloads all data sets from the CDN & imports them into elasticsearch. 

  heydan start
  #starts up the webserver for heydan

  heydan download
  #Downloads all datasets from the CDN

  heydan process
  #Grab all datasets from original source process them and output into downloads folder. Mostly used to test original download.

  heydan import
  #Bulk updates elasticsearch from the jurisdictions folder

  heydan list
  #Output a nicely formatted list of names

  heydan upload
  #uploads download folder to s3. If you have a heydan.yml file with keys, it will upload all the files to an s3 bucket 

  heydan new name
  #generates new files to add a new dataset datasets/name.json, scripts/name.rb

Proposal Needed for Versioning Scripts/Identifiers

How do we version scripts? As an example, developer 1 adds a script for a dataset. Developer 2 installs/processes that script. Then developer 1 updates the script, and developer 2 need to run the update. Is there a heydan update *names?

Rename folders

The downloads folder doesn't make a lot of sense...we should rename that to data sets.

Rename datasets to sources, because its really about metadata from the sources

Rename tmp to downloads, b/c we are really just downloading into that folder.

heydan process

heydan process dataset_name

Grab all datasets from original source process them and output into downloads folder. Mostly used to test original download.

Create a central HeyDan::Process class with methods:

class HeyDan::Process

    def get_data
    #this method is used to connect to a source, like an api, ftp or download a csv, and then saves it into the tmp folder.
      super
    end

    def transform_data
    #this method can transform data, like, pulling in data from another file and computing new data like trends, sums, etc. 
    #note, you don't need to do anything here if you don't want to process any data
      super
    end

    def save_data
    #this method saves the file into downloads
      super
    end

    def process_data
    #this method loops through each item and saves it to the jurisdictions/entity_id 
      super
    end

  end

heydan sync

Downloads all data sets from the CDN & imports them into elasticsearch.

essentially runs the download and import commands

danmelton / heydan Goto Github PK

heydan's People

Contributors

Stargazers

Watchers

heydan's Issues