Coder Social home page Coder Social logo

heydan's People

Contributors

danmelton avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

heydan's Issues

Proposal to add command line search for data sets with entity type and year specificity

Need the ability to search through data by specifying a tag and then perhaps entity type and year:

heydan search population --type all --year all

Dan, I found something!

population
american_community_survey_population, 1990 to 2013
decennial_census, 1990 to 2013

heydan search population --type all --year 2015 --type school_district

population
american_community_survey_estimates)population, 2015

Rename steps and variables in the scripts

Suggest renaming to:

@DaTa is the main variable used in each step

use this block to download a data and put it into a csv file

download
return @DaTa

use this step to transform the @DaTa variable into just identifier column, and variable value for each year, descending by year

ansi_id, 2015, 2014, 2013

#1, 1, 10, 20, 40

transform
return @DaTa

this step compiles the necessary jurisdiction files

compile

this step imports into elastic search

import

simpler and cleaner methods, make more sense

Generalize update_files method in the scripts

Generalize update_files method in the script.rb to be leveraged in the

It should look at the column header for the id in the download.csv, and then map accordingly to th file structure needed. At the moment, the code is slightly different in each of the script implementations.

Reduce the amount of metadata needed

we have a ton of fields that we don't need:

too many descriptions
notes
etc
we can add any additional fields as needed, but get rid of the base number

Windows compatibility issues

Here are some roadblocks I hit when attempting to install Heydan on a Windows 10 box:

(1) It seems that kgio and unicorn are not compatible with Windows
(2) I couldn’t perform downloads because Ruby returned a certificate error. I found a couple of web pages that had downloadable certificates and suggestions about setting up an SSL_CERT_FILE environment variable but couldn’t get it working. I suspect this could be resolved with more effort.
(3) A more minor issue is that I couldn’t invoke heydan from the command prompt, but could work around this with “ruby heydan”.
(4) Windows versions of Ruby apparently do not support process.fork which seems to be required for sync

Add an md5 hash and version for downloads

Currently, the downloads process just checks to see if the download is present or not. If the download file gets updated (new data), then it won't be downloaded. Perhaps we need a version control system? I.e. census_decennial_population_1.csv?

We should create an md5 hash of the download file, and store it in the metadata at upload time. That way, we can also check to see if the file downloaded is exactly the same as the file we need to process.

heydan import

heydan import

Bulk updates elasticsearch from the jurisdictions folder

Create ElasticSearch helper HeyDan::ElasticSearch that creates an index, updates mapping and bulk posts json files to an elasticsearch server

Simplify naming schema for variables/script names

the file names for sources/folder/scripts/ should be reduced to just the variable name, we can scope the script by its containing folder
after we call it, unset the name to prevent writing over

heydan upload

heydan upload

uploads download folder to s3. If you have a heydan.yml file with keys, it will upload all the files to an s3 bucket

uploads data from the download folder to an s3 bucket

heydan start

heydan server

starts a sinatra server, checks to see if the data has been processed and files are present in the jurisidictions folder first.

Proposal needed for reminder to process new data

Data is updated over time at the source (i.e. catalogues). How do we surface the need to update from the source (i.e. a next updated at?). Maybe the metadata has a nextUpdate with a date stamp?

Proposal needed for Ontology

We need to tag and easily surface data sets to the developers. What tagging/data system can we develop for developers to quickly identify and add/grab data to the system?

heydan download

heydan download

Downloads all datasets from the CDN

Download class that scans the file names in the datasets folder and attempts to download them from an s3 bucket

Generate new dataset from templates

running heydan new name

generates new files to add a new dataset datasets/name.json, scripts/name.rb

Copies scripts/template.rb.erb and replaces the filename and class name to scripts/dataset_name.rb
Copies datasets/template.json.erb and replaces the filename to datasets/dataset_name.json

It also checks for name collisions

Command Line Tools

Add a Thor enabled commandline tools for heydan including, this is just the skeleton and does not include the actual processing:

  heydan sync 
  #Downloads all data sets from the CDN & imports them into elasticsearch. 

  heydan start
  #starts up the webserver for heydan

  heydan download
  #Downloads all datasets from the CDN

  heydan process
  #Grab all datasets from original source process them and output into downloads folder. Mostly used to test original download.

  heydan import
  #Bulk updates elasticsearch from the jurisdictions folder

  heydan list
  #Output a nicely formatted list of names

  heydan upload
  #uploads download folder to s3. If you have a heydan.yml file with keys, it will upload all the files to an s3 bucket 

  heydan new name
  #generates new files to add a new dataset datasets/name.json, scripts/name.rb 

Proposal Needed for Versioning Scripts/Identifiers

How do we version scripts? As an example, developer 1 adds a script for a dataset. Developer 2 installs/processes that script. Then developer 1 updates the script, and developer 2 need to run the update. Is there a heydan update *names?

Rename folders

The downloads folder doesn't make a lot of sense...we should rename that to data sets.

Rename datasets to sources, because its really about metadata from the sources

Rename tmp to downloads, b/c we are really just downloading into that folder.

heydan process

heydan process dataset_name

Grab all datasets from original source process them and output into downloads folder. Mostly used to test original download.

Create a central HeyDan::Process class with methods:

class HeyDan::Process

    def get_data
    #this method is used to connect to a source, like an api, ftp or download a csv, and then saves it into the tmp folder.
      super
    end

    def transform_data
    #this method can transform data, like, pulling in data from another file and computing new data like trends, sums, etc. 
    #note, you don't need to do anything here if you don't want to process any data
      super
    end

    def save_data
    #this method saves the file into downloads
      super
    end

    def process_data
    #this method loops through each item and saves it to the jurisdictions/entity_id 
      super
    end

  end

heydan sync

heydan sync

Downloads all data sets from the CDN & imports them into elasticsearch.

essentially runs the download and import commands

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.