dssg / hitchhikers-guide Goto Github PK

The Hitchhiker's Guide to Data Science for Social Good

Jupyter Notebook 83.28% Python 2.28% CSS 0.04% HTML 13.78% Shell 0.06% Makefile 0.16% PLSQL 0.05% TeX 0.35%

tutorial-exercises data-science dssg training machine-learning

hitchhikers-guide's Introduction

Welcome to the Hitchhiker's Guide to Data Science for Social Good.

What is the Data Science for Social Good Fellowship?

The Data Science for Social Good Fellowship (DSSG) is a hands-on and project-based summer program that launched in 2013 at the University of Chicago and has now expanded to multiple locations globally and currently coordinated by the Data Science for Social Good Foundation and Carnegie Mellon University. It brings a group of fellows, typically graduate students (or senior undergraduate students in some cases), from across the world to work on machine learning, artificial intelligence, and data science projects that have a social impact in partnership with social good organizations. From a pool of typically around 1000 applicants, 20-40 fellows are selected from diverse computational and quantitative disciplines including computer science, statistics, math, engineering, psychology, sociology, economics, and public policy.

The fellows work in small, cross-disciplinary teams on social good projects spanning education, health, energy, transportation, criminal justice, social services, economic development and international development in collaboration with global government agencies and non-profits. This work is done under close and hands-on mentorship from full-time, dedicated, senior data science mentors as well as dedicated project managers, with industry and/or government experience. The result is highly trained fellows, improved data science capacity of the social good organization, and a high quality data science project that is ready for field trial and implementation at the end of the program.

In addition to hands-on project-based training, the summer program also consists of workshops, tutorials, and ethics discussion groups based on our data science for social good curriculum designed to train the fellows in doing practical data science and artificial intelligence for social impact.

Who is this guide for?

The primary audience for this guide is the set of fellows coming to DSSG but we want everything we create to be open and accessible to larger world. We hope this is useful to people beyond the summer fellows coming to DSSG.

If you are applying to the program or have been accepted as a fellow, check out the manual to see how you can prepare before arriving, what orientation and training will cover, and what to expect from the summer.

If you are interested in learning at home, check out the tutorials and teach-outs developed by our staff and fellows throughout the summer, and to suggest or contribute additional resources.

*Another one of our goals is to encourage collaborations. Anyone interested in doing this type of work, or starting a DSSG program, to build on what we've learned by using and contributing to these resources.

What is in this guide?

Our number one priority at DSSG is to train fellows to do responsible data science/ML/AI for social good work. This curriculum includes many things you'd find in a data science course or bootcamp, but with an emphasis on solving problems with social impact, integrating data science with the social sciences, understanding and discussing ethical implications of the work, as well as privacy, and confidentiality issues.

We have spent many (sort of) early mornings waxing existential over Dunkin' Donuts while trying to define what makes a "data scientist for social good," that enigmatic breed combining one part data scientist, one part helper, one part educator, and one part bleeding heart idealist. We've come to a rough working definition in the form of the skills and knowledge one would need, which we categorize as follows:

Programming, because you'll need to tell your computer what to do, usually by writing code.
Computer science, because you'll need to understand how your data is - and should be - structured, as well as the algorithms you use to analyze it.
Math and stats, because everything else in life is just applied math, and numerical results are meaningless without some measure of uncertainty.
Machine learning, because you'll want to build predictive or descriptive models that can learn, evolve, and improve over time.
Social science, because you'll need to know how to design experiments to validate your models in the field, and to understand when correlation can plausibly suggest causation, and sometimes even do causal inference.
Problem and Project Scoping, because you'll need to be able to go from a vague and fuzzy project description to a problem you can solve, understand the goals of the project, the interventions you are informing, the data you have and need, and the analysis that needs to be done.
Project management, to make progress as a team, to work effectively with your project partner, and work with a team to make that useful solution actually happen.
Privacy and security, because data is people and needs to be kept secure and confidential.
Ethics, fairness, bias, and transparency, because your work has the potential to be misused or have a negative impact on people's lives, so you have to consider the biases in your data and analyses, the ethical and fairness implications, and how to make your work interpretable and transparent to the users and to the people impacted by it.
Communications, because you'll need to be able to tell the story of why what you're doing matters and the methods you're using to a broad audience.
Social issues, because you're doing this work to help people, and you don't live or work in a vacuum, so you need to understand the context and history surrounding the people, places and issues you want to impact.

All material is licensed under CC-BY 4.0

The links below will help you find things quickly.

DSSG Manual

Summer Overview

This sections covers general information on projects, working with partners, presentations, orientation information, and the following schedules:

High level summer plan: details what the goals are for each week of the program
Sample Orientation schedules 2016 and 2022: sample detailed schedules for the first two weeks of the program

Conduct, Culture, and Communications

This section details the DSSG anti-harassment policy, goals of the fellowship, what we hope fellows get out of the experience, the expectations of the fellows, and the DSSG environment. A slideshow version of this can also be found here.

Curriculum

This section details the various topics we will be covering throughout the summer. This includes:

Wiki

In the wiki, you will find a bunch of helpful information and instructions that people have found helpful along the way. It covers topics like:

Accessing S3 from the command line
Creating an alias to make Python3 your default (rather than python2)
Installing RStudio on your EC2
Killing your query
Creating a custom jupyter setup
Mounting box from ubuntu
Pretty Print psql and less output
Remotely editing text files in your favorite text editor
SQL Server to Postgres
Using rpy2
VNC Viewer

Contributing

This guide is compiled through mkdocs and served with github pages. When updating them, you can serve them locally to test your changes via (from the top level of this repo):

mkdocs serve -f "$(pwd)/mkdocs.yml"

Once you're ready to publish them, you can do so with:

mkdocs gh-deploy -f "$(pwd)/mkdocs.yml"

(Note that a bug in the version of mkdocs we currently use requires specifying the full path to the configuration file, hence the $(pwd) in the command -- we should be able to remove this in the future if we update the dependency)

hitchhikers-guide's People

Contributors

Stargazers

Watchers

Forkers

ivanapetrovic autodidact24 mrpsonglao htorrence mgrever chrisamunoz adamlglover mbauman dbright90 mabidm nimitkothari dwww2012 mukeshab an3bi mal824 jz2575 ini20 jackierosegenova lucaswu17 achillessaxby ibrahim85 joan-wang warmlogic imsharad ecoblockchain pedroarmengol vishelar cml391 yunque emmygold1 laura-lin fmcdgoncalves jsatua sandra-tilmon lauracodecreations williamgrimes cgrandet anhnguyendepocen seanangio sangeetha-007 diasmfrancisco snitkdan rayofsunshinesha tomymacmillan philipcaochicago vaibhavi-r ayoushm gosuddin wwei29 sandy575 venessalobo chadwgardner jtmancilla jmapost qiweihan karenesther gmontanari krisburke drjohnwalls tejasarackal isaclira levilian chriscarmona zsunpku rogomes juanhernandz snowdj adrianomourthe anavaldi dhanyscode epsimatic88 annaprymakova seungahslee caique sudhu26 nissim-panchpor afcarl rebeccayiunyc molliemarie jess-strategix wangjunbo571 sarahtyt zxhui0 tpbansal bin2000 leonardj-uw aphroditef reabbot ginobaltazar7 wangw23 marcotav sefalab douglas-marques imangla ayodeleohh minhdvo nateraluis jasonyliang ngapuileung dinorego

hitchhikers-guide's Issues

update command for changing column names and get the create table

Show standard, temporal, and "standard-temporal"

For the standard vs temporal cross validation notebook:

Add another common approach: do standard CV up to a point and then a holdout set at the end

Fix Sanergy and Police links in reproducible-ETL directory

the linked repos are private https://github.com/dssg/hitchhikers-guide/tree/master/curriculum/reproducible-ETL

Missing curriculum content: Spatial Analysis Tools

Need content for Ongoing Curriculum: Spatial Analysis Tools

Broken links throughout the repo.

root README

sources/curriculum/programming_best_practices/reproducible-software/README.md

Probably others....

Missing section: Typical Week at DSSG

"Typical Week at DSSG" section of Summer Overview README is blank

update SSH section in software setup file

https://github.com/dssg/hitchhikers-guide/blob/master/curriculum/0_before_you_start/software-setup/README.md

Network tutorial: missing N-1 factor in closeness centrality formula

While the formula for closeness centrality is correct in Introduction_to_Networks_clean.ipynb, it is incorrect in Introduction_to_Networks_Karate.ipynb: a factor of n-1 is missing on the right-hand side, where n is the number of nodes.

Consolidate git/github tutorials

There are 2 git tutorials:
curriculum/2_data_exploration_and_analysis/git_and_github
&
curriculum/2_data_exploration_and_analysis/intro_to_git_and_python

Should these be consolidated into one single .md or .pynb? Or perhaps split into basics and advanced?
Should they be moved to curriculum/4_programming_best_practices ?

Network tutorial: wrong shortest path length

The length of the shortest path between nodes 12 and 15 in the Karate network should be:

len(networkx.shortest_path(Graph_Karate, 12, 15)) - 1

because the output of shortest_path includes both the starting and ending nodes. However, the notebooks Introduction_to_Networks_Karate.ipynb and Introduction_to_Networks_clean.ipynb show, instead:

len(networkx.shortest_path(Graph_Karate, 12,15))

Multiple SSH keys

I have two machines I would like to use during dssg. Do I need to make two different SSH keys and if so what should I name them? Can I use the same SSH key on both machines?

Missing curriculum content: Feature Generation Workshop

Need content for Ongoing Curriculum: Feature Generation Workshop

Missing curriculum content: Pipelines and Evaluation

Need content for Ongoing Curriculum: Pipelines and Evaluation

setup - anaconda and pycharm

Consolidate README files

Note that the top-level README.md and sources/README.md have overlapping but different content (only the latter is actually in the served hitchhikers guide page). If we want these to overlap, can they both be sourced from the same place (I think we did something similar with triage for a while, which seemed relatively straightforward)? Otherwise, perhaps the github README should provide a link out to the guide itself but otherwise be geared towards people contributing to it?

Missing curriculum content: Model Evaluation

Need content for Ongoing Curriculum: Model Evaluation

Update machine learning to most recent version

https://github.com/dssg/hitchhikers-guide/blob/master/curriculum/3_modeling_and_machine_learning/machine-learning/machine_learning_lecture.pdf

Missing curriculum content: Making the Fellowship

Need content for Orientation Curriculum: Making the Fellowship

Machine Learning Outline

Outline

Explore the Data

Previously univariate data

Clustering

Clusters over time windows see if we can group people

Define Features

Clusters over subset of features.

Outcomes

Recidivism 1 or 2 years

Build Classifiers

Models

In-sample testing
- why that is wrong
hold out set
go over cross-validation but going to do temporal cross validation

Evaluation

Accuracy
Precision-Recall
Confusion Matrix
Feature Importances

Feature Engineering

Pre-build features.

Missing curriculum content: Legal Agreements

Need content for Orientation Curriculum: Legal Agreements

Fix typo ("techincal" in week 1 "goal") in high level plan pdf document and re-upload to repo

Here is the link: https://github.com/dssg/hitchhikers-guide/blob/master/dssg-manual/summer-overview/High%20Level%20Plan%20for%20the%20Summer.pdf

Missing curriculum content: Educational Data and Testing (Kevin Wilson)

Need content for Ongoing Curriculum: Educational Data and Testing (Kevin Wilson)

Deleted Google Doc link - update or remove?

Source link on bottom of the page refers to a Google Doc that no longer exists: https://github.com/dssg/hitchhikers-guide/blob/master/curriculum/project-management/README.md

Need to either find correct link or remove link.

Missing curriculum content: The Work We Do

Need content for Orientation Curriculum: The Work We Do

Update githubflow

In the githubflow, needs a readme -- explanations of the empty files and github-flow.*

fix software setup tutorial

visualization basics

Missing curriculum content: Open and Closed Data (Jen Helsby)

Need content for Ongoing Curriculum: Open and Closed Data (Jen Helsby)

update tools documentation to also have info on non-tech tools like slack, trello, etc.

Software Setup: PSQL Windows instructions

The instructions used were:

Make sure you have at least one of dbeaver or postgresSQL as instructed in the prerequisites.
Same as (2) in the Mac instructions, done from the ssh client (eg cygwin) to create the tunnel.
Use dbeaver or the psql terminal (accessible from the start menu once you install postgresSQL) to enter the server, port number, username and database name one by one.

Remove apparently superfluous modules in requirements.txt

It seems like the requirements.txt file here should contain just those requirements required to compile and serve the documentation, but there are many packages like sklearn, spacy, psycopg2-binary that seem unlikely to be needed for that task (but perhaps are related to example code/notebooks provided here). It may be helpful to provide a requirements file for running the code examples, but that should be separate from the main requirements file for the repo itself.

Start with Project Repo

Make changes

Commit

-- diff
-- git rm
-- git mv

Push

Pull