nickeubank / ds4humans Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 43.57 MB

Home Page: http://ds4humans.com

Jupyter Notebook 58.47% TeX 0.47% HTML 39.14% JavaScript 0.83% CSS 1.09%

ds4humans's People

Contributors

Stargazers

Watchers

ds4humans's Issues

Just a little...

a place to throw things into something with structure.

Add discussion of feasibility considerations in "solving right problem"

Talk about how often best to start by specifying ideal, then work back to feasible. Lot of student questions about this.

https://github.com/nickeubank/ds4humans/blob/main/30_questions/00_solving_the_right_problem.md

How not to get fooled: What people do and what the say they do: risk assessment and arrests

That way to think resonated. Add to passive prediction

Exercise: Descriptive / Proscriptive

Give them some proscriptive problems. Ask them to come up with descriptive questions they might want to answer to help.
Differentiate descriptive and proscriptive questions
Ask whether data scientists have "privileged authority" to speak to either class of question.

Add Goodhart's Law to adversarial selection readings

https://en.wikipedia.org/wiki/Goodhart%27s_law

Adriane says

she's gonna write a competing book called "data science for cats".

@adrianefresh

Dont use \hat{ATE}, use "Observed Correlation" or something

intro: attempt at laying out my little "three types of questions" structure...

No idea if it works... the first time I tried to use it in teaching it went OK, the last time I tried teaching causal inference first and then introducing it and it did not go great. It feels meaningful to me, but... Curious what you think when you have a chance.

If you wanna see where I'm at...

It occurs to me that because most of my work I'm just pushing directly to the main branch, you don't actually have a lot of visibility into what I'm up to (sorry!).

If you wanna see where I'm at, so far I've re-written all of these readings:

And I'm working on passive-prediction questions now.

I've taken a "fuck figuring out the ideal audience, just write it for the class I'm teaching now" approach, which means what's there assumes that the readers have taken a basic statistical modeling course and are taking an ML course concurrently. It's what I need right now, so I figure start there and then we can reshape/organize for the future.

But if you want to read through it, I'd be very curious what you think. It is finally starting to feel like this kinda vague framework I've been working to express for a while is coming together. It's definitely not "a social scientist does data science" book, but rather (I hope!) a real data science book that tries to put all perspectives on similar footing in terms of usefulness.

Kyle: soften motivation language a little in EDA rant?

Kyle:

In the EDA reading, you have one paragraph that I was reflecting on:

“This is problematic because any activity that involves data but lacks a clear motivation is doomed to be unending and unproductive. Data science has emerged precisely because our datasets are far too complex for us to understand directly; indeed, I would argue that the job of a data scientist can be summed up, in part, as a person who identifies meaningful patterns in our data and makes them comprehensible.”

My question on this is whether or not a clear motivation is necessary for a preliminary analysis/EDA? A data scientist might “explore” the data in the process of conducting an EDA (whether directed or undirected); won’t what matters be what they synthesize and then present to the stakeholder?

Me:

This is definitely the branch I'm exploring walking out on, and I recognize that in taking such a strong position, I'm sure I'm not quite right. But given our students are coming from the opposite extreme, I'm feeling out the more extreme position on the other side.

With that said, I am finding it pretty compelling. To some degree I think it depends on your definition of "clear motivation" and how narrow-precise one interprets that. I don't expect people to dive in to their data with most of their paper already written and a need for nothing but the values to plug into their tables (if they're the research paper-writing sort); but I do think you need a clear sense of what matters in the sense of "what outcomes are problem relevant? what independent variables might you have leverage to manipulate (if your goal is to be impactful)?

Put differently, what you synthesize and present to the stakeholder is the conclusion or answer to a question, but the metric for whether something is substantive significant comes from the stakeholder's problem.

But your point is well taken and I'll keep thinking about it.

Kyle:

That perspective makes perfect sense. Erring on the side of overcorrection from many students’ current default of approaching an EDA like a treasure hunt without a map is reasonable. I agree it hinges around “how narrow-precise” the interpretation of what a “clear motivation” is. If there’s some wiggle room to make that a bit less narrow-precise, then the paradigm you present here feels very well-composed.

talk about "embarassingly parallel" versus "integrated" workflows (and the transition between them)

More on training data collection

Where dem data come from

add "correlation is not even _necessary_ for causation" block on optimization

Add zillow and adverse selection to external validity of passive prediction

wsj_zillow.pdf

Let me know

If wanna get on zoom some time to talk about notebooks / how this book works.

Intro reading: Add rain/umbrella

Students don't quite grok the distinction between passive-prediction and causal. Add the rain example? Umbrella use (passively) predicts rain (if we see one, we don't expect the other), but there's no causal relationship (if I were to manipulate umbrella use, it wouldn't cause rain / we wouldn't "predict" rain to start).

Moved to being about "solving problems" instead of just answering questions

https://ds4humans.com/00_introduction.html

closes #3

Good correlation not causation example: blood pressure and surgery complications

Suppose high blood pressure predicts complications. Maybe it's the high blood pressure itself, maybe people with three jobs who take public transit are just leading more stressful lives.

So blood pressure is great predictor of complications; but would it respond to blood pressure drugs?

(And does that matter if just targeting for followup care?)

added R tutorials

Will need to group better so their list in the table of contents side-bar on the left isn't so long, but fun to put in place!

intro: causal & why?

Adriane:

Passive prediction doesn't care about "why", right? Whether you understand why a patient is likely to get cancer or not is irrelevant to your task. With enough data, you can put anything in your model, you don't even need to think bout whther its a good idea to put it in the model. A good prediction is a good prediction regardless of your understanding. Causality begins to move you closer to questions of why. You likely have an inkling of an understanding in order to choose to manipulate a cause, or think about a cause in the first place.

Potential Examples (with data)

2020 (or 2022) Exit Polls for a state - "Gender" gap?

Good for data wrangling, mutation, if else, missing data, univariate descriptives (mean, median, mode), data types

Pre-Election Polls: 2020 (but have all data since 2008) (but also used in Imai)

Good for looping/Group_by, error/uncertainty, prediction

Election Returns: County level & state level over time

visualization, prediction, merging (demographics), long vs wide data? mapping?
segmentation? X different americas (segmentation analysis of counties based on demographics & voting)

Covid incidence

Measurement issues, mapping? (Correlation with vote returns - correlation vs causality)

Trump Twitter data, House Twitter data

Text as data? classification, "sentiment analysis"

Federalist Papers (but from Imai)

Classification, prediction

Open ended responses from JDC Covid Survey

classification/description

Survey data for segmentation study for NBC. ("6 voter types")

classification

Crime/Incarceration data?

Voter file data

North Carolina? All?

nickeubank / ds4humans Goto Github PK

ds4humans's People

Contributors

Stargazers

Watchers

ds4humans's Issues

Recommend Projects

Recommend Topics

Recommend Org