matteocourthoud / blog-posts Goto Github PK

View Code? Open in Web Editor NEW

96.0 96.0 52.0 90.79 MB

Code and notebooks for my Medium blog posts

Jupyter Notebook 99.85% Python 0.15%

blog-posts's Introduction

Hi all 👋

I am a PhD student in Economics at the University of Zürich. I am passionate about causal inference and data science.

Linkedin: https://www.linkedin.com/in/matteo-courthoud/
Twitter: https://twitter.com/matteocourthoud/
Medium: https://medium.com/@matteo.courthoud/
Website: https://matteocourthoud.github.io/

blog-posts's People

Contributors

Stargazers

Watchers

Forkers

rameryp ravishankar-as shalevy1 mksalawa lht1107 kevinkhang2909 sindelfinden yangliufx lemelinm vietecon andromeda0505 sefinance davidnvq martavallejo cuulee eliekawerk counterfactuals zhangbei123 quangxn katygrace predictionengineer symeonsavvopoulos hamidbekamiri forestqin josephogle vitalyastiy hangjianli wuhailing amychen9002 elevantastic kevinyang1704 qmzheng09work noorazhaoz buptzj paritoshk liam-peng-bfg regzhuce ds415 guido-hwang snassimr scott2borg dlbancroft cadejs chengqian-mu marc181181 taoxue-99 ardan09 germayneng

blog-posts's Issues

Question on Conclusion of this ROI Notebook

This is a great notebook, enjoyed reading it. I do have one question that is really bugging me.

It is established that creating an auxiliary variable of revenue divided by cost:

df["rho"] = df["revenue"] / df["cost"]
smf.ols("rho ~ new_machine", df).fit().summary().tables[1]

does not represent $\frac{\Delta R}{\Delta C}$

But if this is the case, how does the regression at the end, conducted on an auxiliary variable, which is essentially
df["revenue"] - df["cost"] and a couple constants, adequately represent essentially $\Delta R - \Delta C$ ? Isnt this the same thing as above in concept?

Bayesian bootstrap is not more precice after accounting for oversampling

Hey Matteo -

Thank you for your blog post on the Bayesian boostrap! I've found it quite helpful in adapting to my own problems and gaining a better understanding of the differences between bayesian and classic bootstrap.

I was trying to replicate your analysis by rewriting some of the code, and I noticed that in the two-level sampling part of your blog, you oversample from the dataframe 10x (cell 19). This is the reason you get a more precise / narrow posterior distribution, not just the use of the bayesian boostrap. You can check this yourself by oversampling in your classic bootstrap procedure, which results in this:

Within the wider context of the blog post, I think you do need to oversample to account for the rare events cases you describe later in the blog post. If you don't oversample, you're going to have instances of sampling where you won't get the rare event. You could try this for yourself with a regression that's unable to additionally take weights and would require the two-level sampling procedure. This would also result in instances where you might not be able to fit the model (since it is actually resampling) or end up with parameter estimates at extreme values.

Seemingly wrong chart in the published versions of the CUPED notebook

Hi,

First of all - I found your post on CUPED and comparisons to diff-in-diff and a simple regression with covariates very useful. I'm in a process of "upgrading" how we analyse experiments at my workplace and your work has helped a lot to clarify things.

However - there's one thing that bothered me in your post - the below chart (and associated table just below it):

According to it, the autoregression has the highest variance among all the methods - which I found very counter-intuitive. Surely it would not perform worse that a simple t-test.. The text in the article also suggests otherwise, which made me wonder if there was some strange mistake/issue when the blog post was rendered.

I just cloned your repo and re-ran the notebook, and indeed - I get results that I would expect:

I'm not sure what exactly happened - but it would be great to have those corrected! I'm sure I am not the only one who found your blog posts helpful, and another person may take away the wrong conclusion (that auto-regression is really bad).

Requirements file

I love this series of blog posts. Thanks for writing them!

I'm trying to get some of these notebooks to run and I'm struggling to get versions of the packages to play well together. Could you push a requirements.txt or poetry.toml or pip freeze? I'm using poetry and would be happy to contribute a working .toml file once I have it.

ERROR: cannot import name 'dgp_educ_wages' from 'src.dgp'

The line:
from src.dgp import dgp_educ_wages

throws me this error:
ImportError: cannot import name 'dgp_educ_wages' from 'src.dgp' (/content/Blog-Posts/src/dgp.py)

And indeed, searching through the file dgp.py, I could not find 'dgp_educ_wages'

dag_collections and dag class imports

Hi there, I was trying to run some of the notebooks to follow. However, I couldnt seems to import to the functions/class correctly.

Is the current dag/folder hierarchy functional for the notebook? Or am I missing something

matteocourthoud / blog-posts Goto Github PK

blog-posts's Introduction

blog-posts's People

Contributors

Stargazers

Watchers

Forkers

blog-posts's Issues

Question on Conclusion of this ROI Notebook

Bayesian bootstrap is not more precice after accounting for oversampling

Seemingly wrong chart in the published versions of the CUPED notebook

Requirements file

ERROR: cannot import name 'dgp_educ_wages' from 'src.dgp'

dag_collections and dag class imports

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent