matteocourthoud / blog-posts Goto Github PK
View Code? Open in Web Editor NEWCode and notebooks for my Medium blog posts
Code and notebooks for my Medium blog posts
I love this series of blog posts. Thanks for writing them!
I'm trying to get some of these notebooks to run and I'm struggling to get versions of the packages to play well together. Could you push a requirements.txt or poetry.toml or pip freeze? I'm using poetry and would be happy to contribute a working .toml file once I have it.
ModuleNotFoundError: No module named 'src.utils'
Hey Matteo -
Thank you for your blog post on the Bayesian boostrap! I've found it quite helpful in adapting to my own problems and gaining a better understanding of the differences between bayesian and classic bootstrap.
I was trying to replicate your analysis by rewriting some of the code, and I noticed that in the two-level sampling part of your blog, you oversample from the dataframe 10x (cell 19). This is the reason you get a more precise / narrow posterior distribution, not just the use of the bayesian boostrap. You can check this yourself by oversampling in your classic bootstrap procedure, which results in this:
Within the wider context of the blog post, I think you do need to oversample to account for the rare events cases you describe later in the blog post. If you don't oversample, you're going to have instances of sampling where you won't get the rare event. You could try this for yourself with a regression that's unable to additionally take weights and would require the two-level sampling procedure. This would also result in instances where you might not be able to fit the model (since it is actually resampling) or end up with parameter estimates at extreme values.
The line:
from src.dgp import dgp_educ_wages
throws me this error:
ImportError: cannot import name 'dgp_educ_wages' from 'src.dgp' (/content/Blog-Posts/src/dgp.py)
And indeed, searching through the file dgp.py, I could not find 'dgp_educ_wages'
This is a great notebook, enjoyed reading it. I do have one question that is really bugging me.
It is established that creating an auxiliary variable of revenue divided by cost:
df["rho"] = df["revenue"] / df["cost"]
smf.ols("rho ~ new_machine", df).fit().summary().tables[1]
does not represent
But if this is the case, how does the regression at the end, conducted on an auxiliary variable, which is essentially
df["revenue"] - df["cost"]
and a couple constants, adequately represent essentially
Hi,
First of all - I found your post on CUPED and comparisons to diff-in-diff and a simple regression with covariates very useful. I'm in a process of "upgrading" how we analyse experiments at my workplace and your work has helped a lot to clarify things.
However - there's one thing that bothered me in your post - the below chart (and associated table just below it):
According to it, the autoregression has the highest variance among all the methods - which I found very counter-intuitive. Surely it would not perform worse that a simple t-test.. The text in the article also suggests otherwise, which made me wonder if there was some strange mistake/issue when the blog post was rendered.
I just cloned your repo and re-ran the notebook, and indeed - I get results that I would expect:
I'm not sure what exactly happened - but it would be great to have those corrected! I'm sure I am not the only one who found your blog posts helpful, and another person may take away the wrong conclusion (that auto-regression is really bad).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.