aeturrell / coding-for-economists Goto Github PK
View Code? Open in Web Editor NEWThis repository hosts the code behind the online book, Coding for Economists.
Home Page: https://aeturrell.github.io/coding-for-economists
License: MIT License
This repository hosts the code behind the online book, Coding for Economists.
Home Page: https://aeturrell.github.io/coding-for-economists
License: MIT License
One section says:
High-level languages like Python and R do not get compiled into highly performant machine code ahead of being run, unlike C++ and FORTRAN. What this means is that although they are much less unwieldy to use, some types of operation can be very slow–and for loops are particularly cumbersome. (Although you may not notice this unless you’re working on a bigger computation.)
But there is a way around this, and it’s with something called a list comprehension. These can combine what a for loop and a condition do in a single line of efficiently executable code. Say we had a list of numbers and wanted to filter it according to whether the numbers divided by 3 or not:
Public sector data science colleagues have pointed out that this isn't right. List comprehensions can actually be slower, and it's not about compilation of code. See for example this article, this SO post, and this video (tldr which one is faster isn't constrained by the spec so their relative performance can change in every version, which in fact they do).
It was also noted that most time is spent reading rather than optimising code, and list comprehensions are arguably a clearer pattern.
@aeturrell Thanks for the fantastic work!
Just to let you know that I tried to access the link in the description, but it's not working.
Cheers!
Various issues:
Thanks for this amazing resource!
By the way, re:
…the number of
Baths
is a floating point number rather than an integer (is it possible to have half a bathroom? Maybe, but it doesn't sound very private), and there are some NaNs in there too. It's not clear what the fractional values of bathrooms mean (including from the documentation) so we'll just have to take care with that variable.
It's very much the norm in American real estate to refer to/count bathroom with only a toilet and sink as a "half bath" (and sometimes those with a shower but no bathtub as a "three-quarter bath," which also shows up in the data). Nothing surprising in that data ;)
When running environment.yml from the Anaconda Prompt I get the following:
PackagesNotFoundError: The following packages are not available from current channels:
- datatable
Current channels:
- https://conda.anaconda.org/oxfordcontrol/win-64
- https://conda.anaconda.org/oxfordcontrol/noarch
- http://conda.anaconda.org/gurobi/win-64
- http://conda.anaconda.org/gurobi/noarch
- https://repo.anaconda.com/pkgs/main/win-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/win-64
- https://repo.anaconda.com/pkgs/r/noarch
- https://repo.anaconda.com/pkgs/msys2/win-64
- https://repo.anaconda.com/pkgs/msys2/noarch
I'm pretty sure this is because datatable needs to be installed with pip
rather than conda
.
A less likely explanation could be operating system dependency (I'm on Windows 10), in which case appending --no-builds
may be a solution.
I have put datatable down to the end of the pip
section as below:
- pip:
- specification_curve
- twopiece
- stargazer
- matplotlib-scalebar
- black-nb
- pyhdfe
- skimpy
- dataprep
- graphviz
- pygraphviz
- ruptures
- deadlinks
- datatable
This works, but the current set of dependencies seem to have conflicts as I get:
Collecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
You may be wondering why Lets-Plot isn’t featured here: its functions have almost exactly the same names as those in lets-plot, and we have opted to include the latter as it is currently the more mature plotting package.
Did you mean
You may be wondering why plotnine isn’t featured here..
?
It may make more sense to start with scatters before introducing faceted scatters
In this section: Connected scatter plot
Req: Lets-Plot v4.3.0
Problem: arrowheads are sunk into circles.
Solution: use the "spacer" option with the value 5 (i.e. the point size in this chart) + 1 (to account for the circle stroke):
(
ggplot(df, aes("Unemployment", "Vacancies"))
+ geom_segment(
aes(
x="Unemployment_from",
y="Vacancies_from",
xend="Unemployment_to",
yend="Vacancies_to",
),
data=path_df,
size=1,
color="gray",
arrow=arrow(type="closed", length=15, angle=15), # <-- Slightly smaller arrow (was 20)
spacer=5+1 # <-- The spacer !
)
+ geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
+ geom_text(
aes(label="Year"),
data=df[df["Year"].isin([2001, 2021])],
position=position_nudge(y=0.3),
)
+ labs(x="Unemployment rate, %", y="Vacancy rate, %")
)
Just as an option: the geom_curve()
often times look nicer :):
(
ggplot(df, aes("Unemployment", "Vacancies"))
+ geom_curve( # <-- New !
aes(
x="Unemployment_from",
y="Vacancies_from",
xend="Unemployment_to",
yend="Vacancies_to",
),
data=path_df,
size=1,
color="gray",
arrow=arrow(type="closed", length=15, angle=15),
spacer=5+1, # <-- The spacer !
curvature=-0.1 # <-- Not too curved.
)
+ geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
+ geom_text(
aes(label="Year"),
data=df[df["Year"].isin([2001, 2021])],
position=position_nudge(y=0.3),
)
+ labs(x="Unemployment rate, %", y="Vacancy rate, %")
)
See this issue.
Hi @aeturrell, the pyfixest
version which you ran the coding for economists regression chapter with did not report R2 values but only the RMSE. If you upgrade to pyfixest 0.14.0
, this should be fixed =)
Best, Alex
Errors can be seen here:
https://aeturrell.github.io/coding-for-economists/vis-intro.html#categorical-data
IndexError: index 0 is out of bounds for axis 0 with size 0
Although the instructions for installing the environment and packages are fairly straightforward, it would be good to have a start-up script that also handled extras such as the installation of nltk and spacy models.
In section Common Plots / Marginal histograms you could replace the code for Lets-Plot with
from lets_plot.bistro.joint import *
(
joint_plot(penguins, x="bill_length_mm", y="bill_depth_mm", reg_line=False)
+ labs(
x="Bill length (mm)",
y="Bill depth (mm)"
)
)
This simplifies the code a bit and uses the function that is designed for the task at hand.
It's already been replaced in PR #43, if you prefer that way of updating code.
See this link for a good example in matplotlib.
In the Working With Data there is an exercise
Create a pandas dataframe using the data=, index=, and columns= keyword arguments. The data should consist of one column with ascending integers from 0 to 5, the column name should be “series”, and the index should be the first 5 letters of the alphabet. Remember that the index and columns keyword arguments expect an iterable of some kind (not just a string).
I believe this is impossible due to there being 6 integers 0 to 5 but only 5 letters of the alphabet.
import pandas as pd
data = {"Series": list(range(6))}
index = list("abcdef")
df = pd.DataFrame(data=data, index=index)
print(df)
I believe this satisfies the exercise requirements.
Look into pros and cons of adding watermarks to scripts using watermark, eg as PyMC3 do for their examples.
%load_ext watermark
%watermark -n -u -v -iv -w
See the Hotkey list from https://github.com/tchapi/markdown-cheatsheet for examples.
In section Common Plots / Pyramid there is a few issues with the plot:
Clipped labels: unfortunately, the 20 character limit is hardcoded, so y labels are cut off. But the full text can be seen in the axial tooltip.
Weird-looking tooltips on top of the pyramid: to improve tooltips displaying I suggest not to use identity statistic; you can calculate and add weight for users as shown below:
g = (
ggplot(df, aes(x="Stage", y="Users", fill="Gender", weight='Users'))
+ geom_bar(width=0.8) # baseplot
+ coord_flip() # flip coordinates
+ theme_minimal()
+ ylab("Users (millions)")
)
g
It's already been replaced in PR #43, if you prefer that way of updating code.
The Lets-Plot library is not mentioned in the Geo-Spatial Visualization section. However, it can work with cartographic data. Detailed information about geocoding can be found here.
Basic example with UK districts:
from lets_plot.geo_data import *
country = geocode_counties().scope('UK').inc_res().get_boundaries()
ggplot() + geom_map(data=country, show_legend=False, size=0.2)
Also, you can add an interactive basemap layer to create a beautiful map:
(
ggplot()
+ geom_livemap()
+ geom_map(aes(fill='found name'), data=country, show_legend=False, size=0.2)
)
There is an error message when using the ' ::rocket:: -> Binder' option on pages with code. This does not appear to be a main/master issue, but to do with the URL that JupyterBook uses to load a given Binder page. Rather than (for example)
being loaded, instead
gets loaded (with some parts of the path repeated).
May be good examples on jupyter book website
As someone somewhat experienced with EDA and data cleaning but new-ish to this work in python, I was interested in learning your (perhaps pythonic) solution:
start_code = 16436
end_code = df['Date'].max() + 1 # +1 because of how ranges are computed; we want to *include* the last date
datetime_dict = dict(zip(range(start_code, end_code),
pd.date_range(start='2005/01/01', periods=end_code-start_code)))
df['datetime'] = df['Date'].apply(lambda x: datetime_dict[x])
but thought I'd mention that the solution that occurred to me first and seems perhaps easier both to develop and explain was
def convert_date(d):
return pd.to_datetime("01-01-2005") + pd.DateOffset(d-16436)
df['datetime'] = df['Date'].apply(convert_date)
Hi @aeturrell , please see my comment in the associated PR: #69.
Best, Alex
In section Common Plots / Overlapping Area plot you could replace the code for Lets-Plot with
(
ggplot(
planets.groupby(["year", "method"])["number"].sum().reset_index(),
aes(x="year", y="number", fill="method", group="method", color="method"),
)
+ geom_area(alpha=.5)
+ scale_x_continuous(format="d")
)
This allows to build a nicer looking plot:
It's already been replaced in PR #43, if you prefer that way of updating code.
Also, make clear that pass is a special word
In column-and-row-exercises, there is a missing question mark:
- Compare
air_time
witharr_time - dep_time
. What do you expect to see? What do you see**?** What do you need to do to fix it?\n",
There are no Lets-Plot examples in section Common Plots / Ridge, or 'joy', plots, but the library does have a suitable function for it: geom_area_ridges()
. You can add the following code:
final_year = df["Year"].max()
first_year = df["Year"].min()
breaks = [y for y in list(df.Year.unique()) if y % 10 == 0]
(
ggplot(df, aes("Anomaly", "Year", fill="Year"))
+ geom_area_ridges(scale=20, alpha=1, size=.2, trim=True, show_legend=False)
+ scale_y_continuous(breaks=breaks, trans='reverse')
+ scale_fill_viridis(option='inferno')
+ ggtitle("Global daily temperature anomaly {0}-{1} \n(°C above 1951-80 average)".format(first_year, final_year))
)
It's already been replaced in PR #43, if you prefer that way of updating code.
Some pages, eg on narrative data visualisation (which uses 'varta'), need special fonts. These are not currently available in the Dockerfile.
In principle, this is possible and some example code to achieve it would be:
FROM continuumio/miniconda3:4.10.3-alpine
WORKDIR /app
COPY ./my-custom-font.ttf ./
RUN mkdir -p /usr/share/fonts/truetype/
RUN install -m644 my-custom-font.ttf /usr/share/fonts/truetype/
RUN rm ./my-custom-font.ttf
But it would be good to pull the font directly from a website, eg using Google fonts.
This article goes into detail of how to install fonts in docker containers:
https://axellarsson.com/blog/install-fonts-in-docker-containers/
In section Common Plots / Connected scatter plot you could replace the code for Lets-Plot with
path_df = df.iloc[:-1].reset_index(drop=True).join(
df.iloc[1:].reset_index(drop=True), lsuffix='_from', rsuffix='_to'
)
(
ggplot(df, aes("Unemployment", "Vacancies"))
+ geom_segment(aes(x="Unemployment_from", y="Vacancies_from", xend="Unemployment_to", yend="Vacancies_to"), \
data=path_df, size=1, color="gray", arrow=arrow(type='closed', length=20, angle=15))
+ geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
+ geom_text(aes(label='Year'), data=df[df['Year'].isin([2001, 2021])], position=position_nudge(y=0.3))
+ labs(x="Unemployment rate, %", y="Vacancy rate, %")
)
This allows to build a nicer looking plot:
It's already been replaced in PR #43, if you prefer that way of updating code.
In section Common Plots / Contour Plot you could replace the code for Lets-Plot with
contour_data = {'x': X.flatten(), 'y': Y.flatten(), 'z': Z.flatten()}
(
ggplot(contour_data)
+ geom_contourf(aes(x='x', y='y', z='z', fill='..level..'))
+ scale_fill_viridis(option="plasma")
+ ggtitle("Maths equations don't currently work")
)
This allows to build a nicer looking plot:
It's already been replaced in PR #43, if you prefer that way of updating code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.