gedeck / practical-statistics-for-data-scientists Goto Github PK

View Code? Open in Web Editor NEW

2.7K 70.0 1.7K 89.82 MB

Code repository for O'Reilly book

License: GNU General Public License v3.0

Python 0.89% R 0.59% Jupyter Notebook 98.49% Makefile 0.01% HTML 0.02% SCSS 0.01%

practical-statistics-for-data-scientists's Introduction

Code repository

Practical Statistics for Data Scientists:

50+ Essential Concepts Using R and Python
by Peter Bruce, Andrew Bruce, and Peter Gedeck

Publisher: O'Reilly Media; 2nd edition (June 9, 2020)
ISBN-13: 978-1492072942
Buy on Amazon
Errata: http://oreilly.com/catalog/errata.csp?isbn=9781492072942

Online

View the notebooks online:

Excecute the notebooks in Binder:

This can take some time if the binder environment needs to be rebuilt.

Other language versions

	English: Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python 2020: ISBN 149207294X Google books, Amazon
	Japanese (2020-06-11): データサイエンスのための統計学入門第2版 ―予測、分類、統計モデリング、統計的機械学習とR/Pythonプログラミング 2020: ISBN 978-4-873-11926-7, Shinya Ohashi (supervised), Toshiaki Kurokawa (translated), O'Reilly Japan Inc. Google books, Amazon, Order here
	German (2021-03-29): Praktische Statistik für Data Scientists: 50+ essenzielle Konzepte mit R und Python 2021: ISBN 978-3-960-09153-0, Marcus Fraaß (Übersetzer), dpunkt.verlag GmbH Google books, Amazon Order here
	Korean (2021-05-07): Practical Statistics for Data Scientists: 데이터 과학을 위한 통계(2판) 2021: ISBN 979-1-162-24418-0, Junyong Lee (translation), Hanbit Media, Inc. Google books, Order here
	Polish (2021-06-16): Statystyka praktyczna w data science. 50 kluczowych zagadnien w jezykach R i Python 2021: ISBN 978-8-328-37427-0, Helion Google books, Amazon, Order here
	Russian (2021-05-31): Практическая статистика для специалистов Data Science, 2-е изд. 2021: ISBN 978-5-9775-6705-3, BHV St Petersburg Google books, Order here
	Chinese complex (2021-07-29): Practical Statistics for Data Scientists: 資料科學家的實用統計學第二版 2021: ISBN 978-9-865-02841-1, Hong Weien (translation), GoTop Information Inc. Order here
	Chinese simplified (2021-10-15): Practical Statistics for Data Scientists: 数据科学中的实用统计学（第2版） 2021: ISBN 978-7-115-56902-8, Chen Guangxin (translation), Posts & Telecom Press Order here
	English (Indian subcontinent & select countries only): Practical Statistics for Data Scientists: 50+ Essential Concepts Using R And Python, Second Edition 2021: ISBN 978-8-194-43500-6, Shroff Publishers and Distributors Pvt. Ltd. Order here
	Spanish (2022-02-22): Estadística práctica para ciencia de datos con R y Python, Second Edition 2022: ISBN 978-8-426-73443-3, Marcombo S.A. Google books, Amazon, Order here

Setup of R and Python environments

We recommend using a conda environment to run the Python and R code.

conda create -n sfds #Create the conda environment named sfds.
conda activate sfds #Activate the environment we created.
conda env update -n sfds -f environment.yml #Update the depencies of the environment from environment.yml

The full list of Python and R dependencies from the environment.yml file:

python
jupyter
pandas
matplotlib
scipy
statsmodels
wquantiles
seaborn
scikit-learn
pygam
dmba
pydotplus
imbalanced-learn
prince
xgboost
graphviz
numpy
adjustText
r-essentials
r-base
r-vioplot
r-corrplot
r-gmodels
r-matrixstats
r-lmperm
r-pwr
r-fnn
r-klar
r-dmwr
r-xgboost
r-ellipse
r-mclust
r-ca
r-ggplot2
r-irkernel
r-boot
r-randomforest

practical-statistics-for-data-scientists's People

Contributors

Stargazers

Watchers

Forkers

childish1jin nimu77 ehsong xordux peterleong ratskevichdg krishnatray lengpoh ptracton kumarchandan anwar-hegazy itsshaikaslam jesufemi-o awgroeneveld jdhazard lhmet-forks adarshkhanna rauan92 ferrerasrp xrick ganeshbade nash0990 gridl johnprobyn hainesm6-learning jerryrgcm purusharthmalik timkok pacoruizds franbaldi 0ceangypsy alvincamarillo alexkn77 muk18 arturo-ortigosa jn1995 htran212 mrjjo18 ashishpatel26 ejhortala gadepallisaipavan estkae kukku hieucnm bgg11117 selurun corralm-forks jwatq sullya974 caruso33 korsunkon jacovine pvwa pmsoltani pdalvara gindrinkersline walterms the5cheduler n8halsey kmtk49 czarevangelista mkmanianv redwa sbwiecko farru46 shahir123 el-ouard tmarvinc gurpreet-learning meiyu1pm abeusher akulm26 wallyliu2 miklt stanleycruvinel yasodakrishnav amitkpandey11 tcarrpgh ssjusa aparna993 hiiamjeff karthy257 rajajrds adityajadhavab anhnguyendepocen nhatle9529 fintrek criosch1 marty-zhu valdanchev manish-rocks s-4-m-a-n adityarlv gusdelact chrish2019 mimitheone danwaltmorgan emddarn vousmeevoyez ryokoakaike

practical-statistics-for-data-scientists's Issues

chi-square, resampling approach

Hi, I hope it is OK that I am commenting on this here.
In chapter 3 I am stuck at this step:
3. Find the squared differences between the shuffled counts and expected counts then sum them.
Do you mean "calculate chi-square statistics" for each resampled sample set, where you calculate Pearson residuals first, or you just literally sum the squared differences between observed and expected counts? Thank you.

Again in Ch 5, 6, 7

Naive Bayes, The Naive Solution

The predicted probabilities results are different. They should be 0.4798964(paid off) 0.5201036(default).

I ran the code in colab. Would check this notebook?
https://colab.research.google.com/drive/1ChitMlzaMHYDru6ngI1qBHhJGcIP-RhI#scrollTo=1EnynWD14l2R&line=7&uniqifier=1

Variable importance

Need line-break in line 318.

practical-statistics-for-data-scientists/python/code/Chapter 6 - Statistical Machine Learning.py

Line 318 in 3e1bf1c

print('Features sorted by their score:')

Hyperparameters and Cross-Validation

Need line-break in 453.

practical-statistics-for-data-scientists/python/code/Chapter 6 - Statistical Machine Learning.py

Line 453 in 3e1bf1c

error.append({

And line 452 has type error. Would check this line?
"TypeError: Object with dtype category cannot perform the numpy op subtract"

Python XGBoost codes in Ch6

It would be better to set eval_metric='error' in Python codes too.

Resampling in chi square test

In this function:

https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%203%20-%20Statistical%20Experiments%20and%20Significance%20Testing.ipynb?short_path=69496c2#L873

def perm_fun(box):
    sample_clicks = [sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000))]

Shouldn't it be

def perm_fun(box):
    random.shuffle(box)
    sample_clicks = [sum(box[0:1000]),
                     sum(box[1000:2000]),
                     sum(box[2000:3000])]

to ensure the total count of clicks is always 34?

Data Issue: house_sales.csv

On 29 July 2020, two data were added in house_sales.csv by gedeck. Those data's zip codes are 9800 and 89118. Because of them, many of the execution result of the codes in book, especially in Chapter 4, are not mached with Github code's. 9800 and 89118 are not even the zip codes of King County. They were not in the original data, printed book and in Learning O'Reiily contents. Are they really needed?

Anaconda - ResolvePackageNotFound

I was trying to clone the repo and run the Python files. While updating the environment.yml file after creating the sfds, I got this below shown error.

Chapter 2: Specification of exponential distribution is incorrect

Feedback on errata page:

The mean of the random values generated using the rexp(n=100, rate=0.2) function in R is ~5, which makes sense given that the mean number of events per time period is 0.2. However, for the Python code given in the book as stats.expon.rvs(0.2, size=100) we have the mean of the random values generated ~1.2, where loc=0.2 is the starting location for the exponential distribution. To get the same range of random values as those obtained with R we need to use stats.expon.rvs(scale=5, size=100) instead.

Make change to notebook.

Python code for Chapter 3 - Web Stickness - TypeError in the original code

There is a TypeError running the Chapter 3 Web Stickness notebook:

The line:
print(np.mean(perm_diffs > mean_b - mean_a))

results in the following TypeError: '>' not supported between instances of 'list' and 'float'

which can be fixed using a mapObj such as:

mapObj = map(lambda _: _>(mean_b-mean_a), perm_diffs)
print (f'{sum(mapObj)*100/len(perm_diffs):4.2f}%')

水戸さん

やほーー

Enable github CI for pull requests

Ch. 2 - R Code Data and Sampling Distributions Lines 35, 36

I am getting an error with the three sampling examples in R version 4.1.0:

"Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'"

This error is also generated for the sample of 5 and sample of 20 starting on line 39 and 45.
I've fixed it by passing into the sample arguments loans_income$x, not just loans_income, based on the suggestion on this post: https://stackoverflow.com/questions/19648238/r-says-cannot-take-a-sample-larger-than-the-population-but-i-am-not-taking/19648272

I'm using R 4.1.0; but the arm64 version.

Add R build to CI

At a minimum make sure that the R code executes without a problem.

Example github action to run R
https://blog--simonpcouch.netlify.app/blog/r-github-actions-commit/

Ch 3. Line 77 in Python Code

practical-statistics-for-data-scientists/python/code/Chapter 3 - Statistial Experiments and Significance Testing.py

Line 77 in 0db4dbb

print(np.mean(perm_diffs > mean_b - mean_a))

This line brings typeerror: TypeError: '>' not supported between instances of 'list' and 'float'

It would be better to correct this line to
print(np.mean(np.array(perm_diffs) > mean_b - mean_a))

Incorrect variable reference Chi2 (Chapter 3 page 127)

The following code makes a variable call to the chi2 value calculated using the permutation test (chi2observed), vice the chi2 value computed using the scipy stats module (chisq).

chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')

I believe the first print line should be:
print(f'Observed chi2: {chisq:.4f}') since the purpose is to demonstrate using the chi2 module for statistical tests rather than the previous sections permutation test.

Thanks!

perm_fun use of set()

Using the perm_fun(x, nA, nB) for the permutation tests on pages 99-101 results in a deprecation warning now.

"FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead."

Errors and Questions in Ch5, 6, 7

1. In Chapter 5, some notebook code results are diffrent with printed book's.

[Confusion Matrix]

In [18]:
# Confusion matrix
pred <- predict(logistic_gam, newdata=loan_data)
pred_y <- as.numeric(pred > 0)
true_y <- as.numeric(loan_data$outcome=='default')
true_pos <- (true_y==1) & (pred_y==1)
true_neg <- (true_y==0) & (pred_y==0)
false_pos <- (true_y==0) & (pred_y==1)
false_neg <- (true_y==1) & (pred_y==0)
conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
                     sum(false_neg), sum(true_neg)), 2, 2)
colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
rownames(conf_mat) <- c('Y = 1', 'Y = 0')
conf_mat

	Yhat = 1	Yhat = 0
Y	14293	8378
Y	8051	14620

In the R notebook, the correctly predicted defaults are 14,293 and incorrectly predicted ones are 8,378. But, in the printed book they are 14,295 and 8,376.

And in Python, I got the another diffrent numbers.

    Yhat = default  Yhat = paid off
Y = default       14336        8335
Y = paid off        8148      14523

Which one is correctly right? If the notebook's results are right, the numbers in the first paragrahp of page 222 should be edited.

2. It's also about the diffrent code results in printed book.

[AUC]

In [21]: 
sum(roc_df$recall[-1] * diff(1-roc_df$specificity))
head(roc_df)
0.692623197044616

The result in notebook is 0.692623197044616, but it is 0.6926172 in the book book. Please check the Python code and result too.

3. XGBoost was updated 1.3.0, so it bring some errors in codes, in Chapter 6 and 7(page 272, 275, 276, 280).

It's okay to excutue the codes till to page 276. But without explicitly setting eval_metric="error", you will finally get errors in page 280. I think it would be better to edit github's codes.

4. In Chapter 7, K-Means Clustering - A Simple Example

In [12]:
set.seed(1010103)
df <- sp500_px[row.names(sp500_px)>='2011-01-01', c('XOM', 'CVX')]
km <- kmeans(df, centers=4, nstart=1)

df$cluster <- factor(km$cluster)
head(df)
XOM	CVX	cluster
2011-01-03	0.73680496	0.2406809	1
2011-01-04	0.16866845	-0.5845157	4
2011-01-05	0.02663055	0.4469854	1
2011-01-06	0.24855834	-0.9197513	4
2011-01-07	0.33732892	0.1805111	1
2011-01-10	0.00000000	-0.4641675	4

In the nodebook the first six records are assigned to either cluster 1 or clust 4. The meas of the clusters are the below.

In [13]:
centers <- data.frame(cluster=factor(1:4), km$centers)
centers

cluster	XOM	CVX
1	 0.2315403	 0.3169645
2	 0.9270317	 1.3464117
3	-1.1439800	-1.7502975
4	-0.3287416	-0.5734695

But the excution results in the book are little bit different. They are assigned to cluster 1 or 2. However, as you see the [Figur 7-5], the cluster 3 and 4 are in the minus area(left below of the graph). and it looks like they represent "down" market. So, I think the code results and some sentences in page 296~297 should be changed.

5. In Chapter 7, in page 323, the first line of the date table bring wrong column.

> x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
> x

    dti payment_inc_ratio   home             purpose  
  <dbl>             <dbl> <fctr>             <fctr>
1  1.00           2.39320   RENT                car
...

It should be changed like this.

> x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
> x

    dti payment_inc_ratio   home_             purpose_  
  <dbl>             <dbl> <fctr>             <fctr>
1  1.00           2.39320   RENT                major_purchase
...

Please check them all and let me know if I think(or did) something wrong. :) Thanks in advance.

Chapter 7 Unsupervised Learning Cell #18 Dendrogram is giving error

Chapter 7 Unsupervised Learning
Cell #18 Dendrogram is giving following error:

Please fix the error and upload corrected code to Github web page.
Thanks

ValueError Traceback (most recent call last)
in
1 fig, ax = plt.subplots(figsize=(5, 5))
2
----> 3 dendrogram(Z, labels=df.index, color_threshold=0)
4 plt.xticks(rotation=90)
5 ax.set_ylabel('distance')

C:\ProgramData\Anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color)
3275 "'bottom', or 'right'")
3276
-> 3277 if labels and Z.shape[0] + 1 != len(labels):
3278 raise ValueError("Dimensions of Z and labels must be consistent.")
3279

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in nonzero(self)
2148 def nonzero(self):
2149 raise ValueError(
-> 2150 f"The truth value of a {type(self).name} is ambiguous. "
2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
2152 )

ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Adjust code to changes in Python packages

Chapter 4 code fails with

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
Error: Process completed with exit code 1.

Root cause of this failure is a change in the pandas get_dummies function. It used to create 0/1 and now creates True/False.

Prince package changed API to create plots to use Vega. replace with custom plot

Pull request

I'm trying to do a pull request for some files which I've added to this project. They are the Python files Chapter..N...py broken down into smaller files to make them easier to read. I couldn't see how to do a pull request unless I had write access to this repo, so I cloned, and created my own, at https://github.com/pdxrod/practical-statistics-for-data-scientists. I'll delete this repo if requested to do so by Peter Gedeck.

The main purpose of this branch (small-files) was to make it easier for me to read the book and understand it, being able to see the code in smaller sections, whereas the Chapter..N...py files are 395 lines on average.

Statistics

Graphs in Chapter 5 Classification are not displaying in the Jupyter Notebook

Jupyter Notebook program of Chapter 5 Classification is giving following errors:

Matplotlib is currently using agg, which is a non-GUI backend, so can't show the figure.

Please fix these errors and update notebook's code files on this book's Github webpage.

Thanks and best regards,
SSJ

Chapter 1, Correlation, filtering through data gives an error.

Chapter 1, the Correlation section, the first 2 cells give the same error,

TypeError Traceback (most recent call last)
in ()
4
5 # Filter data for dates July 2012 through June 2015
----> 6 telecom = sp500_px.loc[sp500_px.index >= '2012-07-01', telecomSymbols]
7 telecom.corr()
8 telecom

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in cmp_method(self, other)
120 else:
121 with np.errstate(all="ignore"):
--> 122 result = op(self.values, np.asarray(other))
123
124 if is_bool_dtype(result):

TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'numpy.ndarray'

FYI, I'm running the cells on Google Colab.

sp500_data.csv.gz & kc_tax.csv.gz

HI Peter
i am new to this platform,python and your book. I was able to download all the data file to follow along except the two zip file above they an error 79- Inappropriate file type or format. I am on MAC (catalina) 10.15.6

please upload a better copy.
Thanks

Python Jupyter Notebook program output is different from what is shown there

This is in reference to Python Jupyter Notebook for Chapter 5: Classification, section: Undersampling.

The codes and outputs are, as mentioned in Notebook, shown below -

However, when I rerun that notebook, the output is as shown below

Needless to say, the output is drastically different from what is in original notebook. I have rerun the same code in different notebook and yet the output is different from the original.

Please look into this.

Different histogram under the same number of bins

In chapter 1, the section where we talk about "Frequency Tables and Histograms", I tried to replicate the code of the histogram with a different Python package lets-plot, which should be similar hist() plot in r. However, the y-axis (the frequency) is different than what the R and Python generated under the same number of bins.

The histogram generated from the textbook code:

Code:

ax = (state['Population'] / 1_000_000).plot.hist(bins=10)  
ax.set_xlabel('Population (millions)')

The histogram generated by lets-plot (aka ggplot in Python):

Code:

temp_df = pd.DataFrame(state['Population'] / 1_000_000)  
ggplot(temp_df, aes(x="Population")) + geom_histogram(bins=10)

Figure 7.1 (Python) - Broken

Using given code creates an error (1).

Per Prince CA documentation, I was able to get it working (2).

Python version: 3.11.4 (Using Jupyter Notebook)

(1) Orignal Python Code & Error:
`housetasks = pd.read_csv(HOUSE_TASKS_CSV, index_col=0)

ca = prince.CA(n_components=2)
ca = ca.fit(housetasks)

ca.plot_coordinates(housetasks, figsize=(6, 6))
plt.tight_layout()
plt.show()`

ERROR: AttributeError: 'CA' object has no attribute 'plot_coordinates'

(2) Updated Python Code:
`import pandas as pd
import prince
import altair as alt

#Load the data
housetasks = pd.read_csv(HOUSE_TASKS_CSV, index_col=0)

#Create the model
ca = prince.CA(n_components=2)

#Fit the model
ca = ca.fit(housetasks)

#Extract the column coordinate dataframe, and change the column names
cc = ca.column_coordinates(housetasks).reset_index()
cc.columns = ['name', 'x', 'y']

#Extract the row coordinates dataframe, and change the column names
rc = ca.row_coordinates(housetasks).reset_index()
rc.columns = ['name', 'x', 'y']

#Combine the dataframes
crc_df = pd.concat([cc, rc], ignore_index=True)

#Plot and annotate
points = ca.plot(housetasks, x_component=0, y_component=1)

annot = alt.Chart(crc_df).mark_text(
align='left',
baseline='middle',
fontSize = 10,
dx = 7
).encode(
x='x',
y='y',
text='name'
)

points + annot`

Possible Considerations on moving R into conda environment for consistency

Due to conda being able to handle the R dependencies as well, I'd recommend adding the following to the existing environment file:

r-vioplot
r-corrplot
r-gmodels
r-matrixstats
r-lmperm
r-pwr
r-fnn
r-klar
r-dmwr
r-xgboost
r-ellipse
r-mclust
r-ca

Optional: add rstudio-desktop version that is more supported
conda install -c conda-forge rstudio-desktop compared to the rstudio-desktop version that is part of the default.

Here's the link to the conda-forge version of rstudio-desktop.