Coder Social home page Coder Social logo

gedeck / practical-statistics-for-data-scientists Goto Github PK

View Code? Open in Web Editor NEW
2.5K 68.0 1.7K 89.82 MB

Code repository for O'Reilly book

License: GNU General Public License v3.0

Python 0.89% R 0.59% Jupyter Notebook 98.49% Makefile 0.01% HTML 0.02% SCSS 0.01%

practical-statistics-for-data-scientists's Introduction

Python

Code repository

Practical Statistics for Data Scientists:

50+ Essential Concepts Using R and Python
by Peter Bruce, Andrew Bruce, and Peter Gedeck

Online

View the notebooks online: nbviewer

Excecute the notebooks in Binder: Binder

This can take some time if the binder environment needs to be rebuilt.

Other language versions

English:
Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python
2020: ISBN 149207294X
Google books, Amazon
Japanese (2020-06-11):
データサイエンスのための統計学入門 第2版 ―予測、分類、統計モデリング、統計的機械学習とR/Pythonプログラミング
2020: ISBN 978-4-873-11926-7, Shinya Ohashi (supervised), Toshiaki Kurokawa (translated), O'Reilly Japan Inc.
Google books, Amazon, Order here
German (2021-03-29):
Praktische Statistik für Data Scientists: 50+ essenzielle Konzepte mit R und Python 
2021: ISBN 978-3-960-09153-0, Marcus Fraaß (Übersetzer), dpunkt.verlag GmbH
Google books, Amazon Order here
Korean (2021-05-07):
Practical Statistics for Data Scientists: 데이터 과학을 위한 통계(2판)
2021: ISBN 979-1-162-24418-0, Junyong Lee (translation), Hanbit Media, Inc.
Google books, Order here
Polish (2021-06-16):
Statystyka praktyczna w data science. 50 kluczowych zagadnien w jezykach R i Python
2021: ISBN 978-8-328-37427-0, Helion
Google books, Amazon, Order here
Russian (2021-05-31):
Практическая статистика для специалистов Data Science, 2-е изд.
2021: ISBN 978-5-9775-6705-3, BHV St Petersburg
Google books, Order here
Chinese complex (2021-07-29):
Practical Statistics for Data Scientists: 資料科學家的實用統計學 第二版
2021: ISBN 978-9-865-02841-1, Hong Weien (translation), GoTop Information Inc.
Order here
Chinese simplified (2021-10-15):
Practical Statistics for Data Scientists: 数据科学中的实用统计学(第2版)
2021: ISBN 978-7-115-56902-8, Chen Guangxin (translation), Posts & Telecom Press
Order here
English (Indian subcontinent & select countries only):
Practical Statistics for Data Scientists: 50+ Essential Concepts Using R And Python, Second Edition
2021: ISBN 978-8-194-43500-6, Shroff Publishers and Distributors Pvt. Ltd.
Order here
Spanish (2022-02-22):
Estadística práctica para ciencia de datos con R y Python, Second Edition
2022: ISBN 978-8-426-73443-3, Marcombo S.A.
Google books, Amazon, Order here

See also

Setup of R and Python environments

We recommend using a conda environment to run the Python and R code.

conda create -n sfds #Create the conda environment named sfds.
conda activate sfds #Activate the environment we created.
conda env update -n sfds -f environment.yml #Update the depencies of the environment from environment.yml 

The full list of Python and R dependencies from the environment.yml file:

python
jupyter
pandas
matplotlib
scipy
statsmodels
wquantiles
seaborn
scikit-learn
pygam
dmba
pydotplus
imbalanced-learn
prince
xgboost
graphviz
numpy
adjustText
r-essentials
r-base
r-vioplot
r-corrplot
r-gmodels
r-matrixstats
r-lmperm
r-pwr
r-fnn
r-klar
r-dmwr
r-xgboost
r-ellipse
r-mclust
r-ca
r-ggplot2
r-irkernel
r-boot
r-randomforest

practical-statistics-for-data-scientists's People

Contributors

gedeck avatar gregorywaynepower avatar jan-janssen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

practical-statistics-for-data-scientists's Issues

Chapter 7 Unsupervised Learning Cell #18 Dendrogram is giving error

Chapter 7 Unsupervised Learning
Cell #18 Dendrogram is giving following error:

Please fix the error and upload corrected code to Github web page.
Thanks

ValueError Traceback (most recent call last)
in
1 fig, ax = plt.subplots(figsize=(5, 5))
2
----> 3 dendrogram(Z, labels=df.index, color_threshold=0)
4 plt.xticks(rotation=90)
5 ax.set_ylabel('distance')

C:\ProgramData\Anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color)
3275 "'bottom', or 'right'")
3276
-> 3277 if labels and Z.shape[0] + 1 != len(labels):
3278 raise ValueError("Dimensions of Z and labels must be consistent.")
3279

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in nonzero(self)
2148 def nonzero(self):
2149 raise ValueError(
-> 2150 f"The truth value of a {type(self).name} is ambiguous. "
2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
2152 )

ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Ch. 2 - R Code Data and Sampling Distributions Lines 35, 36

I am getting an error with the three sampling examples in R version 4.1.0:

"Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'"

This error is also generated for the sample of 5 and sample of 20 starting on line 39 and 45.
I've fixed it by passing into the sample arguments loans_income$x, not just loans_income, based on the suggestion on this post: https://stackoverflow.com/questions/19648238/r-says-cannot-take-a-sample-larger-than-the-population-but-i-am-not-taking/19648272

I'm using R 4.1.0; but the arm64 version.

Errors and Questions in Ch5, 6, 7

1. In Chapter 5, some notebook code results are diffrent with printed book's.

[Confusion Matrix]

In [18]:
# Confusion matrix
pred <- predict(logistic_gam, newdata=loan_data)
pred_y <- as.numeric(pred > 0)
true_y <- as.numeric(loan_data$outcome=='default')
true_pos <- (true_y==1) & (pred_y==1)
true_neg <- (true_y==0) & (pred_y==0)
false_pos <- (true_y==0) & (pred_y==1)
false_neg <- (true_y==1) & (pred_y==0)
conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
                     sum(false_neg), sum(true_neg)), 2, 2)
colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
rownames(conf_mat) <- c('Y = 1', 'Y = 0')
conf_mat
  Yhat = 1 Yhat = 0
Y 14293 8378
Y 8051 14620

In the R notebook, the correctly predicted defaults are 14,293 and incorrectly predicted ones are 8,378. But, in the printed book they are 14,295 and 8,376.

And in Python, I got the another diffrent numbers.

    Yhat = default  Yhat = paid off
Y = default       14336        8335
Y = paid off        8148      14523

Which one is correctly right? If the notebook's results are right, the numbers in the first paragrahp of page 222 should be edited.

2. It's also about the diffrent code results in printed book.

[AUC]

In [21]: 
sum(roc_df$recall[-1] * diff(1-roc_df$specificity))
head(roc_df)
0.692623197044616

The result in notebook is 0.692623197044616, but it is 0.6926172 in the book book. Please check the Python code and result too.

3. XGBoost was updated 1.3.0, so it bring some errors in codes, in Chapter 6 and 7(page 272, 275, 276, 280).

It's okay to excutue the codes till to page 276. But without explicitly setting eval_metric="error", you will finally get errors in page 280. I think it would be better to edit github's codes.

4. In Chapter 7, K-Means Clustering - A Simple Example

In [12]:
set.seed(1010103)
df <- sp500_px[row.names(sp500_px)>='2011-01-01', c('XOM', 'CVX')]
km <- kmeans(df, centers=4, nstart=1)

df$cluster <- factor(km$cluster)
head(df)
XOM	CVX	cluster
2011-01-03	0.73680496	0.2406809	1
2011-01-04	0.16866845	-0.5845157	4
2011-01-05	0.02663055	0.4469854	1
2011-01-06	0.24855834	-0.9197513	4
2011-01-07	0.33732892	0.1805111	1
2011-01-10	0.00000000	-0.4641675	4

In the nodebook the first six records are assigned to either cluster 1 or clust 4. The meas of the clusters are the below.

In [13]:
centers <- data.frame(cluster=factor(1:4), km$centers)
centers

cluster	XOM	CVX
1	 0.2315403	 0.3169645
2	 0.9270317	 1.3464117
3	-1.1439800	-1.7502975
4	-0.3287416	-0.5734695

But the excution results in the book are little bit different. They are assigned to cluster 1 or 2. However, as you see the [Figur 7-5], the cluster 3 and 4 are in the minus area(left below of the graph). and it looks like they represent "down" market. So, I think the code results and some sentences in page 296~297 should be changed.

5. In Chapter 7, in page 323, the first line of the date table bring wrong column.

> x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
> x

    dti payment_inc_ratio   home             purpose  
  <dbl>             <dbl> <fctr>             <fctr>
1  1.00           2.39320   RENT                car
...

It should be changed like this.

> x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
> x

    dti payment_inc_ratio   home_             purpose_  
  <dbl>             <dbl> <fctr>             <fctr>
1  1.00           2.39320   RENT                major_purchase
...

Please check them all and let me know if I think(or did) something wrong. :) Thanks in advance.

Data Issue: house_sales.csv

On 29 July 2020, two data were added in house_sales.csv by gedeck. Those data's zip codes are 9800 and 89118. Because of them, many of the execution result of the codes in book, especially in Chapter 4, are not mached with Github code's. 9800 and 89118 are not even the zip codes of King County. They were not in the original data, printed book and in Learning O'Reiily contents. Are they really needed?

Python code for Chapter 3 - Web Stickness - TypeError in the original code

There is a TypeError running the Chapter 3 Web Stickness notebook:

The line:
print(np.mean(perm_diffs > mean_b - mean_a))

results in the following TypeError: '>' not supported between instances of 'list' and 'float'

which can be fixed using a mapObj such as:

mapObj = map(lambda _: _>(mean_b-mean_a), perm_diffs)
print (f'{sum(mapObj)*100/len(perm_diffs):4.2f}%')

Again in Ch 5, 6, 7

Naive Bayes, The Naive Solution

The predicted probabilities results are different. They should be 0.4798964(paid off) 0.5201036(default).

I ran the code in colab. Would check this notebook?
https://colab.research.google.com/drive/1ChitMlzaMHYDru6ngI1qBHhJGcIP-RhI#scrollTo=1EnynWD14l2R&line=7&uniqifier=1

Variable importance

Need line-break in line 318.

Hyperparameters and Cross-Validation

Need line-break in 453.

And line 452 has type error. Would check this line?
"TypeError: Object with dtype category cannot perform the numpy op subtract"

Python XGBoost codes in Ch6

It would be better to set eval_metric='error' in Python codes too.

Anaconda - ResolvePackageNotFound

I was trying to clone the repo and run the Python files. While updating the environment.yml file after creating the sfds, I got this below shown error.
Screenshot from 2023-06-29 16-58-41

Figure 7.1 (Python) - Broken

Using given code creates an error (1).

Per Prince CA documentation, I was able to get it working (2).

Python version: 3.11.4 (Using Jupyter Notebook)


(1) Orignal Python Code & Error:
`housetasks = pd.read_csv(HOUSE_TASKS_CSV, index_col=0)

ca = prince.CA(n_components=2)
ca = ca.fit(housetasks)

ca.plot_coordinates(housetasks, figsize=(6, 6))
plt.tight_layout()
plt.show()`

ERROR: AttributeError: 'CA' object has no attribute 'plot_coordinates'

(2) Updated Python Code:
`import pandas as pd
import prince
import altair as alt

#Load the data
housetasks = pd.read_csv(HOUSE_TASKS_CSV, index_col=0)

#Create the model
ca = prince.CA(n_components=2)

#Fit the model
ca = ca.fit(housetasks)

#Extract the column coordinate dataframe, and change the column names
cc = ca.column_coordinates(housetasks).reset_index()
cc.columns = ['name', 'x', 'y']

#Extract the row coordinates dataframe, and change the column names
rc = ca.row_coordinates(housetasks).reset_index()
rc.columns = ['name', 'x', 'y']

#Combine the dataframes
crc_df = pd.concat([cc, rc], ignore_index=True)

#Plot and annotate
points = ca.plot(housetasks, x_component=0, y_component=1)

annot = alt.Chart(crc_df).mark_text(
align='left',
baseline='middle',
fontSize = 10,
dx = 7
).encode(
x='x',
y='y',
text='name'
)

points + annot`

perm_fun use of set()

Using the perm_fun(x, nA, nB) for the permutation tests on pages 99-101 results in a deprecation warning now.

"FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead."

Pull request

I'm trying to do a pull request for some files which I've added to this project. They are the Python files Chapter..N...py broken down into smaller files to make them easier to read. I couldn't see how to do a pull request unless I had write access to this repo, so I cloned, and created my own, at https://github.com/pdxrod/practical-statistics-for-data-scientists. I'll delete this repo if requested to do so by Peter Gedeck.

The main purpose of this branch (small-files) was to make it easier for me to read the book and understand it, being able to see the code in smaller sections, whereas the Chapter..N...py files are 395 lines on average.

Different histogram under the same number of bins

In chapter 1, the section where we talk about "Frequency Tables and Histograms", I tried to replicate the code of the histogram with a different Python package lets-plot, which should be similar hist() plot in r. However, the y-axis (the frequency) is different than what the R and Python generated under the same number of bins.

The histogram generated from the textbook code:
image

Code:

ax = (state['Population'] / 1_000_000).plot.hist(bins=10)  
ax.set_xlabel('Population (millions)')

The histogram generated by lets-plot (aka ggplot in Python):
image

Code:

temp_df = pd.DataFrame(state['Population'] / 1_000_000)  
ggplot(temp_df, aes(x="Population")) + geom_histogram(bins=10)

sp500_data.csv.gz & kc_tax.csv.gz

HI Peter
i am new to this platform,python and your book. I was able to download all the data file to follow along except the two zip file above they an error 79- Inappropriate file type or format. I am on MAC (catalina) 10.15.6

please upload a better copy.
Thanks
Screen Shot 2020-09-10 at 5 30 59 PM

Chapter 1, Correlation, filtering through data gives an error.

Chapter 1, the Correlation section, the first 2 cells give the same error,

TypeError Traceback (most recent call last)
in ()
4
5 # Filter data for dates July 2012 through June 2015
----> 6 telecom = sp500_px.loc[sp500_px.index >= '2012-07-01', telecomSymbols]
7 telecom.corr()
8 telecom

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in cmp_method(self, other)
120 else:
121 with np.errstate(all="ignore"):
--> 122 result = op(self.values, np.asarray(other))
123
124 if is_bool_dtype(result):

TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'numpy.ndarray'

FYI, I'm running the cells on Google Colab.

Resampling in chi square test

In this function:

https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%203%20-%20Statistical%20Experiments%20and%20Significance%20Testing.ipynb?short_path=69496c2#L873

def perm_fun(box):
    sample_clicks = [sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000))]

Shouldn't it be

def perm_fun(box):
    random.shuffle(box)
    sample_clicks = [sum(box[0:1000]),
                     sum(box[1000:2000]),
                     sum(box[2000:3000])]

to ensure the total count of clicks is always 34?

Chapter 2: Specification of exponential distribution is incorrect

Feedback on errata page:

The mean of the random values generated using the rexp(n=100, rate=0.2) function in R is ~5, which makes sense given that the mean number of events per time period is 0.2. However, for the Python code given in the book as stats.expon.rvs(0.2, size=100) we have the mean of the random values generated ~1.2, where loc=0.2 is the starting location for the exponential distribution. To get the same range of random values as those obtained with R we need to use stats.expon.rvs(scale=5, size=100) instead.

Make change to notebook.

chi-square, resampling approach

Hi, I hope it is OK that I am commenting on this here.
In chapter 3 I am stuck at this step:
3. Find the squared differences between the shuffled counts and expected counts then sum them.
Do you mean "calculate chi-square statistics" for each resampled sample set, where you calculate Pearson residuals first, or you just literally sum the squared differences between observed and expected counts? Thank you.

Python Jupyter Notebook program output is different from what is shown there

This is in reference to Python Jupyter Notebook for Chapter 5: Classification, section: Undersampling.

The codes and outputs are, as mentioned in Notebook, shown below -

original

However, when I rerun that notebook, the output is as shown below

actual

Needless to say, the output is drastically different from what is in original notebook. I have rerun the same code in different notebook and yet the output is different from the original.

Please look into this.

Incorrect variable reference Chi2 (Chapter 3 page 127)

The following code makes a variable call to the chi2 value calculated using the permutation test (chi2observed), vice the chi2 value computed using the scipy stats module (chisq).

chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')

I believe the first print line should be:
print(f'Observed chi2: {chisq:.4f}') since the purpose is to demonstrate using the chi2 module for statistical tests rather than the previous sections permutation test.

Thanks!

Possible Considerations on moving R into conda environment for consistency

Due to conda being able to handle the R dependencies as well, I'd recommend adding the following to the existing environment file:

r-vioplot
r-corrplot
r-gmodels
r-matrixstats
r-lmperm
r-pwr
r-fnn
r-klar
r-dmwr
r-xgboost
r-ellipse
r-mclust
r-ca

Optional: add rstudio-desktop version that is more supported
conda install -c conda-forge rstudio-desktop compared to the rstudio-desktop version that is part of the default.

Here's the link to the conda-forge version of rstudio-desktop.

Adjust code to changes in Python packages

Chapter 4 code fails with

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
Error: Process completed with exit code 1.

Root cause of this failure is a change in the pandas get_dummies function. It used to create 0/1 and now creates True/False.

Prince package changed API to create plots to use Vega. replace with custom plot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.