ethen8181 / machine-learning Goto Github PK

View Code? Open in Web Editor NEW

3.2K 3.2K 650.0 453.64 MB

:earth_americas: machine learning tutorials (mainly in Python3)

License: MIT License

R 0.06% HTML 73.19% Python 0.15% Jupyter Notebook 26.58% CSS 0.01% Dockerfile 0.01% CMake 0.01% C++ 0.01% Cython 0.01%

data-science deep-learning jupyter-notebook machine-learning python python3

machine-learning's People

Contributors

Stargazers

Watchers

Forkers

iamkbpark data321 anhnguyendepocen xiaoxiangma codingbanana nelsontwb93 carmeuribe rajjaa libardo1 analyticsrules smrjans fajk ronypik l6270789 spark-lin little1tow yuanchima analyticsanalytics lgw1860 allensmile saintland yusonkong jxlin yimingpeng juzenn xixihaha369300 romanbrickie xuxiuning jiegzhan christofernal jackforward stella-gao ml-ai-nlp-ir xiapeng23 maheshkkumar nrvnujd overfitter ziliwang andandandand miguellok dchahor abercus bastinrobin columbiays2874 ghellstern picasso056 xuecli eddoding miscacc2020 akshit96 vigneshprajapati shikharateverest valeman yashtawade moshi04 valerasarapas suhasvijayalur machinelearningorg kunlqt avarf nandanjs live2pro raj-maurya practise2017 naveen-tirupattur psychic-spoon arifur794 matrix-revolution ssh-shashi chiragsingla vermuz radovankavicky gapdata brainy749 skynode swapnilawasthi philippschw janove51 h4rr9 asifiqbal cbentes mjk276 ubaidsayyed54 bssrdf aradhyamathur tony32769 tomyc mrumra lyrl imraghava sksundaram-learning dynamics77 say2surender hhy5277 balazsdukai liuyusg nguyenbaduy1995 chemjong sciencepal dataguy-anil

machine-learning's Issues

Minor calculation mistake in "compute_calibration_error"

The formula for ECE (expected calibration error) includes the size of each bin as weight in the weighted average of the squared errors (|Bm|/n)

The function that uses this formula in the code is called "compute_calibration_error":
https://github.com/ethen8181/machine-learning/blob/master/model_selection/prob_calibration/calibration_module/utils.py#L66

(Link to the code line that sums the errors without weight for each bin size)

Although the bins are created so that they are of approximately equal size, they might differ slightly, and the code does not take this into account, i think the bin_error should be multiplied by the bin size, and the sum of all the errors divided by the number of samples (len of y_true for example) instead of the number of bins (in line 68).

I hope my issue is clear and easy to understand, if not, feel free to ask me for clarification.

it may be an error in torch transformer.

class MultiHeadAttention(nn.Module):

in this class, it does not implement the scale of the multiplication of Query and Key.
and in the forward function, it seems that the funcation should return linear_proj , not output?

Columns and DataType Not Explicitly Set on line 290 of utils.py

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem	If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios.
Solution	It is recommended to set the columns and DataType explicitly in data processing.
Impact	Readability

Example:

### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]

### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})

You can find the code related to this smell in this link: https://github.com/ethen8181/machine-learning/blob/916fc7fe0e5e788a1cc8b8f4d24d44f05c492d5e/model_selection/prob_calibration/calibration_module/utils.py#L280-L300.

I also found instances of this smell in other files, such as:

File: https://github.com/ethen8181/machine-learning/blob/master/big_data/sparkml/get_data.py#L21-L31 Line: 26
File: https://github.com/ethen8181/machine-learning/blob/master/data_science_is_software/src/features/build_features.py#L4-L14 Line: 9
File: https://github.com/ethen8181/machine-learning/blob/master/deep_learning/contrastive/clip/clip/utils.py#L5-L15 Line: 10
File: https://github.com/ethen8181/machine-learning/blob/master/model_selection/partial_dependence/partial_dependence.py#L307-L317 Line: 312
.

I hope this information is helpful!

Possible mistake in sanity check function

z variable isn't used in interval calculation:

def sanity_check(size1, size2, significance = 0.05):
    n = size1 + size2
    confidence = 1 - significance
    z = stats.norm.ppf(confidence + significance / 2)
    confint = n * 0.5 + np.array([-1, 1]) * np.sqrt(n * 0.5 * 0.5)
    return confint

Source: http://ethen8181.github.io/machine-learning/ab_tests/frequentist_ab_test.html#Sanity-Check

Machine learning

Why does Logistic Regression Solver impact the conclusion?

Ethen, I have an interesting finding.

If we change the solver of LogisticRegression from 'liblinear' to the default 'lbfgs', theeffect will not be significant with pvalue=0.1605910849805837. What the reason behind this change? why you choose 'liblinear' instead of any other solver? Thanks!

ethen8181 / machine-learning Goto Github PK

machine-learning's People

Contributors

Stargazers

Watchers

Forkers

machine-learning's Issues

Minor calculation mistake in "compute_calibration_error"

it may be an error in torch transformer.

Columns and DataType Not Explicitly Set on line 290 of utils.py

Possible mistake in sanity check function

Machine learning

Why does Logistic Regression Solver impact the conclusion?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent