mml-book / mml-book.github.io Goto Github PK

Companion webpage to the book "Mathematics For Machine Learning"

Jupyter Notebook 100.00%

mml-book.github.io's Introduction

mml-book.github.io

Companion webpage to the book "Mathematics For Machine Learning"

We are in the process of writing a book on Mathematics for Machine Learning that motivates people to learn mathematical concepts. The book is not intended to cover advanced machine learning techniques because there are already plenty of books doing this. Instead, we aim to provide the necessary mathematical skills to read those other books.

We split the book into two parts:

Mathematical foundations
Example machine learning algorithms that use the mathematical foundations

We aim to keep this book reasonably short, so we cannot cover everything. We will also provide exercises for part 1 and jupyter notebooks for part 2 of the book.

The notebooks can be run live on .

Alternatively try them directly on Google Colab

Title	Tutorial Notebook	Solution
Linear Regression
Principal Component Analysis (PCA)
Gaussian Mixture Model (GMM)

mml-book.github.io's People

Contributors

Stargazers

Watchers

Forkers

poppintiger adriannagmz abhi-jha neo4reo jmirandalc franklin2412 mogaio howardh zoenolan rahulbendre1 yemika yuishihara mrzyzhaozeyu adrianomourthe gamindu hbcbh1999 marcelomata shuwozhiyan feiyue1984 snazz2001 pchan-pipeline karsumit94 sammels gbganalyst auserj espoirmur alfords epgauss idoamihai lizhen-dlut zhoudaqing hcwei13 schweinebauch michaeljyt irleader styanddty zeta-lime sumuyunyu jackliu2014 justinzhu n-zheng zfxu devhttps mistshi lixiangbao shaunstanislauslau qljz nfslover qss2012 sausax drg7722 jiamim boltzmannzhaung kirayuta skols wll199566 paipaipaidaxing locosoft1986 sharmer156 beekbin bayethiernodiop baoguo jgabriellima stevechow codedealer31 zhangyi-chao smirkcao saidaml taiphillips terencelim gliptak kefuzhu y0ngb00n denysav tilelabs tggiles donaghyxu arkival gridl nd1511 advaitha code-ram seeeen tranvukhanh jeydion kkc-krish frank-lb spencerai zhangangus hunterhawk freshman02 zhangcj5131 mzlx srmchcy srivvish mangquan rameezrehman83 makoprovski alishakiba dhigna

mml-book.github.io's Issues

Chapter 5, line 2732 spelling of univariate

In chapter 5, page 118, line 2732: univariate is spelled incorrectly as "Univeriate"

[Chapter 5] Errata & feedback

Would you please find here below a list of what I think to be errors, and what I think could require modifications.

Errata:
Line 2716: "where we look" instead of "where look"
Line 2855 (+5): "the gradient, we compute" instead of "the gradient compute"
Eq. (5.86): Index over the sum should be N instead of D
Eq. (5.103): Index over the sum should be m instead of n
Line 2895: ", and every" instead of ", end every"
Line 2906: "a taste" instead of "a taster"
Eq. (5.117), (5.139), (5.146): Error of sign in the derivative of the square root function
Line 2920: "f_{i}(x_{i-1})" instead of "f_{i}(x)"
Line 2925: "j" instead of "i"
Eq. (5.142): The second element of the second term of the RHS should be the partial derivative of e with respect to c instead of the partial derivative of d with respect to c.
Eq. (5.184), (5.185): Not being very knowledgeable about it, it felt very counter-intuitive. I would have derived the equation (5.184) by applying on H the operator d\dx, but it seems the equation was derived by applying it from the inside. Though it does not modify the final result, it feels odd.

Feedback:
Eq. (5.98): Would have found it more intuitive to have a transpose of the zero vector right before the transpose of x in the RHS of the equation.
Line 2927: Notation seems ambiguous. Clarifying to make sure that people understands that \theta is the set of all A_{.} and b_{.} in contrast with \theta_{j} which contains only the associated A_{j} and b_{j} would be nice.

I did truly appreciate that chapter.

Eq. 3.6 and 3.7 : using Greek letters for mapping (\beta) and scalar (\gamma) may be confusing

Unless it is for educational purposes (to force the reader to really parse the equations carefully), the use of Greek letters for different purposes may be confusing. Can one use \Psi for instance?

line 1395-97 : terms in sidebar with colon (e.g. Isomorphism:) and capitalized

three lines above:
homomorphism

Cosmetic fix in Chapter 2: Figure 2.2

Describe the mistake
The used in between Linear/Affine Mapping overlaps with the box.

Location
Please provide the

version (bottom of page)
Chapter 2
Page 17
Figure 2.2

Proposed solution
Maybe positioning it to the center of the arrow would help

2.4 between 774 & 775. "the the" concept we ...

Cumulative feedback on first view

Would you mind using $\mathbb{R}$ for the blackboard R?
What about $\mathbb{I}$ for the identity matrix?
Use $\varnothing$ instead of $\emptyset$?
Also, isn't the gradient (column vector) the transposed of the Jacobian (row vector)?
What about having the differential operator d upright, and not like a variable?
Figure 5.6 uses bold upright font, instead of the italic one.
Coloured equations (like 5.189) have bad spacing when the colour is changed. Have a look at the "+" spacing.

Is it possible to have a look at the TikZ source? Do you use any GUI, or you code them up from scratch? Say, Figure 5.9, for example.

after eq. 3.27 : definition integrals --> definition of integrals

line 1165: missing comma between R and \forall

Wording suggests you're giving examples of groups as opposed to simply operations.

Describe the mistake

The wording in lines 920, 921 directly after the definition of a groups seems to suggests that R, N, Z are groups with \otimes = +, \cdot and P(B) is a group under \cap, \cup, .

Of those pairs, only (R, +) and (Z, +) actually form groups.

Location
Please provide the

version (bottom of page): 2018-05-28
Chapter: 2
page: 31
line number/equation number: 920 - 921

Proposed solution

Just delete these lines. You give examples of these things directly below anyway.

Chapter 2 equation 2.27

It should be (8, 2, -1, 0) instrad of (8, 12, -1, 0).

Add a table of notations

From Hacker News:

Not sure if the authors will read this or not but I beg of you, please put a table of notation in the forward. The Sutton and Barto Reinforcement Learning book did that for basically every notation that wasn't basic algebra and it's been extremely helpful.
Just labeling things I had never seen before, like indicator functions, was extremely valuable.

Especially for this kind of book that is introducing mathematics to people from a broad background - I think it's important to understand how much of an impediment not knowing notation is by sight. Trying to Google or search for notation is a nightmare.

Line 8382

Inversion is missplleed

1:1 between any kind of vector and R^n

line 761: "There is a 1:1 correspondence between any kind of vector and R^n." Does kind here mean any dimension? Or any field? What about infinite dimensional vectors?

before eq. 3.10 : 'induced norm' seems to be not defined

Not an issue (as such) but ...

When am I going to be able to buy and review this book?

Also the link https://www.mml-book.com/ gives me an a "site cannot be reached" error ...

Figure 10.11

Describe the mistake
A clear and concise description of what the mistake is.
The caption of Figure 10.11 is not visible

Location
Please provide the

version (bottom of page)
(2018-05-28
Chapter
10
page
275
line number/equation number
4762

Proposed solution
There must be a bug in the figure environment

line 1179: Intersection of all subspaces of a vector space?

"The intersection of all subspaces U_i ⊆ V is called linear hull of V."
As the empty set is a subspace of V, the intersection is always the empty set.

The linear hull (linear span) is defined for a set of vectors.

line 1267: missing "in" in interested sets"

Batch sizes should not be tied to hardware constraints

Hi,

Firstly, I love the style of this book - very clear and precise. I'm a big fan of Strang's lectures too.

On page 175 you mention some hardware-oriented constraints that encourage the use of large batch sizes to optimise performance in SGD, but these are predicated upon the use of GPUs, which are not (for long) going to be the dominant form of processor for ML. New devices are being produced right now that don't have the memory bandwidth or data-path width issues of GPUs, and papers such as this one: https://arxiv.org/abs/1804.07612 show that small batches are better for a number of reasons.

I assume you don't want your book to be out of date, or bound to existing and dated hardware practices.

Will definitely be buying this book as a reference resource!

Cheers,

Chris

Possible Misinterpretation in Uniqueness of solutions to Systems of Linear Equations

Verification by plugging in, is actually incorrect. The verification behind why this is a unique solution to the system of linear equations stems from the theorem: A system of linear equations, which is non-homogenous has a unique solution if and only if the determinant is non zero. Otherwise it has either no solutions or infinite solutions. It is also possible I misunderstood the "verify by plugging in" statement, is it indicative of both the uniqueness and existence, or only of the existence. This may just be my lack of understanding of the statement, but maybe it can be made clearer.

Location

(2018-02-25)
Chapter 2
Page 18
Line Number 834

Proposed solution
What about something along the lines of: "The existence of the solution is verified by plugging in the vector, and the uniqueness is a result of determinants (and provide a link/extra reading on the matter for those interested)?

Additional context
Determinants are a fairly useful concept when it comes to the understanding of basics of matrices, would it be valuable to have a mention of it in the chapter apart from the additional exercises presented at the end?

line 1259: add the case of "one vector is a scaled version of another" here

Rationale: Scaling is still a binary relation (as opposed to the more general checks starting from line 1260) and is almost as easy to check as identity.

Minor wording issue in the second example in section 2.1.

Describe the mistake
The line: "Adding the first two equations yields (1) + (2) = 2x_1 + 3x_3 = 5"
feels a little off because the 2 uses of = here are fundamentally different.

Really, (1) + (2) = (2x_1 + 3x_3 = 5) but this bracketing could be confusing,

Location
Please provide the

version: (2018-05-28)
Chapter: 2
page: 16
line number/equation number : Below 701/2.4

Proposed solution
Reword the line to be: Adding the first two equations ((1) + (2)) yields 2x_1 + 3x_3 = 5
or: Adding the first two equations yields 2x_1 + 3x_3 = 5

after eq. 3.11 : "This implies that the inner product is uniquely determined through A." ???

According to eq. 3.11 the inner product <x, y> (LHS) is defined by both the co-ordinates of x and y (\hat{x} and \hat{y}) and A (RHS).

line 1325: target space not defined

For which space is A in eq 2.67 not a generating set? Eq 2.65 and 2.66 deal with R^3.

Line751: Word missing

Add "represent" or a similar word to the following phrase in line 751 "use a bold letter to them"

EM revisited - Chapter 12

Doesn't he M step maximize the expected joint likelihood p(x,z), where expectation is taken under the posterior distribution. 12.82 and following lines seems to suggest expected p(x/theta) is maximised?

eq 2.75: LHS, second argument should be [y_1, y_2] (not [y_1, x_2] )

before Definition 3.6: "The inner product also allows us to characterize vectors that are most dissimilar, i.e., orthogonal."

Why are not anti-parallel vectors most dissimilar?

line 844, sidebar: k -> m, but eq 2.12 uses other dimensions

There are m rows in A (not k).

Eq. 2.12 uses other dimensions than in line 844. Does not seem necessary to me. Shouldn't the dimensions be consistent here?

lines 1277 to 1280: why the focus on (sub)spaces here?

I don't see the purpose of the additional complexity by mentioning subspaces here.

line 1349: missing "are" in "are linearly independent"

before eq. 3.27 : The inner product of two functions ...

The defined article "The" implies that there are no other definitions for an inner product of functions. But there is for instance a more general form with a weighting function corresponding to a metric depending on x. Maybe "An inner product of two functions can be defined as ..." is more appropriate.

Wrtiting (A^T)^{-1} without stating that A^T is invertible.

Describe the mistake
While the line "If A is invertible (A^{-1})^T = (A^T)^{-1}" is certainly true, it would be useful to state somewhere that we know that (A^T)^{-1} exists.

Location

version: 2018-05-28
Chapter: 2
page : 20
line number/equation number: 763

Proposed solution
Replace the line with: "If A is invertible then so is A^T and (A^{-1})^T = (A^T)^{-1}"

line 844: typo k3 -> k in dimension of B

751 missing "denote"

end of sentence "use a bold letter to them," --> "use a bold letter to denote them,"

Is it possible to get an information theory chapter?

I appreciate that the authors are being careful about the scope of the book, but in my own studies of ML, I've noticed that basic information theory would go a long way in helping understand some important concepts. I'm thinking of KL-Divergence, information gain, AIC, cross entropy and other concepts which show up even in basic ML.

Some books which do introduce ML start talking about coding theory and communication channels which may have been what motivated information theory but seem like the wrong approach to teach information theory to data scientists or ML practitioners.

This paper takes an interesting approach, they start with KL Divergence before even entropy.
Divergence, Entropy, Information
https://arxiv.org/abs/1708.07459

Typo "Tupels" for "Tuples"

"Tuples" is written as "tupels".

Version: Draft (2018-02-25)
Chapter 2, Page 46, Line 1399

lines 1062, 1063: layout: G and \forall too close, needs comma as separator

line 1218: example seems to wrongly imply that independence is orthogonality

The London-Munich example seems to imply that orthogonal vectors are independent while a third (non-orthogonal) is not. Of course, any pair is independent here, and a third one makes it dependent. Maybe, use East and Southeast as the first pair and South as the third. (A more natural non-orthogonal coordinate system would be even better, but none comes to mind.)

Missing exponent Equation 5.46

Describe the mistake
In Equation 5.46 the (2x+1) term should have the 3 exponent in the last two steps.

Location
Please provide the

Version: Draft (2018-02-14)
Chapter: 5
Page: 122
Equation: 5.46

eq 2.74 : space between V and \forall

Definition 2.9: which + and \cdot? and "R-vector space" ?

There is no precise explanation that the vector space operations + and \cdot are inherited from space V.
Furthermore, the term "R-vector space" seems to be not defined.

V300118 L8383

Version 30/1/2018
Line 8383
There is a typo in the word misspelled

(Demo entry)

Figure 3.3: missing axis units

Both the explanation of figure 3.3 and the text after eq. 3.3 refer to norm 1. Adding axis units would convey the concept much clearer.

Missing line numbers, words and typos (pages 255-259, chapter 10)

Missing line numbers and words in chapter 10
Version number: Draft chapter (May 28, 2018)

Many lines in chapter 10 PDF have line numbers missing. Examples:
-See line number 4442-4443 (page 255)
-See line number 4501-4502 (page 257)
-See line number 4518-4519 (page 257-258)
Also, some places words seem to be missing or have minor typos.

See line number 4480, page 256, the sentence "[...] corresponding to the training" is missing the word "data"?
See line number 4481, page 256, the sentence "Intuitively we imagine nice data for binary classification [...]" --> did you mean to use "nicely separable" here?, or it would be useful to let the reader know what "nice" means in this context.
I haven't read the earlier chapters (came across this just today), so I apologize in advance if this has been explained before. In that case referring to it could still be useful.
See line number 4483, page 256, "[...] arranged in such as way as to allow [...]" --> "[...] arranged such a way as to allow [...]"

See line number 4527, "[...] have a many possible classifiers." -- > "have many possible classifiers."

Thanks.

version: Draft 2018-05-28
chapter: 2
page: 29
line number/equation number: 870/2.46

Proposed solution
Perhaps give the mild assumptions required for 2.46 to hold. (The columns of the matrix must be linearly independent).

mml-book / mml-book.github.io Goto Github PK

mml-book.github.io's Introduction

mml-book.github.io

mml-book.github.io's People

Contributors

Stargazers

Watchers

Forkers

mml-book.github.io's Issues

Recommend Projects

Recommend Topics

Recommend Org