In case of any errors on displaying the math formulas, please use this this file for reviewing instead.
Ref: https://www.youtube.com/watch?v=zzTbptEdKhY
- What the video is about?
Answer: How to recommend content to users in YouTube.
- Decribe a little bit about the initial neural network.
Answer: Inputs are embedded videos watched and embedded search tokens with geographic information. There are three ReLU layers with a KNN and a softmax in the end.
- What is the cold start problem in this video?
Answer: If you have a new video, there will not be much signals or information to train the neural networks.
- What is one solution of the cold start problem?
Answer: The topic of the video would help.
- How they modify the initial model to solve the cold start problem?
Answer: Adding topic vector to the input.
- What's the method they use to generate topic vector?
Answer: Deep CNN, which means to train on the video shots or audio.
- How to maintain a huge list of topics?
Answer: Use a structured knowledge graph.
- What are the components of triples?
Answer: Entity, property, value.
- What is the solution if the names are not exactly match?
Answer: Look at all the hyperlinks inside wikipedia and these always links to many different names.
- How to decide which topic is the central topic of a video?
Answer: Looks at co-occurrence of topics on wikipedia, so some topics get more votes.
- Describe the steps of entity linking using textual metadata.
Answer: mention detection, disambiguation, pruning.
Ref: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf
- What are the steps for building this recommendation system?
candidate generate, scoring, reranking
- What are the two models they use in building this recommendation system?
candidate generation model, ranking model
- Why do we need a ranking model?
By using a ranking model, they reduced the number of the recommendations based on candidate generations from hundreds to dozens, which makes it easier for users to browse.
- What is gradient descent?
Answer: gradient descent is a iterative optimization algorithm for finding the minimum of a function.
- Suppose we are considering a matrix factorization model with
n
users andm
items, with a embedding size ofK
. Then how many parameters show we have in this model.
K * (m + n)
- Given the gradient descent equations of matrix factorization with MSE in algebra expressions.
- Given the gradient descent equations of matrix factorization with bias using MSE as the loss in algebra expressions.
- Given the gradient descent equations of matrix factorization with L2 regularization using MSE as the loss in algebra expressions.
- Given the gradient equations of matrix factorization in matrix expressions.
Where,
- Given a loss function used for optimizing matrix factorization.
- How would you use a content based recommendation system for online blogs?
Answer: Analyze content of the blogs, then use the similarity between blogs to recommend blogs similar to what a user likes.
- How would you use a collaborative filtering recommendation system for online blogs?
Answer: Use past user behaviours and similarities between users and items simultaneously to provide recommendations.
- Select "explicit feedback" (E) or "implicit feedback" (I) for the following user data.
rating
clicks
transactions
purchases
navgigation history
Solution:
rating E
clicks I
transactions I
purchases I
navgigation history I
- Explain the difference between implicit feedback or explicit feedback?
- Implicit rating: make inference from user's behavior
- Explicit rating: ask users to rate items
- What are the benefits and drawbacks for implicit feedbacks?
- Benefits:
- available and easy to get
- Drawbacks:
- no negative feedback
- no preference level
- What are the benefits and drawbacks for explicit ratings?
- Benefits:
- balanced pos and neg feedbacks
- can have preference level
- clear and direct
- Drawbacks:
- biased data
- sparse data
- difficult to get
- Suppose we want to build a recommendation system based on clicks and the data of clicks are listed as follows,
user item
0 1
0 0
1 1
1 4
2 3
2 2
3 3
What is the Utility matrix we can use for this task?
Solution:
- Following the last question, what is the utility matrix if we fill missing as negatives?
- Following the last question, what is the utility matrix if we use negative samples? (Just give one possible answer)
- Embedded user ratings are implicit feedbacks.
False
- What is the difference between model-based filtering and memory-based filtering?
Answer:
- Memory based: Remember the utility matrix and make recommendations based on the KNN algorithm. It tends to be slow.
- Model based: Fit a model and make recommendations based on model predictions. It tends to have better performance. This is usually called matrix factorization.
- Suppose we want to predict the rating of user Alice on a movie by memory based collaborative filtering. Then what methods can we use?
- User-based KNN filtering: find users the nearnest to Alice, then make predictions based on the average of the ratings on that movie.
- Item-based KNN filtering: find the ratings of movies similar to that movie and then predict the average ratings based on these movies.
- What is the cold start problem?
Answer: If you have a new user or a new video, there will not be much signals or information to train the neural networks.
- What are the solutions for the user cold start problem?
Answer: recommend popular, require interests information, link to social media.
- How can we recommend based on the content of documents? What are the steps?
Answer: use TF-IDF. The steps are
- Remove stopwords
- Compute TF-IDF scores
- Keep works with high score
- Make a vector of size N to represent the document
- Given the following two users' ratings on items, calculate the Jaccard distance between them.
user 1 4 5 5 1 3 2
user 2 3 4 3 1 2 1
Solution:
sim = 1/8
- Given the following two users' ratings on items, and treating rating 1 and 2 to 0 and 3, 4, 5 to 1. Then calculate the Jaccard distance between them.
user 1 4 5 5 1 3 2
user 2 3 4 3 1 2 1
Solution:
user 1 1 1 1 0 1 0
user 2 1 1 1 0 0 0
If we fill missing with negatives,
user 1 1 1 0 1 0 0 1 0
user 2 0 1 1 1 0 0 0 0
Then,
sim = 5/8
- Given the following two users' ratings on items, calculate the cosine similarity between them.
user 1 1 5 3
user 2 1 2 3
Solution:
cosine sim = (1 + 10 + 9) / (sqrt(1 + 25 + 9) * sqrt(1 + 4 + 9))
= 20 / (sqrt(35) * sqrt(14))
= 20 / (7 * sqrt(10))
- Given the following two users' ratings on items, calculate the pearson similarity between them.
user 1 1 5 3
user 2 1 2 3
Solution:
avg_u1 = 3
avg_u2 = 2
pearson sim = (2 + 0 + 0) / (sqrt(4 + 4 + 0) + sqrt(1 + 0 + 1))
= 2 / (3 * sqrt(2))
- What are the steps for content based kNN?
- Compute profile vectors for users and items
- Find a similarity measure and compute similarity between users and items
- Recommend to a user items with high similarity by kNN
- What are the benefits and drawbacks for a content based system?
Benefits:
- no need data on other users
- don't have a cold start problem
- can be explained
Drawbacks:
- difficult to construct feature vector
- difficult to recognize new genres
- hard to exploit the qualify of judgements from other users
- List two features we can use to build a user profile for movie recommendations.
Answer: Demographics, interests, search history, current location, item set, etc.
- What are the three types of negative sampling?
Answer: User-oriented sampling, item-oriented sampling, uniform random sampling
Suppose we are given the embedding matrices emb_user
and emb_movie
from fitting the matrix factorization problem and the model is,
y_hat = U * V.T
We also have a tabular dataset df
which has the data as follows,
userId | movieId | rating |
---|---|---|
0 | 0 | 4 |
0 | 1 | 3 |
1 | 1 | 5 |
1 | 2 | 5 |
Please write a function in numpy
(alias np
) and pandas
(alias pd
) to compute the prediction column without loops.
Solution:
df['prediction'] = np.sum(emb_user[df['userId'].values]*emb_movie[df['moiveId'].values], axis=1)
Suppose we are given the embedding matrices emb_user
, emb_movie
, emb_user_bias
, emb_movie_bias
from fitting the matrix factorization problem and the model is,
y_hat = U * V.T + B_U + B_V
We also have a tabular dataset df
which has the data as follows,
userId | movieId | rating |
---|---|---|
0 | 0 | 4 |
0 | 1 | 3 |
1 | 1 | 5 |
1 | 2 | 5 |
Please write a function in numpy
(alias np
) and pandas
(alias pd
) to compute the prediction column without loops.
Solution:
df['prediction'] = np.sum(emb_user[df['userId'].values]*emb_movie[df['moiveId'].values], axis=1) + emb_user_bias[df['userId'].values] + emb_movie_bias[df['moiveId'].values]
Suppose we are using MSE for gradient descent and we are given the embedding matrices emb_user
and emb_movie
from fitting the matrix factorization problem.
We also have a tabular dataset df
which has the data as follows,
userId | movieId | rating |
---|---|---|
0 | 0 | 4 |
0 | 1 | 3 |
1 | 1 | 5 |
1 | 2 | 5 |
Y
is a sparse representation of df
, and we are given Y_hat
which is a sparse matrix of the prediction if the items appears in df
.
Please write a function in numpy
(alias np
) and pandas
(alias pd
) to compute the gradients without loops.
Solution:
grad_user = -2 / N * np.dot((Y.toarray() - Y_hat.toarray()), emb_movie)
grad_movie = -2 / N * np.dot((Y.toarray() - Y_hat.toarray()), emb_user)
Suppose we are using momentum for calculating gradient descents and we are given the embedding matrices emb_user
and emb_movie
from fitting the matrix factorization problem. The dataset df
and its sparse representation Y
are provided.
There are also some other variables we have defined,
v_user
: the history velocity of user embeddings for momentum gradient descentv_movie
: the history velocity of movie embeddings for momentum gradient descentbeta
: the weight for historical velocityiterations
: the numer of iterations for gradient descentlearning_rate
: the given learning rate
The gradient of user and movie and be calculated through calling gradient
within each iteration.
def gradient_descent(df, Y, emb_user, emb_movie):
v_user = np.zeros_like(emb_user)
v_movie = np.zeros_like(emb_movie)
beta = 0.9
iterations=100
learning_rate=0.01
for i in range(iterations):
grad_user, grad_movie = gradient(df, Y, emb_user, emb_movie)
# TODO: Implement here.
return emb_user, emb_movie
Please implement the function in numpy
(alias np
) and pandas
(alias pd
) to compute the gradient descents.
Solution:
v_user = beta * v_user + (1 - beta) * grad_user
v_movie = beta * v_movie + (1 - beta) * grad_movie
emb_user -= learning_rate * v_user
emb_movie -= learning_rate * v_movie
- What is the shape of the tensor
out
?
embed = nn.Embedding(5,7)
x = torch.LongTensor([[1,0,1,4,2]])
out = embed(x)
Answer:
[1, 5, 7]
- What is the minimum valid integer for the blank?
embed = nn.Embedding(_______,7)
x = torch.LongTensor([[1,1,2,2,3],[5,4,4,2,1]])
out = embed(x)
Answer:
6
- What is the shape of the tensor
out
?
embed = nn.Embedding(10,5)
x = torch.LongTensor([1,0,1,4,2,0])
out = embed(x)
Answer:
[6, 5]
- What is the value of
result
in the last line?
embed = nn.Embedding(10,5)
x = torch.LongTensor([1,0,1,4,2,1])
out = embed(x)
bools = out[0] == out[2]
result = bools.detach().numpy()[0]
Answer:
True
- What is the value of the tensor x.grad after running the next few lines?
x = torch.tensor([1.0,3.0,2.0], requires_grad=True)
L = (3*x + 7).sum()
L.backward()
Answer:
tensor([3., 3., 3.])
- What is the value of the tensor x.grad after running the next few lines?
x = torch.tensor([1.0,3.0,2.0], requires_grad=True)
L = (-2*x**2 + 3*x + 7).mean()
L.backward()
Answer:
tensor([-0.3333, -3.0000, -1.6667])
- How to create a tensor of zeros shaped
[2,2,2]
?
torch.zeros(2,2,2)
- How to create a tensor of ones shaped
[2,2,2]
?
torch.ones(2,2,2)
- How to create a tensor of normal distributed random values ranged [0, 1) of shape
[2,2,2]
?
torch.rand(2,2,2)
- How to create a tensor of uniform distributed random integers ranged [3, 10) of shape
[2,2,2]
?
torch.randint(3, 10, (2,2,2))
- What is the shape of the
x
after the following lines?
x = torch.tensor([[1,2,3,4]])
x = x.squeeze()
Answer:
[4]
- What is the shape of the
x
after the following lines?
x = torch.tensor([[1,2,3,4]])
x = x.squeeze(1)
Answer:
[1, 4]
- What is the shape of the
x
after the following lines?
x = torch.tensor([[1],[2],[3],[4]])
x = x.squeeze()
Answer:
[4]
- What is the shape of the
x
after the following lines?
x = torch.tensor([[1,2,3,4]])
x = x.unsqueeze(1)
Answer:
[1,1,4]
- What is the shape of the
x
after the following lines?
x = torch.tensor([[1],[2],[3],[4]])
x = x.unsqueeze(1)
Answer:
[4,1,1]
- What is the shape of the
x
after the following lines?
x = torch.tensor([1,2,3,4])
x = x.unsqueeze(1)
Answer:
[4,1]
- Describe what each of the following lines of code are doing during a training loop.
optimizer.zero_grad()
loss.backward()
optimizer.step()
Answer:
line 1: zero out the gradients for all the tensors in the model
line 2: computing the gradients for all tensors based on loss
line 3: iterate all thee tensors and use the interally stored grad to update their values
- Suppose we are given batch size of 1,000 and the data size is 1,000,000, then calculate how many batches do we have in one epoch.
1000000/1000 = 1000
- Suppose we are given batch size of 1,000 and the data size is 1,000,000. The total number of epoches is 10 per training. Calculate the number of iterations we have in one training.
1000000/1000 * 10 = 10000
Suppose we have a linear regression problem with 4 input variables. Write a model in PyTorch that can be used for this task. Suppose we have imported,
import torch
import torch.nn as nn
import torch.nn.functional as F
Use the following template for your implementation.
class LinearRegression(torch.nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
# TODO: Implement this method
def forward(self, x):
# TODO: Implement this method
model = LinearRegression()
Solution:
class LinearRegression(torch.nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
self.linear = torch.nn.Linear(4, 1)
def forward(self, x):
y_pred = self.linear(x)
return y_pred
model = LinearRegression()
Suppose we have a logistic regression problem with 4 input variables. Write a model in PyTorch that can be used for this task. Suppose we have imported,
import torch
import torch.nn as nn
import torch.nn.functional as F
Use the following template for your implementation.
class LogisticLinearRegression(torch.nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
# TODO: Implement this method
def forward(self, x):
# TODO: Implement this method
model = LogisticLinearRegression()
Solution:
class LogisticRegression(torch.nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
self.linear = torch.nn.Linear(4, 1)
def forward(self, x):
y_pred = F.sigmoid(self.linear(x))
return y_pred
model = LogisticRegression()
Suppose we have a classification problem with 10 input variables and 3 output classes. Write a model in PyTorch that can be used for this task. Suppose we have imported,
import torch
import torch.nn as nn
import torch.nn.functional as F
Use the following template for your implementation.
class Classification(torch.nn.Module):
def __init__(self):
super(Classification, self).__init__()
# TODO: Implement this method
def forward(self, x):
# TODO: Implement this method
model = Classification()
Solution:
class Classification(torch.nn.Module):
def __init__(self):
super(Classification, self).__init__()
self.linear = torch.nn.Linear(10, 3)
self.softmax = torch.nn.Softmax(dim=1)
def forward(self, x):
y_pred = self.softmax(self.linear(x))
return y_pred
model = Classification()
Suppose we have a matrix factorization problem with embedding size of 100 without bias. Write a model in PyTorch that can be used for this task. Suppose we have imported,
import torch
import torch.nn as nn
import torch.nn.functional as F
Use the following template for your implementation.
class MF(nn.Module):
def __init__(self, num_users, num_items, emb_size=100):
super(MF, self).__init__()
# TODO: Implement this method
def forward(self, u, v):
# TODO: Implement this method
model = MF()
Solution:
class MF(nn.Module):
def __init__(self, num_users, num_items, emb_size=100):
super(MF, self).__init__()
self.user_emb = nn.Embedding(num_users, emb_size)
self.item_emb = nn.Embedding(num_items, emb_size)
self.user_emb.weight.data.uniform_(0, 0.05)
self.item_emb.weight.data.uniform_(0, 0.05)
def forward(self, u, v):
U = self.user_emb(u)
V = self.item_emb(v)
return (U * V).sum(1)
model = MF()
Suppose we have a matrix factorization problem with embedding size of 100 with bias. Write a model in PyTorch that can be used for this task. Suppose we have imported,
import torch
import torch.nn as nn
import torch.nn.functional as F
Use the following template for your implementation.
class BiasedMF(nn.Module):
def __init__(self, num_users, num_items, emb_size=100):
super(BiasedMF, self).__init__()
# TODO: Implement this method
def forward(self, u, v):
# TODO: Implement this method
model = BiasedMF()
Solution:
class BiasedMF(nn.Module):
def __init__(self, num_users, num_items, emb_size=100):
super(BiasedMF, self).__init__()
self.user_emb = nn.Embedding(num_users, emb_size)
self.item_emb = nn.Embedding(num_items, emb_size)
self.user_bias = nn.Embedding(num_users, 1)
self.item_bias = nn.Embedding(num_item, 1)
self.user_emb.weight.data.uniform_(0, 0.05)
self.item_emb.weight.data.uniform_(0, 0.05)
self.user_bias.weight.data.uniform_(-0.01, 0.01)
self.item_bias.weight.data.uniform_(-0.01, 0.01)
def forward(self, u, v):
U = self.user_emb(u)
V = self.item_emb(v)
b_u = self.user_bias(u).squeeze()
b_v = self.item_bias(v).squeeze()
return (U * V).sum(1) + b_u + b_v
model = BiasedMF()
Write a training loop in PyTorch that can train a linear regression model. The template is given as follows,
def train_model(model, optimizer, train_dl, epoch=10):
for i in range(epochs):
model.train()
total = 0
sum_loss = 0
for x, y in train_dl:
batch = y.shape[0]
# TODO: Implement here.
total += batch
sum_loss += batch*(loss.item())
train_loss = sum_loss / total
print(train_loss)
Solution:
y_hat = model(x)
loss = F.mse_loss(y_hat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Write a training loop in PyTorch that can train a logistic regression model (assume the model is linear and doesn't apply sigmoid in the end). The template is given as follows,
def train_model(model, optimizer, train_dl, epoch=10):
for i in range(epochs):
model.train()
total = 0
sum_loss = 0
for x, y in train_dl:
batch = y.shape[0]
# TODO: Implement here.
total += batch
sum_loss += batch*(loss.item())
train_loss = sum_loss / total
print(train_loss)
Solution:
y_hat = model(x)
loss = F.binary_cross_entropy_with_logit(y_hat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Write a training loop in PyTorch that can train a multiclass classification model. The template is given as follows,
def train_model(model, optimizer, train_dl, epoch=10):
for i in range(epochs):
model.train()
total = 0
sum_loss = 0
for x, y in train_dl:
batch = y.shape[0]
# TODO: Implement here.
total += batch
sum_loss += batch*(loss.item())
train_loss = sum_loss / total
print(train_loss)
Solution:
y_hat = model(x)
loss = F.cross_entropy(y_hat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Suppose we are given x_train
and y_train
, construct a dataloader train_dl
with batch size of 1000, and shuffle the data.
Solution:
class TrainingDataset(data_utils.Dataset):
def __init__(self, x, y):
x = torch.tensor(x).float()
y = torch.tensor(y).float().unsqueeze(1)
self.x, self.y = x, y
def __len__(self):
return len(self.y)
def __getitem__(self, idx):
return self.x[idx], self.y[idx]
train_ds = TrainingDataset(x_train, y_train)
train_dl = data_utils.DataLoader(train_ds, batch_size=1000, shuffle=True)
Suppose we want to get one batch of data from the dataloader train_dl
. Please give the function for this task.
Solution:
x, y = next(iter(train_dl))
Write a function that given a model model
and a dataloader train_dl
that computes the balanced accuracy. Assume that you have a binary classification problem and you can use balanced_accuracy_score
from sklearn
.
Solution:
def metric_accuracy(model, dataload):
model.eval()
accuracies = []
for x, y in dataload:
y_hat = model(x)
accuracy = sklearn.metrics.balanced_accuracy_score(y, y_hat)
accuracies.append(accuracy)
return np.mean(accuracies)
metric_accuracy(model, train_dl)
- What are the features of a bushy tree?
Answer:
Tends to overfit, high variance, low bias
- What are the features of a shallow tree?
Answer:
Tends to underfit, high bias.
- Suppose we have the following dataset. What is the best split for a minimized misclassification error? And report this misclassification error. Let's suppose all the points have the same weight.
X1 | X2 | y |
---|---|---|
0 | 0 | 1 |
0 | 1 | 1 |
1 | 2 | 1 |
1 | 0 | -1 |
2 | 1 | -1 |
Solution:
We can drow a plot as follows.
From this plot, we can know the best split can be 0.2
.
- Continue with the last question, what are the Gini impurity and the cross entropy for the classification after that split?
Answer: $$ Gini = 1/5 * (1-1/5) + 4/5 * (1-4/5) = 0.32 $$
- Suppose we have the following decision tree. Then how many terminal regions do we have finally?
Split 1: x1 > 1.3
Split 2: x2 <= 0.45
Answer:
4, because we have two splits on different features
- Suppose we have the following decision tree. Then how many terminal regions do we have finally?
Split 1: x1 > 1.3
Split 2: x1 <= 0.45
Answer:
3, because we have two splits on the same feature
- Suppose we have a decision tree as follows.
Then how many parameters do we have for this tree? And how many regions do we have as a result?
Answer:
We have 2J parameters (R1 ... RJ and b1 ... bJ).
We have J regions (R1 ... RJ).
- Give three examples of the weak classifiers.
Answer:
logistic regression, shallow decision tree, decision stumps
- What are the features of the weak classifiers?
Answer:
- fast to fit
- don't overfit
- tend to underfit
- high bias
- do not have good accuracy
- How to get a strong classifier? (Give three examples)
Answer:
- add more features or add polynomial features
- use an ensemble model with bagging
- combine weak classifiers
Let's suppose the following information is given, $$ x = (120K, Bad, 50K, Good) $$ And the weights for each tree in the model are, $$ \alpha_1 = 2,\ \alpha_2 = 1.5,\ \alpha_3 = 1.5,\ \alpha_4 = 0.5 $$ Solution:
Let's suppose the labels are Risky
.
- Consider we have the following weighted dataset.
x1 | X2 | y | w |
---|---|---|---|
0 | 0 | 1 | 1/5 |
1 | 1 | -1 | 3/5 |
2 | 2 | 1 | 2/5 |
Find a stump that minimize the weighted classification error.
Solution: We can draw the following plot.
So that the stump with
Another possible answer can be
- Consider the following dataset.
x1 | x2 | y | w |
---|---|---|---|
0 | 0 | -1 | 2/10 |
0 | 1 | -1 | 2/10 |
1 | 1 | -1 | 3/10 |
1 | 3 | 1 | 1/10 |
2 | 1 | 1 | 2/10 |
Compute the weighted classification error of the following classifier. Which points are misclassified? $$ g(x)= \begin{cases}1, & \text { if } x_{2} \geq 0.5 \ -1, & \text { otherwise }\end{cases} $$ Answer:
Point 2 and 3 are misclassified. The weighted misclassification error is, $$ error = (2+3)/(2+2+3+1+2) = 1/2 $$
- Continue with the last problem, find a stump that minimizes the weighted misclassification error and report this error.
Solution: We can draw a plot as follows.
Then the stump with
- Consider the following dataset and
play tennis
is the label.
Tempature | Wind | play tennis |
---|---|---|
hot | F | no |
hot | T | no |
hot | F | yes |
mild | F | yes |
cool | F | yes |
cool | T | no |
Run Adaboosting by hand for 2 iterations.
Solution:
X1 | X2 | Y | W0 |
---|---|---|---|
0 | 0 | -1 | 1/6 |
0 | 1 | -1 | 1/6 |
0 | 0 | 1 | 1/6 |
1 | 0 | 1 | 1/6 |
2 | 0 | 1 | 1/6 |
2 | 1 | -1 | 1/6 |
The first stump should be 1/6
. So
X1 | X2 | Y | W0 | T1 | W1 |
---|---|---|---|---|---|
0 | 0 | -1 | 1/6 | 1 | 5/6 |
0 | 1 | -1 | 1/6 | -1 | 1/6 |
0 | 0 | 1 | 1/6 | 1 | 1/6 |
1 | 0 | 1 | 1/6 | 1 | 1/6 |
2 | 0 | 1 | 1/6 | 1/6 | |
2 | 1 | -1 | 1/6 | -1 | 1/6 |
The second stump should be
X1 | X2 | Y | W0 | T1 | W1 | T2 | W2 |
---|---|---|---|---|---|---|---|
0 | 0 | -1 | 1/6 | 1 | 5/6 | -1 | 5/6 |
0 | 1 | -1 | 1/6 | -1 | 1/6 | -1 | 1/6 |
0 | 0 | 1 | 1/6 | 1 | 1/6 | -1 | 4/6 |
1 | 0 | 1 | 1/6 | 1 | 1/6 | 1 | 1/6 |
2 | 0 | 1 | 1/6 | 1 | 1/6 | 1 | 1/6 |
2 | 1 | -1 | 1/6 | -1 | 1/6 | 1 | 4/6 |
The final predictions are, $$ \hat{y}_1 = sign(log5 - log 4) = 1\ \hat{y}_1 = sign(-log5 - log 4) = -1\ \hat{y}_1 = sign(log5 - log 4) = 1\ \hat{y}_1 = sign(log5 + log 4) = 1\ \hat{y}_1 = sign(log5 + log 4) = 1\ \hat{y}_1 = sign(-log5 + log 4) = -1\ $$
- Fill in the following blanks.
Solution:
- What's the meaning of forward stepwise additive model?
Answer:
- forward: means we don't go back
- stepwise: means we take one step for each iteration
- additive: means we sum up the models to get a prediction
- Loss functions for binary classification are written as a function of margin.
True
- Hinge loss is also called SVM loss.
True
- What is the difference between adaboosting and gradient boosting?
- trees in adaboosting are fitted on the same data changing weights
- trees in gradient boosting are fitted by pseudo residual which changes every iteration
- Gradient boosting is a generalized Adaboosting.
True
- What is the difference between gradient boosting and random forest?
- random forest uses bagging and grow fully grown decision trees, while gradient boosting use weak learners like shallow trees.
- the initial bias for random forest is made as low as possible, but gradient boosting reduces the bias by adding more trees.
- What is the difference between gradient boosting and neural network?
- gradient boosting works well on tabular data
- neural networks are useful when applied to raw sensory data
- Suppose we have the following data.
Sqfeet | Rent |
---|---|
750 | 1160 |
800 | 1200 |
850 | 1280 |
900 | 1450 |
950 | 2000 |
Compute by hand with gradient boosting for two iterations with MAE loss.
Solution:
The first step is,
x | y | f0 | y-f0 | Pr0 |
---|---|---|---|---|
750 | 1160 | 1280 | -120 | -1 |
800 | 1200 | 1280 | -80 | -1 |
850 | 1280 | 1280 | 0 | 0 |
900 | 1450 | 1280 | 170 | 1 |
950 | 2000 | 1280 | 720 | 1 |
To fit a tree based on pseudo residual 0, we have to split at
x | y | f0 | y-f0 | Pr0 | T1 | f1 = f0 + T1 |
---|---|---|---|---|---|---|
750 | 1160 | 1280 | -120 | -1 | -100 | 1180 |
800 | 1200 | 1280 | -80 | -1 | -100 | 1180 |
850 | 1280 | 1280 | 0 | 0 | 170 | 1450 |
900 | 1450 | 1280 | 170 | 1 | 170 | 1450 |
950 | 2000 | 1280 | 720 | 1 | 170 | 1450 |
Continue this process,
x | y | f0 | y-f0 | Pr0 | T1 | f1 | y-f1 | Pr1 |
---|---|---|---|---|---|---|---|---|
750 | 1160 | 1280 | -120 | -1 | -100 | 1180 | -20 | -1 |
800 | 1200 | 1280 | -80 | -1 | -100 | 1180 | 20 | 1 |
850 | 1280 | 1280 | 0 | 0 | 170 | 1450 | -170 | -1 |
900 | 1450 | 1280 | 170 | 1 | 170 | 1450 | 0 | 0 |
950 | 2000 | 1280 | 720 | 1 | 170 | 1450 | 550 | 1 |
To fit a tree, we have to split on
x | y | f0 | y-f0 | Pr0 | T1 | f1 | y-f1 | Pr1 | T2 |
---|---|---|---|---|---|---|---|---|---|
750 | 1160 | 1280 | -120 | -1 | -100 | 1180 | -20 | -1 | -20 |
800 | 1200 | 1280 | -80 | -1 | -100 | 1180 | 20 | 1 | 10 |
850 | 1280 | 1280 | 0 | 0 | 170 | 1450 | -170 | -1 | 10 |
900 | 1450 | 1280 | 170 | 1 | 170 | 1450 | 0 | 0 | 10 |
950 | 2000 | 1280 | 720 | 1 | 170 | 1450 | 550 | 1 | 10 |
Then by
x | y | f0 | y-f0 | Pr0 | T1 | f1 | y-f1 | Pr1 | T2 | f2 |
---|---|---|---|---|---|---|---|---|---|---|
750 | 1160 | 1280 | -120 | -1 | -100 | 1180 | -20 | -1 | -20 | 1160 |
800 | 1200 | 1280 | -80 | -1 | -100 | 1180 | 20 | 1 | 10 | 1190 |
850 | 1280 | 1280 | 0 | 0 | 170 | 1450 | -170 | -1 | 10 | 1460 |
900 | 1450 | 1280 | 170 | 1 | 170 | 1450 | 0 | 0 | 10 | 1460 |
950 | 2000 | 1280 | 720 | 1 | 170 | 1450 | 550 | 1 | 10 | 1460 |
And
- Suppose we have the following data.
Sqfeet | Rent |
---|---|
750 | 1160 |
800 | 1200 |
850 | 1280 |
900 | 1450 |
950 | 2000 |
Compute by hand with gradient boosting for one iteration with MSE loss.
Solution:
Because we have MSE loss, the best initial model is to take the mean value of y.
x | y | f0 | Pr0 = y-f0 |
---|---|---|---|
750 | 1160 | 1418 | -258 |
800 | 1200 | 1418 | -218 |
850 | 1280 | 1418 | -138 |
900 | 1450 | 1418 | 32 |
950 | 2000 | 1418 | 582 |
To minimize MSE loss, we have to fit a stump that split at
x | y | f0 | Pr0 | T1 | f1=f0+T1 |
---|---|---|---|---|---|
750 | 1160 | 1418 | -258 | -145.5 | 1272.5 |
800 | 1200 | 1418 | -218 | -145.5 | 1272.5 |
850 | 1280 | 1418 | -138 | -145.5 | 1272.5 |
900 | 1450 | 1418 | 32 | -145.5 | 1272.5 |
950 | 2000 | 1418 | 582 | 582 | 2000 |
And
- Suppose we have the following data.
Sqfeet | Rent |
---|---|
750 | 1160 |
800 | 1200 |
850 | 1280 |
900 | 1450 |
950 | 2000 |
Compute by hand with gradient boosting for one iteration with MSE loss. Note that we have a shrinkage (learning rate) of 0.8 in this case.
Solution:
This question is similar to the question above and what we have to change is to mutiply that shrinkage when calculating
x | y | f0 | Pr0 | T1 | $\nu$T1 | f1=f0+$\nu$T1 |
---|---|---|---|---|---|---|
750 | 1160 | 1418 | -258 | -145.5 | -116.4 | 1301.6 |
800 | 1200 | 1418 | -218 | -145.5 | -116.4 | 1301.6 |
850 | 1280 | 1418 | -138 | -145.5 | -116.4 | 1301.6 |
900 | 1450 | 1418 | 32 | -145.5 | -116.4 | 1301.6 |
950 | 2000 | 1418 | 582 | 582 | 465.6 | 1883.6 |
Here,
- Suppose we have a learning rate for gradient boosting as
$\eta$ and we find an optimized iteration number$M$ . Then if we want to devide the learning rate by 2 for finding the best learning rate, how should we change$M$ ?
Answer:
Increase M to 2M
- Suppose we have the loss function for gradient boosting as,
Then calculate the best initial constant for its initial model.
Answer:
So the best constant
- Calculate the pseudo residual for MSE loss.
Solution:
For MSE, the pseudo residual is quite simple with, $$ MSE = (y-f)^2 $$ Then, $$ r_{im} = -\left.\frac{\partial (y-f)^2}{\partial f}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} = -2(y^{(i)}-f_{m-1}(x^{(i)})) $$
- Calculate the pseudo residual for MAE loss.
Solution:
For MAE, the loss function is, $$ MAE = |y - f| $$ Then, $$ r_{im} = -\left.\frac{\partial |y - f|}{\partial f}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} $$ This is to say we have, $$ r_{im} = \begin{cases} -\left.\frac{\partial (y - f)}{\partial f}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} , & y-f\geq0 \ -\left.\frac{\partial (f-y)}{\partial f}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}}, & y-f<0 \end{cases} $$ Then, $$ r_{im} = \begin{cases} 1 , & y-f\geq0 \ -1 , & y-f<0 \end{cases} $$ So that we can conclude that, for MAE, the pseudo result can be written as, $$ r_{im} = sign(y^{(i)}-f_{m-1}(x^{(i)})) $$
- Calculate the pseudo residual for the following loss function,
Solution:
For the definition of pseudo residual, $$ r_{im} = -\left.\frac{\partial e^{-yf}}{\partial f}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} = y^{(i)} \cdot e^{-y^{(i)}f_{m-1}\left(x^{(i)}\right)} $$
- Calculate the pseudo residual for the following loss function,
Solution:
The loss function can also be written as,
$$
L(y, f)= \begin{cases}
0, & yf \geq 1\
1-yf, & yf < 1
\end{cases}
$$
Then the gradient of
- Calculate the pseudo residual for the following loss function,
Solution:
Then, $$ r_{im} = -\left.\frac{\partial \log {(1 + e^{-yf})}}{\partial f}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} $$ This is also, $$ r_{im} = -\left.\frac{-y \cdot e^{-yf}}{\log {(1 + e^{-yf})}}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} $$ So, $$ r_{im} = -\left.\frac{-y}{\log {(1 + e^{yf})}}\right|{f=f{m-1}\left(x^{(i)}\right),\ y^{(i)}} = \frac{y^{(i)}}{\log {(1 + e^{y^{(i)}f_{m-1}(x^{(i)})})}} $$
- Calculate the best constant per region for logloss.
Solution:
After getting the pseudo residual and fitting the model , then, the next step is to find the optimal constant per region. For logloss, that is,
$$
\begin{aligned}
\beta_{j} &=\arg \min {\beta} \sum{x^{(i)} \in R_{j}} L\left(y^{(i)}, f_{m-1}\left(x^{(i)}\right)+\beta\right) \
&=\arg \min {\beta} \sum{x^{(i)} \in R_{j}} \log \left(1+e^{-y^{(i)}\left[f_{m-1}\left(x^{(i)}\right)+\beta\right]}\right)
\end{aligned}
$$
Then the derivative is,
$$
G(\beta) = \frac {\partial \sum_{x^{(i)} \in R_{j}} \log \left(1+e^{-y^{(i)}\left[f_{m-1}\left(x^{(i)}\right)+\beta\right]}\right)}{\partial \beta} = -\sum_{x^{(i)} \in R_{j}} \frac{y^{(i)}}{1+e^{y^{(i)}\left[f_{m-1}\left(x^{(i)}\right)+\beta\right]}}
$$
Because
- Fill in the blanks for Gradient boosting with MSE loss.
Answer:
- Fill in the blanks for Gradient boosting with MAE loss.
Solution:
$$ \text{median}{x_i \in R{jm}}(y_i - f_{m-1}(x_i)) $$
$$ f_{m-1}(x) + \sum_{j=1}^{Jm}\beta_{jm}\mathbb{1}{[x_i \in R{jm}]} $$
- Fill in the blanks for general Gradient tree boosting.
Solution:
$$ -\left.\frac{\partial \log {(1 + e^{-yf})}}{\partial f}\right|{f=f{m-1}\left(x_i\right),\ y_i} $$
$$ f_{m-1}(x) + \sum_{j=1}^{Jm}\beta_{jm}\mathbb{1}{[x_i\in R{jm}]} $$
- Fill in the blank for this gradient boosting with log loss.
Solution:
$$ f_{m-1}(x) + \sum_{j=1}^{Jm}\beta_{jm}\mathbb{1}{[x_i\in R{jm}]} $$