yaringal / multi-task-learning-example Goto Github PK
View Code? Open in Web Editor NEWA multi-task learning example for the paper https://arxiv.org/abs/1705.07115
License: MIT License
A multi-task learning example for the paper https://arxiv.org/abs/1705.07115
License: MIT License
Hi! Really nice work. I wonder how you calculate the noise σ when training a real network ?
why my result log_var_a=0 ,log_var_b=0 always equal zero
I'm not in the field of deep learning and computer science, but I found this work very interesting. I am confused about what should I do if I want to use the trained model for prediction? Can I achieve this goal through prediction_model.predict(new_x)? I see only the trainable_model was trained but it can not achieve prediction. Has the prediction_model been trained at the same time? Thanks very much.
As described in this paper, the noise(sigma) increases, the respected L(W) decreases. But if we understand sigma as uncertainty of y,
maybe it's better for L to be increase with uncertainty, because it means y is harder to learn, so it needs more attention to learn?
The loss function can optimize in a way that keep decreasing the log_var values, which I observe in my experiments. One simple solution is to do torch.abs(log_var). Any thoughts on how this might affect the formulation of the deductions?
I think this is a lucky demo. When I chenge the data generation code, the optimization will be guided wrongly and the variance prediction is wrong.
So I think this uncertainty method only works at a situation that: the value of the diff is close to precsion and log variance. If their value are not at the same scale, the method will broke.
`
def gen_data(N):
X = np.random.randn(N, Q)
w1 = 2.*1e2
b1 = 8.*1e2
sigma1 = 10 # ground truth
Y1 = X.dot(w1) + b1 + sigma1 * np.random.randn(N, D1)
w2 = 3*1e2
b2 = 3*1e2
sigma2 = 1*1e2 # ground truth
Y2 = X.dot(w2) + b2 + sigma2 * np.random.randn(N, D2)
return X, Y1, Y2
`
Thanks for the great research and code sharing.
After reading the paper and using it in my research, I got a question.
There are two styles for the implementation of weighted loss.
Case 1) L = w_a * L_a + w_b * L_b + w_c * L_c
Case 2) L = L_a + w_b * L_b + w_c * L_c
In case 2, the weight of a loss L_a is set to 1. In my humble opinion, I guess that w_b and w_c will be learned with relative log_vars values accordingly.
In your paper or code, on the other hand, all weights, i.e., all log_vars are set to learnable as in Case 1.
Is there any intention to prefer Case 1? Could it be a problem if I use the style of Case 2?
There are a couple of instances where the code doesn't agree with the paper (https://arxiv.org/pdf/1705.07115.pdf):
For each individual task's loss, the paper suggests adding the log of task-dependent standard deviation to the multi-task loss. However, the code is adding log of log of task-dependent variance to the multi-task loss instead. Why this discrepancy? Is there a typo in the paper?
Weights given to each loss in the code don't correspond to those in the equations presented in the paper, please see my followup here: #1
@yaringal Hi, I have a question about your multi-task loss function.
Below you return a loss as torch.mean(loss), but if i undersatnd this function correctly, loss is just a single tensor value and not a list, so torch.mean(loss) will be same as loss. What was your motivation behind using torch.mean(loss)?
Thank you!
def criterion(y_pred, y_true, log_vars):
loss = 0
for i in range(len(y_pred)):
precision = torch.exp(-log_vars[i])
diff = (y_pred[i]-y_true[i])**2.
loss += torch.sum(precision * diff + log_vars[i], -1)
return torch.mean(loss)
I read the paper carefully, the formula in paper is fundamentally wrong.
Under the formula (2) and (3), the probility output has a gaussian distribution. However, the probility can't be a gaussian distribution as it distributed in [0,1] rather than (-infty, +infty).
Under the independent assumption(formula (4)) and gaussian distribution mentioned above, the formula (7) is correct. However, if we just look at the first line in formula (7), if independent assumption is established, -log p(y1, y2|f(w,x)) = -log p(y1|f(w,x)) - log(y2|f(w,x)); which is just a sum of cross-entropy loss over different tasks. This is apparently contradicted with the result under additional gaussian assumption.
Somehow, the paper repalce the cross entrophy loss with mse which finally reach the result that higher loss task should have higher theta weights. If the paper report is correct, I think the benefit here comes from loss re-balance. Which means, re-balance the task loss will benefit multi-task performance?
Thanks for your good jobs. But I have a question. I have transported your code into my project and it worked at that time. However after several steps the loss became nagative. And I found that it was the log_var
item led to that. When I removed log_var
item the loss would be all right. So I want to know if there is any better solution for that? Thanks again!
loss += K.sum(precision * (y_true - y_pred)**2. + log_var[0], -1)
@yaringal Hi, thank you very much for releasing the code here!
I am currently using your technique in adversarial loss training for semantic segmentation.
However, I would like to know the actual weights applying to each loss function (Cross Entropy, Adversarial Loss).
Could you please help me with how to get these weights?
(I am currently using your equation of std and do 1/(2*std**2) to get the weight. Is this correct?)
Hi @yaringal !
I have read the paper and it is really amazing. Thanks for your team's hard-work!
However, I have a question regarding the final equation and also your keras implementation.
The equation above has 1/2 multiplied by the loss, but you didn't include it in the keras implementation.
I tried experimenting on it, and included 1/2 in the loss function, but it couldn't converge. I am wondering if the problem is in the paper or the keras implementation, because if I exclude the 1/2, it converges to the ground-truth std.
Best regards,
Hardian
Thank you for your example, it helps a lot to understand the paper. I am currently use the proposed formula (exp(-log_var)*loss+log_var)) in self-supervised learning with uncertainty estimation.
In my project, the loss is L1 distance between input images pixels and warped images pixels, the loss works well along. But, when I take uncertainty into training together using the above formula, however, performance drops a lot.
I have totally no idea why. Do you have any advice? By the way, before taking L1 distance, diff = warp_pixel - input_pixel follows Gaussian distribution perfectly.
Hi,thanks for your great works, and i have some questions about the formulation 10
thanks for your reply
Hi, thanks for your excellent job! I wonder is there any way to easily incorporate this method into other multi-tasks learning pipeline? I'm still trying to understanding the formulas and have no idea where I can obtain a noise scalar for each task. Looking forward to your reply :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.