yaringal / multi-task-learning-example Goto Github PK

View Code? Open in Web Editor NEW

836.0 836.0 204.0 58 KB

A multi-task learning example for the paper https://arxiv.org/abs/1705.07115

License: MIT License

Jupyter Notebook 100.00%

multi-task-learning-example's People

Contributors

Stargazers

Watchers

Forkers

giserh benjamesbabala vyraun chandrasg rzel xuanheiiis s-ai kaeflint simpleconjugate emasa kitlien briando2005 hbcbh1999 afcarl xiaorong108 zhangwenwen fighterfong leehayun alixing zhongyunuestc timxzz leeyangg jhzhou123 world4jason bzhong2 berther cathy-kim wujiayi sean0719 ragmeh11 stjordanis ykwon0407 ianchen88 binhna qniguoym knowledgehacker yzou2 jwan33 learning2021 1360885769 futurev taoismpan jankinnn csypeng allenlmn tanbo1 nashwang1997 macroustc ccunique lizhaodong evanzhu2013 nextguido playplaydata zywyry sunriver kkoo1122 eggpan95 smith6036 gusuperstar mahshad92 masoud-ghodrati aung2phyowai astrogilda huang851887444 holmes-gu mysee1989 cooktouriverlake linwenfang gaochonga andrewlee33 shenyi666666 baishiruyue apple629 marcelomata jjwangnlp xstephen yh2010 zhuxuhan v3551g hya-cala maec1208 hcui7511 idillegal xrosliang duwenrui zhyq kiminh buptygz ylzhang29 jiayangshi yangliujobs freegliboracle jrubin01 ashan2012 mariaast towardsautonomy xuetf beijinggao ibunny01 bapleliu

multi-task-learning-example's Issues

about σ

Hi! Really nice work. I wonder how you calculate the noise σ when training a real network ?

log_var_a and log_var_b

why my result log_var_a=0 ，log_var_b=0 always equal zero

How can I use the trained model for prediction? is it right to use prediction_model.predict(new_x) ?

I'm not in the field of deep learning and computer science, but I found this work very interesting. I am confused about what should I do if I want to use the trained model for prediction? Can I achieve this goal through prediction_model.predict(new_x)? I see only the trainable_model was trained but it can not achieve prediction. Has the prediction_model been trained at the same time? Thanks very much.

Question about the loss

As described in this paper, the noise(sigma) increases, the respected L(W) decreases. But if we understand sigma as uncertainty of y,
maybe it's better for L to be increase with uncertainty, because it means y is harder to learn, so it needs more attention to learn?

Log var can become negative and explode

The loss function can optimize in a way that keep decreasing the log_var values, which I observe in my experiments. One simple solution is to do torch.abs(log_var). Any thoughts on how this might affect the formulation of the deductions?

This is a lucky demo When I change the data generation process, the prediction of the variance is wrong

I think this is a lucky demo. When I chenge the data generation code, the optimization will be guided wrongly and the variance prediction is wrong.

So I think this uncertainty method only works at a situation that: the value of the diff is close to precsion and log variance. If their value are not at the same scale, the method will broke.

def gen_data(N):

X = np.random.randn(N, Q)

w1 = 2.*1e2

b1 = 8.*1e2

sigma1 = 10  # ground truth

Y1 = X.dot(w1) + b1 + sigma1 * np.random.randn(N, D1)

w2 = 3*1e2

b2 = 3*1e2

sigma2 = 1*1e2 # ground truth

Y2 = X.dot(w2) + b2 + sigma2 * np.random.randn(N, D2)

return X, Y1, Y2

Question on relative weights

Thanks for the great research and code sharing.
After reading the paper and using it in my research, I got a question.
There are two styles for the implementation of weighted loss.
Case 1) L = w_a * L_a + w_b * L_b + w_c * L_c
Case 2) L = L_a + w_b * L_b + w_c * L_c
In case 2, the weight of a loss L_a is set to 1. In my humble opinion, I guess that w_b and w_c will be learned with relative log_vars values accordingly.
In your paper or code, on the other hand, all weights, i.e., all log_vars are set to learnable as in Case 1.
Is there any intention to prefer Case 1? Could it be a problem if I use the style of Case 2?

Code Doesn't Agree with Paper

There are a couple of instances where the code doesn't agree with the paper (https://arxiv.org/pdf/1705.07115.pdf):

For each individual task's loss, the paper suggests adding the log of task-dependent standard deviation to the multi-task loss. However, the code is adding log of log of task-dependent variance to the multi-task loss instead. Why this discrepancy? Is there a typo in the paper?
Weights given to each loss in the code don't correspond to those in the equations presented in the paper, please see my followup here: #1

Why return torch.mean(loss)?

@yaringal Hi, I have a question about your multi-task loss function.
Below you return a loss as torch.mean(loss), but if i undersatnd this function correctly, loss is just a single tensor value and not a list, so torch.mean(loss) will be same as loss. What was your motivation behind using torch.mean(loss)?
Thank you!

def criterion(y_pred, y_true, log_vars):
  loss = 0
  for i in range(len(y_pred)):
    precision = torch.exp(-log_vars[i])
    diff = (y_pred[i]-y_true[i])**2.
    loss += torch.sum(precision * diff + log_vars[i], -1)
  return torch.mean(loss)

Sounds like a lucky result comes from a wrong formula deduction

I read the paper carefully, the formula in paper is fundamentally wrong.

Under the formula (2) and (3), the probility output has a gaussian distribution. However, the probility can't be a gaussian distribution as it distributed in [0,1] rather than (-infty, +infty).
Under the independent assumption(formula (4)) and gaussian distribution mentioned above, the formula (7) is correct. However, if we just look at the first line in formula (7), if independent assumption is established, -log p(y1, y2|f(w,x)) = -log p(y1|f(w,x)) - log(y2|f(w,x)); which is just a sum of cross-entropy loss over different tasks. This is apparently contradicted with the result under additional gaussian assumption.
Somehow, the paper repalce the cross entrophy loss with mse which finally reach the result that higher loss task should have higher theta weights. If the paper report is correct, I think the benefit here comes from loss re-balance. Which means, re-balance the task loss will benefit multi-task performance?

The loss might be nagative value

Thanks for your good jobs. But I have a question. I have transported your code into my project and it worked at that time. However after several steps the loss became nagative. And I found that it was the log_var item led to that. When I removed log_var item the loss would be all right. So I want to know if there is any better solution for that? Thanks again!

loss += K.sum(precision * (y_true - y_pred)**2. + log_var[0], -1)

Calculating back to actual weights of loss functions

@yaringal Hi, thank you very much for releasing the code here!
I am currently using your technique in adversarial loss training for semantic segmentation.
However, I would like to know the actual weights applying to each loss function (Cross Entropy, Adversarial Loss).
Could you please help me with how to get these weights?

(I am currently using your equation of std and do 1/(2*std**2) to get the weight. Is this correct?)

Final Loss equation

Hi @yaringal !

I have read the paper and it is really amazing. Thanks for your team's hard-work!

However, I have a question regarding the final equation and also your keras implementation.

The equation above has 1/2 multiplied by the loss, but you didn't include it in the keras implementation.

I tried experimenting on it, and included 1/2 in the loss function, but it couldn't converge. I am wondering if the problem is in the paper or the keras implementation, because if I exclude the 1/2, it converges to the ground-truth std.

Best regards,

Hardian

MergeLoss with regular item is \log_{sigma} in paper but \log_{sigma}^2 in code

uncertainty for self-supervised learning

@yaringal

Thank you for your example, it helps a lot to understand the paper. I am currently use the proposed formula (exp(-log_var)*loss+log_var)) in self-supervised learning with uncertainty estimation.

In my project, the loss is L1 distance between input images pixels and warped images pixels, the loss works well along. But, when I take uncertainty into training together using the above formula, however, performance drops a lot.

I have totally no idea why. Do you have any advice? By the way, before taking L1 distance, diff = warp_pixel - input_pixel follows Gaussian distribution perfectly.

some questions about formulation 10 in paper

Hi，thanks for your great works, and i have some questions about the formulation 10

it says "in the last transition we introduced the explicit simplifying assumption ... which becomes an equality when /sigma2 --> 1";
infact, i find that the final value of /sigma （task variance param）is not closed to 1, so is the assumption appropriate ?

thanks for your reply

Any way to incorporate these methods in other tasks easily?

Hi, thanks for your excellent job! I wonder is there any way to easily incorporate this method into other multi-tasks learning pipeline? I'm still trying to understanding the formulas and have no idea where I can obtain a noise scalar for each task. Looking forward to your reply :)