juho-lee / set_transformer Goto Github PK

View Code? Open in Web Editor NEW

521.0 521.0 100.0 69 KB

Pytorch implementation of set transformer

License: MIT License

Python 22.13% Jupyter Notebook 77.87%

set_transformer's People

Contributors

Stargazers

Watchers

Forkers

yyht kangway ryooit stjordanis phanideepgampa qianrenjian xujinfan coolsunxu benjamesbabala zzxn tong8080 madonokouki intuitionmachine seohuibae codeaudit rlds-107 amiltonwong liruivah dixinluo tsourolampis bitsandscraps 7777777-liuyaqi tzuren tsukar ensan-hcl edocoh87 srxie ammarkamoona whq-hqw renqincai huaguifan jingweiz jungtaekkim cognoscentai shubhampachori12110095 millsrw queenie88 tianrongchen louisrouillard zlapp gokyeongryeol mooreamor yunzhaoucsb tommyhosman jeonghyunho ernoult declanmcnamara minjiecheng bampt jacobkimmel freekang hearthewind skolouri jonathansum benjaminschwenker astanic edward-jianqaq wenhaoma-uts changjo rateixei jiadonglee dididoes ksoumya ramyapriya aabbccgithub terrywschan andrew-loeber nivdavidson annili1212 gtegner zixind muyun1996 wenlin-chen yiqingzhoukelly lianzhaoy shengyufan acse-sl420 zyx921197494 skorzhovskyi ethanabrooks zbayng1 david5010 nimuh hope205 luyitian bzp92 r-n zhongdong1967 amreis sznajder julescollenne vsunspiral bogoliubon gilyoungcoder qimaqi fz920 phenomen21 wtzheng99 chinnnns

set_transformer's Issues

Is there code for Amortized Clustering on CIFAR-100 available?

Hi! As introduced in the paper, you also experimented with Amortized Clustering on CIFAR-10 with SetTransformer. Yet I did not find the code in the repo, could you make that part of code available as well? Thanks!

Inputs of the SetTransformer

Hi,

Could you please explain the meanings of the inputs of SetTransformer:

dim_input, num_outputs, dim_output, num_inds=32, dim_hidden=128, num_heads=4, ln=False

Thanks.

A little puzzle about the implementation details.

Hi juho-lee!
I have two little puzzles about your paper. In section 1-Introduction. You said "A model for set-input problems should satisfy two critical requirements. First, it should be permutation invariant the output of the model should not change under any permutation of the elements in the input set. Second, such a model should be able to process input sets of any size."
But after reading the whole paper, I actually didn't know how you tackle with these two problems.
For problem 1, I guess you may remove the position embedding from the initial Transformers?
As for problem 2, I had totally no idea how you achieved it.
Thank you!

Question about ISAB

Not sure if I understand the Induced Set Attention Block correctly.

So basically SAB is a transformer without positional encoding (and dropout?). In the paper, you said that SAB is "too expensive for large sets". But set size here refers to the max sequence length in a transformer which is usually 512. Why not just use the SAB for SetTransformer? Is there any reason other than efficiency, to use ISAB for SetTransformer?

Why is LayerNorm default to False?

Not an issue, but a question: why is the default LayerNorm function set to False? In particular, for the point cloud example, the LayerNorm is not used.

Can you comment on the importance of having the nested LayerNorm activated for the model? That is, in the paper there was not exposition on having LayerNorm activated versus not.

Thanks!

4-D equivalent?

What if I have a set of matrices instead of a set of vectors? Is it possible to extend the Set Transformer framework to cover that scenario?

I played around with it a little (including making some small tweaks) but got bogged down with the .bmm call in the MAB module:

RuntimeError: Expected 3-dimensional tensor, but got 4-dimensional tensor for argument #1 'batch1' (while checking arguments for bmm)

LayerNorm

Dear Juho,
Thanks for making the code public!
One quick question, if I read the code correctly, LayerNorm was never used in any of the three examples you opensourced here in this repo is that correct?
If so, is it because they give bit inferior performances? And have you tried moving the LayerNorm layer inside the skip connections instead of before/after the skip connections like done in several more recent papers such that you have an connection directly from output to input?
Thanks in advance and looking forward to your reply!

MAB Implementation diverges from Paper

Dear Juho,

is it possible that the implementation of the MAB diverges from the paper?

In more detail: The paper states

Multihead(Q,K,V;λ,ω)=concat(O_1,··· ,O_h)W_O
H = LayerNorm(X + Multihead(X, Y, Y ; ω))
MAB(X, Y ) = LayerNorm(H + rFF(H))

but the code does

A = torch.softmax(Q_.bmm(K_.transpose(1,2))/math.sqrt(self.dim_V), 2)
O = torch.cat((Q_ + A.bmm(V_)).split(Q.size(0), 0), 2)  # This is output of multihead
O = O if getattr(self, 'ln0', None) is None else self.ln0(O)
O = O + F.relu(self.fc_o(O))
O = O if getattr(self, 'ln1', None) is None else self.ln1(O)

It seems that the matrix W_O is not being used in the code at all to mix the output of the different heads?
The skip connection Q_ + A.bmm(V_) also diverges from what's stated in the paper, given that Q_ is derived from Q which gets linearly transformed via Q = self.fc_q(Q) in the first line of forward() and is therefore no longer equal to the original query. (On second thought, this may be a necessary requirement, since the output of the MAB has different shape than the input shape. That means in this case, the paper is imprecise.)

Thanks a lot and best wishes
Jannik

Question about Counting Unique Characters code

I was wondering do you still have the code for part 5.2 counting unique characters? It would be really helpful. Thanks!

Question about model's input

Hi juho-lee,
I have many sets, each of which has a different size. I want to take some sets as a mini-batch for set-transformer model. But I find that every set in a mini-batch must have same size. Have you ever face this problem? How did you deal with it? Padding or other methods?
thank you!

about Set Anomaly Detection experiment code

Hi, Could you provide the code of Set Anomaly Detection experiment in your paper? thx

Question about Deep Sets Implementation

Hi @juho-lee,

First of all, thanks for making this code publicly available. It's very useful.

One question, though. I am looking at your implementation of the Zaheer et al network ("Deep Sets.") In his paper, we have something like rho(sum (phi(x))), where we are adding over each element of the set (I believe you call this a set pooling method in your paper )

In your DeepSet class, we have a succession of Linear -> ReLU -> Linear -> ReLU layers, that operate on the entire data set, and then are pooled at the end.

Could you explain a little about why these are equivalent?

Question on dim_split in MAB

Hello,

Would you please explain the necessity to use dim_split in MAB?
For e.g. if I have a batch of 2x387x768 I see the A tensor has shape 24x387x387 because it is using Q_ instead of Q

Would appreciate your response!

Thank you!
Sharmi
�

question about the network architecture for set transformer

Hi, @yoonholee ,

Thanks a lot for adding the code for the point cloud part. After looking into the network part, it shows that SAB modules are not included in decoder part? Is that the reason of increased time complexity when appending SAB modules to enhance the expressiveness of representations ? It seems that the classification accuracy will be increased by doing so. Had you performed the related experiments?

THX!

about shape classification experiment

Hi, @juho-lee ,

Thanks for releasing this package. Where could I find the code for modelnet 40 shape classification as mentioned in the paper?

THX!

Updating Multihead attention to include batch size>1?

Just want to know if you have plans to extend the functionality of the code itself instead of using PyTorch's MAB block?

Thank you!

Question about the normalization in the attention weight calculation

Hi!

I would like to ask you about the 1/sqrt(self.dim_V) normalization in the MAB inside the softmax function. Usually the attention scaling is implemented with the reciprocal of the dimensionality of the key, and since here the dim_V is split up into num_heads equal parts the size of the key vectors are dim_V//num_heads.

Is this something intentional or a "bug"? Although calling it a bug is an over exaggeration since it only introduces an extra 1/sqrt(num_heads) scale.

If this is unintentional, I'm happy to make a pull request, although it's only changing a word or if it was something intentional could you explain the idea behind it?

Thanks!

License

Hello,

Just read your paper and was very happy to see that you've made this implementation available. Would you be willing to add a license to this repo (MIT, for instance), so that others can build on this code?

PMA implementation missing rFF?

Dear Juho,

First of all, thank you for the implementation! It has been very helpful to my understanding of the architecture.

I ran into an alleged discrepancy between code and paper, and I was wondering if you could help clear this up. In particular, it seems to me that the PMA implementation is missing the row-wise feed-forward layer that is mentioned in the paper:

PMA(S, Z) = MAB(S, rFF(Z))

The PMA code:

class PMA(nn.Module):
    def __init__(self, dim, num_heads, num_seeds, ln=False):
        super(PMA, self).__init__()
        self.S = nn.Parameter(torch.Tensor(1, num_seeds, dim))
        nn.init.xavier_uniform_(self.S)
        self.mab = MAB(dim, dim, dim, num_heads, ln=ln)

    def forward(self, X):
        return self.mab(self.S.repeat(X.size(0), 1, 1), X)

To me this reads PMA(S, X) = MAB(S, X), rather than the MAB(S, rFF(X)) of the paper.

Thanks!

Tim

juho-lee / set_transformer Goto Github PK

set_transformer's People

Contributors

Stargazers

Watchers

Forkers

set_transformer's Issues

Recommend Projects

Recommend Topics

Recommend Org