The poly-encoder from chijames

Licensing

I came across your Poly-Encoder and would like to adapt it for some work purposes. I was told that I can't use it unless it is open-source licensed. I was wondering if you are willing to allow for that, perhaps through an MIT license etc.?

https://choosealicense.com/licenses/mit/

Hope to hear from you and thank you very much!

Best Regards,
Chor Seng

About Performance

Hi! Thanks a lot for sharing your code!
I got some question about the performance.
You propose the performance of your code on DSTC7 with bi-encoder as follows,

However, in the original paper, the performance of bi-encoder on DSTC7 is

With your code we can get R@1 for 0.437 but the performance in the original paper is 0.565 on dev set and 0.668 on test set. I read your code carefully but find little difference with the setting in the original paper. I also change your default one-bert to two different bert for bi-encoder, but still cannot get the same performance as that in the original paper. Why?

Something wrong when calculate t_total

First of all, I really appreciate for the nice repo.

The t_total in run.py is calculated by t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs and the t_total is passed into transformers.get_linear_schedule_with_warmup. This indicates the total number of steps of the training process.

However, I guess the total nember of steps is calculated by the number of batches * epoch. Therefore, the code for calculating t_total should be t_total = len(train_dataloader) // (args.train_batch_size * args.gradient_accumulation_steps) * args.num_train_epochs

If I'm wrong, please let me know what am I missing.

About the implementation of Poly Encoder

Hi @chijames, thanks so much for this wonderful project!
After digging into the code, I have two questions:

Is there any special reason why masking is not implemented in this section?

Poly-Encoder/encoder.py

Lines 72 to 78 in e5299e3

    
           def dot_attention(self, q, k, v): 
        
               # q: [bs, poly_m, dim] or [bs, res_cnt, dim] 
        
               # k=v: [bs, length, dim] or [bs, poly_m, dim] 
        
               attn_weights = torch.matmul(q, k.transpose(2, 1)) # [bs, poly_m, length] 
        
               attn_weights = F.softmax(attn_weights, -1) 
        
               output = torch.matmul(attn_weights, v) # [bs, poly_m, dim] 
        
               return output

Can we speed up the construction of poly_code_embeddings by using nn.Parameters? In this way, we don't need to create poly_ids and move it to GPU in every batches.

Thanks for your reply!

Are the transformers of bi-encoder trained separately?

(To be honest, I'm not used to "deep learning coding" (PyTorch, Huggingface, etc...), so this might be a silly question. Keep in mind I'm a beginner.)

The original paper said that context encoder and candidate encoder are trained separately.

However I found in your code that both transformers are called as self.bert().

https://github.com/chijames/Poly-Encoder/blob/master/encoder.py#L20-L27

Is it OK? I doubt these two encoders have different weights after training.

FYI: In the official implementation of BLINK(https://arxiv.org/pdf/1911.03814.pdf ) paper, they prepare different methods. https://github.com/facebookresearch/BLINK/blob/master/blink/biencoder/biencoder.py#L37-L48

Code understanding

Hi, Work is really great.

I am just trying to understand if labels are None, then encoders are outputting matrices instead of scalers, but you have not made any provision for this in your code.

Also what is neg in cross encoders? Can you please provide some context on the variables that you use using commenting?

Also why bi & poly encoder model use responses as 3-dimensional stuff?

Hyperparameters when training Cross-Encoder.

Hi! I'm using your code and want to reproduce your result on the DSTC 7 dataset.
When training the Cross-Encoder. I use BERT-small (uncased_L-4_H-512_A-8.zip) and leave all hyperparameters unchanged as in run.py (batch size=32, max context length=128, max response length=32). However I came across OOM on my Tesla M40 GPU, which has a memory of 11G.
I wonder how you can train the cross-encoder on your GPU. I guess the default hyperparameters in run.py are designed for training bi-encoder and poly-encoder. Could you please show me your hyperparameters when training cross-encoder?

Why not direct use Huggingface-BERT Pretrained Weights ?

Why do you convert the google-bert weight instead of directly using the bert weight of huggingface. Is there any performance difference between the two？

# converted weight from google-bert
bert = BertModelClass.from_pretrained(args.bert_model, state_dict=model_state_dict) 

# huggingface weight
bert = BertModelClass.from_pretrained('bert-base-uncased')

dstc7_aug data does not work for cross encoder training

All input samples are positive, the training is meaningless.

这个地方为啥取的是第一个词位置的向量？

Poly-Encoder/encoder.py

Lines 20 to 27 in 6f0d9c4

    
           context_vec = self.bert(context_input_ids, context_input_masks)[0][:,0,:]  # [bs,dim] 
        
           batch_size, res_cnt, seq_length = responses_input_ids.shape 
        
           responses_input_ids = responses_input_ids.view(-1, seq_length) 
        
           responses_input_masks = responses_input_masks.view(-1, seq_length) 
        
           responses_vec = self.bert(responses_input_ids, responses_input_masks)[0][:,0,:]  # [bs,dim] 
        
           responses_vec = responses_vec.view(batch_size, res_cnt, -1)

一个句子经过bert之后得到整个句子的向量表示，这个地方为啥取的是第一个词位置的向量：[0][:,0,:]

A bug in parse.py

I have noticed that in parse.py, the candidate response is concatenated to context by '\t'. This will lead to mistake when reading this record for training. Considering this case that candidate response is "", which actually exists in the dstc7 dataset, when split this record by '\t' to extract response, the last utterence in context will be chosen.

(this is my first time to submit the issue, I hope I have dipicted the bug clearly.

Config file missing

There doesn’t seem to be the config file you used to run this code, I’m just curious what some of the values you are using are. Specifically hidden size referenced in the poly encoder section to calculate your m.

	def dot_attention(self, q, k, v):
	# q: [bs, poly_m, dim] or [bs, res_cnt, dim]
	# k=v: [bs, length, dim] or [bs, poly_m, dim]
	attn_weights = torch.matmul(q, k.transpose(2, 1)) # [bs, poly_m, length]
	attn_weights = F.softmax(attn_weights, -1)
	output = torch.matmul(attn_weights, v) # [bs, poly_m, dim]
	return output

	context_vec = self.bert(context_input_ids, context_input_masks)[0][:,0,:] # [bs,dim]

	batch_size, res_cnt, seq_length = responses_input_ids.shape
	responses_input_ids = responses_input_ids.view(-1, seq_length)
	responses_input_masks = responses_input_masks.view(-1, seq_length)

	responses_vec = self.bert(responses_input_ids, responses_input_masks)[0][:,0,:] # [bs,dim]
	responses_vec = responses_vec.view(batch_size, res_cnt, -1)

chijames / poly-encoder Goto Github PK

poly-encoder's People

Contributors

Stargazers

Watchers

Forkers

poly-encoder's Issues

Recommend Projects

Recommend Topics

Recommend Org