qiang2100 / bert-ls Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 26.0 13.39 MB

Lexical Simplification with Pretrained Encoders

Python 99.69% Shell 0.31%

bert-ls's Introduction

jqiang.github.io

Jipeng Qiang

bert-ls's People

Stargazers

Watchers

bert-ls's Issues

Bug in convert_whole_word_to_feature

 ind = 0
   for pos in  mask_position:
       true_word = true_word + tokens[pos]
       if(ind ==0):
           tokens[pos] = '[MASK]'
       else:
           del tokens[pos]
           del input_type_ids[pos]
       ind = ind + 1

This line at convert_whole_word_to_feature is problematic because positions of tokens changes after using del . Instead of deleting forward, I would consider deleting backward in order to avoid this problem like this. I confirmed this is an error with tokenized sentence as following,
['His', 'stories', 'g', '##lit', '##tered', 'with', 'color', ';']

count = 0
      mask_position_length = len(mask_position)
      while count in range(mask_position_length):
          index = mask_position_length - 1 - count
          pos = mask_position[index]
          if index == 0:
              tokens[pos] = '[MASK]'
          else:
              del tokens[pos]
              del input_type_ids[pos]
          count +=1

Cheers!

May I ask did you do ablation study?

In the paper, you suggested 4 type of features, did you do any ablation study, e.g. remove one or two of these features, to see if it still gives similar performance ?

Files request

Dear colleagues,
thank you for sharing your code.

May I also ask you to share the files referenced in the code and mentioned in the report but not present in the repo (if possible)? Or the scripts preparing them.
I mean word_frequency_wiki.txt and the CBT dictionary,

Thank you in advance.

How to use other models ?

Which files should be changed if loading another pretrained bert with pytorch and other fasttext embeddings ?

Dependencies unknown

This repository does not have any requirements.txt and that makes the reproducing task much harder, because all the dependencies should be installed manually. The only way to know that some dependency is missing is to encounter a runtime error.

`NoneType` object has no attribute `to`

I've just downloaded the repo, installed all the dependencies and tried to run ./run_LSBert1.sh. This is the error I got:

AttributeError: 'NoneType' object has no attribute 'to'

Full log

(XXXXXXX) XXXXX@XXXXXX:XXXXXX$ ./run_LSBert1.sh                                                                                             
INFO:__main__:device: cpu n_gpu: 0, distributed training: False, 16-bits training: False                    
INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt not found in cache, downloading to /tmp/tmppn72yug9                         
100%|█████████████████████████████████████████████████████████████| 231508/231508 [00:03<00:00, 61222.48B/s]INFO:pytorch_pretrained_bert.file_utils:copying /tmp/tmppn72yug9 to cache at /XXXXXXXXXXXX/.cache/torch/pytorch_pretrained_bert/b3a6b2c6d7ea2ffa06d0e7577c1e88b94fad470ae0f060a4ffef3fe0bdf86730.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084                                                                   
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for /XXXXXXXXXXXX/.cache/torch/pytorch_pretrained_bert/b3a6b2c6d7ea2ffa06d0e7577c1e88b94fad470ae0f060a4ffef3fe0bdf86730.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084                                                                             
INFO:pytorch_pretrained_bert.file_utils:removing temp file /tmp/tmppn72yug9                                 
INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt from cache at /XXXXXXXXXXXX/.cache/torch/pytorch_pretrained_bert/b3a6b2c6d7ea2ffa06d0e7577c1e88b94fad470ae0f060a4ffef3fe0bdf86730.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084                                                                        
ERROR:pytorch_pretrained_bert.modeling:Couldn't reach server at '/home/qiang/Desktop/pytorch-pretrained-BERT/bert-large-uncased-whole-word-masking-pytorch_model.bin' to download pretrained weights.                   
Traceback (most recent call last):                                                                          
  File "/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/LSBert1.py", line 951, in <module>            
    main()                                                                                                  
  File "/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/LSBert1.py", line 822, in main                
    model.to(device)                                                                                        
AttributeError: 'NoneType' object has no attribute 'to'

gpu_attention.model is not working

I couldn`t copy gpu_attetion.model ,so complex word ...

Referring to local files

BERT-LS/pytorch_pretrained_bert/modeling.py

Line 45 in 3aff731

    
           'bert-large-uncased-whole-word-masking': "/home/qiang/Desktop/pytorch-pretrained-BERT/bert-large-uncased-whole-word-masking-pytorch_model.bin",

BERT-LS/pytorch_pretrained_bert/modeling.py

Line 60 in 3aff731

    
           'bert-large-uncased-whole-word-masking': "/home/qiang/Desktop/pytorch-pretrained-BERT/bert-large-uncased-whole-word-masking-config.json",

Why don't replace that with https://s3.amazonaws.com/models.huggingface.co/bert/... links?

cosine similarity ? sentence loss ?

In the paper, it is claimed that the cosine similarity is computed by concatenating first 4 layers from BERT. But in the code, it is computed from word embedding - fasttext, why ?
To compute the proposal score, why you compute the masked loss over all words in a sentence ? is this the same defined in your paper ?

Now i know where the difference comes from, I read the incomplete version of this paper.

mask words not in list ?

The following error occurs when runnning with dataset: lex_mturk.txt.
Traceback (most recent call last):
File "LS_Bert.py", line 953, in
main()
File "LS_Bert.py", line 867, in main
mask_index = words.index(mask_words[i])
ValueError: 'companies' is not in list

Issue tested

Thanks for your reply! Sorry I should have described what the problem was in detail.
Here is the result of running your code.

mask_position = [2,3,4]
tokens = ['His', 'stories', 'g', '##lit', '##tered', 'with', 'color', ';']

ind = 0
for pos in mask_position:
    if (ind == 0):
        tokens[pos] = '[MASK]'
    else:
        del tokens[pos]
    ind = ind + 1

print(tokens)

['His', 'stories', '[MASK]', '##tered', 'color', ';']
I don't think this is what you wanted to get, isn't it?
I thought the tokens should be like this after running

mask_position = [2,3,4]
tokens = ['His', 'stories', 'g', '##lit', '##tered', 'with', 'color', ';']

count = 0
mask_position_length = len(mask_position)
while count in range(mask_position_length):
    index = mask_position_length - 1 - count
    pos = mask_position[index]
    if index == 0:
        tokens[pos] = '[MASK]'
    else:
        del tokens[pos]
    count += 1
print(tokens)

['His', 'stories', '[MASK]', 'with', 'color', ';']

Thank you !

Originally posted by @TheoSeo93 in #2 (comment)

qiang2100 / bert-ls Goto Github PK

bert-ls's Introduction

jqiang.github.io

bert-ls's People

Stargazers

Watchers

Forkers

bert-ls's Issues

Recommend Projects

Recommend Topics

Recommend Org