cliang1453 / bond Goto Github PK

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

License: Apache License 2.0

Python 82.41% Shell 17.59%

ner named-entity-recognition distant-supervision roberta open-domain self-training weakly-supervised-learning weak-supervision weakly-supervised bert

bond's People

Contributors

Stargazers

Watchers

bond's Issues

Reproducing distant labels with gazetteer information

Congratulations on this paper getting accepted into KDD 2020!

I'm a Computer Science masters student at the National University of Singapore. I'm hoping to explore whether a better F1 score can be produced by improving the process of distant label generation. Would you have the gazetteers and code used to generate the distant labels to share?

Happy to share my progress with this as it is a semester project I am looking at. I am reachable at [email protected]

Thank you,
Jeanne

Trying BOND on new datasets and languages

Thank you for the great work. Would you be able to provide the Gazetteers data and distant label generation code? I would like to try BOND on new datasets. @cliang1453 @HMJiangGatech

Thank you in advance

Questions About variable `self_training_hp_label`

What do the different values of the variable self_training_hp_label in the code mean?
What does the variable hp_label in https://github.com/cliang1453/BOND/blob/32f26988a58ee44eb4f50772c6d6c6eb116c83cf/data_utils.py#L111 mean?
Thanks a lot!

what is the format of the dataset? how to convert any new dataset into this format?

what is the format of the dataset? what are json keys, and how are IDs assigned? Do these correspond to BERT tokenizer or something else?
how to convert any new dataset into this format?

RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

when i was running my own dataset, the following problem arised:
Traceback (most recent call last):
File "run_self_training_ner.py", line 752, in
main()
File "run_self_training_ner.py", line 681, in main
model, global_step, tr_loss, best_dev, best_test = train(args, train_dataset, model_class, config, tokenizer, labels, pad_token_label_id)
File "run_self_training_ner.py", line 353, in train
loss.backward()
File "/anaconda3/envs/bond/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/anaconda3/envs/bond/lib/python3.7/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

Are there any specific requirements for the data format, such as the sentence length?

What if i would like to use Electra model from huggingface transformers?

Hi,

Thanks for the amazing work.

Just one simple question, if i would like to use electra model from transformers library , besides, the BertConfig, tokenizers, and model load are all converted into electra tools,   any other code i have to edit in order to make the electra into Bond model? 

Thanks a lot.

Marcus

Comparison with Positive-unlabeled learning

Thank you for the great work! Just wondering if you may consider to compared BOND with the paper "Distantly Supervised Named Entity Recognition using Positive-Unlabeled Learning", since you share similar motivations and challenges of partially matched labels in distantly supervised NER.

I did notice that you may have used different entity dictionaries and datasets, but it would be great if you can share some comparison results.

Thanks!

Testing new dataset

Hi,
Thanks for providing the code.
The code works fine with datasets given in the repository, but, if I want to use it to process some other dataset, how should I construct the "words" list given in your current datasets, which has values for the attention-mask?

Results reproduction

I'm having trouble reproducing your results on CoNNL dataset.

All I have changed is batch sizes:

TRAIN_BATCH=8
EVAL_BATCH=8

Loss on eval consistently goes up on second stage of self training. Can you help me figure out what am I doing wrong?

two questions about your paper

thanks for sharing! i want to ask two quesions:
1、 the teacher model just initialized with the student model and generate pseudo labels,so why use it ? why not just use the student model to generate pseudo labels?
2、 if I have small full annotated data how to comine with you model?
thanks

question on stage 2 learning rate

Hi thanks for the work! I have some question on some implementations for stage 2
https://github.com/cliang1453/BOND/blob/master/run_self_training_ner.py#L204-L215

From the code, I can see stage 1 and stage 2 share the same scheduler, which means the learning rate for stage 2 is very small. Is this designed deliberately? The alternative is that I first train a baseline teacher's model, and pass the model to stage 2. And stage 2 can have its own learning rate scheduler then.

I am asking because I think learning rate is very important to BERT model training. Thanks.

Are pseudo labels with high confidence retained ?

In the paper, its mentioned that ' we select samples based on the prediction confidence of
the student model to further improve the quality of soft labels.' But its also mentioned that 'we discard all pseudo-labels from the (t-1)-th iteration, and only train the student model using pseudo-labels generated by the teacher model at the t-th iteration'.

Is the fist statement talking about calculating the loss of student model only on the high confidence pseudo labels or its something else because in the code i could'nt find any other justification for this line. Please suggest.

one question about "tags_hp" in the preprocessing stage

Hi,

In the /data_utils.py ,there is a strange point i would like to ask before trying implement stage 1.

In the function of "read_examples_from_file", there is a command hp_labels = item["tags_hp"]

However, i cannot search '"tags_hp" in the train.json of the five datasets you provided.

Would you mind to explain the function of this 'tags_hp' ? Thanks a lot.

def read_examples_from_file(data_dir, mode):

    file_path = os.path.join(data_dir, "{}.json".format(mode))
    guid_index = 1
    examples = []

    with open(file_path, 'r') as f:
        data = json.load(f)
    
        for item in data:
            words = item["str_words"]
            labels = item["tags"]
            if "tags_hp" in labels:
                hp_labels = item["tags_hp"]
            else:
                hp_labels = [None]*len(labels)
            examples.append(InputExample(guid="%s-%d".format(mode, guid_index), words=words, labels=labels, hp_labels=hp_labels))
            guid_index += 1

    return examples

Marcus

Question about the results

Hello, thanks for your good work. I am a beginner of the NLP. I want to know how to reproduce the results reported in the paper. what's the version of the transformers used?

The file `dataset/BC5CDR-chem/turn.py` is missing

About data format, it is said that we can transform file, e.g., BIO-format data, into json by referring to dataset/BC5CDR-chem/turn.py (in the semi_script dir). However, this file is not available. Could you help me with that?

Many thanks in advance.

About the gazetteers information and distant label generation code

Thanks for your great work! Could you please provide the gazetteers information and distant label generation code?

Could you please provide the codes for matching distant labels?

Questions about "soft labels"

BOND/run_self_training_ner.py

Line 240 in 3651a92

    
           inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": pred_labels, "label_mask": label_mask}

Hello, a few questions came to mind when I read the code dealing with "soft labels", and I wonder if you could kindly help:

What's the difference between "label_mask" and "attention_mask" here? And since BERT forward doesn't take "label_mask" as input, including the keyword "label_mask" seems to cause an unexpected keyword argument error when inputs is passed to model
It seems to me that pred_labels is of shape (batch_size , sequence_length , num_labels), which is different than what is accepted by BERT labels, i.e. (batch_size, sequence_length)(thus potential source of size does not match error when passed to BERT forward). And accoding to the paper, losses associated with soft labels are calculate in a different way than BERT loss, but in the code, soft label loss is also calculated by passing inputs to BERT model, so I'm a bit confused.

It would be grateful if you could kindly help, and have a nice day!

Distant label generation code

Thank you for the great work. Would you be able to provide the Gazetteers data and distant label generation code? I would like to try BOND on new datasets. My email address is [email protected].

Thank you in advance!

Can the NER model change from BERT+Linear_layer to BERT+CRF?

Thank you very much for the work! I have a question,

If keeping your framework unchanged, just use BERT+CRF instead of BERT+Linear_layer as the NER model, is it available? Thank you very much!

cliang1453 / bond Goto Github PK

bond's People

Contributors

Stargazers

Watchers

Forkers

bond's Issues

Recommend Projects

Recommend Topics

Recommend Org