Coder Social home page Coder Social logo

cliang1453 / bond Goto Github PK

View Code? Open in Web Editor NEW
289.0 289.0 34.0 117.27 MB

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

License: Apache License 2.0

Python 82.41% Shell 17.59%
bert dataset distant-supervision fine-tuning named-entity-recognition natural-language-processing ner nlp open-domain pre-trained roberta self-training weak-supervision weakly-supervised weakly-supervised-learning

bond's People

Contributors

cliang1453 avatar hmjianggatech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bond's Issues

The file `dataset/BC5CDR-chem/turn.py` is missing

About data format, it is said that we can transform file, e.g., BIO-format data, into json by referring to dataset/BC5CDR-chem/turn.py (in the semi_script dir). However, this file is not available. Could you help me with that?

Many thanks in advance.

Are pseudo labels with high confidence retained ?

In the paper, its mentioned that ' we select samples based on the prediction confidence of
the student model to further improve the quality of soft labels.' But its also mentioned that 'we discard all pseudo-labels from the (t-1)-th iteration, and only train the student model using pseudo-labels generated by the teacher model at the t-th iteration'.

Is the fist statement talking about calculating the loss of student model only on the high confidence pseudo labels or its something else because in the code i could'nt find any other justification for this line. Please suggest.

Reproducing distant labels with gazetteer information

Congratulations on this paper getting accepted into KDD 2020!

I'm a Computer Science masters student at the National University of Singapore. I'm hoping to explore whether a better F1 score can be produced by improving the process of distant label generation. Would you have the gazetteers and code used to generate the distant labels to share?

Happy to share my progress with this as it is a semester project I am looking at. I am reachable at [email protected]

Thank you,
Jeanne

Testing new dataset

Hi,
Thanks for providing the code.
The code works fine with datasets given in the repository, but, if I want to use it to process some other dataset, how should I construct the "words" list given in your current datasets, which has values for the attention-mask?

two questions about your paper

thanks for sharing! i want to ask two quesions:
1ใ€ the teacher model just initialized with the student model and generate pseudo labels,so why use it ? why not just use the student model to generate pseudo labels?
2ใ€ if I have small full annotated data how to comine with you model?
thanks

one question about "tags_hp" in the preprocessing stage

Hi,

In the /data_utils.py ,there is a strange point i would like to ask before trying implement stage 1.

In the function of "read_examples_from_file", there is a command hp_labels = item["tags_hp"]

However, i cannot search '"tags_hp" in the train.json of the five datasets you provided.

Would you mind to explain the function of this 'tags_hp' ? Thanks a lot.

def read_examples_from_file(data_dir, mode):

    file_path = os.path.join(data_dir, "{}.json".format(mode))
    guid_index = 1
    examples = []

    with open(file_path, 'r') as f:
        data = json.load(f)
    
        for item in data:
            words = item["str_words"]
            labels = item["tags"]
            if "tags_hp" in labels:
                hp_labels = item["tags_hp"]
            else:
                hp_labels = [None]*len(labels)
            examples.append(InputExample(guid="%s-%d".format(mode, guid_index), words=words, labels=labels, hp_labels=hp_labels))
            guid_index += 1

    return examples

Marcus

Results reproduction

I'm having trouble reproducing your results on CoNNL dataset.

All I have changed is batch sizes:

TRAIN_BATCH=8
EVAL_BATCH=8

Loss on eval consistently goes up on second stage of self training. Can you help me figure out what am I doing wrong?

What if i would like to use Electra model from huggingface transformers?

Hi,

Thanks for the amazing work.

Just one simple question, if i would like to use electra model from transformers library , besides, the BertConfig, tokenizers, and model load are all converted into electra tools,   any other code i have to edit in order to make the electra into Bond model? 

Thanks a lot.

Marcus

Distant label generation code

Thank you for the great work. Would you be able to provide the Gazetteers data and distant label generation code? I would like to try BOND on new datasets. My email address is [email protected].

Thank you in advance!

RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

when i was running my own dataset, the following problem arised:
Traceback (most recent call last):
File "run_self_training_ner.py", line 752, in
main()
File "run_self_training_ner.py", line 681, in main
model, global_step, tr_loss, best_dev, best_test = train(args, train_dataset, model_class, config, tokenizer, labels, pad_token_label_id)
File "run_self_training_ner.py", line 353, in train
loss.backward()
File "/anaconda3/envs/bond/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/anaconda3/envs/bond/lib/python3.7/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

Are there any specific requirements for the data format, such as the sentence length?

question on stage 2 learning rate

Hi thanks for the work! I have some question on some implementations for stage 2
https://github.com/cliang1453/BOND/blob/master/run_self_training_ner.py#L204-L215

From the code, I can see stage 1 and stage 2 share the same scheduler, which means the learning rate for stage 2 is very small. Is this designed deliberately? The alternative is that I first train a baseline teacher's model, and pass the model to stage 2. And stage 2 can have its own learning rate scheduler then.

I am asking because I think learning rate is very important to BERT model training. Thanks.

Question about the results

Hello, thanks for your good work. I am a beginner of the NLP. I want to know how to reproduce the results reported in the paper. what's the version of the transformers used?

Questions about "soft labels"

inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": pred_labels, "label_mask": label_mask}

Hello, a few questions came to mind when I read the code dealing with "soft labels", and I wonder if you could kindly help:

  1. What's the difference between "label_mask" and "attention_mask" here? And since BERT forward doesn't take "label_mask" as input, including the keyword "label_mask" seems to cause an unexpected keyword argument error when inputs is passed to model

  2. It seems to me that pred_labels is of shape (batch_size , sequence_length , num_labels), which is different than what is accepted by BERT labels, i.e. (batch_size, sequence_length)(thus potential source of size does not match error when passed to BERT forward). And accoding to the paper, losses associated with soft labels are calculate in a different way than BERT loss, but in the code, soft label loss is also calculated by passing inputs to BERT model, so I'm a bit confused.

It would be grateful if you could kindly help, and have a nice day!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.