jayyip / m3tl Goto Github PK

BERT for Multitask Learning

Home Page: https://jayyip.github.io/m3tl/

License: Apache License 2.0

Python 27.99% Jupyter Notebook 71.98% Makefile 0.03%

bert cws encoder-decoder multi-task-learning multitask-learning named-entity-recognition ner nlp part-of-speech pretrained-models text-classification transformer word-segmentation

m3tl's People

Contributors

Stargazers

Watchers

Forkers

aiedward colinsongf ohwe fajunchen haejupark hwh199103 wangbq18 nonva luotong1995 zuiwufenghua yyht nanhaishun yl1113 nivekney xmxoxo kelly2016 humdingers huangxz stud2008 wuguobiao fanfanba xcgfth courteouswood liudicsu shubhampachori12110095 bin2000 zsweet santhoshsthanikam hauwenc kaiminggao qyuguo asirem16 marcos0318 wangxekon wangxinyufighting single430 olivia-xu shannonyu databill86 90217 shijie97 benben18 rocdog zofuthan arszero gptcod cszer misoknisky tony1236 crownceo gaoyz0625 donaldxu zhyuchao123 nathinal liantieyu paulxu1314 napoler holygen greatgirltina goingcoder mylv1222 wayswang claudiu-mihaila numb3r3 hjhjianghua fengqinlin eminemrain tiffen yeggasd f-ture i-zhangjingjun seeker1943 sunliangliang-max zwcdp lijia2019310 nefujiangping jinchaocai super-shen vishalbelsare dayl-w chavesliu iflybird zhyoung24 zpppy cdeng30832 pinkman026 aiikai moyan007 alexyoung757 katehuang920909 cwickniss hongdd qianrenjian irfan11111111 flxuru ml2457 akashnd mayankpatel14 haojiepan1 china-challengehub

m3tl's Issues

Shape Mismatch error for new data set

Hey, I have been trying to use a sentiment analysis dataset with the imdb class (mentioned in the notebook) as a multitask.

This is the sample format of the sentiment data:
train_data = [['I', 'am', 'going', 'to', 'school', '.'], ['I', 'am', 'not', 'feeling', 'good', '.']] train_labels = [0, 1] test_data = [['I', 'wass', 'so', 'sick', 'yesterday', '.']] test_labels = [1]
Unfortunately, this runs to the error

ValueError: generator yielded an element of shape (48,) where an element of shape () was expected.

Can you kindly help me solve this issue?

想了解一下怎么可以把输入数据形式不同的两种任务放在一起训练

之前一直想做bert的多任务，单独做过Bert 的单句多标签分类问题，也单独做过bert 的NER实体识别。按照我对这个项目的理解,是可以比如做一次finetune 这次finetune 涉及这两种任务, 然后得到一个比较通用的finetune 模型, 这个模型在分类任务和NER上都有一个相对可以的表现.请问我理解的对吗.

从代码中看，是 cws & NER (同形式输入的两种任务) （用| 是不同形式输入的两种任务）.
想问下我是否可以 (multilabel_classification | NER) 这样做呢。可以讲解下数据这部分应该怎么处理这种不同形式的两种任务的读入呢，自己理解的不是很明白. 非常感谢

Out-of-memory issue

I tried to run the notebook Run Pre-defined problems.ipynb
after

train_bert_multitask(problem='weibo_ner&weibo_cws', num_gpus=1, num_epochs=3)

I got the error message:

Traceback (most recent call last):
File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/params.py", line 206, in assign_problem
self.get_data_info(self.problem_list, self.ckpt_dir)
File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/params.py", line 270, in get_data_info
list(self.read_data_fnproblem))
File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/create_generators.py", line 300, in create_single_problem_generator
example_list=example) for example in example_list
File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in call
self.retrieve()
File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
return future.result(timeout=timeout)
File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/cluster/tufts/**/lib/anaconda3/envs/1001-nlp/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

How much RAM do I need?

question for training process

I have few questions of the training process. My task is "cls1&cls2&cls3" task. 1. For the classification task, the model uses pre-trained BERT to obtain a sentence representation of each input, how does this representation generate (how is it pooled)? 2. what is the loss function of classification task? 3. the loss using for backpropagation is the mean of these three classification losses, is it correct? 4. During the backpropagation, the model whether updates entire model (include BERT) or only the top layer.

该项目是否支持文本摘要生成任务

介绍中提到本项目可以支持seq2seq_text任务：
seq2seq_text: Sequence to Sequence text generation problem

想请教下，数据输入部分（data_preprocessing）的函数应该如何写？谢谢~

muti-task classifiction

Hi, thanks for your great jobs! I am wondering if I can use it for multi-task classifications.

为什么baseline.md里multitask的performance基本上都要比single task更差

如题，谢谢

How to modify optimization.py of original BERT code to implement multi GPU training?

Hello! I am working on training BERT on multiple GPUs these days (using official code released by Google Research) and after specifying the Run_Config of the estimator with MirroredStrategy, I encountered an "ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context." This error was the same with that in https://github.com/tensorflow/tensorflow/issues/23986#issuecomment-444389363 and I found your reply below. You said "You can take my implementation as a reference:", and therefore I came to this repo.
I modified the original optimization.py code with reference to your src/optimizer.py but still got the same error, could you give me some advice about how to re-implement the optimizer in original optimization.py?
The original official code of optimization.py is here https://github.com/google-research/bert/blob/master/optimization.py

bert model use the way of collectiveallreduce to train in multi-gpu

Dear author：
I find that you have achived the method which is mirrorStrategy to train the bert in multi-gpu.Now,I want to use the way of collectiveallreduce to train in multi-gpu.I use the tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=2) to set the distributation.Then i use the function of train_and_evaluate to start the process of train,but encounter a problem unsupported operand type(s) for +:'perreplica' and 'str'.I don't know how to solve it.(I have 2 gpus which is V100)

error： Invalid argument: cycle_length must be > 0

0.3.4版本，执行notebook示例：Run Self Defined Problem，出现以下错误：
InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: cycle_length must be > 0
[[node ExperimentalParallelInterleaveDataset (defined at /home/appadmin/anaconda3/envs/multi_task/lib/python3.6/site-packages/bert_multitask_learning-0.3.4-py3.6.egg/bert_multitask_learning/read_write_tfrecord.py:520) ]]
[[MultiDeviceIteratorToStringHandle/_8865]]
(1) Invalid argument: cycle_length must be > 0
[[node ExperimentalParallelInterleaveDataset (defined at /home/appadmin/anaconda3/envs/multi_task/lib/python3.6/site-packages/bert_multitask_learning-0.3.4-py3.6.egg/bert_multitask_learning/read_write_tfrecord.py:520) ]]
0 successful operations.
0 derived errors ignored.

运行环境如下：
tensorflow==1.14.0
keras==2.3.1
tensor2tensor==1.15.5

how to solve it

请问您的项目支持tf1.15吗？

您好，非常感谢您的分享。
我在使用您的项目的时候，发现tensor2tensor好像只支持tf2.2以上，然后看您的项目里面又是支持tf1.13；我装了tf1.15之后，运行您demo里面的代码，报了错

File "/root/data/glusterfs_sharing_04_v3/11117720/bert-multitask-learning-master/bert_multitask_learning/run_bert_multitask.py", line 120, in train_bert_multitask
input_fn=train_input_fn, max_steps=params.train_steps, hooks=[train_hook])
File "/root/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 163, in new
'Must specify max_steps > 0, given: {}'.format(max_steps))
ValueError: Must specify max_steps > 0, given: 0

使用的是这个代码 https://github.com/JayYip/bert-multitask-learning/blob/master/notebooks/Run%20Pre-defined%20problems.ipynb
希望您能帮忙解答一下，万分感谢

maybe a error in model_fn

in model_fn.py line 284:
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
but there is no apply_gradients in optimizer.py

你好，请教nan的问题

你好，感谢分享代码！想请教下，我们有几个任务进行多任务学习，在用这套框架一起训练时候，会报nan错误，但是单独训练时候都没有问题。不知道你可否知道可能哪里出现了问题？（看代码里面有注释说 # WARNING: Potential nan created here! # TODO: Fix this.）谢谢！

No OpKernel was registered to support Op 'NcclAllReduce'

在服务器上跑出现一下问题，这个该怎么解决呀？

[Question]OOM occurred when using larger batch_size, maybe data parallelism didn't work well

A lot of multi-gpu-related issues under this project have benefited me a lot.

Based on the original code of bert, with batch_size of 24 and tensorflow of 1.13.1, I recently used the AdamWeightDecayOptimizer in your project to successfully train a classifier with bert-large-uncased in 2 x Tesla P40, and the prediction looks fine.

But when I adjusted the batch_size to 32, I got the following OOM error. At this time I increased the number of GPUs to 3 but still OOM, I feel that MirroredSttrategy does not make data parallelism work. Then I reduced the number of GPUs to 1, the batch_size to 24, no OOM occurred.

Do you have any clues to solve this problem pls? Thank you very much!

error message:

WARNING:tensorflow:Efficient allreduce is not supported for IndexedSlices.
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = hierarchical_copy, num_packs = 0, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
...
Limit:                 22654317364
InUse:                 22621908992
MaxInUse:              22621909760
NumAllocs:                   13050
MaxAllocSize:            247209984
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node replica_1/gradients/replica_1/bert/encoder/layer_20/intermediate/dense/Pow_grad/Pow (defined at /scripts/bert/custom_optimization.py:74) ]]

mirrored strategy:

dist_strategy = tf.contrib.distribute.MirroredStrategy(
    cross_device_ops=AllReduceCrossDeviceOps('nccl'))
log_every_n_steps = 8
run_config = RunConfig(
    train_distribute=dist_strategy,
    eval_distribute=dist_strategy,
    log_step_count_steps=log_every_n_steps,
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps)
estimator = Estimator(
    model_fn=model_fn,
    params={},
    config=run_config)
...
train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
file_based_convert_examples_to_features(
    train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
train_input_fn = file_based_input_fn_builder(
    input_file=train_file,
    seq_length=FLAGS.max_seq_length,
    is_training=True,
    drop_remainder=True,
    batch_size=FLAGS.train_batch_size)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)

custom_optimization:

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps):
    """Creates an optimizer training op."""
    global_step = tf.train.get_or_create_global_step()

    learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)

    # Implements linear decay of the learning rate.
    learning_rate = tf.train.polynomial_decay(
        learning_rate,
        global_step,
        num_train_steps,
        end_learning_rate=0.0,
        power=1.0,
        cycle=False)

    # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
    # learning rate will be `global_step/num_warmup_steps * init_lr`.
    if num_warmup_steps:
        global_steps_int = tf.cast(global_step, tf.int32)
        warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)

        global_steps_float = tf.cast(global_steps_int, tf.float32)
        warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)

        warmup_percent_done = global_steps_float / warmup_steps_float
        warmup_learning_rate = init_lr * warmup_percent_done

        is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
        learning_rate = (
                (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)

    # It is recommended that you use this optimizer for fine tuning, since this
    # is how the model was trained (note that the Adam m/v variables are NOT
    # loaded from init_checkpoint.)
    optimizer = AdamWeightDecayOptimizer(
        learning_rate=learning_rate,
        weight_decay_rate=0.01,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
        exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

    tvars = tf.trainable_variables()
    grads = tf.gradients(loss, tvars)

    # This is how the model was pre-trained.
    (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)

    train_op = optimizer.apply_gradients(
        zip(grads, tvars), global_step=global_step)

    # Normally the global step update is done inside of `apply_gradients`.
    # However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
    # a different optimizer, you should probably take this line out.
    new_global_step = global_step + 1
    train_op = tf.group(train_op, [global_step.assign(new_global_step)])
    return train_op


class AdamWeightDecayOptimizer(Optimizer):
    """A basic Adam optimizer that includes "correct" L2 weight decay."""

    def __init__(self,
                 learning_rate,
                 weight_decay_rate=0.0,
                 beta_1=0.9,
                 beta_2=0.999,
                 epsilon=1e-6,
                 exclude_from_weight_decay=None,
                 name="AdamWeightDecayOptimizer"):
        """Constructs a AdamWeightDecayOptimizer."""
        super(AdamWeightDecayOptimizer, self).__init__(False, name)

        self.learning_rate = learning_rate
        self.weight_decay_rate = weight_decay_rate
        self.beta_1 = beta_1
        self.beta_2 = beta_2
        self.epsilon = epsilon
        self.exclude_from_weight_decay = exclude_from_weight_decay

    def _prepare(self):
        self.learning_rate_t = ops.convert_to_tensor(
            self.learning_rate, name='learning_rate')
        self.weight_decay_rate_t = ops.convert_to_tensor(
            self.weight_decay_rate, name='weight_decay_rate')
        self.beta_1_t = ops.convert_to_tensor(self.beta_1, name='beta_1')
        self.beta_2_t = ops.convert_to_tensor(self.beta_2, name='beta_2')
        self.epsilon_t = ops.convert_to_tensor(self.epsilon, name='epsilon')

    def _create_slots(self, var_list):
        for v in var_list:
            self._zeros_slot(v, 'm', self._name)
            self._zeros_slot(v, 'v', self._name)

    def _apply_dense(self, grad, var):
        learning_rate_t = math_ops.cast(
            self.learning_rate_t, var.dtype.base_dtype)
        beta_1_t = math_ops.cast(self.beta_1_t, var.dtype.base_dtype)
        beta_2_t = math_ops.cast(self.beta_2_t, var.dtype.base_dtype)
        epsilon_t = math_ops.cast(self.epsilon_t, var.dtype.base_dtype)
        weight_decay_rate_t = math_ops.cast(
            self.weight_decay_rate_t, var.dtype.base_dtype)

        m = self.get_slot(var, 'm')
        v = self.get_slot(var, 'v')

        # Standard Adam update.
        next_m = (
                tf.multiply(beta_1_t, m) +
                tf.multiply(1.0 - beta_1_t, grad))
        next_v = (
                tf.multiply(beta_2_t, v) + tf.multiply(1.0 - beta_2_t,
                                                       tf.square(grad)))

        update = next_m / (tf.sqrt(next_v) + epsilon_t)

        if self._do_use_weight_decay(var.name):
            update += weight_decay_rate_t * var

        update_with_lr = learning_rate_t * update

        next_param = var - update_with_lr

        return control_flow_ops.group(*[var.assign(next_param),
                                        m.assign(next_m),
                                        v.assign(next_v)])

    def _resource_apply_dense(self, grad, var):
        learning_rate_t = math_ops.cast(
            self.learning_rate_t, var.dtype.base_dtype)
        beta_1_t = math_ops.cast(self.beta_1_t, var.dtype.base_dtype)
        beta_2_t = math_ops.cast(self.beta_2_t, var.dtype.base_dtype)
        epsilon_t = math_ops.cast(self.epsilon_t, var.dtype.base_dtype)
        weight_decay_rate_t = math_ops.cast(
            self.weight_decay_rate_t, var.dtype.base_dtype)

        m = self.get_slot(var, 'm')
        v = self.get_slot(var, 'v')

        # Standard Adam update.
        next_m = (
                tf.multiply(beta_1_t, m) +
                tf.multiply(1.0 - beta_1_t, grad))
        next_v = (
                tf.multiply(beta_2_t, v) + tf.multiply(1.0 - beta_2_t,
                                                       tf.square(grad)))

        update = next_m / (tf.sqrt(next_v) + epsilon_t)

        if self._do_use_weight_decay(var.name):
            update += weight_decay_rate_t * var

        update_with_lr = learning_rate_t * update

        next_param = var - update_with_lr

        return control_flow_ops.group(*[var.assign(next_param),
                                        m.assign(next_m),
                                        v.assign(next_v)])

    def _apply_sparse_shared(self, grad, var, indices, scatter_add):
        learning_rate_t = math_ops.cast(
            self.learning_rate_t, var.dtype.base_dtype)
        beta_1_t = math_ops.cast(self.beta_1_t, var.dtype.base_dtype)
        beta_2_t = math_ops.cast(self.beta_2_t, var.dtype.base_dtype)
        epsilon_t = math_ops.cast(self.epsilon_t, var.dtype.base_dtype)
        weight_decay_rate_t = math_ops.cast(
            self.weight_decay_rate_t, var.dtype.base_dtype)

        m = self.get_slot(var, 'm')
        v = self.get_slot(var, 'v')

        m_t = state_ops.assign(m, m * beta_1_t,
                               use_locking=self._use_locking)

        m_scaled_g_values = grad * (1 - beta_1_t)
        with ops.control_dependencies([m_t]):
            m_t = scatter_add(m, indices, m_scaled_g_values)

        v_scaled_g_values = (grad * grad) * (1 - beta_2_t)
        v_t = state_ops.assign(v, v * beta_2_t, use_locking=self._use_locking)
        with ops.control_dependencies([v_t]):
            v_t = scatter_add(v, indices, v_scaled_g_values)

        update = m_t / (math_ops.sqrt(v_t) + epsilon_t)

        if self._do_use_weight_decay(var.name):
            update += weight_decay_rate_t * var

        update_with_lr = learning_rate_t * update

        var_update = state_ops.assign_sub(var,
                                          update_with_lr,
                                          use_locking=self._use_locking)
        return control_flow_ops.group(*[var_update, m_t, v_t])

    def _apply_sparse(self, grad, var):
        return self._apply_sparse_shared(
            grad.values, var, grad.indices,
            lambda x, i, v: state_ops.scatter_add(  # pylint: disable=g-long-lambda
                x, i, v, use_locking=self._use_locking))

    def _resource_scatter_add(self, x, i, v):
        with ops.control_dependencies(
                [resource_variable_ops.resource_scatter_add(
                    x.handle, i, v)]):
            return x.value()

    def _resource_apply_sparse(self, grad, var, indices):
        return self._apply_sparse_shared(
            grad, var, indices, self._resource_scatter_add)

    def _do_use_weight_decay(self, param_name):
        """Whether to use L2 weight decay for `param_name`."""
        if not self.weight_decay_rate:
            return False
        if self.exclude_from_weight_decay:
            for r in self.exclude_from_weight_decay:
                if re.search(r, param_name) is not None:
                    return False
        return True

model_fn:

is_training = (mode == tf.estimator.ModeKeys.TRAIN)

(total_loss, per_example_loss, logits, probabilities) = create_model(
    bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
    num_labels, use_one_hot_embeddings)

tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
    (assignment_map, initialized_variable_names
     ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
    
    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

tf.logging.info("**** Trainable Variables ****")
for var in tvars:
    init_string = ""
    if var.name in initialized_variable_names:
        init_string = ", *INIT_FROM_CKPT*"
    tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                    init_string)

if mode == tf.estimator.ModeKeys.TRAIN:
    train_op = custom_optimization.create_optimizer(
        total_loss, learning_rate, num_train_steps, num_warmup_steps)
    output_spec = tf.estimator.EstimatorSpec(
        mode=mode,
        loss=total_loss,
        train_op=train_op,
        scaffold=scaffold_fn)
 ...

have you just worked on multi-gpus version with mirroredstrategy in pretraining

Code error ?

NameError: name 'input_list' is not defined
@JayYip
--> https://github.com/JayYip/bert-multitask-learning/blob/9c21e432ca1afd54423cff0bbfd16cc966156d21/bert_multitask_learning/data_preprocessing/ner_data.py#L131-L141

ValueError: The two structures don't have the same nested structure

抱歉又来打扰您。训练多个任务，freeze_step设置为>0时，会报一个如题所述的异常；自己尝试调试，但最终未能解决，希望能得到你的建议。

环境：

python 3.6
tensorflow-gpu 1.15.0 (pip install bert-multitask-learning 之后，运行示例，提示需要tensorflow版本>=1.14，故升级为1.15了)

何时遇到 ValueError：

训练单个problem，freeze_step设置为0 或者>0，均没有报异常
训练两个problem ('cls|seq_tag')，freeze_step设置为0没有报异常，但是设置为>0，就报ValueError异常
异常显示是IndexedSlicesSpec不匹配，具体Traceback如下

Traceback (most recent call last):
File "debug_freeze.py", line 161, in
model_dir=model_dir)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/bert_multitask_learning/run_bert_multitask.py", line 126, in train_bert_multitask
train_and_evaluate(estimator, train_spec, eval_spec)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1159, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1222, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1302, in _actual_train_model_distributed
self.config))
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1810, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 662, in _call_for_each_replica
fn, args, kwargs)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 880, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/bert_multitask_learning/model_fn.py", line 496, in model_fn
features, hidden_feature, loss_eval_pred, mode, warm_start)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/bert_multitask_learning/model_fn.py", line 457, in create_spec
train_scaffold)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/bert_multitask_learning/model_fn.py", line 384, in create_train_spec
aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
return grad_fn() # Exit early
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in
lambda: grad_fn(op, *out_grads))
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_grad.py", line 84, in _SwitchGrad
return merge(grad, name="cond_grad")[0], None
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 413, in merge
nest.assert_same_structure(inputs[0], v, expand_composites=True)
File "/home/jp/anaconda3/envs/jp_test/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 326, in assert_same_structure
% (str(e), str1, str2))
ValueError: The two structures don't have the same nested structure.

First structure: type=IndexedSlices str=IndexedSlices(indices=Tensor("gradients/cond/Merge_grad/cond_grad/Switch_1:0", shape=(?,), dtype=int64, device=/replica:0/task:0/device:GPU:0), values=Tensor("gradients/cond/Merge_grad/cond_grad/Switch:0", shape=(?, ?, 768), dtype=float32, device=/replica:0/task:0/device:GPU:0), dense_shape=Tensor("gradients/cond/Merge_grad/cond_grad/Switch_2:0", shape=(3,), dtype=int64, device=/replica:0/task:0/device:GPU:0))

Second structure: type=IndexedSlices str=IndexedSlices(indices=Tensor("gradients/cond/Switch_1_grad/cond_grad/Cast:0", shape=(?,), dtype=int64, device=/replica:0/task:0/device:GPU:0), values=Tensor("gradients/zeros:0", shape=(?, ?, 768), dtype=float32, device=/replica:0/task:0/device:GPU:0), dense_shape=Tensor("gradients/cond/Switch_1_grad/cond_grad/Shape:0", shape=(3,), dtype=int32, device=/replica:0/task:0/device:GPU:0))

More specifically: Incompatible CompositeTensor TypeSpecs: type=IndexedSlicesSpec str=IndexedSlicesSpec(TensorShape([Dimension(None), Dimension(None), Dimension(768)]), tf.float32, tf.int64, tf.int64, TensorShape([Dimension(None)])) vs. type=IndexedSlicesSpec str=IndexedSlicesSpec(TensorShape([Dimension(None), Dimension(None), Dimension(768)]), tf.float32, tf.int64, tf.int32, TensorShape([Dimension(None)]))
Entire first structure:
.
Entire second structure:
.

请教一些问题

修改一下问题：
1、您好，我在运行Run Run Pre-defined problems.ipynb这个文件的时候，导入自己的模型，然后设置batchsize为1，multiprocess为False，单个任务，会占满整个显存。

其实我尝试各种batchsize大小都不行，都是这样，不知道问题出在哪。另外我是这么修改您的代码

除了pip安装，我也试过直接下载你的代码直接运行，也是这个问题。另外我想问问，除了batchsize、multiprocess，还可以调整哪些参数可以降低显存的使用

2、另外就是调用train_bert_multitask会出现eval_feature_desc.json未创建。

最后查到train_and_evaluate(estimator, train_spec, eval_spec)中有误

3、最后一个问题是，运行程序的时候有时候会出现以下问题，但再运行时，又能成功运行

分类问题

请问notebook中，分类的例子中，这里params.init_checkpoint = 'models/cased_L-12_H-768_A-12'是原始bert的模型吗？还有这个项目有模型说明吗或者论文？谢谢

When I use the new version, there's some problems.

'''
WARNING:root:bert_config not exists. will load model from huggingface checkpoint.
Traceback (most recent call last):
File "run_weibo_ner_cws.py", line 31, in
train_bert_multitask(problem='weibo_ner&weibo_cws', params=params, problem_type_dict=problem_type_dict,
File "/data/home/likai/.conda/envs/lkai_tf2/lib/python3.8/site-packages/bert_multitask_learning/run_bert_multitask.py", line 113, in train_bert_multitask
params.assign_problem(problem, gpu=int(num_gpus),
File "/data/home/likai/.conda/envs/lkai_tf2/lib/python3.8/site-packages/bert_multitask_learning/params.py", line 221, in assign_problem
self.prepare_dir(base_dir, dir_name, self.problem_list)
File "/data/home/likai/.conda/envs/lkai_tf2/lib/python3.8/site-packages/bert_multitask_learning/params.py", line 491, in prepare_dir
tokenizer = load_transformer_tokenizer(
File "/data/home/likai/.conda/envs/lkai_tf2/lib/python3.8/site-packages/bert_multitask_learning/utils.py", line 278, in load_transformer_tokenizer
tok = getattr(transformers, load_module_name).from_pretrained(
File "/data/home/likai/.conda/envs/lkai_tf2/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 188, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/data/home/likai/.conda/envs/lkai_tf2/lib/python3.8/site-packages/transformers/configuration_auto.py", line 289, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in models/weibo_cws_weibo_ner_ckpt/tokenizer. Should have a model_type key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert
'''
The file-'config.json' under this path -'models/weibo_cws_weibo_ner_ckpt/tokenizer' is updated each time I run the program, and there is no model_type. Do you know what the problem is？Expect a response, thank you.

TopLayer for Regression

It would be useful to create a TopLayer for regression-type problems, where the label is a score, for instance similarity metrics between embeddings.

使用multi worker后效率很慢，GPU利用率也大大下降

单机多GPU，效率较单机单卡有所提升；但切换到多机多卡后，效率大大下降了

想了解一下更具体的训练过程

非常感谢你的repo给我提供了一个学习multi-task的材料。
最近打算使用预训练的bert做文本分类和NER的多任务学习（两个任务输入不同）。我了解到Hard parameter sharing多任务学习有两种训练方式：

联合训练（joint），即两个任务的loss相加，整合成一个loss，只优化这一个loss即可
交替训练（alternate），两个任务的loss不相加，然后两个任务交替训练，如A_batch1, B_batch_1, A_batch2, B_batch2, ...

我想研究上述两个任务使用联合训练/交替训练时的效果。我看到你文档中写的不同输入时problem参数必须使用 | 。我有两个疑问：

输入不同时，只能交替训练（使用｜，随机采样一个任务）是吗？
输入相同时，使用&完成的是联合训练；使用｜完成的是交替训练？

How to slove Loss 0 Error

Just as you remind "Therefore, in a particular batch, some tasks might not be sampled, and their loss could be 0 in this batch." Now I get the error:

tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

with multitask problem.how to slove that?

How to prepare two sequences as input for bert-multitask-learning?

Hi, I have a dataset that involves 2 sequences and the task is classifying the sequence pair. I am not sure how to prepare the input in this case. So far, I have been working with only one sequence where I used the following format:

["Everyone", "should", "be", "happy", "."]
How do I extend this for 2 sequences? Do I have to insert a "SEP" token myself?

Example for Pre-training

Hi, is this only for fine-tuning?
I see all examples are about fine-tuning.
Does it support pre-training? Any examples?

多GPU训练bert问题

ValueError: You must specify an aggregation method to update a MirroredVariable in Tower Context"
该怎么修改优化器？

ask some detail about multi-gpu support

Hi, I'm new to tensorflow, and not quite familiar with multi-GPU training using tf.eatimator. Could you explain some key points that you modify the original code to implement multi-GPU support, especially in the optimizer.py? By the way, is there any big changes in the estimator.py compared with the original code? Thanks in advance!

Why so great architectural changes?

Hi! Great to see such a tremendous work done. One question: why do you consider such great changes to the project architecture necessary? Why just passing DistributionStrategy to estimator would not be enough? And did you try Horovod from Uber for the same purpose?

question

from .ner_data import *
from .test_data import *
from .test_data import *

Excuse me, is there a small error with the file in this path -‘bert_multitask_learning/predefined_problems/init.py’ ？cws_data？

notebook is not working

I think your patch made notebook files not working such as Run Self Defined Problem with Modified Model.ipynb.
it tries to import create_single_problem_generator, which does not exist anymore..
can you fix it?

请问在更下游任务上有使用过多任务训练的结果fine tune吗

question for tokenization

Do you add the wordpiece tokenization in the process? In the example, your input data is already tokenized, so I am wondering about this.

problem

请问这个问题该怎么解决

installation issue

Collecting googleapis-common-protos (from tensorflow-metadata->tensorflow-datasets->tensor2tensor->bert-multitask-learning)
Downloading https://mirrors.aliyun.com/pypi/packages/eb/ee/e59e74ecac678a14d6abefb9054f0bbcb318a6452a30df3776f133886d7d/googleapis-common-protos-1.6.0.tar.gz
ERROR: Complete output from command python setup.py egg_info:
ERROR: Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3/dist-packages/setuptools/init.py", line 11, in
from setuptools.extern.six.moves import filterfalse, map
File "/usr/lib/python3/dist-packages/setuptools/extern/init.py", line 1, in
from pkg_resources.extern import VendorImporter
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2927, in
@_call_aside
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2913, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2952, in _initialize_master_working_set
add_activation_listener(lambda dist: dist.activate())
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 956, in subscribe
callback(dist)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2952, in
add_activation_listener(lambda dist: dist.activate())
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2515, in activate
declare_namespace(pkg)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2097, in declare_namespace
_handle_ns(packageName, path_item)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2047, in _handle_ns
_rebuild_mod_path(path, packageName, module)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2066, in _rebuild_mod_path
orig_path.sort(key=position_in_sys_path)
AttributeError: '_NamespacePath' object has no attribute 'sort'
----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-s47a0wqk/googleapis-common-protos/
ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/bert-multitask-learning$

confused about the bert-multitask-learning/baseline.md

Hi~

I'm a bit confused about the results in the baseline.md:
Does the baseline mean that we use the bert model without multi-task learning and the multitask_label_transfer_first_train mean that we use the bert model with multi-task learning?
It seems that all the results prove that the multitask_label_transfer_first_train is not as good as the baseline?

Hope to get your reply. Thanks a lot.

global step double increment

https://github.com/JayYip/bert-multitask-learning/blob/9f6950eb72a98e877558701d4ade9aec9ba76bc9/src/model_fn.py#L326

optimizer.apply_gradients() will increase global step once when param global_step is not set to None. So in this line, global step will be increased again which it doesn't make sense. The issue can influence the number of train step when using max_steps or steps parameter in estimator.train()

BTW, thanks for sharing the optimization implementation for BERT multi-gpu.

Reference: https://github.com/tensorflow/tensorflow/blob/743dea5427b11ff08782f4c352b8424aa7fad982/tensorflow/python/training/optimizer.py#L546

make_hidden_model()调用时参数传递错误？

top.py 中有两处(link1, link2)调用函数make_hidden_model()时传递的最后一个参数似乎有问题，函数定义最后一个参数是bool类型的is_seq，而调用时传入的'pooled'字符串，会被解析为True

请问有MSRA的test set吗

关于训练时间

请问有比较过多卡和单卡的训练时间吗？
我按照这个方式实现了一下bert里的run_squad，发现2卡和单卡时间相当，甚至2卡比四卡时间还短一些.
使用的是tf 1.13 gpu是1080

多gpu训练步数问题

你好，使用repo中的optimizer发现总的训练步数并没有减少，多gpu理论上训练步数应该线性递减啊？请问你有没有遇到这个问题，谢谢。

'TFBertEmbeddings' object has no attribute 'word_embeddings'

代码：
from bert_multitask_learning import train_bert_multitask, eval_bert_multitask, predict_bert_multitask
problem_type_dict = {'toy_cls': 'cls', 'toy_seq_tag': 'seq_tag'}

problem = 'toy_cls&toy_seq_tag'
model = train_bert_multitask(
problem=problem,
num_epochs=1,
problem_type_dict=problem_type_dict,
processing_fn_dict=processing_fn_dict,
#continue_training=True
)

报错：
/root/.local/lib/python3.7/site-packages/bert_multitask_learning/run_bert_multitask.py in train_bert_multitask(problem, num_gpus, num_epochs, model_dir, params, problem_type_dict, processing_fn_dict, model, create_tf_record_only, steps_per_epoch, warmup_ratio, continue_training, mirrored_strategy)
257
258 model = create_keras_model(
--> 259 mirrored_strategy=mirrored_strategy, params=params, mode=mode, inputs_to_build_model=one_batch)
260
261 _train_bert_multitask_keras_model(

/root/.local/lib/python3.7/site-packages/bert_multitask_learning/run_bert_multitask.py in create_keras_model(mirrored_strategy, params, mode, inputs_to_build_model, model)
91 if mirrored_strategy is not None:
92 with mirrored_strategy.scope():
---> 93 model = _get_model_wrapper(params, mode, inputs_to_build_model, model)
94 else:
95 model = _get_model_wrapper(params, mode, inputs_to_build_model, model)

/root/.local/lib/python3.7/site-packages/bert_multitask_learning/run_bert_multitask.py in _get_model_wrapper(params, mode, inputs_to_build_model, model)
51 def _get_model_wrapper(params, mode, inputs_to_build_model, model):
52 if model is None:
---> 53 model = BertMultiTask(params)
54 # model.run_eagerly = True
55 if mode == 'resume':

/root/.local/lib/python3.7/site-packages/bert_multitask_learning/model_fn.py in init(self, params, name)
261 self.params = params
262 # initialize body model, aka transformers
--> 263 self.body = BertMultiTaskBody(params=self.params)
264 # mlm might need word embedding from bert
265 # build sub-model

/root/.local/lib/python3.7/site-packages/bert_multitask_learning/model_fn.py in init(self, params, name)
63 super(BertMultiTaskBody, self).init(name=name)
64 self.params = params
---> 65 self.bert = MultiModalBertModel(params=self.params)
66 if self.params.custom_pooled_hidden_size:
67 self.custom_pooled_layer = tf.keras.layers.Dense(

/root/.local/lib/python3.7/site-packages/bert_multitask_learning/modeling.py in init(self, params, use_one_hot_embeddings)
40 # multimodal input dense
41 embedding_dim = get_embedding_table_from_model(
---> 42 self.bert_model).shape[-1]
43 self.modal_name_list = ['image', 'others']
44 self.multimodal_dense = {modal_name: tf.keras.layers.Dense(

/root/.local/lib/python3.7/site-packages/bert_multitask_learning/utils.py in get_embedding_table_from_model(model)
397 def get_embedding_table_from_model(model):
398 base_model = get_transformer_main_model(model)
--> 399 return base_model.embeddings.word_embeddings
400
401

AttributeError: 'TFBertEmbeddings' object has no attribute 'word_embeddings'

Bad recall f1 precision

Hello,i am still trying to train ner task ,but every time i get very low loss after 100 steps and after evaluating get Acc Score: 0.839600
Precision Score: 0.125084
Recall Score: 0.216169
F1 Score: 0.158471
Score still dont change after 1000 and more steps
can you help me? maybe my data processing function is bad,but i compare outputs with predifined functions and all is okey

running issue

Hi,
I am trying to run the example in "run self defined problem". But I got error on "Train Model" part. I am using python 3.6.3 and bert-multitask-learning 0.2.7.

Adding new problem imdb_cls, problem type: cls
INFO:tensorflow:Saving preprocessing files to tmp/imdb_cls_train_data.pkl
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
'''
Traceback (most recent call last):
  File "/home/chiyu94/bert_multitask/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/chiyu94/bert_multitask/lib/python3.6/site-packages/bert_multitask_learning/__init__.py", line 4, in <module>
    from .model_fn import *
  File "/home/chiyu94/bert_multitask/lib/python3.6/site-packages/bert_multitask_learning/model_fn.py", line 5, in <module>
    from .bert import modeling
ModuleNotFoundError: No module named 'bert_multitask_learning.bert'
'''

The above exception was the direct cause of the following exception:

BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-7-3420e0b145e9> in <module>
      4 train_bert_multitask(problem='imdb_cls', num_gpus=0, 
      5                      num_epochs=10, params=params,
----> 6                      problem_type_dict=new_problem_type, processing_fn_dict=new_problem_process_fn_dict)

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/run_bert_multitask.py in train_bert_multitask(problem, num_gpus, num_epochs, model_dir, params, problem_type_dict, processing_fn_dict, model)
    106                 problem_name=new_problem, problem_type=problem_type_dict[new_problem], processing_fn=new_problem_processing_fn)
    107     params.assign_problem(problem, gpu=int(num_gpus),
--> 108                           base_dir=base_dir, dir_name=dir_name)
    109     params.to_json()
    110 

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/params.py in assign_problem(self, flag_string, gpu, base_dir, dir_name, is_serve)
    191         self.prepare_dir(base_dir, dir_name, self.problem_list)
    192 
--> 193         self.get_data_info(self.problem_list, self.ckpt_dir)
    194 
    195         if not is_serve:

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/params.py in get_data_info(self, problem_list, base)
    255 
    256                     self.data_num_dict[problem] = len(
--> 257                         list(self.read_data_fn[problem](self, 'train')))
    258                     self.data_num += self.data_num_dict[problem]
    259                 else:

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/create_generators.py in create_single_problem_generator(problem, inputs_list, target_list, label_encoder, params, tokenizer, mode)
    295 
    296             return_dict_list_list = Parallel(num_process)(delayed(partial_fn)(
--> 297                 example_list=example) for example in example_list
    298             )
    299 

~/bert_multitask/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
    932 
    933             with self._backend.retrieval_context():
--> 934                 self.retrieve()
    935             # Make sure that we get a last message telling us we are done
    936             elapsed_time = time.time() - self._start_time

~/bert_multitask/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    831             try:
    832                 if getattr(self._backend, 'supports_timeout', False):
--> 833                     self._output.extend(job.get(timeout=self.timeout))
    834                 else:
    835                     self._output.extend(job.get())

~/bert_multitask/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    519         AsyncResults.get from multiprocessing."""
    520         try:
--> 521             return future.result(timeout=timeout)
    522         except LokyTimeoutError:
    523             raise TimeoutError()

/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

How to add vocab

Hello,thank you for this great implementation.I want to use different vocab and init checkpoint.But in DynamicBatchSizeParams() methods there is only 'decode_vocab_file' method.Is it about adding vocab?

请问一下代码运行环境

感谢作者分享代码，小白能求问下这个代码的运行环境么，之前在python 2.7 tf 1.12.0 运行bert finetune是没问题的，在这个代码运行时候遇到这个报错：

def train_eval_input_fn(config: Params, mode='train', epoch=None):
SyntaxError: invalid syntax

这个 “config 冒号 Param”
以前没有见过这种写法，是缺少了什么包还是运行环境不对呢，谢谢作者

想請問未來會支援Multi-modal嗎?

您好，在github上搜尋 BERT multitask的時候發現了你的repo，因為目前在github找不太到有人同時整合一個BERT multitask multi-modal的架構，有找到BERT multi-modal的但沒有multitask，所以想說加果能支援multi-modal的話就完美了，謝謝!

NoneTypes of labels

Hi I am trying to run the classification tasks. I can successfully run my code at first time. But I get error when I remove the results of first run and rerun the model. The error is:

Adding new problem country_cls, problem type: cls
Adding new problem gender_cls, problem type: cls
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-50089851ce6e> in <module>
      4 train_bert_multitask(problem='country_cls&gender_cls', num_gpus=2, 
      5                      num_epochs=2, params=params,
----> 6                      problem_type_dict=new_problem_type, processing_fn_dict=new_problem_process_fn_dict)

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/run_bert_multitask.py in train_bert_multitask(problem, num_gpus, num_epochs, model_dir, params, problem_type_dict, processing_fn_dict, model)
    106                 problem_name=new_problem, problem_type=problem_type_dict[new_problem], processing_fn=new_problem_processing_fn)
    107     params.assign_problem(problem, gpu=int(num_gpus),
--> 108                           base_dir=base_dir, dir_name=dir_name)
    109     params.to_json()
    110 

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/params.py in assign_problem(self, flag_string, gpu, base_dir, dir_name, is_serve)
    191         self.prepare_dir(base_dir, dir_name, self.problem_list)
    192 
--> 193         self.get_data_info(self.problem_list, self.ckpt_dir)
    194 
    195         if not is_serve:

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/params.py in get_data_info(self, problem_list, base)
    255 
    256                     self.data_num_dict[problem] = len(
--> 257                         list(self.read_data_fn[problem](self, 'train')))
    258                     self.data_num += self.data_num_dict[problem]
    259                 else:

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/data_preprocessing/preproc_decorator.py in wrapper(params, mode)
     13         if os.path.exists(pickle_file) and params.multiprocess:
     14             label_encoder = get_or_make_label_encoder(
---> 15                 params, problem=problem, mode=mode)
     16             return create_single_problem_generator(
     17                 func.__name__,

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/utils.py in get_or_make_label_encoder(params, problem, mode, label_list, zero_class)
    136             label_encoder = LabelEncoder()
    137 
--> 138             label_encoder.fit(label_list, zero_class=zero_class)
    139             label_encoder.dump(le_path)
    140         else:

/lustre03/project/6007993/chiyu94/bert-multitask-learning-master/bert_multitask_learning/utils.py in fit(self, y, zero_class)
     28         self.encode_dict = {}
     29         self.decode_dict = {}
---> 30         label_set = set(y)
     31         if zero_class is None:
     32             zero_class = '[PAD]'

TypeError: 'NoneType' object is not iterable

bug of label set in classification task

I find a bug of generating the label set of classification task.

The label set always has one more label: [PAD]. I find in utils.fit() (line 33 and 35), the [PAD] will be add to label set.
Could you please check this issue?

https://github.com/JayYip/bert-multitask-learning/blob/9fe97739194f801e539efbadbaaf97a9c945eaaa/bert_multitask_learning/utils.py#L33