geyingli / unif Goto Github PK

View Code? Open in Web Editor NEW

113.0 6.0 27.0 6.44 MB

基于 Tensorflow，仿 Scikit-Learn 设计的深度学习自然语言处理框架。支持 40 余种模型类，涵盖语言模型、文本分类、NER、MRC、知识蒸馏等各个领域

License: Apache License 2.0

Python 100.00%

nlp tensorflow language-modeling mrc classification transformer distillation gpu-training bert deep-learning

unif's People

Contributors

Stargazers

Watchers

unif's Issues

task about UNILM model

Nice job! Is there any plan about unilm model?

对抗训练tf2

class FreeAT(tf.keras.Model):
    def train_step(self, data):
        x, y = data
        last_r = 0.0
        last_r_slice = 0.0
        K = 3
        ep = 1e-3

        for t in range(K):
            with tf.GradientTape() as tape:
                y_pred = self(x, training=True)
                loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)        
            embedding_gradients = tape.gradient(loss, [self.trainable_variables[0]])[0]
            grad_values = tf.zeros_like(self.trainable_variables[0]) + embedding_gradients
            sign = tf.cast(tf.greater(grad_values, 0.0), tf.float32)
            r = last_r + tf.multiply(ep, sign) if t > 0 else \
                    tf.multiply(ep, sign)
            r *= tf.divide(ep, tf.norm(r))
            r_slice = tf.IndexedSlices(
                values=r,
                indices=embedding_gradients.indices,
                dense_shape=embedding_gradients.dense_shape)
            self.trainable_variables[0].assign_add(r_slice - last_r_slice)
            last_r = r
            last_r_slice = r_slice

        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        self.compiled_metrics.update_state(y, y_pred)
        
        return {m.name: m.result() for m in self.metrics}

您好，这是我用tf2的特性重写的一个FreeAT，但是在实验效果上差了不少，是不是因为没有restore_grad的缘故？另外想请教，SMART与FreeLB的自定义r变量应该在tf2中如何实现呢，感激不尽

smart对抗训练

作者您好！最近读了关于您写的对抗训练部分代码，非常的感兴趣，同时对smart算法的部分有一点疑惑，

runs at the start of each epoch

self.init_tilda_op = tilda.assign(param)

runs at the end of each epoch

self.update_tilda_op = tilda.assign(
(1 - tilda_beta) * param + tilda_beta * tilda)

在这段代码中，tilda_embedding将会在每个epoch初始时更新为当前word embedding的值，在每个epoch结束时使用动量再次更新。那么每个epoch开始时的更新是否会覆盖掉上一个epoch结束时的更新呢？如果是的话，感觉最后的动量更新没有起到什么作用呀。

希望您抽时间能够帮助解答这个小小的疑惑，非常感谢！

with tf.control_dependencies([init_op]): # fix perturbation
# Scale randomly initialized permutation, to make sure norm
# of r is smaller than epsilon.
shape = tf.cast(np.prod(init_r.shape.as_list()), tf.float32)
r = tf.divide(init_r, tf.sqrt(shape))
r = tf.IndexedSlices(values=r,
indices=grad.indices,
dense_shape=grad.dense_shape)
attack_op = param.assign(param + r)

with tf.control_dependencies([init_op]): # fix perturbation
# Scale randomly initialized permutation, to make sure norm
# of r is smaller than epsilon.
r = tf.divide(init_r, tf.norm(init_r, np.inf))
r = tf.IndexedSlices(values=r,
indices=grad.indices,
dense_shape=grad.dense_shape)
attack_op = param.assign(param + r)

关于对抗训练的FreeLB模式中初始扰动r，上述代码两个部分对r的计算似乎不太一样，实际用的时候是否第二种方式会直接覆盖掉第一种，需要注释第一种吗？这里的逻辑有点不太明白
另，有关对抗训练中dropout mask在每次扰动计算中保持一致的方式希望可以实现~

Any bugs?

You are welcome to leave problems here. Any questions will be answered ASAP.

单机多卡

您好~

在使用unif的过程中，对下面这个函数有点疑惑，您用空的时候看看哈~

如下函数求梯度的平均值时，如果grad是IndexedSlices类型的话，对value求平均，而indices则取第一个grad的indices；
感觉每个grad的indices是不一样的，假如是四卡的情况，一个batch被分成四分，其数据是不一样的，那取得应该是embedding_table矩阵的不同行；

这样的话，直接取第一个grad的indices作为indices感觉漏掉了embedding_table里一些参数的梯度；这里的value直接取平均的话，意思是把embedding_table里不同batch里的不同行的梯度值进行平均，感觉是不同参数的梯度值取了平均，直觉上是相同参数的梯度值取平均，所以感觉有些奇怪。看网上有的单机多卡的梯度平均实现是，不管是不是IndexedSlices类型，都直接用tf.divide(tf.add_n(split_grads), len(split_grads))来求平均，也不知道这样能解决我说的疑惑嘛？
https://github.com/geyingli/unif/blob/master/uf/utils.py#L748

def average_n_grads(split_grads):
    split_grads = [grad for grad in split_grads if grad is not None]

    # Dealing with IndexedSlices for large-dimensional embedding
    # matrix. The gradient of an embedding matrix is not a tensor,
    # but a tuple-like object named `IndexedSlices`, for this one,
    # we need to take special processings.
    if split_grads[0].__str__().startswith('IndexedSlices'):
        all_values = [grad.values for grad in split_grads]

        values = tf.divide(tf.add_n(all_values), len(split_grads))
        indices = split_grads[0].indices
        dense_shape = split_grads[0].dense_shape

        return tf.IndexedSlices(
            values=values,
            indices=indices,
            dense_shape=dense_shape)
    return tf.divide(tf.add_n(split_grads), len(split_grads))

我尝试也直接用tf.divide(tf.add_n(split_grads), len(split_grads))来试试，结果在freelb中的如下代码中，会报grad.indices找不到的错误。我发现不返回IndexedSlices类型的话，grad返回的是一个clip类型的变量；用的话会返回IndexedSlices类型的变量，那么这样就能找到grad.indices。
https://github.com/geyingli/unif/blob/master/uf/processing.py#L483

r = tf.IndexedSlices(values=r,
                     indices=grad.indices,
                     dense_shape=grad.dense_shape)

以及此处代码创建的init_r变量的shape是[batch_size * max_seq_length，embedding_dim]的，使用单机四卡的话，grad.indices的shape应该为[batch_size / 4 * max_seq_length]，但给的values的shape是[batch_size * max_seq_length]的，多了四倍，会报错。

init_r = tf.get_variable(
    'init_r',
    shape=[module.batch_size * module.max_seq_length,
           param.shape.as_list()[-1]],
    initializer=tf.random_uniform_initializer(
        minval=-epsilon, maxval=epsilon),
    trainable=False)

r = tf.IndexedSlices(values=r,
                     indices=grad.indices,
                     dense_shape=grad.dense_shape)

InvalidArgumentError (see above for traceback): data.shape = [4096,768] does not start with segment_ids.shape = [1024]
	 [[node add_1/y (defined at /root/unif-tencent/uf/processing.py:590)  = UnsortedSegmentSum[T=DT_FLOAT, Tindices=DT_INT32, Tnumsegments=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](truediv_202, bert/embeddings/Reshape/_457, add_1/strided_slice)]]
	 [[{{node Assign/_476}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2147_Assign", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

发现此问题的背景是我对freelb的实现方式做了点修改，如下
主要是不使用with_dependencies()，而是在加对embedding_table加完扰动后，生成一个新的attack_embedding_table，并用了tf.stop_gradient函数，来防止其对其他参数计算梯度时的影响。在计算前向传播的过程中，将attack_embedding_table作为一个参数传给module._parallel_forward()函数，来动态变换embedding_table。

    def _freelb(self, module, alpha=0.3, epsilon=0.3, n_loop=3, **kwargs):
        # FreeLB is similar to PGD, but uses average gradients from loop.
        # i.e. grad = (first_grad + ... + last_grad) / n_loop
        #
        # Also, it initializes the perturbation not from usual forward
        # propagation, but a collection of uniform distribution within
        # epsilon range. It does not uses actual gradient to average
        # gradients. The perturbation is iterated, in the same way with
        #  PGD.
        # (epsilon: the norm of perturbation, must be smaller than the
        # norm of gradients)

        # initialize
        (d_grads, module._losses, module._probs, module._preds) = \
            module._parallel_forward(**self._kwargs)
        grad, param = utils.get_grad_and_param(
            module.trainable_variables, d_grads, 'word_embedding')
        init_r = tf.get_variable(
            'init_r',
            shape=[module.batch_size * module.max_seq_length,
                   param.shape.as_list()[-1]],
            initializer=tf.random_uniform_initializer(
                minval=-epsilon, maxval=epsilon),
            trainable=False)
        init_op = tf.variables_initializer([init_r])
        with tf.control_dependencies([init_op]):    # fix perturbation
            # Scale randomly initialized permutation, to make sure norm
            # of `r` is smaller than epsilon.
            shape = tf.cast(np.prod(init_r.shape.as_list()), tf.float32)
            r = tf.divide(init_r, tf.sqrt(shape))
            r = tf.IndexedSlices(values=r,
                                 indices=grad.indices,
                                 dense_shape=grad.dense_shape)

        # with tf.control_dependencies([init_op]):    # fix perturbation
        #     # Scale randomly initialized permutation, to make sure norm
        #     # of `r` is smaller than epsilon.
        #     r = tf.divide(init_r, tf.norm(init_r, np.inf))
        #     r = tf.IndexedSlices(values=r,
        #                          indices=grad.indices,
        #                          dense_shape=grad.dense_shape)
        #     attack_op = param.assign(param + r)

        # attack
        acc_r = r
        all_grads = []
        for k in range(n_loop):
            attack_param = param + acc_r  ######修改部分
            attack_param = tf.stop_gradient(attack_param)  ######修改部分
            module.attack_trainable_variables = [attack_param if v.name == 'bert/embeddings/word_embeddings:0' else v for v in
                                                 module.trainable_variables]  ######修改部分
            (attack_grads, _, _, _) = \
                module._parallel_forward(attack_embeddings=attack_param, **self._kwargs)  ######修改部分
            all_grads.append(attack_grads)
            grad, _ = utils.get_grad_and_param(
                module.attack_trainable_variables,
                attack_grads, attack_param.name)
            tmp_r = tf.multiply(alpha, grad / (tf.norm(grad) + 1e-9))

            # In order not to shuffle the distribution of gradient-
            # induced perturbation, we use norm to scale instead of
            # simply clip the values.
            norm = tf.norm(acc_r + tmp_r)
            cur_r = tf.cond(norm > epsilon,
                            lambda: (acc_r + tmp_r) * tf.divide(epsilon, norm),
                            lambda: (acc_r + tmp_r))
            acc_r = cur_r

        attack_param = param + acc_r  ######修改部分
        attack_param = tf.stop_gradient(attack_param)  ######修改部分
        module.attack_trainable_variables = [attack_param if v.name == 'bert/embeddings/word_embeddings:0' else v for v in
                                             module.trainable_variables]  ######修改部分
        (attack_grads, _, _, _) = \
            module._parallel_forward(attack_embeddings=attack_param, **self._kwargs)  ######修改部分
        all_grads.append(attack_grads)

        # sum up
        grads = [utils.average_n_grads(split_grad) for split_grad in zip(
            *all_grads)]
        update_params_op = utils.update_global_params(
            module.trainable_variables, module._global_step,
            module._optimizer, grads)
        update_step_op = module._global_step.assign(module._global_step + 1)
        module._train_op = tf.group([update_params_op, update_step_op])

结果就会出现如下错误，主要就是说需要1024个值（），却给了4096个
我设的batch_size为128，单机四卡，max_seq_length为32
1024恰好为128/4 * 32，
4096恰好为128 * 32
所以我便产生了上面的疑惑，而当我对init_r的shape，设为[batch_size * max_seq_length / n_device，embedding_dim]时，是可以正常运行的。

出错位置：https://github.com/geyingli/unif/blob/master/uf/modeling/bert.py#L174

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 1024 values, but the requested shape has 4096
         [[node gradients_4/bert_4/embeddings/embedding_look_up_grad/Reshape_1 (defined at /jizhi/jizhi2/worker/trainer/uf/core.py:859)  = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert_16/embeddings/ExpandDims, gradients_16/bert_16/embeddings/embedding_look_up_grad/ExpandDims)]]
         [[{{node concat_2/_10269}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_124285_concat_2", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

callbacks

请问支持模型在训练过程中调用类似　keras 中的callbacks　的函数进行验证吗

question on wide and deep

Hi, I've noticed that you have implemented the wide and deep structure which is differnt from the classical "youtube wide and deep". Here is my question:

what is the input of wide side
what's the purpose to use attention mechanism between wide side and deep feature.

Thanks a lot.

GPU显存溢出

已解决

How to load pretrained models from Hugging Face with this framework?

say i want to use/load a pre-trained model from hugging face,
is something like this possible:

model = uf.BERTClassifier(from_pretrained='dccuchile/bert-base-spanish-wwm-cased')

thanks in advance,
the idea of this framework is great!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.