Ascend上AdamWeightDecay优化器是通过小算子组成而成，代码见附录。
Pynative模式下正向过程完全是按照Python语法进行执行，对于某个小算子，就是从Python侧通过pybind11调用C++侧实现。所以多个小算子就会涉及到Python侧与C++侧的多次切换。

基于上面2点，有一种Pynative模式下网络训练性能优化的思路：在Python侧把AdamWeightDecay做成一个大算子，然后通过后端Fission pass把大算子拆成对应小算子。这样大大减少Python侧与C++侧切换次数，提升了性能(Bert Base 单step提升500ms)。

这里C++侧需要拆分是因为Ascend上的kernel依赖于CANN包，不能自己实现。如果kernel可以自己实现，比如GPU或CPU上，就可以直接按照大算子来实现kernel，代码见附录，这样性能提升会更明显。

附录

MindSpore AdamWeightDecay 优化器算子官网介绍

Fission 代码

AdamWeightDecay小算子实现：

# mindspore/python/mindspore/nn/optim/adam.py
@_adam_opt.register("Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor",
                    "Tensor", "Bool", "Bool")
def _update_run_op(beta1, beta2, eps, lr, weight_decay, param, m, v, gradient, decay_flag, optim_filter):
    """
    Update parameters.

    Args:
        beta1 (Tensor): The exp decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exp decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (numbers.Number): Weight decay. Should be equal to or greater than 0.
        param (Tensor): Parameters.
        m (Tensor): m value of parameters.
        v (Tensor): v value of parameters.
        gradient (Tensor): Gradient of parameters.
        decay_flag (bool): Applies weight decay or not.
        optim_filter (bool): Applies parameter update or not.

    Returns:
        Tensor, the new value of v after updating.
    """
    op_cast = P.Cast()
    if optim_filter:
        op_mul = P.Mul()
        op_square = P.Square()
        op_sqrt = P.Sqrt()
        op_cast = P.Cast()
        op_reshape = P.Reshape()
        op_shape = P.Shape()
        param_fp32 = op_cast(param, mstype.float32)
        m_fp32 = op_cast(m, mstype.float32)
        v_fp32 = op_cast(v, mstype.float32)
        gradient_fp32 = op_cast(gradient, mstype.float32)

        next_m = op_mul(beta1, m_fp32) + op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32)
                                                - beta1, gradient_fp32)

        next_v = op_mul(beta2, v_fp32) + op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32)
                                                - beta2, op_square(gradient_fp32))

        update = next_m / (eps + op_sqrt(next_v))
        if decay_flag:
            update = op_mul(weight_decay, param_fp32) + update

        update_with_lr = op_mul(lr, update)
        next_param = param_fp32 - op_reshape(update_with_lr, op_shape(param_fp32))

        next_param = F.depend(next_param, F.assign(param, op_cast(next_param, F.dtype(param))))
        next_param = F.depend(next_param, F.assign(m, op_cast(next_m, F.dtype(m))))
        next_param = F.depend(next_param, F.assign(v, op_cast(next_v, F.dtype(v))))

        return op_cast(next_param, F.dtype(param))
    return op_cast(gradient, F.dtype(param))

GPU上AdamWeightDecay大算子实现

AdamWeightDecay fission

附录

Trending Tags

AdamWeightDecay fission

附录

Further Reading

动静态图

图缓存

内存同步模式

Trending Tags