Ascend
上AdamWeightDecay
优化器是通过小算子组成而成,代码见附录。Pynative
模式下正向过程完全是按照Python
语法进行执行,对于某个小算子,就是从Python
侧通过pybind11
调用C++
侧实现。所以多个小算子就会涉及到Python
侧与C++
侧的多次切换。
基于上面2点,有一种Pynative
模式下网络训练性能优化的思路:在Python
侧把AdamWeightDecay
做成一个大算子,然后通过后端Fission
pass把大算子拆成对应小算子。这样大大减少Python
侧与C++
侧切换次数,提升了性能(Bert Base
单step提升500ms
)。
这里C++
侧需要拆分是因为Ascend
上的kernel
依赖于CANN
包,不能自己实现。如果kernel
可以自己实现,比如GPU
或CPU
上,就可以直接按照大算子来实现kernel
,代码见附录,这样性能提升会更明显。
附录
MindSpore AdamWeightDecay 优化器算子官网介绍
AdamWeightDecay
小算子实现:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# mindspore/python/mindspore/nn/optim/adam.py
@_adam_opt.register("Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor",
"Tensor", "Bool", "Bool")
def _update_run_op(beta1, beta2, eps, lr, weight_decay, param, m, v, gradient, decay_flag, optim_filter):
"""
Update parameters.
Args:
beta1 (Tensor): The exp decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
beta2 (Tensor): The exp decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
lr (Tensor): Learning rate.
weight_decay (numbers.Number): Weight decay. Should be equal to or greater than 0.
param (Tensor): Parameters.
m (Tensor): m value of parameters.
v (Tensor): v value of parameters.
gradient (Tensor): Gradient of parameters.
decay_flag (bool): Applies weight decay or not.
optim_filter (bool): Applies parameter update or not.
Returns:
Tensor, the new value of v after updating.
"""
op_cast = P.Cast()
if optim_filter:
op_mul = P.Mul()
op_square = P.Square()
op_sqrt = P.Sqrt()
op_cast = P.Cast()
op_reshape = P.Reshape()
op_shape = P.Shape()
param_fp32 = op_cast(param, mstype.float32)
m_fp32 = op_cast(m, mstype.float32)
v_fp32 = op_cast(v, mstype.float32)
gradient_fp32 = op_cast(gradient, mstype.float32)
next_m = op_mul(beta1, m_fp32) + op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32)
- beta1, gradient_fp32)
next_v = op_mul(beta2, v_fp32) + op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32)
- beta2, op_square(gradient_fp32))
update = next_m / (eps + op_sqrt(next_v))
if decay_flag:
update = op_mul(weight_decay, param_fp32) + update
update_with_lr = op_mul(lr, update)
next_param = param_fp32 - op_reshape(update_with_lr, op_shape(param_fp32))
next_param = F.depend(next_param, F.assign(param, op_cast(next_param, F.dtype(param))))
next_param = F.depend(next_param, F.assign(m, op_cast(next_m, F.dtype(m))))
next_param = F.depend(next_param, F.assign(v, op_cast(next_v, F.dtype(v))))
return op_cast(next_param, F.dtype(param))
return op_cast(gradient, F.dtype(param))