What is Gradient Clipping

深度学习

发布日期: 2021-04-11

为什么需要gradient clipping？

在DL的项目中常常会看到gradient clipping的身影，命令行传入参数grad_clip，然后再调用clip_grad_norm_()函数，如下：

parser.add_argument("--grad_clip", default=1.0, type=float)

clip_grad_norm_(self.model.parameters(), self.args.grad_clip)

查阅资料后，得出结论：gradient clipping是用来解决exploding gradients问题的。关于exploding gradients的细节理解，我并未做深入学习，只对它有浅显的intuition：

如上图，左边图片中的右斜线就是明显的exploding gradients，这样带来的问题就是过大的梯度，一下子就让模型的参数带离了“good region”。因此如果gradients过大，就要适当缩小gradients，这就是gradient clipping要完成的任务。

实现

其数学形式如下：

PyTorch中使用clip_grad_norm_来实现gradient clipping，它的函数签名是：

def clip_grad_norm_(parameters: _tensor_or_tensors, max_norm: float, norm_type: float = 2.0) -> torch.Tensor:
    r"""Clips gradient norm of an iterable of parameters.

    The norm is computed over all gradients together, as if they were
    concatenated into a single vector. Gradients are modified in-place.

    Args:
        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
            single Tensor that will have gradients normalized
        max_norm (float or int): max norm of the gradients
        norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
            infinity norm.

    Returns:
        Total norm of the parameters (viewed as a single vector).
    """

如上所述，所有的parameters tensor是一起计算norm的。max_norm则是一个类似于阈值的东西，当原始norm大于max_norm，才使用max_norm进行gradient clipping。