CRF Layer on the Top of BiLSTM(BiLSTM-CRF)


原论文:

@article{DBLP:journals/corr/LampleBSKD16,
  author    = {Guillaume Lample and
               Miguel Ballesteros and
               Sandeep Subramanian and
               Kazuya Kawakami and
               Chris Dyer},
  title     = {Neural Architectures for Named Entity Recognition},
  journal   = {CoRR},
  volume    = {abs/1603.01360},
  year      = {2016},
  url       = {http://arxiv.org/abs/1603.01360},
  archivePrefix = {arXiv},
  eprint    = {1603.01360},
  timestamp = {Mon, 13 Aug 2018 16:47:39 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/LampleBSKD16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

关于BiLSTM-CRF的详细讲解,可以通过https://createmomo.github.io/ 里面的文章进行系统学习,里面讲得非常非常好,给出了大量的实例。对CRF的train阶段(loss function)和test阶段(decode)都讲得非常详细。

Application

BiLSTM-CRF常见的应用场景如下:

  1. Named Entity Recognition

Model Architecture⭐⭐⭐⭐⭐

如果只是当一个掉包侠,那么一定要把握好模型的Architecture,知道输入和输出,以及需要调整的参数。

模型由BiLSTM和CRF两个部分构成,BiLSTM的输出作为CRF的输入。

BiLSTM做的是多分类任务,其输出是某个word被标记为某个tag的概率(即Emission Score)。

CRF吃进去这些Emission Score,然后在内部训练Transition Score Matrix,最后直接输出预测的tag序列。

Q&A

为什么需要CRF层?

首先,如果不使用CRF层,只使用BiLSTM怎么做?

如下图所示,直接选择每个word的最大概率标签,然后将它们拼接在一起即可:

The BiLSTM model with out CRF layer output correct labels

但是,我们并不能每次都能得到合理的结果,例如下图:

The BiLSTM model with out CRF layer output some invalid label sequences

而此时正是CRF的用武之地,**CRF can automatically learn some constraints from the training dataset to make output predictions valid**。

The constrains could be:

  • The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
  • “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid.
  • “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label”

CRF的损失函数中使用了哪两种score?哪种score可以帮助CRF学习到constraints?

CRF的损失函数由两种类型的score构成——Emission ScoreTransition Score,这两个score是CRF的核心概念。

Emission Score

如下图红框部分,Emission Score就是BiLSTM的输出:

the emission scores come from the BiLSTM layer

Transition Score

我们使用$t_{y_iy_j}$来表示transition score,比如$t_{B-Person,I-Person} \ = \ 0.9$意味着标签转移$B-Person -> I-Person$的score为0.9。以此类推,我们应该能够得到一个transition score matrix,如下图是一个示例:

transition score matrix

可以通过transition score matrix看出,CRF能够学习到许多有用的constraints:

  • The label of the first word in a sentence should start with “B-“ or “O”, not “I-“ (The transtion scores from “START” to “I-Person or I-Organization” are very low.)
  • “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid. (For example, the score from “B-Organization” to “I-Person” is only 0.0003 which is much lower than the others.)
  • “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label” (Again, for instance, the score tO,I−PersontO,I−Person is very small.)

CRF的loss function是什么?

The CRF loss function is consist of the real path score and the total score of all the possible paths.The real path should have the highest score among those of all the possible paths.
$$
LossFunction =\frac {P_{RealPath}} {P_{total}} = \frac {P_{RealPath}} {P_1+P_2+…+P_N}
$$
Now, the questions are:

  1. How to define the score of a path?
  2. How to calculate the total score of all possible paths?
  3. When we calculate the total score, do we have to list all the possible paths? (The answer to this question is NO. )

如何计算Real Path Score?

首先是定义计算公式:$P_i = e^{S_i}$

Take the real path, “START B-Person I-Person O B-Organization O END”, we used before, for example:

  • We have a sentence which has 5 words, $w_1,w_2,w_3,w_4,w_5$
  • We add two more extra words which denote the start and the end of a sentence, $w_0,w_6$
  • SiSi consists of 2 parts: $S_i=EmissionScore+TransitionScore$

Emission Score

$$
EmissionScore=x_{0,START}+x_{1,B−Person}+x_{2,I−Person}+x_{3,O}+x_{4,B−Organization}+x_{5,O}+x_{6,END}
$$

Transition Score

$$
TransitionScore=t_{START,B-Person} + t_{B-Person,I-Person} + t_{I-Person,O} + t_{O,B-Organization} + t_{B_Organazation,O} + t_{O,END}
$$

其实上述的公式用可以表示为:
$$
P_i = e^{EmissionScore} \cdot e^{TransitionScore}
$$

如何计算The total score of all the possible paths?

采用动态规划算法求解:

推导-2.jpg

推导-3.jpg

如何得到Transition Score Matrix?

它们都是超参数,需要通过机器学习,自己得到!

如何使用CRF进行inference?

过程和计算The total score of all the possible paths类似,同样是动态规划,只不过是将previous更新时的sum()变成了max()操作。

API

使用torchcrf包可以直接使用CRF Layer。

CRF Module包含forward()decode()两个重要的函数,分别对应traintest

forward()函数的输出是conditional log likelihood,至于为什么是log,这个是在我们推导loss function的时候定义的。而且需要注意的是,这里得到的是条件概率,而不是概率,这也是我们推导loss function的时候定义的。

参考文献

  1. CRF Layer on the Top of BiLSTM - 1

文章作者: CarlYoung
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 CarlYoung !
 上一篇
实验室服务器深度学习环境搭建 实验室服务器深度学习环境搭建
[TOC] 下载Anaconda下载Anaconda,推荐到清华镜像里面下载,这样下载会比去Anaconda官网下载快很多。 首先检查Linux的版本:
2021-01-25
下一篇 
【李宏毅-深度学习】Structured Learning: Sequence Labeling 【李宏毅-深度学习】Structured Learning: Sequence Labeling
讲解HMM和CRF的概念,以及两者之间的关系,最后提出Deep Learning + HMM/CRF是一种常见的做法。
2021-01-23
  目录