CRF Layer on the Top of BiLSTM(BiLSTM-CRF)

深度学习

发布日期: 2021-01-24

原论文：

@article{DBLP:journals/corr/LampleBSKD16,
  author    = {Guillaume Lample and
               Miguel Ballesteros and
               Sandeep Subramanian and
               Kazuya Kawakami and
               Chris Dyer},
  title     = {Neural Architectures for Named Entity Recognition},
  journal   = {CoRR},
  volume    = {abs/1603.01360},
  year      = {2016},
  url       = {http://arxiv.org/abs/1603.01360},
  archivePrefix = {arXiv},
  eprint    = {1603.01360},
  timestamp = {Mon, 13 Aug 2018 16:47:39 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/LampleBSKD16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

关于BiLSTM-CRF的详细讲解，可以通过https://createmomo.github.io/ 里面的文章进行系统学习，里面讲得非常非常好，给出了大量的实例。对CRF的train阶段（loss function）和test阶段（decode）都讲得非常详细。

Application

BiLSTM-CRF常见的应用场景如下：

Named Entity Recognition

Model Architecture⭐⭐⭐⭐⭐

如果只是当一个掉包侠，那么一定要把握好模型的Architecture，知道输入和输出，以及需要调整的参数。

模型由BiLSTM和CRF两个部分构成，BiLSTM的输出作为CRF的输入。

BiLSTM做的是多分类任务，其输出是某个word被标记为某个tag的概率（即Emission Score）。

CRF吃进去这些Emission Score，然后在内部训练Transition Score Matrix，最后直接输出预测的tag序列。

Q&A

为什么需要CRF层？

首先，如果不使用CRF层，只使用BiLSTM怎么做？

如下图所示，直接选择每个word的最大概率标签，然后将它们拼接在一起即可：

The BiLSTM model with out CRF layer output correct labels

但是，我们并不能每次都能得到合理的结果，例如下图：

The BiLSTM model with out CRF layer output some invalid label sequences

而此时正是CRF的用武之地，**CRF can automatically learn some constraints from the training dataset to make output predictions valid**。

The constrains could be:

The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
“B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid.
“O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label”
…

CRF的损失函数中使用了哪两种score？哪种score可以帮助CRF学习到constraints？

CRF的损失函数由两种类型的score构成——Emission Score和Transition Score，这两个score是CRF的核心概念。

Emission Score

如下图红框部分，Emission Score就是BiLSTM的输出：

the emission scores come from the BiLSTM layer

Transition Score

我们使用$t_{y_iy_j}$来表示transition score，比如$t_{B-Person,I-Person} \ = \ 0.9$意味着标签转移$B-Person -> I-Person$的score为0.9。以此类推，我们应该能够得到一个transition score matrix，如下图是一个示例：

transition score matrix

可以通过transition score matrix看出，CRF能够学习到许多有用的constraints：

The label of the first word in a sentence should start with “B-“ or “O”, not “I-“ (The transtion scores from “START” to “I-Person or I-Organization” are very low.)
“B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid. (For example, the score from “B-Organization” to “I-Person” is only 0.0003 which is much lower than the others.)
“O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label” (Again, for instance, the score tO,I−PersontO,I−Person is very small.)
…

CRF的loss function是什么？

The CRF loss function is consist of the real path score and the total score of all the possible paths.The real path should have the highest score among those of all the possible paths.
$$
LossFunction =\frac {P_{RealPath}} {P_{total}} = \frac {P_{RealPath}} {P_1+P_2+…+P_N}
$$
Now, the questions are:

How to define the score of a path?
How to calculate the total score of all possible paths?
When we calculate the total score, do we have to list all the possible paths? (The answer to this question is NO. )

如何计算Real Path Score？

首先是定义计算公式：$P_i = e^{S_i}$

Take the real path, “START B-Person I-Person O B-Organization O END”, we used before, for example:

We have a sentence which has 5 words, $w_1,w_2,w_3,w_4,w_5$
We add two more extra words which denote the start and the end of a sentence, $w_0,w_6$
SiSi consists of 2 parts: $S_i=EmissionScore+TransitionScore$