CRF Layer on the Top of BiLSTM(BiLSTM-CRF)


关于BiLSTM-CRF的详细讲解,可以通过 里面的文章进行系统学习,里面讲得非常非常好,给出了大量的实例。对CRF的train阶段(loss function)和test阶段(decode)都讲得非常详细。



  1. Named Entity Recognition

Model Architecture⭐⭐⭐⭐⭐



BiLSTM做的是多分类任务,其输出是某个word被标记为某个tag的概率(即Emission Score)。

CRF吃进去这些Emission Score,然后在内部训练Transition Score Matrix,最后直接输出预测的tag序列。





The BiLSTM model with out CRF layer output correct labels


The BiLSTM model with out CRF layer output some invalid label sequences

而此时正是CRF的用武之地,**CRF can automatically learn some constraints from the training dataset to make output predictions valid**。

The constrains could be:

  • The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
  • “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid.
  • “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label”


CRF的损失函数由两种类型的score构成——Emission ScoreTransition Score,这两个score是CRF的核心概念。

Emission Score

如下图红框部分,Emission Score就是BiLSTM的输出:

the emission scores come from the BiLSTM layer

Transition Score

我们使用$t_{y_iy_j}$来表示transition score,比如$t_{B-Person,I-Person} \ = \ 0.9$意味着标签转移$B-Person -> I-Person$的score为0.9。以此类推,我们应该能够得到一个transition score matrix,如下图是一个示例:

transition score matrix

可以通过transition score matrix看出,CRF能够学习到许多有用的constraints:

  • The label of the first word in a sentence should start with “B-“ or “O”, not “I-“ (The transtion scores from “START” to “I-Person or I-Organization” are very low.)
  • “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid. (For example, the score from “B-Organization” to “I-Person” is only 0.0003 which is much lower than the others.)
  • “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label” (Again, for instance, the score tO,I−PersontO,I−Person is very small.)

CRF的loss function是什么?

The CRF loss function is consist of the real path score and the total score of all the possible paths.The real path should have the highest score among those of all the possible paths.
LossFunction =\frac {P_{RealPath}} {P_{total}} = \frac {P_{RealPath}} {P_1+P_2+…+P_N}
Now, the questions are:

  1. How to define the score of a path?
  2. How to calculate the total score of all possible paths?
  3. When we calculate the total score, do we have to list all the possible paths? (The answer to this question is NO. )

如何计算Real Path Score?

首先是定义计算公式:$P_i = e^{S_i}$

Take the real path, “START B-Person I-Person O B-Organization O END”, we used before, for example:

  • We have a sentence which has 5 words, $w_1,w_2,w_3,w_4,w_5$
  • We add two more extra words which denote the start and the end of a sentence, $w_0,w_6$
  • SiSi consists of 2 parts: $S_i=EmissionScore+TransitionScore$

Emission Score


Transition Score

TransitionScore=t_{START,B-Person} + t_{B-Person,I-Person} + t_{I-Person,O} + t_{O,B-Organization} + t_{B_Organazation,O} + t_{O,END}

P_i = e^{EmissionScore} \cdot e^{TransitionScore}

如何计算The total score of all the possible paths?




如何得到Transition Score Matrix?



过程和计算The total score of all the possible paths类似,同样是动态规划,只不过是将previous更新时的sum()变成了max()操作。


使用torchcrf包可以直接使用CRF Layer。

CRF Module包含forward()decode()两个重要的函数,分别对应traintest

forward()函数的输出是conditional log likelihood,至于为什么是log,这个是在我们推导loss function的时候定义的。而且需要注意的是,这里得到的是条件概率,而不是概率,这也是我们推导loss function的时候定义的。


文章作者: CarlYoung
