原论文:
@article{DBLP:journals/corr/LampleBSKD16, author = {Guillaume Lample and Miguel Ballesteros and Sandeep Subramanian and Kazuya Kawakami and Chris Dyer}, title = {Neural Architectures for Named Entity Recognition}, journal = {CoRR}, volume = {abs/1603.01360}, year = {2016}, url = {http://arxiv.org/abs/1603.01360}, archivePrefix = {arXiv}, eprint = {1603.01360}, timestamp = {Mon, 13 Aug 2018 16:47:39 +0200}, biburl = {https://dblp.org/rec/journals/corr/LampleBSKD16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
关于BiLSTM-CRF的详细讲解,可以通过https://createmomo.github.io/ 里面的文章进行系统学习,里面讲得非常非常好,给出了大量的实例。对CRF的train阶段(loss function)和test阶段(decode)都讲得非常详细。
Application
BiLSTM-CRF常见的应用场景如下:
- Named Entity Recognition
Model Architecture⭐⭐⭐⭐⭐
如果只是当一个掉包侠,那么一定要把握好模型的Architecture,知道输入和输出,以及需要调整的参数。
模型由BiLSTM和CRF两个部分构成,BiLSTM的输出作为CRF的输入。
BiLSTM
做的是多分类任务,其输出是某个word被标记为某个tag的概率(即Emission Score
)。
CRF
吃进去这些Emission Score
,然后在内部训练Transition Score Matrix
,最后直接输出预测的tag序列。
Q&A
为什么需要CRF层?
首先,如果不使用CRF层,只使用BiLSTM怎么做?
如下图所示,直接选择每个word的最大概率标签,然后将它们拼接在一起即可:
但是,我们并不能每次都能得到合理的结果,例如下图:
而此时正是CRF的用武之地,**CRF can automatically learn some constraints from the training dataset to make output predictions valid
**。
The constrains could be:
- The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
- “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid.
- “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label”
- …
CRF的损失函数中使用了哪两种score?哪种score可以帮助CRF学习到constraints?
CRF的损失函数由两种类型的score构成——Emission Score
和Transition Score
,这两个score是CRF的核心概念。
Emission Score
如下图红框部分,Emission Score就是BiLSTM的输出:
Transition Score
我们使用$t_{y_iy_j}$来表示transition score,比如$t_{B-Person,I-Person} \ = \ 0.9$意味着标签转移$B-Person -> I-Person$的score为0.9。以此类推,我们应该能够得到一个transition score matrix,如下图是一个示例:
可以通过transition score matrix看出,CRF能够学习到许多有用的constraints:
- The label of the first word in a sentence should start with “B-“ or “O”, not “I-“ (The transtion scores from “START” to “I-Person or I-Organization” are very low.)
- “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid. (For example, the score from “B-Organization” to “I-Person” is only 0.0003 which is much lower than the others.)
- “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label” (Again, for instance, the score tO,I−PersontO,I−Person is very small.)
- …
CRF的loss function是什么?
The CRF loss function is consist of the real path score and the total score of all the possible paths.The real path should have the highest score among those of all the possible paths.
$$
LossFunction =\frac {P_{RealPath}} {P_{total}} = \frac {P_{RealPath}} {P_1+P_2+…+P_N}
$$
Now, the questions are:
- How to define the score of a path?
- How to calculate the total score of all possible paths?
- When we calculate the total score, do we have to list all the possible paths? (The answer to this question is NO. )
如何计算Real Path Score?
首先是定义计算公式:$P_i = e^{S_i}$
Take the real path, “START B-Person I-Person O B-Organization O END”, we used before, for example:
- We have a sentence which has 5 words, $w_1,w_2,w_3,w_4,w_5$
- We add two more extra words which denote the start and the end of a sentence, $w_0,w_6$
- SiSi consists of 2 parts: $S_i=EmissionScore+TransitionScore$
Emission Score
$$
EmissionScore=x_{0,START}+x_{1,B−Person}+x_{2,I−Person}+x_{3,O}+x_{4,B−Organization}+x_{5,O}+x_{6,END}
$$
Transition Score
$$
TransitionScore=t_{START,B-Person} + t_{B-Person,I-Person} + t_{I-Person,O} + t_{O,B-Organization} + t_{B_Organazation,O} + t_{O,END}
$$
其实上述的公式用可以表示为:
$$
P_i = e^{EmissionScore} \cdot e^{TransitionScore}
$$
如何计算The total score of all the possible paths?
采用动态规划算法求解:
如何得到Transition Score Matrix?
它们都是超参数,需要通过机器学习,自己得到!
如何使用CRF进行inference?
过程和计算The total score of all the possible paths类似,同样是动态规划,只不过是将previous更新时的sum()
变成了max()
操作。
API
使用torchcrf
包可以直接使用CRF Layer。
CRF
Module包含forward()
和decode()
两个重要的函数,分别对应train
和test
。
forward()
函数的输出是conditional log likelihood,至于为什么是log,这个是在我们推导loss function的时候定义的。而且需要注意的是,这里得到的是条件概率,而不是概率,这也是我们推导loss function的时候定义的。