发布日期: 2021-08-08

Template-Based Named Entity Recognition Using BART

作者：西湖大学张岳老师组

收录地址：Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

代码地址：https://github.com/Nealcly/templateNER

Background(要解决的问题)

Few-shot场景下的NER（RE，EE）任务是近期的一个热点话题。解决few-shot任务的方法有很多，其中transfer learning是其中最重要的方法之一。

Transfer learning的思想一般指：在resource-rich的source domain dataset上进行训练模型，然后将所学到的知识“迁移”到low-resource的target domain dataset上，具体的迁移方式有很多种，常见的有使用小样本数据进行微调等等。

其实，在目前来看，已经有不少transfer learning-based few-shot NER models了。其中，最简单的一种是使用 pre-trained model with a softmax/CRF output layer （traditional methods）：

如上图所示，该方法存在的一个问题是：

Problem 1: low-resource source domain dataset上的entity label不一定和resource-rich target domain dataset的一致。所以，当每一次有新的entity label到来时，模型都需要重新定义output layers，这样的开销显然是非常大的，而且直接丢弃原来的output layers，没有考虑到the association between entity type——例如“person” and “character”两个实体类型可能会存在某种联系，同时也不能进行zero-shot的学习。

这种pre-trained model with a softmax/CRF output layer的方式是最基本的，聪明的“炼丹师们”总有很多不同的idea，比如最近常用的方法是使用 Similarity-based metric ，其主要思想是在source domain上训练一个similarity function，然后将这个similarity函数运用到target domain上。

可以看出distance-based methods相比于pre-trained model with a softmax/CRF output layer在domain adaptation的能力方面要更强一些，但是它依然存在一些问题：

Problem 2: target domain提供的小样本只是用来寻找最近邻算法的最佳超参数，而不是用来更新target domain上的instances的representation的，说人话就是不更新模型的parameters；
Problem 3: 这些方法十分依赖source domain和target domain之间的textual pattern similarity（也可以称为是writing styles，比如新闻领域和评论领域，它们俩的风格就不太一样）。如果两者的textual pattern不相同，那么会导致模型性能的下降。

为了解决pre-trained model with a softmax/CRF output layer和distance-based methods两种few-shot NER方法存在的3个问题，本文提出基于模板，利用生成式语言模型解决few-shot NER问题。

Contributions(创新点)

能够进行zero-shot

Method——Template-Based Method

如上图所示，本文将few-shot NER看作是在seq2seq框架下的language model ranking problem。其中，source sequence是原始文本——$\mathbf{X} = {x_1,…,x_n}$，而target sequence则是被待分类span $x_{i:j}$ 和entity type $y_k$ 填充上的Template—— $\mathbf{T}$${y_k,x_{i:j}} = {t_1,…,t_m} $。

其实文章的基本思想还是非常好理解的，下面从Template Creation，Inference，Training 3个方面介绍文章的方法：

Template Creation

模板的构造采用的是manual的方式，对于正样本，模板为“<candidate_span> is a <entity_type> type”，对于负样本，模板为“<candidate_span> is not a named entity”。其中<candidate_span>和<entity_type>是需要填充上的内容。

Inference

如上图所示，在推理阶段，需要枚举一个句子中所有可能的<candidate_span>，但是为了减少计算量，按照经验，限定它的长度为8。如此而言，对于一个长度为$n$的句子，它的<candidate_span>大概有$8n$个。因此，会产生$8n$个待分类的template。

得到模板之后，使用fine-tuned pre-trained generative language model来赋予每个template一个分数：

文章取最大的$f(·)$值对应的entity type作为这个<candidate_span>的entity type（其中包含了不是实体的情况）。

Training

在训练的时候，对于正样本，使用的是gold entity进行模板的构建。而对于负样本，使用的是句子中的none entity text span进行模板的构建。并且保持负样本是正样本的1.5倍。

其实，这里在构建负样本的时候是不能乱采样的，因为有时候不标注成gold entity不代表其不是entity！关于这个内容，可以参考论文——Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition

给定一个sequence pair(source sequence ➕ target sequence)，首先将$\mathbf{X}$输入到BART的Encoder部分，得到：

在decoder进行到c步的时候，decoder此时的输入是$\mathbf{h}^{enc}$和之前的的c-1个字符$t_{1:c-1}$，从而得到第c步的representation：

此时，第c步解码字符的条件概率定义为：

于是，对于一个sequence pair(source sequence ➕ target sequence)，其损失函数可以利用交叉熵函数定义为：

Experiments

Datasets

论文的数据集设定如下所示，论文需要实现不同的domain——resource-rich datasets到low-resource datasets上的有效迁移。

resource-rich datasets：

CoNLL03（a news dataset）

low-resource datasets：

MIT Movies
MIT Restaurants
ATIS

对比的方法

Sequnece Labelling BERT
Sam Wiseman and Karl Stratos. 2019. Label-agnostic sequence labeling by copying nearest neighbors. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5363–5369, Florence, Italy. Association for Computational Linguistics.
Yi Yang and Arzoo Katiyar. 2020. Simple and effective few-shot named entity recognition with structured nearest neighbor learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6365–6375, Online. Association for Computational Linguistics.
Morteza Ziyadi, Yuting Sun, Abhishek Goswami, Jade Huang, and Weizhu Chen. 2020. Example-based named entity recognition.
Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien Jose, Shobana Balakrishnan, Weizhu Chen, Baolin Peng, Jianfeng Gao, and Jiawei Han. 2020. Few- shot named entity recognition: A comprehensive study.

Template Influence

在Prompt-based Learning中，template的选择是一个非常重要的，不同的template带来的结果可能会差异特别大！为此，论文在CoNLL03的development set上进行了对比实验，实验结果如下：

从上面的结果来看，好像是语言表达越复杂（或者说更高级？）的template带来的性能越差。

CoNLL03 Results

首先，文章只在CoNLL03这一个数据集上进行实验，实验设定包含两种：

standard NER setting
in-domain few-show NER setting

Standard NER setting

这种设定就是标准的实验设定，其实验结果为：

由上面的结果可以发现，虽然本文提出的方法没有超越SOTA，但是也是性能还是不错的。例如，相比于Sequence Labeling BERT和Sequence Labeling BART，它能够明显地提升Recall值。其中的原因估计是本文提出的方法采用的是枚举<candidate_span>的方式，从而能够提高Recall。

此外，采用多个template融合的方式的multi-template BART比只使用单个template的Template BART的性能要好一些，主要的提升的方面是Precision值，说明，不同的template捕获的信息是不同的，集成template是一个可以探索的方向。

In-domain Few-Shot NER Setting

所谓的in-domain是相对于cross-domain而言的，in-domain表示source entity type和target entity type是属于同一题材下的entity type，并且它们的文本表述风格也是一致的。例如这里使用的是新闻数据集CoNLL03，并将其中的PER，ORG实体类型作为source entity type，而将LOC，MISC作为target entity type。

从上面的结果来看，论文的方法无论是在source entity type还是target entity type都是完胜Sequence Labeling BERT的。但是，我们都知道Sequence Labeling BERT是一个弱baseline，这样做是不是并没有什么代表性啊？不应该和专业的few-shot NER技术进行对比吗？（可能是因为比不过吧？不然为啥只作为cross-domain的比较，没有做in-domain的比较）

Cross-domain Few-Shot NER Result

在cross-domain的实验设定下，source-domain是CoNLL03，而target domain是MIT Movie，MIT restaurant和ATIS。

对比不使用source domain，直接进行training from scratch的方法，使用source domain的方法要更好，说明本文提出的模型确实能够实现source domain向target domain的知识迁移，并且其带来的提升比Sequence Labeling BERT要大一些，作者认为其中的原因在于——One possible explanation is that our model makes more use of the correlations between different entity type labels in the vocabulary as mentioned earlier, which BERT cannot achieve due to treating the output as discrete class labels.（不太了解这句话是什么意思）
对于traditional method（即Sequence Labeling BERT和Sequence Labeling BART），无论是target domain上的样本数量少（10个）还是多（500）个，Template-based BART都要优于traditional method。这可能得益于本文采用的Template-based method，不需要重新定义output layer，从而保留了source-domain原有的信息，实现了continual learning；
对比distance-based method，本文提出的方法不仅能够在instance数量极少的情况下超过distance-based method。而且随着样本数量的增加，distance-based method的性能几乎没有很大的提升，而本文的方法则有，这可能得益于本文的方法使用instances更新了模型的参数，从而能学到更好的特征，而distance-based method并没有；

Discussion

Impact of entity frequencies in training data

Thinking

有价值的参考文献

不足之处

不懂的地方

为什么不直接使用MLM来解决这个问题呢？
由于对生成类LM并不太了解，不知道代码实现的时候，是怎么做到Seq2Seq的训练的？请问公式4中的$t_{1:c-1}$是gold的，还是decoder的预测结果呢？（我猜是gold）
论文中举例说“in Bangkok”可能被识别成“ORG”，“Bangkok”可能被识别成“LOC”。由于数据集中本身不存在嵌套实体现象，因此对于嵌套的实体，采用概率值高的作为最终的结果。那么，具体在编程实现方面是怎么进行的？总不可能暴力枚举吧？