Extensively Matching for Few-shot Learning Event Detection

2020 ACL《用于小样本学习事件检测的广泛匹配》(Extensively Matching for Few-shot Learning Event Detection) 的阅读笔记

1. Background

目前典型的事件检测方法是基于特征工程的传统监督学习和神经网络，但是监督学习模型在处理未知类别的事件时效率较差，通常使用标注、再训练的方法，其代价较大。
One potential problem of prior FSL methods is that the model relies solely on training signals
between query instance and the support set. Thus, the matching information between samples in the support set has not been exploited yet.这句话的意思其实就是说以往的few-shot只依赖于下文中提到的$L_{query}$，没有引入$\hat {L}{intra}$和$\hat {L}{inter}$。

2. Contributions

第一次将ED问题定义为一个few-shot learning问题；
对Loss Function进行增强，引入了两种matching information：
1. matching information between query instance and the support set；
2. matching information between the samples in the support themselves；
文章中把这两种matching information叫做training signals
提出的两个training signals效果显著，能够用于任何metric-based FSL models。

3. Terminology

Few shot learning

In FSL, a trained model rapidly learns a new concept from a few examples while keeping great generalization from observed examples. Hence, if we need to extend event detection into a new
domain, a few examples are needed to activate the system in the new domain without retraining the
model. By formulating ED as FSL, we can significantly reduce the annotation cost and training cost
while maintaining highly accurate results.

How to do few shot learning ?

In a few shot learning iteration, the model is given a support set and a query instance. The support set consists of examples from a small set of classes. A model needs to predict the label of the query instance in accordance with the set of classes appeared in the support set.

4. Models

Event Detection as Few-shot Learning

通常FSL模型都会采用 N-way K-shot的范式预测query instance。作者在这里增加了1个新的类别$NULL$，从而变成了（N+1）-way K-shot的求解范式。

支撑集（Support Set）中添加NULL类 N+1-way K-shot，如下：

$(t_{1},…,t_{N})$: positive labels
$t_{null}$: a special label for non-event

Framework

Instance Encoder

就是对句子s里面的每个word进行word embedding。

随后对整个句子s，采用一些神经网络（CNN，LSTM，GCN），从句子s的word中，得到句子s的表示。

Prototype Encoder

This module computes a representative vector, called prototype.

有两种方法获得，一种方法是暴力平均，另外一种方法是加权平均（需要用到注意力机制）。

Classification Module

计算公式如下：

其中的函数$d()$可以有四种选择：

Cosine similarity with averaging prototype as Matching network
Euclidean distance with averaging prototype as Proto network
Euclidean distance with weighted sum prototype as Proto+Att network
Learnable distance function with averaging prototype as Relation network

Training Objectives

利用查询实例和支撑集之间与支撑集内样本自身之间的匹配信息来训练ED模型，可以显著减少标注和训练代价，同时维持高准确率。具体的方法是通过在损失函数中添加辅助参数来抑制学习过程。

最大似然估计值
$$
L_{query}(x,S)=-logP(y=t|x,S) \tag{1}
$$
where $x$,$t$,$S$ are query instance,ground true label,and support set
Intra-cluster matching

相同类之间的向量是类似的，因此最小化它们的间距
$$
L_{intra}=\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\sum\limits_{j=k+1}^{K}mse(v_{i}^{j},v_{i}^{k}) \tag{2}
$$
Inter-cluster information

最大化类之间的距离
$$
L_{inter}=1-\sum\limits_{i=1}^{N}\sum\limits_{j=i+1}^{N}cosine(c_{i},c_{j}) \tag{3}
$$
损失函数

由（1）、（2）、（3）

三、实验

表1显示：
- 5+1-Way 5-shot的表现总是优于10+1-Way 10-shot，因为后者中需要被分类的类的数量是前者的2倍之多
- Proto和Proto+Att模型的表现均最好
- 在10+1-Way 10-shot中Proto+Att的表现略好于Proto
表2显示：
- 使用给出的损失函数后，所有的神经网络模型中F都明显提高了