Distant supervision for relation extraction without labeled data

论文笔记事件抽取

NLP

发布日期: 2021-01-15

[TOC]

这篇论文已经很老了，看它的原因主要是想了解Distant Supervision的概念，所以论文有些不会并不会读得很清楚。

Background

在2009年的时候，关系抽取主要有下面三种，它们有各自的弊端：

Supervised approaches；首先数据层面需要大量的annotated corpora，而且训练出来的Model存在corpus-specific的问题（即跨域问题）。
Unsupervised approaches；无监督产生的relation的名称是不规范的。
Bootstrap learning；结果存在low precision和semantic drift的现象。

Contributions

如论文的名字所示——relation extraction without labeled data，不需要像Supervised approaches那样需要的annotated corpora；
首次在关系抽取任务上提出了distant surpervision的思想；

Terminologies

Freebase

一个数据库，数据库里面是relation及相应的relation instances。

p.s. 这个数据库特别特别大

Distant Supervision

这里需要明确一个概念，远程监督是一种学习方式，而不是具体模型，类似于有监督学习、无监督学习一样。

字面理解，既然是远程监督，那么一定存在监督喽。我们之前提到，有监督就一定要有label，数据需要带标签，那么标签从哪儿来？答案是来自远方：）

论文中对于远程监督的定义提到了两次：

The intuition of distant supervision is that any sentence that contains a pair of entities that participate in a known Freebase relation is likely to express that relation in some way.

The intuition of our distant supervision approach is to use Freebase to give us a training set of relations and entity pairs that participate in those relations.

用通俗的话来说就是：两个实体如果在知识库中存在某种关系，则包含该两个实体的非结构化句子均能表示出这种关系。

Model

论文采用的model的输入是实体对+关系的特征向量，model的内部结构非常的简单，就是multi-class logistic classifier。

接下来，依据模型train和test的两个阶段，讲讲需要做哪些工作：

Train Step⭐

In the training step, all entities are identified in sentences using a named entity tagger that labels persons, organizations and locations. If a sentence contains two entities and those entities are an instance of one of our Freebase relations, features are extracted from that sentence and are added to the feature vector for the relation.

The distant supervision assumption is that if two entities participate in a relation, any sentence that contain those two entities might express that relation. Because any individual sentence may give an incorrect cue, our algorithm trains a multiclass logistic regression classifier, learning weights for each noisy feature. In training, the features for identical tuples (relation, entity1, entity2) from different sentences are combined, creating a richer feature vector.

对于Freebase出现的每一个relation instance（即entity pair），在目前已有training corpus中找到包含这个relation instance的所有句子，对这些句子分别进行feature extraction，然后将这些句子的feature进行融合，得到这个(relation, entity1, entity2) 的feature vector。

Test Step⭐

In the testing step, entities are again identified using the named entity tagger. This time, every pair of entities appearing together in a sentence is considered a potential relation instance, and whenever those entities appear together, features are extracted on the sentence and added to a feature vector for that entity pair. For example, if a pair of entities occurs in 10 sentences in the test set, and each sentence has 3 features extracted from it, the entity pair will have 30 associated features. Each entity pair in each sentence in the test corpus is run
through feature extraction, and the regression classifier predicts a relation name for each entity pair based on the features from all of the sentences in which it appeared.

测试阶段，需要提取testset中所有出现过的entity pair（而不是Freebase中已有的），然后重复train step里面的操作，得到分类结果。

Feature Engineering

上面提到了要提取每个sentence的feature，接下来就讲怎么提取这些feature。

对于传统的机器学习model，特征工程一直是比较难搞的一部分。这篇论文也不例外，有着大量的特征工程：

Lexical Feature

词汇级别的feature。主要是捕捉下面的特征：

The sequence of words between the two entities
The part-of-speech tags of these words
A flag indicating which entity came first in the sentence
A window of k words to the left of Entity 1 and their part-of-speech tags
A window of k words to the right of Entity 2 and their
part-of-speech tags

示例如下：

Lexical Feature

Syntactic Feature

语法级别的feature。主要捕捉下面的特征：

A dependency path between the two entities
For each entity, one ‘window’ node that is not part of the dependency path

示例如下：

Syntactic Feature

Experiment

Dataset

除了Freebase，它的corpus来自Wikipedia上的文本。

Held-out evaluation

略

Human evaluation

略

反思

这篇远古论文复现也复现不了，主要还是用于理解Distant supervision的概念。
于我而言，distant supervision就像是人为地合成了一个特殊的训练集。

它需要一个像Freebase这样超大超大的knowledge base，同时也需要一个corpus，这个corpus里面要尽量包含这个knowledge base里面的entities！knowledge base+corpus，两者缺一不可，合成一个新的训练集用于之后的关系抽取任务。
distant supervision这个假设非常的大，其实很多的共现 entities 都没有什么关系，仅仅是出现在同一个句子中；而有的 entities 之间的关系其实并不仅仅只有一种，可能有多种，比如奥巴马和美国的关系，可能是 born in，也可能是 is the president of 的关系。

基于这个假设条件下的关系抽取工作通常都存在两个明显的弱点：
1. 基于给出的假设，训练集会产生大量的 wrong labels，比如两个实体有多种关系或者根本在这句话中没有任何关系，这样的训练数据会对关系抽取器产生影响。
2. NLP 工具带来的误差，比如 NER，比如 Parsing 等，越多的 feature engineering 就会带来越多的误差，在整个任务的 pipeline 上会产生误差的传播和积累，从而影响后续关系抽取的精度。
关于问题（1）中 wrong labels 的问题，有的工作将关系抽取定义为一个 Multi-instance Multi-label 学习问题，比如工作 Multi-instance Multi-label Learning for Relation Extraction ，训练集中的每个 instance 都可能是一种 label。

而有的工作则是将问题定义为 Multi-instance Single-label 问题，假设共现的 entity 对之间只存在一种关系或者没有关系，一组包括同一对 entities 的 instances 定义为一个 Bag，每一个 Bag 具有一个 label，最终训练的目标是优化 Bag Label 的准确率。第一种假设更加接近于实际情况，研究难度也相对更大一些。

关于问题（2）中的 pipeline 问题，用深度学习的思路来替代特征工程是一个非常自然的想法，用 word embedding 来表示句子中的 entity 和 word，用 RNN 或者 CNN 以及各种 RNN 和 CNN 的变种模型来对句子进行建模，将训练句子表示成一个 sentence vector，然后进行关系分类，近几年有几个工作都是类似的思路，比如：

[3] Relation Classification via Convolutional Deep Neural Network

[4] Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks

[5] Neural Relation Extraction with Selective Attention over Instances

[6] Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions