【AI研习社】KDD 2020丨欺诈检测研究现状以及欺诈者对抗行为建模

学习笔记

GNN

发布日期: 2020-12-12

直播介绍

https://live.yanxishe.com/room/865

Review 1和Review 3是水军，其中Review 1是由GAN自动生成的，实在是牛逼。

1. Background: a history of fraud

欺诈的演变，哪里有钱，哪里就有欺诈，现在互联网上的金融有钱，所以出现了金融欺诈。

欺诈者存在欺骗行为，而黑客不一定存在欺诈。

是异常而不是欺诈。

欺诈和异常的区别，异常是客观的，是分布上的，比如离群点，而欺诈并不是这样子的，它是主观的，具有领域知识。

交叉领域。

2. Background: fraud type and fraud detectors

要做欺诈检测（Fraud Detection），首先要知道什么叫做欺诈（Fraud），在当今社会，按照欺诈发生的领域，主要有3类：

Social Network
- Fake Reviews –> GraphConsis
- Social Bots
- Misinformation（错误信息）
- Disinformation（假情报）
- Fake Accounts
- Social Sybils（社会女巫？）
- Link Advertising
Finance（蚂蚁金服在做）
- Insurance Fraud
- Loan Defaulter –> SemiGNN
- Money Laundering（洗钱）
- Malicious Account（恶意账户） –> GEM、GeniePath
- Transaction Fraud
- Cash-out User（现金不足用户）
- Credit Card Fraud（信用卡欺诈）
Others
- Advertisement（虚假CTR）
- Mobile Apps（下载量造假）
- Ecommerce（薅羊毛）
- Crowdturfing（有不良商家就雇人给自己的商品写正面评论，甚至败坏竞争对手的名声）
- Bitcoin Fraud
- Game
- Email，SMS，Phones

)$65KW4K04FJGPS~2D0KJQJ.png

模态角度分类：

Content-based Detectors：比如虚假评论，就是看虚假评论内容进行分辨

Behavior-based Detectors：基于它在时间维度上的变化，进行分辨

Graph-based Detectors

技术角度分类：

Ruled-based Detectors：专家知识

Feature-based Detectors：逻辑回归，决策树

Deep learning-based Detectors：端到端

7QPSQQ6[W5@DE~74]P1AGC1.png

![DE6B~GAL7QMD`Z4ITDM)}ZP.png](http://ww1.sinaimg.cn/large/9b63ed6fgy1gllf2w4ue4j20vo0kr7k6.jpg)

应用宝和咸鱼

复杂的对抗行为，给你买好评。

3. Resource：dataset,toolbox,paper,survey,company,etc.

SafeGraph

https://github.com/safe-graph

开源的GitHub仓库

Dataset

OODS dataset（Outlier Detection DataSets）

http://odds.cs.stonybrook.edu/

包含五种类别的数据：
1. Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
2. Time series graph datasets for event detection: Temporal graph data where the graph changes dynamically over time in which new nodes and edges arrive or existing nodes and edges disappear.
3. Time series point datasets (Multivariate/Univariate): Temporal point data where each point has one or more attributes and the attributes change over time.
4. Adversarial/Attack scenario and security datasets: Opinion fraud detection data from online review system. Cyber security data, e.g. intrusion detection with DoS, DDoS etc. attack scenario.
5. Crowded scene video data for anomaly detection: Video clips acquired with camera.

Bitcoin dataset❗

https://www.kaggle.com/ellipticco/elliptic-data-set

Description

将比特币交易作为实体节点，通过机器学习的方法将比特币交易分类成合法和不合法两个类别。这是一个Node Level的二分类问题。

The Elliptic Data Set maps Bitcoin transactions to real entities belonging to licit categories (exchanges, wallet providers, miners, licit services, etc.) versus illicit ones (scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc.). The task on the dataset is to classify the illicit and licit nodes in the graph.

Content

This anonymized data set is a transaction graph collected from the Bitcoin blockchain. A node in the graph represents a transaction, an edge can be viewed as a flow of Bitcoins between one transaction and the other. Each node has 166 features and has been labeled as being created by a “licit”, “illicit” or “unknown” entity.

Nodes and edges

The graph is made of 203,769 nodes and 234,355 edges.
Two percent (4,545) of the nodes are labelled class1 (illicit).
Twenty-one percent (42,019) are labelled class2 (licit).
The remaining transactions are not labelled with regard to licit versus illicit.

Features

There are 166 features associated with each node. Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset.
There is a time step associated to each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The time steps, running from 1 to 49, are evenly spaced with an interval of about two weeks. Each time step contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other; there are no edges connecting the different time steps.
The first 94 features represent local information about the transaction – including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs.The remaining 72 features are aggregated features, obtained using transaction information one-hop backward/forward from the center node - giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).

使用到这个数据集的论文有：

EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

Yelp and Amazon

https://github.com/safe-graph/DGFraud

For GraphConsis, we preprocessed Yelp Spam Review Dataset with reviews as nodes and three relations as edges.

The dataset with .mat format is located at /dataset/YelpChi.zip. The .mat file includes:

net_rur, net_rtr, net_rsr: three sparse matrices representing three homo-graphs defined in GraphConsispaper;
features: a sparse matrix of 100-dimension Bag-of-words features;
label: a numpy array with the ground truth of nodes. 1 represents spam and 0 represents benign.

To get the complete metadata of the Yelp dataset, please send an email to ytongdou@gmail.com for inquiry.