【AI研习社】KDD 2020丨欺诈检测研究现状以及欺诈者对抗行为建模


直播介绍

https://live.yanxishe.com/room/865

Review 1和Review 3是水军,其中Review 1是由GAN自动生成的,实在是牛逼。

1. Background: a history of fraud

欺诈的演变,哪里有钱,哪里就有欺诈,现在互联网上的金融有钱,所以出现了金融欺诈。

欺诈者存在欺骗行为,而黑客不一定存在欺诈。

是异常而不是欺诈。

欺诈和异常的区别,异常是客观的,是分布上的,比如离群点,而欺诈并不是这样子的,它是主观的,具有领域知识。

交叉领域。

2. Background: fraud type and fraud detectors

要做欺诈检测(Fraud Detection),首先要知道什么叫做欺诈(Fraud),在当今社会,按照欺诈发生的领域,主要有3类:

  1. Social Network
    • Fake Reviews –> GraphConsis
    • Social Bots
    • Misinformation(错误信息)
    • Disinformation(假情报)
    • Fake Accounts
    • Social Sybils(社会女巫?)
    • Link Advertising
  2. Finance(蚂蚁金服在做)
    • Insurance Fraud
    • Loan Defaulter –> SemiGNN
    • Money Laundering(洗钱)
    • Malicious Account(恶意账户) –> GEM、GeniePath
    • Transaction Fraud
    • Cash-out User(现金不足用户)
    • Credit Card Fraud(信用卡欺诈)
  3. Others
    • Advertisement(虚假CTR)
    • Mobile Apps(下载量造假)
    • Ecommerce(薅羊毛)
    • Crowdturfing(有不良商家就雇人给自己的商品写正面评论,甚至败坏竞争对手的名声)
    • Bitcoin Fraud
    • Game
    • Email,SMS,Phones

)$65KW4K04FJGPS~2D0KJQJ.png

模态角度分类

Content-based Detectors:比如虚假评论,就是看虚假评论内容进行分辨

Behavior-based Detectors:基于它在时间维度上的变化,进行分辨

Graph-based Detectors

技术角度分类

Ruled-based Detectors:专家知识

Feature-based Detectors:逻辑回归,决策树

Deep learning-based Detectors:端到端

7QPSQQ6[W5@DE~74]P1AGC1.png

![DE6B~GAL7QMD`Z4ITDM)}ZP.png](http://ww1.sinaimg.cn/large/9b63ed6fgy1gllf2w4ue4j20vo0kr7k6.jpg)

应用宝和咸鱼

复杂的对抗行为,给你买好评。

3. Resource:dataset,toolbox,paper,survey,company,etc.

SafeGraph

https://github.com/safe-graph

开源的GitHub仓库

Dataset

OODS dataset(Outlier Detection DataSets)

http://odds.cs.stonybrook.edu/

  • 包含五种类别的数据:
    1. Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
    2. Time series graph datasets for event detection: Temporal graph data where the graph changes dynamically over time in which new nodes and edges arrive or existing nodes and edges disappear.
    3. Time series point datasets (Multivariate/Univariate): Temporal point data where each point has one or more attributes and the attributes change over time.
    4. Adversarial/Attack scenario and security datasets: Opinion fraud detection data from online review system. Cyber security data, e.g. intrusion detection with DoS, DDoS etc. attack scenario.
    5. Crowded scene video data for anomaly detection: Video clips acquired with camera.

Bitcoin dataset❗

https://www.kaggle.com/ellipticco/elliptic-data-set

Description

将比特币交易作为实体节点,通过机器学习的方法将比特币交易分类成合法和不合法两个类别。这是一个Node Level的二分类问题。

The Elliptic Data Set maps Bitcoin transactions to real entities belonging to licit categories (exchanges, wallet providers, miners, licit services, etc.) versus illicit ones (scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc.). The task on the dataset is to classify the illicit and licit nodes in the graph.

Content

This anonymized data set is a transaction graph collected from the Bitcoin blockchain. A node in the graph represents a transaction, an edge can be viewed as a flow of Bitcoins between one transaction and the other. Each node has 166 features and has been labeled as being created by a “licit”, “illicit” or “unknown” entity.

Nodes and edges

  • The graph is made of 203,769 nodes and 234,355 edges.
  • Two percent (4,545) of the nodes are labelled class1 (illicit).
  • Twenty-one percent (42,019) are labelled class2 (licit).
  • The remaining transactions are not labelled with regard to licit versus illicit.

Features

  • There are 166 features associated with each node. Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset.

  • There is a time step associated to each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The time steps, running from 1 to 49, are evenly spaced with an interval of about two weeks. Each time step contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other; there are no edges connecting the different time steps.

    time_step

  • The first 94 features represent local information about the transaction – including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs.The remaining 72 features are aggregated features, obtained using transaction information one-hop backward/forward from the center node - giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).

使用到这个数据集的论文有:

  1. EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

Yelp and Amazon

https://github.com/safe-graph/DGFraud

For GraphConsis, we preprocessed Yelp Spam Review Dataset with reviews as nodes and three relations as edges.

The dataset with .mat format is located at /dataset/YelpChi.zip. The .mat file includes:

  • net_rur, net_rtr, net_rsr: three sparse matrices representing three homo-graphs defined in GraphConsispaper;
  • features: a sparse matrix of 100-dimension Bag-of-words features;
  • label: a numpy array with the ground truth of nodes. 1 represents spam and 0 represents benign.

To get the complete metadata of the Yelp dataset, please send an email to ytongdou@gmail.com for inquiry.

百度飞桨数据集

Code & Toolbox

重要资源

Scholar

重要人物

这些学者都有和工业界的合作

Name School Lab Link Overseas/Domestic
Christos Faloutsos CMU Overseas

Company

重要人物

Summary and Q&A

好发论文的点:

  • 发现新的欺诈类型(有一个新问题,将传统的方法做一些改变和适应就能够解决)
  • 提高模型的效率(GNN的效率没有传统的模型高的)
  • Model Ensemble(模型集成和融合)

工业界:

  • Define problem clearly,find appropriate model定义问题明确(水军是适合规则?图?Feature来建模?)
  • 采样很重要(特别针对GNN来说,闲鱼的一篇论文可以)
  • Cost & return trade off
  • Old but good(XGBoost)
  • Early detection is a challenge(Early Detection很难做)

一些数据集都是社交网络的。

KDD20: spammer adversarial behavior and spamming practical effect

非深度学习

动态博弈建模水军:

新的衡量指标:

P~5$58%FX(QC~3AN37I02{5.png

LG~.png

![5Q64V(LB$YHMW@AJ@`F0)T1.png](http://ww1.sinaimg.cn/large/9b63ed6fgy1gllgba1z6nj20y20k511g.jpg)

image-20201212224120421

看不懂,直接跳过噜

SIGIR20&CIKM20: how to apply GNN to fraud detection problems

camouflage /ˈkæməflɑːʒ/ v. 伪装

混合的模型,就是模型里面啥都有。

IP地址不再适用,通过代理等等方式,改变大家之间的relation,从而成功伪装。这些伪装给GNN带来挑战。

设计模型的目标就是为了去克服这些困难。

三个模块来解决问题。

第一个模块:

必须要引入一些外部只是来进行辅助训练。

第二个模块:

不用学习relation

注意力机制训练慢


文章作者: CarlYoung
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 CarlYoung !
 上一篇
【斯坦福cs224w-图机器学习】10. Deep Generative Models for Graphs 【斯坦福cs224w-图机器学习】10. Deep Generative Models for Graphs
这节课通过Auto-Regressive model巧妙地将图生成问题转换成序列生成问题,利用RNN来解决图生成问题。
2020-12-13
下一篇 
A Semi-supervised Graph Attentive Network for Financial Fraud Detection A Semi-supervised Graph Attentive Network for Financial Fraud Detection
支付宝2019ICDM论文,论文采用node-level + view-level的方式构造用户的embedding,最后以supervised loss+ unsupervised loss的方式构造损失函数,良好地利用了labeled和unlabeled的数据。
2020-12-12
  目录