直播介绍
https://live.yanxishe.com/room/865


Review 1和Review 3是水军,其中Review 1是由GAN自动生成的,实在是牛逼。
1. Background: a history of fraud

欺诈的演变,哪里有钱,哪里就有欺诈,现在互联网上的金融有钱,所以出现了金融欺诈。

欺诈者存在欺骗行为,而黑客不一定存在欺诈。
是异常而不是欺诈。
欺诈和异常的区别,异常是客观的,是分布上的,比如离群点,而欺诈并不是这样子的,它是主观的,具有领域知识。
交叉领域。
2. Background: fraud type and fraud detectors

要做欺诈检测(Fraud Detection),首先要知道什么叫做欺诈(Fraud),在当今社会,按照欺诈发生的领域,主要有3类:
- Social Network
- Fake Reviews –> GraphConsis
- Social Bots
- Misinformation(错误信息)
- Disinformation(假情报)
- Fake Accounts
- Social Sybils(社会女巫?)
- Link Advertising
- Finance(蚂蚁金服在做)
- Insurance Fraud
- Loan Defaulter –> SemiGNN
- Money Laundering(洗钱)
- Malicious Account(恶意账户) –> GEM、GeniePath
- Transaction Fraud
- Cash-out User(现金不足用户)
- Credit Card Fraud(信用卡欺诈)
- Others
- Advertisement(虚假CTR)
- Mobile Apps(下载量造假)
- Ecommerce(薅羊毛)
- Crowdturfing(有不良商家就雇人给自己的商品写正面评论,甚至败坏竞争对手的名声)
- Bitcoin Fraud
- Game
- Email,SMS,Phones

模态角度分类:
Content-based Detectors:比如虚假评论,就是看虚假评论内容进行分辨
Behavior-based Detectors:基于它在时间维度上的变化,进行分辨
Graph-based Detectors
技术角度分类:
Ruled-based Detectors:专家知识
Feature-based Detectors:逻辑回归,决策树
Deep learning-based Detectors:端到端
![7QPSQQ6[W5@DE~74]P1AGC1.png](http://ww1.sinaimg.cn/large/9b63ed6fgy1gllevfl01ej20y60hg45j.jpg)

应用宝和咸鱼
复杂的对抗行为,给你买好评。
3. Resource:dataset,toolbox,paper,survey,company,etc.
SafeGraph

Dataset
OODS dataset(Outlier Detection DataSets)
- 包含五种类别的数据:
- Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
- Time series graph datasets for event detection: Temporal graph data where the graph changes dynamically over time in which new nodes and edges arrive or existing nodes and edges disappear.
- Time series point datasets (Multivariate/Univariate): Temporal point data where each point has one or more attributes and the attributes change over time.
- Adversarial/Attack scenario and security datasets: Opinion fraud detection data from online review system. Cyber security data, e.g. intrusion detection with DoS, DDoS etc. attack scenario.
- Crowded scene video data for anomaly detection: Video clips acquired with camera.
Bitcoin dataset❗
Description
将比特币交易作为实体节点,通过机器学习的方法将比特币交易分类成合法和不合法两个类别。这是一个Node Level的二分类问题。
The Elliptic Data Set maps Bitcoin transactions to real entities belonging to licit categories (exchanges, wallet providers, miners, licit services, etc.) versus illicit ones (scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc.). The task on the dataset is to classify the illicit and licit nodes in the graph.
Content
This anonymized data set is a transaction graph collected from the Bitcoin blockchain. A node in the graph represents a transaction, an edge can be viewed as a flow of Bitcoins between one transaction and the other. Each node has 166 features and has been labeled as being created by a “licit”, “illicit” or “unknown” entity.
Nodes and edges
- The graph is made of 203,769 nodes and 234,355 edges.
- Two percent (4,545) of the nodes are labelled class1 (illicit).
- Twenty-one percent (42,019) are labelled class2 (licit).
- The remaining transactions are not labelled with regard to licit versus illicit.
Features
There are 166 features associated with each node. Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset.
There is a time step associated to each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The time steps, running from 1 to 49, are evenly spaced with an interval of about two weeks. Each time step contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other; there are no edges connecting the different time steps.

The first 94 features represent local information about the transaction – including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs.The remaining 72 features are aggregated features, obtained using transaction information one-hop backward/forward from the center node - giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).
使用到这个数据集的论文有:
Yelp and Amazon
For GraphConsis, we preprocessed Yelp Spam Review Dataset with reviews as nodes and three relations as edges.
The dataset with .mat format is located at /dataset/YelpChi.zip. The .mat file includes:
net_rur, net_rtr, net_rsr: three sparse matrices representing three homo-graphs defined in GraphConsispaper;features: a sparse matrix of 100-dimension Bag-of-words features;label: a numpy array with the ground truth of nodes.1represents spam and0represents benign.
To get the complete metadata of the Yelp dataset, please send an email to ytongdou@gmail.com for inquiry.

百度飞桨数据集
Code & Toolbox
- PyOD
- PyoDD
- Paper & Code list

- KDD 2020 TrueFact Workshop: Making a Credible Web for Tomorrow
- KDD 2020 Workshop on Machine Learning in Finance
- KDD 2020 Deep Anomaly Detection Tutorial
- AI for Anti-Money Laundering Blog
- Awesome Fraud Detection Research Papers.
Scholar

这些学者都有和工业界的合作
| Name | School | Lab Link | Overseas/Domestic |
|---|---|---|---|
| Christos Faloutsos | CMU | Overseas | |
Company

Summary and Q&A

好发论文的点:
- 发现新的欺诈类型(有一个新问题,将传统的方法做一些改变和适应就能够解决)
- 提高模型的效率(GNN的效率没有传统的模型高的)
- Model Ensemble(模型集成和融合)

工业界:
- Define problem clearly,find appropriate model定义问题明确(水军是适合规则?图?Feature来建模?)
- 采样很重要(特别针对GNN来说,闲鱼的一篇论文可以)
- Cost & return trade off
- Old but good(XGBoost)
- Early detection is a challenge(Early Detection很难做)
一些数据集都是社交网络的。
KDD20: spammer adversarial behavior and spamming practical effect
非深度学习

动态博弈建模水军:

新的衡量指标:









看不懂,直接跳过噜
SIGIR20&CIKM20: how to apply GNN to fraud detection problems


camouflage /ˈkæməflɑːʒ/ v. 伪装

混合的模型,就是模型里面啥都有。
IP地址不再适用,通过代理等等方式,改变大家之间的relation,从而成功伪装。这些伪装给GNN带来挑战。
设计模型的目标就是为了去克服这些困难。
三个模块来解决问题。
第一个模块:
必须要引入一些外部只是来进行辅助训练。
第二个模块:
不用学习relation
注意力机制训练慢