Datasets
DBLP
We uses the pre-processed DBLP dataset from Jhy1993/HAN You can run the FdGars, Player2Vec, GeniePath and GEM based on the DBLP dataset. Unzip the archive before using the dataset:
cd dataset
unzip DBLP4057_GAT_with_idx_tra200_val_800.zip
We extract a subset of DBLP which contains 14328 papers (P), 4057 authors (A), 20 conferences (C), 8789 terms (T). The authors are divided into four areas: database, data mining, machine learning, information retrieval. Also, we label each author’s research area according to the conferences they submitted. Author features are the elements of a bag-of-words represented of keywords.
里面包含论文,作者,会议和专业术语这几种实体
- label表示的是作者是哪个领域的,是我们多分类的目标。
Example dataset
We implement example graphs for SemiGNN, GAS and GEM in data_loader.py
. Because those models require unique graph structures or node types, which cannot be found in opensource datasets.
Yelp dataset
For GraphConsis, we preprocessed Yelp Spam Review Dataset with reviews as nodes and three relations as edges.
The dataset with .mat
format is located at /dataset/YelpChi.zip
. The .mat
file includes:
net_rur, net_rtr, net_rsr
: three sparse matrices representing three homo-graphs defined in GraphConsis paper;features
: a sparse matrix of 100-dimension Bag-of-words features;label
: a numpy array with the ground truth of nodes.1
represents spam and0
represents benign.
To get the complete metadata of the Yelp dataset, please send an email to ytongdou@gmail.com for inquiry.