import snap
Question 1. Analyzing the Wikipedia voters network [27 points]
Download the Wikipedia voting network wiki-Vote.txt.gz: http://snap.stanford.edu/data/wiki-Vote.html.
Using one of the network analysis tools above, load the Wikipedia voting network. Note that Wikipedia is a directed network. Formally, we consider the Wikipedia network as a directed graph G = (V, E), with node set V and edge set E ⊂ V × V where (edges are ordered pairs of nodes). An edge (a, b) ∈ E means that user a voted on user b.
To make our questions clearer, we will use the following small graph as a running example: Gsmall = (Vsmall, Esmall), where Vsmall = {1, 2, 3} and Esmall = {(1, 2), (2, 1), (1, 3), (1, 1)}.
Compute and print out the following statistics for the wiki-Vote network:
0. 从txt中读取图模型
G = snap.LoadEdgeList(snap.PNGraph,"Wiki-Vote.txt",0,1)
1. The number of nodes in the network. (Gsmall has 3 nodes.)
numOfNodes = G.GetNodes()
print("The number of nodes in the network is %d"%(numOfNodes))
The number of nodes in the network is 7115
2. The number of nodes with a self-edge (self-loop), i.e., the number of nodes a ∈ V where (a, a) ∈ E. (Gsmall has 1 self-edge.)
numOfSelfLoopNodes = snap.CntSelfEdges(G)
print("The number of nodes with a self-edge is %d"%(numOfSelfLoopNodes))
The number of nodes with a self-edge is 0
3. The number of directed edges in the network, i.e., the number of ordered pairs (a, b) ∈ E for which a = b. (Gsmall has 3 directed edges.)
numOfEdges = snap.CntUniqDirEdges(G)
print("The number of directed edges in the network is %d"%(numOfEdges))
The number of directed edges in the network is 103689
4. The number of undirected edges in the network, i.e., the number of unique unordered pairs (a, b), a = b, for which (a, b) ∈ E or (b, a) ∈ E (or both). If both (a, b) and (b, a) are edges, this counts a single undirected edge. (Gsmall has 2 undirected edges.)
print("The number of undirected edges in the network is %d"%(snap.CntUniqUndirEdges(G)))
The number of undirected edges in the network is 100762
5. The number of reciprocated edges in the network, i.e., the number of unique unordered pairs of nodes (a, b), a = b, for which (a, b) ∈ E and (b, a) ∈ E. (Gsmall has 1 reciprocated edge.)
print("The number of reciprocated edges in the network is %d"%(snap.CntUniqBiDirEdges(G)))
The number of reciprocated edges in the network is 2927
6. The number of nodes of zero out-degree. (Gsmall has 1 node with zero out-degree.)
snap.CntOutDegNodes(G,0)
print("The number of nodes of zero out-degree is %d"%(snap.CntOutDegNodes(G,0)))
The number of nodes of zero out-degree is 1005
7. The number of nodes of zero in-degree. (Gsmall has 0 nodes with zero in-degree.)
snap.CntInDegNodes(G,0)
print("The number of nodes of zero in-degree is %d"%(snap.CntInDegNodes(G,0)))
The number of nodes of zero in-degree is 4734
8. The number of nodes with more than 10 outgoing edges (out-degree > 10).
cnt = snap.TInt(0)
for outDegree in range(0,11):
cnt += snap.CntOutDegNodes(G,outDegree)
G.GetNodes()-cnt
print("The number of nodes with more than 10 outgoing edges is %d"%(G.GetNodes()-cnt))
The number of nodes with more than 10 outgoing edges is 1612
9. The number of nodes with fewer than 10 incoming edges (in-degree < 10).
cnt = snap.TInt(0)
for inDegree in range(0,10):
cnt += snap.CntInDegNodes(G,inDegree)
cnt
print("The number of nodes with fewer than 10 incoming edges is %d"%(cnt))
The number of nodes with fewer than 10 incoming edges is 5165
关于第8小问和第9小问也可以像下面这样写:
DegToCntV = snap.TIntPrV()
snap.GetOutDegCnt(G, DegToCntV)
Out10_cnt = 0
for item in DegToCntV:
if item.GetVal1() > 10:
Out10_cnt += 1
print('There are {} nodes having >10 out-degree'.format(Out10_cnt))
There are 227 nodes having >10 out-degree
DegToCntV = snap.TIntPrV()
snap.GetInDegCnt(G, DegToCntV)
In10_cnt = 0
for item in DegToCntV:
if item.GetVal1() < 10:
In10_cnt += 1
print('There are {} nodes having <10 in-degree'.format(In10_cnt))
There are 10 nodes having <10 in-degree
Question 2. Further Analyzing the Wikipedia voters network [33 points]
For this problem, we use the Wikipedia voters network. If you are using Python, you might
want to use NumPy, SciPy, and/or Matplotlib libraries.
- (18 points) Plot the distribution of out-degrees of nodes in the network on a log-log scale.
Each data point is a pair (x; y) where x is a positive integer and y is the number of nodes
in the network with out-degree equal to x. Restrict the range of x between the minimum
and maximum out-degrees. You may filter out data points with a 0 entry. For the log-log
scale, use base 10 for both x and y axes.
OutDegV = snap.TIntPrV()
snap.GetOutDegCnt(G,OutDegV)
# 可以使用自带的方法绘制图像~
snap.PlotOutDegDistr(G,"Q2-1","Directed graph - out-degree Distribution")
- (15 points) Compute and plot the
least-square regression
line for the out-degree distribution in the log-log scale plot. Note we want to find coefficients $a$ and $b$ such that the function $log_{10} y = a · log_{10} x + b$, equivalently, $y = 10^b · x^a$, best fits the out-degree distribution. What are the coefficients a and b? For this part, you might want to use the method calledpolyfit
in NumPy withdeg
parameter equal to 1.
import numpy as np
# 需要将OutDegV里面的数字对都取log
x=[]
y=[]
for pair in OutDegV:
if(pair.GetVal1()>0 and pair.GetVal2()>0):#剔除0 entry
x.append(pair.GetVal1())
y.append(pair.GetVal2())
logx=np.log10(x)
logy=np.log10(y)
#获取斜率和截距
slope,intercept=np.polyfit(logx,logy,1)
print("The best fit line is given by equation "
"log(y) = %s * log(x) + %s" % (slope, intercept))
The best fit line is given by equation log(y) = -1.2810647056745657 * log(x) + 3.1324547044999123
#接下来就是绘制图像了
import matplotlib.pyplot as plt
predict=lambda x:10**(intercept)*x**slope
fig = plt.plot(x,y,'bo--',
x,predict(x),'g',
markersize=2)[0]
fig.axes.set_xscale('log')
fig.axes.set_yscale('log')
fig.axes.set_xlim(min(x),max(x))
fig.axes.set_ylim(min(y),max(y))
fig.axes.set_title("Log-Log Degree Distribution Plot And Fit for WikiGraph")
fig.axes.set_xlabel("Node Degree")
fig.axes.set_ylabel("Node Count")
Text(0, 0.5, 'Node Count')
Question 3. Finding Experts on the Java Programming Language on StackOverow [40 points]
Download the StackOverflow network stackoverflow-Java.txt.gz: http://snap.stanford.edu/class/cs224w-data/hw0/stackoverflow-Java.txt.gz. An edge (a; b) in the network
means that person a endorsed an answer from person b on a Java-related question.
Using one of the network analysis tools above, load the StackOverflow network. Note that StackOverflow is a directed network.
Compute and print out the following statistics for the stackoverflow-Java network:
GS = snap.LoadEdgeList(snap.PNGraph,"stackoverflow-Java.txt",0,1)
- The number of weakly connected components in the network. This value can be calculated in Snap.py via function GetWccs.
Components = snap.TCnComV()
snap.GetWccs(GS,Components)
print("There are %d weakly connected components in the network."%(Components.Len()))
There are 10143 weakly connected components in the network.
- The number of edges and the number of nodes in the largest weakly connected component.The largest weakly connected component is calculated in Snap.py with function GetMxWcc.
GS_MxWcc = snap.GetMxWcc(GS)
print("There are %d edges in the largest weakly connected component."%(GS_MxWcc.GetEdges()))
print("There are %d nodes in the largest weakly connected component."%(GS_MxWcc.GetNodes()))
There are 322486 edges in the largest weakly connected component.
There are 131188 nodes in the largest weakly connected component.
- IDs of the top 3 most central nodes in the network by PagePank scores. PageRank scores are calculated in Snap.py with function GetPageRank.
PRankH=snap.TIntFltH()
snap.GetPageRank(GS,PRankH)
PRankH.SortByDat(False)
cnt=1
for key in PRankH:
if(cnt>3):
break
print("The ID of Top %d most central nodes is %d"%(cnt,key))
cnt+=1
The ID of Top 1 most central nodes is 992484
The ID of Top 2 most central nodes is 135152
The ID of Top 3 most central nodes is 22656
- IDs of the top 3 hubs and top 3 authorities in the network by HITS scores. HITS scores are calculated in Snap.py with function GetHits.
NIdHubH = snap.TIntFltH()
NIdAuthH = snap.TIntFltH()
snap.GetHits(GS, NIdHubH, NIdAuthH)
NIdHubH.SortByDat(False)
cnt=1
for key in NIdHubH:
if(cnt>3):
break
print("The ID of Top %d hubs is %d"%(cnt,key))
cnt+=1
The ID of Top 1 hubs is 892029
The ID of Top 2 hubs is 1194415
The ID of Top 3 hubs is 359862
NIdAuthH.SortByDat(False)
cnt=1
for key in NIdAuthH:
if(cnt>3):
break
print("The ID of Top %d authorities is %d"%(cnt,key))
cnt+=1
The ID of Top 1 authorities is 22656
The ID of Top 2 authorities is 157882
The ID of Top 3 authorities is 571407