【斯坦福cs224w-图机器学习】homework 0


import snap

Question 1. Analyzing the Wikipedia voters network [27 points]

Download the Wikipedia voting network wiki-Vote.txt.gz: http://snap.stanford.edu/data/wiki-Vote.html.

Using one of the network analysis tools above, load the Wikipedia voting network. Note that Wikipedia is a directed network. Formally, we consider the Wikipedia network as a directed graph G = (V, E), with node set V and edge set E ⊂ V × V where (edges are ordered pairs of nodes). An edge (a, b) ∈ E means that user a voted on user b.

To make our questions clearer, we will use the following small graph as a running example: Gsmall = (Vsmall, Esmall), where Vsmall = {1, 2, 3} and Esmall = {(1, 2), (2, 1), (1, 3), (1, 1)}.

Compute and print out the following statistics for the wiki-Vote network:

0. 从txt中读取图模型

G = snap.LoadEdgeList(snap.PNGraph,"Wiki-Vote.txt",0,1)

1. The number of nodes in the network. (Gsmall has 3 nodes.)

numOfNodes = G.GetNodes()
print("The number of nodes in the network is %d"%(numOfNodes))
The number of nodes in the network is 7115

2. The number of nodes with a self-edge (self-loop), i.e., the number of nodes a ∈ V where (a, a) ∈ E. (Gsmall has 1 self-edge.)

numOfSelfLoopNodes = snap.CntSelfEdges(G)
print("The number of nodes with a self-edge is %d"%(numOfSelfLoopNodes))
The number of nodes with a self-edge is 0

3. The number of directed edges in the network, i.e., the number of ordered pairs (a, b) ∈ E for which a = b. (Gsmall has 3 directed edges.)

numOfEdges = snap.CntUniqDirEdges(G)
print("The number of directed edges in the network is %d"%(numOfEdges))
The number of directed edges in the network is 103689

4. The number of undirected edges in the network, i.e., the number of unique unordered pairs (a, b), a = b, for which (a, b) ∈ E or (b, a) ∈ E (or both). If both (a, b) and (b, a) are edges, this counts a single undirected edge. (Gsmall has 2 undirected edges.)

print("The number of undirected edges in the network is %d"%(snap.CntUniqUndirEdges(G)))
The number of undirected edges in the network is 100762

5. The number of reciprocated edges in the network, i.e., the number of unique unordered pairs of nodes (a, b), a = b, for which (a, b) ∈ E and (b, a) ∈ E. (Gsmall has 1 reciprocated edge.)

print("The number of reciprocated edges in the network is %d"%(snap.CntUniqBiDirEdges(G)))
The number of reciprocated edges in the network is 2927

6. The number of nodes of zero out-degree. (Gsmall has 1 node with zero out-degree.)

snap.CntOutDegNodes(G,0)
print("The number of nodes of zero out-degree is %d"%(snap.CntOutDegNodes(G,0)))
The number of nodes of zero out-degree is 1005

7. The number of nodes of zero in-degree. (Gsmall has 0 nodes with zero in-degree.)

snap.CntInDegNodes(G,0)
print("The number of nodes of zero in-degree is %d"%(snap.CntInDegNodes(G,0)))
The number of nodes of zero in-degree is 4734

8. The number of nodes with more than 10 outgoing edges (out-degree > 10).

cnt = snap.TInt(0)
for outDegree in range(0,11):
    cnt += snap.CntOutDegNodes(G,outDegree)
G.GetNodes()-cnt
print("The number of nodes with more than 10 outgoing edges is %d"%(G.GetNodes()-cnt))
The number of nodes with more than 10 outgoing edges is 1612

9. The number of nodes with fewer than 10 incoming edges (in-degree < 10).

cnt = snap.TInt(0)
for inDegree in range(0,10):
    cnt += snap.CntInDegNodes(G,inDegree)
cnt
print("The number of nodes with fewer than 10 incoming edges is %d"%(cnt))
The number of nodes with fewer than 10 incoming edges is 5165

关于第8小问和第9小问也可以像下面这样写:

DegToCntV = snap.TIntPrV()
snap.GetOutDegCnt(G, DegToCntV)
Out10_cnt = 0
for item in DegToCntV:
    if item.GetVal1() > 10:
        Out10_cnt += 1
print('There are {} nodes having >10 out-degree'.format(Out10_cnt))
There are 227 nodes having >10 out-degree
DegToCntV = snap.TIntPrV()
snap.GetInDegCnt(G, DegToCntV)
In10_cnt = 0
for item in DegToCntV:
    if item.GetVal1() < 10:
        In10_cnt += 1
print('There are {} nodes having <10 in-degree'.format(In10_cnt))
There are 10 nodes having <10 in-degree

Question 2. Further Analyzing the Wikipedia voters network [33 points]

For this problem, we use the Wikipedia voters network. If you are using Python, you might
want to use NumPy, SciPy, and/or Matplotlib libraries.

  1. (18 points) Plot the distribution of out-degrees of nodes in the network on a log-log scale.
    Each data point is a pair (x; y) where x is a positive integer and y is the number of nodes
    in the network with out-degree equal to x. Restrict the range of x between the minimum
    and maximum out-degrees. You may filter out data points with a 0 entry. For the log-log
    scale, use base 10 for both x and y axes.
OutDegV = snap.TIntPrV()
snap.GetOutDegCnt(G,OutDegV)
# 可以使用自带的方法绘制图像~
snap.PlotOutDegDistr(G,"Q2-1","Directed graph - out-degree Distribution")
  1. (15 points) Compute and plot the least-square regression line for the out-degree distribution in the log-log scale plot. Note we want to find coefficients $a$ and $b$ such that the function $log_{10} y = a · log_{10} x + b$, equivalently, $y = 10^b · x^a$, best fits the out-degree distribution. What are the coefficients a and b? For this part, you might want to use the method called polyfit in NumPy with deg parameter equal to 1.
import numpy as np
# 需要将OutDegV里面的数字对都取log
x=[]
y=[]
for pair in OutDegV:
    if(pair.GetVal1()>0 and pair.GetVal2()>0):#剔除0 entry
        x.append(pair.GetVal1())
        y.append(pair.GetVal2())
logx=np.log10(x)
logy=np.log10(y)
#获取斜率和截距
slope,intercept=np.polyfit(logx,logy,1)
print("The best fit line is given by equation "
      "log(y) = %s * log(x) + %s" % (slope, intercept))
The best fit line is given by equation log(y) = -1.2810647056745657 * log(x) + 3.1324547044999123
#接下来就是绘制图像了
import matplotlib.pyplot as plt
predict=lambda x:10**(intercept)*x**slope
fig = plt.plot(x,y,'bo--',
              x,predict(x),'g',
              markersize=2)[0]
fig.axes.set_xscale('log')
fig.axes.set_yscale('log')
fig.axes.set_xlim(min(x),max(x))
fig.axes.set_ylim(min(y),max(y))
fig.axes.set_title("Log-Log Degree Distribution Plot And Fit for WikiGraph")
fig.axes.set_xlabel("Node Degree")
fig.axes.set_ylabel("Node Count")
Text(0, 0.5, 'Node Count')

output_36_1.png

Question 3. Finding Experts on the Java Programming Language on StackOverow [40 points]

Download the StackOverflow network stackoverflow-Java.txt.gz: http://snap.stanford.edu/class/cs224w-data/hw0/stackoverflow-Java.txt.gz. An edge (a; b) in the network
means that person a endorsed an answer from person b on a Java-related question.

Using one of the network analysis tools above, load the StackOverflow network. Note that StackOverflow is a directed network.

Compute and print out the following statistics for the stackoverflow-Java network:

GS = snap.LoadEdgeList(snap.PNGraph,"stackoverflow-Java.txt",0,1)
  1. The number of weakly connected components in the network. This value can be calculated in Snap.py via function GetWccs.
Components = snap.TCnComV()
snap.GetWccs(GS,Components)
print("There are %d weakly connected components in the network."%(Components.Len()))
There are 10143 weakly connected components in the network.
  1. The number of edges and the number of nodes in the largest weakly connected component.The largest weakly connected component is calculated in Snap.py with function GetMxWcc.
GS_MxWcc = snap.GetMxWcc(GS)
print("There are %d edges in the largest weakly connected component."%(GS_MxWcc.GetEdges()))
print("There are %d nodes in the largest weakly connected component."%(GS_MxWcc.GetNodes()))
There are 322486 edges in the largest weakly connected component.
There are 131188 nodes in the largest weakly connected component.
  1. IDs of the top 3 most central nodes in the network by PagePank scores. PageRank scores are calculated in Snap.py with function GetPageRank.
PRankH=snap.TIntFltH()
snap.GetPageRank(GS,PRankH)
PRankH.SortByDat(False)
cnt=1
for key in PRankH:
    if(cnt>3):
        break
    print("The ID of Top %d most central nodes is %d"%(cnt,key))
    cnt+=1
The ID of Top 1 most central nodes is 992484
The ID of Top 2 most central nodes is 135152
The ID of Top 3 most central nodes is 22656
  1. IDs of the top 3 hubs and top 3 authorities in the network by HITS scores. HITS scores are calculated in Snap.py with function GetHits.
NIdHubH = snap.TIntFltH()
NIdAuthH = snap.TIntFltH()
snap.GetHits(GS, NIdHubH, NIdAuthH)
NIdHubH.SortByDat(False)
cnt=1
for key in NIdHubH:
    if(cnt>3):
        break
    print("The ID of Top %d hubs is %d"%(cnt,key))
    cnt+=1
The ID of Top 1 hubs is 892029
The ID of Top 2 hubs is 1194415
The ID of Top 3 hubs is 359862
NIdAuthH.SortByDat(False)
cnt=1
for key in NIdAuthH:
    if(cnt>3):
        break
    print("The ID of Top %d authorities is %d"%(cnt,key))
    cnt+=1
The ID of Top 1 authorities is 22656
The ID of Top 2 authorities is 157882
The ID of Top 3 authorities is 571407

参考资料

  1. LFhase/Learning_CS224w
  2. kandluis/cs224w

文章作者: CarlYoung
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 CarlYoung !
  目录