karateclub是小规模图挖掘研究的一把瑞士军刀, 可以对图形结构化数据进行无监督学习。

  • 首先,可以计算出节点、图的特征向量
  • 其次,它包括多种重叠和非重叠的社区发现方法。

代码下载

click to download


数据格式

karateclub假设用户提供的用于节点嵌入社区检测的 NetworkX 图具有以下重要属性:

  • 节点用整数索引
  • 节点索引从零开始,索引是连续的

节点的属性矩阵可以提供为 scipy sparse 和 numpy 数组。返回的社区成员字典和嵌入矩阵使用相同的数字连续索引。


安装

pip3 install karateclub

准备数据

import pandas as pd

df = pd.read_csv('karate_club_graph.csv')

print(df.columns)

print()

print(df.head().to_markdown())

print()

edges = list(zip(df['src'], df['tgt']))
print(edges)

Run

Index(['src', 'tgt'], dtype='object')

|    |   src |   tgt |
|---:|------:|------:|
|  0 |     0 |     1 |
|  1 |     0 |     2 |
|  2 |     0 |     3 |
|  3 |     0 |     4 |
|  4 |     0 |     5 |

[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 10), (0, 11), (0, 12), (0, 13), (0, 17), (0, 19), (0, 21), (0, 31), (1, 2), (1, 3), (1, 7), (1, 13), (1, 17), (1, 19), (1, 21), (1, 30), (2, 3), (2, 7), (2, 8), (2, 9), (2, 13), (2, 27), (2, 28), (2, 32), (3, 7), (3, 12), (3, 13), (4, 6), (4, 10), (5, 6), (5, 10), (5, 16), (6, 16), (8, 30), (8, 32), (8, 33), (9, 33), (13, 33), (14, 32), (14, 33), (15, 32), (15, 33), (18, 32), (18, 33), (19, 33), (20, 32), (20, 33), (22, 32), (22, 33), (23, 25), (23, 27), (23, 29), (23, 32), (23, 33), (24, 25), (24, 27), (24, 31), (25, 31), (26, 29), (26, 33), (27, 33), (28, 31), (28, 33), (29, 32), (29, 33), (30, 32), (30, 33), (31, 32), (31, 33), (32, 33)]

import networkx as nx

graph = nx.Graph()
graph.add_edges_from(edges)
nx.draw(graph)

Run


png


社区发现

现在让我们使用LabelPropagation算法来发现网络中的社区结构。

from karateclub import LabelPropagation

model = LabelPropagation()
model.fit(graph)
cluster_membership = model.get_memberships()
cluster_membership

Run

{23: 8,
 33: 8,
 5: 10,
 7: 1,
 28: 31,
 4: 10,
 3: 1,
 31: 31,
 20: 8,
 19: 1,
 6: 10,
 32: 8,
 29: 8,
 9: 1,
 14: 8,
 2: 1,
 0: 1,
 17: 1,
 25: 31,
 22: 8,
 11: 1,
 13: 1,
 1: 1,
 24: 31,
 15: 8,
 18: 8,
 26: 8,
 27: 8,
 16: 10,
 12: 1,
 30: 8,
 21: 1,
 8: 8,
 10: 10}

在有34个节点的图中,发现了4个社区,分别是1、8、10、31。


Node embeddings

计算节点的向量。​使用 Diff2vec 拟合数据的节点嵌入(向量),具有少量维度、每个源节点的扩散和短欧拉游走。

from karateclub import Diff2Vec

model = Diff2Vec(diffusion_number=2, 
                 diffusion_cover=20, 
                 dimensions=5)

model.fit(graph)
X = model.get_embedding()
X.shape

Run

(34, 5)

X

Run

array([[ 1.3687179 , -0.33502993, -0.3294797 ,  0.40154558,  1.0270709 ],
       [ 0.88167036, -0.3201618 , -0.34293872,  0.41519755,  0.71964073],
       [ 0.8756805 , -0.21934716, -0.33261183,  0.33785722,  0.51631075],
       [ 0.9768452 , -0.39260587, -0.39460638,  0.28851682,  0.8665034 ],
       [ 0.4809215 , -0.28729865, -0.19276802,  0.22588767,  0.07305563],
       [ 0.5580538 , -0.28137547, -0.1947159 ,  0.23712516,  0.49257705],
       [ 0.23477663,  0.04262228,  0.07154325,  0.02909669,  0.33999097],
       [ 1.1882199 , -0.21742308, -0.26985615,  0.44171503,  0.6679048 ],
       [ 1.0287609 , -0.27409104, -0.04119629,  0.30143994,  0.704676  ],
       [ 0.5700088 , -0.26341844,  0.01560158, -0.08039217,  0.41796318],
       [ 0.5753763 , -0.2242508 , -0.1795436 ,  0.0705331 ,  0.46571913],
       [ 0.46763912, -0.17108741, -0.22459361,  0.03058788,  0.05998428],
       [ 0.5500626 , -0.12745889, -0.28661036,  0.16889155,  0.48200938],
       [ 0.6217582 , -0.10251168, -0.0713837 ,  0.13550574,  0.60422456],
       [ 0.9797377 , -0.46282482, -0.09380057,  0.2749968 ,  0.7020155 ],
       [ 0.38830167, -0.30841848, -0.20950563, -0.02130592,  0.0836651 ],
       [ 0.57225037, -0.04150235, -0.1246101 ,  0.06918757,  0.23083903],
       [ 0.6431406 , -0.04898892, -0.05708801,  0.1311793 ,  0.46377632],
       [ 0.541667  , -0.16031542, -0.33119023,  0.10385639,  0.39525154],
       [ 0.65543544, -0.27534947, -0.28757   ,  0.2080029 ,  0.5288213 ],
       [ 0.46381798, -0.07729273, -0.09209982,  0.11292508,  0.36836028],
       [ 0.53826964, -0.09915172, -0.09243581,  0.15036733,  0.5449071 ],
       [ 0.31599265, -0.22078821, -0.02872767,  0.07436654,  0.28573534],
       [ 1.0706906 , -0.27783617, -0.16653039,  0.2631594 ,  0.6408689 ],
       [ 0.67875004, -0.34441757, -0.10262538,  0.2588695 ,  0.38405937],
       [ 0.41786563, -0.10344986, -0.19508548,  0.19657765,  0.22006002],
       [ 0.7855942 , -0.27200857,  0.02204541,  0.09168041,  0.42220354],
       [ 0.7773458 , -0.11727296, -0.24145149,  0.04537854,  0.5737133 ],
       [ 0.75732976, -0.314953  , -0.15383345,  0.02065313,  0.51843405],
       [ 0.7226543 , -0.31919608, -0.18878649,  0.15413427,  0.42012522],
       [ 0.43411565, -0.17342259, -0.28042233,  0.26853496,  0.49947587],
       [ 1.1565564 , -0.36802933, -0.12613232,  0.32381424,  0.75113887],
       [ 1.1192797 , -0.162529  , -0.17195942,  0.39265418,  0.83656436],
       [ 1.2231556 , -0.5336606 , -0.14015286,  0.14054438,  0.5695296 ]],
      dtype=float32)



广而告之