新建
第五讲 图相关推荐系统
xuxh@dp.tech
doggyceneks
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c2_m4_cpu
赞
目录
数据集
辅助文件和数据-推荐系统(v1)
本课程由新加坡国立大学(NUS)的计算科学系开设,课程编号CS6208,课程主页Xavier Bresson — Teaching Resources,GitHub主页GML2023,读者可访问原链接查看完整slides。
代码
文本
Lecture 5 : Recommendation on Graphs
内容提要
Intro
- 推荐系统使用 产品特征、用户特征、历史的或动态的打分数据以及产品、用户、用户-产品之间的关系做推荐.
- 本讲中我们关注
- 只对图建立的推荐系统:PageRank 和 content reccomendation
- 只对特征建立的推荐系统:Low-rank recommendation, a.k.a. collaborative recommendation
- 同时处理图和特征的推荐系统:Hybrid systems
PageRank
- 排序算法,对任意图中的节点进行排序,主要依据是节点的popularity(用节点的入度衡量)
- 平稳状态的向量根据节点的连接影响程度给出节点的重要性的一个度量 --> 可以通过求解图G的平稳状态得到拍寻问题的解
- 这是一个奇异值分解问题,但是EVD算法复杂度高,可选用 Power method 算法求解
- Lab1: PageRank实现,比较EVD和Power method的速度,可视化一个简单网络的PageRank函数
Collaborative recommendation
- 依据稀疏的评分矩阵做矩阵填充,预测用户可能的打分并基于此做推荐
- 具体求解的问题
- NP难问题,两种求解算法
- Convex relaxation with nuclear norm
- Non-convex relaxation with matrix factorization
- Lab 2: 协同过滤,计算一个低秩解,可视化矩阵重建过程
Content recommendation
- 通过用户之间和物品之间的相似性来预测评分,相似性主要通过用户之间和物品之间的关系图来度量
- Graph-based content filtering相当于用Graph diffusion做协同推荐
- 问题的求解是一个二次优化问题
- Lab 03: 实现Content filtering
Hybrid system
- 集成协同和内容推荐技术
- 几种方法的效果
- Lab4 : 求hybrid解并可视化解
代码
文本
Lab 01 : PageRank
代码
文本
[1]
# Load libraries
import numpy as np
import scipy.io
%matplotlib inline
#%matplotlib notebook
from matplotlib import pyplot
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
import time
import sys; sys.path.insert(0, 'lib/')
import scipy.sparse.linalg
import warnings; warnings.filterwarnings("ignore")
代码
文本
Synthetic small graph
代码
文本
[2]
# Data matrix
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/pagerank_synthetic.mat')
W = mat['W']
W = scipy.sparse.csr_matrix(W)
Wref = W
X = mat['X']
n = X.shape[0]
d = X.shape[1]
E = mat['E']
XE = mat['X2']
print('num_nodes:',n)
num_nodes: 11
代码
文本
[3]
plt.figure(1)
size_vertex_plot = 100
plt.scatter(X[:,0], X[:,1], s=size_vertex_plot*np.ones(n))
plt.quiver(XE[:,0], XE[:,1], E[:,0], E[:,1], scale=1., units='xy')
plt.title('Visualization of the artificial WWW')
plt.axis('equal')
plt.axis('off')
plt.show()
代码
文本
[5]
# Solve eigenproblem
# vector of 1's
e = np.ones([n,1])/n
one = np.ones([n,1])
# Dumpling vector
D = np.array(W.sum(axis=1),dtype='float32').squeeze()
a_idx = np.zeros([n],dtype='int32')
a_idx[np.where(D<1./2)] = 1
a = (1.0* a_idx)[:,None]
# Compute P = W D^{-1}
invD = 1./(D+1e-10)
invD[a_idx==1] = 0
invD = np.diag(invD)
W = Wref.todense()
P = invD.dot(W).T
# EVD
alpha = 0.85
start = time.time()
Phat = alpha* P + alpha* e.dot(a.T) + (1.0-alpha)* e.dot(one.T)
Phat = scipy.sparse.csr_matrix(Phat)
lamb, U = scipy.sparse.linalg.eigs(Phat, k=1, which='LM')
x_pagerank = np.abs(U[:,0])/ np.sum(np.abs(U[:,0]))
# Computational time
print('Computational time for PageRank solution with EIGEN Method (sec):',time.time() - start)
Computational time for PageRank solution with EIGEN Method (sec): 0.001756429672241211
代码
文本
[6]
plt.figure(2)
size_vertex_plot = 1e3*6
plt.scatter(X[:,0], X[:,1], s=size_vertex_plot*x_pagerank)
plt.quiver(XE[:,0], XE[:,1], E[:,0], E[:,1], scale=1., units='xy')
plt.title('PageRank solution with EIGEN Method.')
plt.axis('equal')
plt.axis('off')
plt.show()
代码
文本
[7]
# PageRank values
x = x_pagerank
val = np.sort(x)[::-1]
idx = np.argsort(x)[::-1]
index = np.array(range(1,1+n))
in_degree = np.array(W.sum(axis=0)).squeeze(axis=0)
out_degree = np.array(W.sum(axis=1)).squeeze(axis=1)
index = index[idx]
in_degree = in_degree[idx]
out_degree = out_degree[idx]
print('\n ''Node'' | ''PageRank'' | ''In-degree'' | ''Out-degree'' ')
for i in range(n):
print(' ',index[i], ' ', round(val[i],3) ,' ', in_degree[i],' ', out_degree[i], end='\n')
Node | PageRank | In-degree | Out-degree 2 0.384 7.0 1.0 3 0.343 1.0 1.0 5 0.081 6.0 3.0 4 0.039 1.0 2.0 6 0.039 1.0 2.0 1 0.033 1.0 0.0 10 0.016 0.0 1.0 8 0.016 0.0 2.0 9 0.016 0.0 2.0 7 0.016 0.0 2.0 11 0.016 0.0 1.0
代码
文本
[8]
# Power Method
# Initialization
x = e
diffx = 1e10
k = 0
# Iterative scheme
start = time.time()
alpha = 0.85
while (k<1000) & (diffx>1e-6):
# Update iteration
k += 1
# Update x
xold = x
x = alpha* P.dot(x) + e.dot( alpha* a.T.dot(x) + (1.0-alpha) )
# Stopping condition
diffx = np.linalg.norm(x-xold,1)
x_pagerank_PM = np.array(x).squeeze(axis=1)
# Computational time
print('Computational time for PageRank solution with POWER Method (sec):',time.time() - start)
plt.figure(3)
size_vertex_plot = 1e3*6
plt.scatter(X[:,0], X[:,1], s=size_vertex_plot*x_pagerank)
plt.quiver(XE[:,0], XE[:,1], E[:,0], E[:,1], scale=1., units='xy')
plt.title('PageRank solution with POWER Method.')
plt.axis('equal')
plt.axis('off')
plt.show()
Computational time for PageRank solution with POWER Method (sec): 0.0045511722564697266
代码
文本
Real-world dataset CALIFORNIA
代码
文本
[9]
###########################
# California graph
# http://vlado.fmf.uni-lj.si/pub/networks/data/mix/mixed.htm
# This graph was constructed by expanding a 200-page response set
# to a search engine query 'California'.
###########################
network = np.loadtxt('/bohr/dataset-6exj/v1/datasets/california.dat')
row = network[:,0]-1
col = network[:,1]-1
n = int(np.max(network))+1 # nb of vertices
ne = len(row)
print('nb of nodes=',n)
print('nb of edges=',ne)
# Create Adjacency matrix W
data = np.ones([ne])
W = scipy.sparse.csr_matrix((data, (row, col)), shape=(n, n))
Wref = W
print(W.shape)
# Plot adjacency matrix
plt.figure(4)
plt.spy(W,precision=0.01, markersize=1)
plt.show()
nb of nodes= 9665 nb of edges= 16150 (9665, 9665)
代码
文本
[4]
# Solve eigenproblem
# vector of 1's
e = np.ones([n,1])/n
one = np.ones([n,1])
# Dumpling vector
D = np.array(W.sum(axis=1),dtype='float32').squeeze()
a_idx = np.zeros([n],dtype='int32')
a_idx[np.where(D<1./2)] = 1
a = (1.0* a_idx)[:,None]
# Compute P = W D^{-1}
invD = 1./(D+1e-10)
invD[a_idx==1] = 0
invD = np.diag(invD)
W = Wref.todense()
P = invD.dot(W).T
# EVD
alpha = 0.85
start = time.time()
Phat = alpha* P + alpha* e.dot(a.T) + (1.0-alpha)* e.dot(one.T)
Phat = scipy.sparse.csr_matrix(Phat)
lamb, U = scipy.sparse.linalg.eigs(Phat, k=1, which='LM')
x_pagerank = np.abs(U[:,0])/ np.sum(np.abs(U[:,0]))
# Computational time
print('Computational time for PageRank solution with EIGEN Method (sec):',time.time() - start)
Computational time for PageRank solution with EIGEN Method (sec): 0.024279356002807617
代码
文本
[ ]
# Power Method
# Initialization
e = np.ones([n,1])/n
x = e
diffx = 1e10
k = 0
W = Wref.todense()
P = invD.dot(W).T
# Iterative scheme
start = time.time()
alpha = 0.85
while (k<1000) & (diffx>1e-6):
# Update iteration
k += 1
# Update x
xold = x
x = alpha* P.dot(x) + e.dot( alpha* a.T.dot(x) + (1.0-alpha) )
# Stopping condition
diffx = np.linalg.norm(x-xold,1)
x_pagerank_PM = np.array(x).squeeze(axis=1)
# Computational time
print('Computational time for PageRank solution with POWER Method (sec):',time.time() - start)
代码
文本
Lab 02 : Collaborative filtering
代码
文本
[2]
# Load libraries
import numpy as np
import scipy.io
%matplotlib inline
#%matplotlib notebook
from IPython.display import display, clear_output
from matplotlib import pyplot
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
import time
import sys; sys.path.insert(0, '/bohr/dataset-6exj/v1/lib/')
from lib.utils import shrink
import scipy.sparse.linalg
import warnings; warnings.filterwarnings("ignore")
代码
文本
Synthetic dataset
代码
文本
[3]
# Load graphs of rows/users and columns/movies
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/synthetic_netflix.mat')
M = mat['M']
Otraining = mat['Otraining']
Otest = mat['Otest']
Wrow = mat['Wrow']
Wcol = mat['Wcol']
n,m = M.shape
print('n,m=',n,m)
Mgt = M # Ground truth
O = Otraining
M = O* Mgt
perc_obs_training = np.sum(Otraining) / (n*m)
print('perc_obs_training=',perc_obs_training)
n,m= 150 200 perc_obs_training= 0.03
代码
文本
[4]
# Viusalize the rating matrix
plt.figure(1)
plt.imshow(Mgt, interpolation='nearest', cmap='jet')
plt.title('Low-rank Matrix M.\nNote: We NEVER observe it\n in real-world applications')
plt.show()
plt.figure(2)
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')
plt.title('Observed values of M\n for TRAINING.\n Percentage=' + str(perc_obs_training))
plt.show()
代码
文本
[5]
# Collaborative filtering / low-rank approximation by nuclear norm
# Norm of the operator
OM = O*M
normOM = np.linalg.norm(OM,2)
#######################################
# Select the set of hyper-parameters
#######################################
# scenario : very low number of ratings, 0.03%, error metric = 138.75
lambdaNuc = normOM/4; lambdaDF = 1e3 * 1e-2
# Indentify zero columns and zero rows in the data matrix X
idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]
idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]
nb_zero_cols = len(idx_zero_cols)
nb_zero_rows = len(idx_zero_rows)
# Initialization
X = M; Xb = X;
Y = np.zeros([n,m])
normA = 1.
sigma = 1./normA
tau = 1./normA
diffX = 1e10
min_nm = np.min([n,m])
k = 0
while (k<2000) & (diffX>1e-1):
# Update iteration
k += 1
# Update dual variable y
Y = Y + sigma* Xb
U,S,V = np.linalg.svd(Y/sigma)
Sdiag = shrink( S , lambdaNuc/ sigma )
I = np.array(range(min_nm))
Sshrink = np.zeros([n,m])
Sshrink[I,I] = Sdiag
Y = Y - sigma* U.dot(Sshrink.dot(V))
# Update primal variable x
Xold = X
X = X - tau* Y
X = ( X + tau* lambdaDF* O* M)/ (1 + tau* lambdaDF* O)
# Fix issue with no observations along some rows and columns
r,c = np.where(X>0.0); median = np.median(X[r,c])
if nb_zero_cols>0: X[:,idx_zero_cols] = median
if nb_zero_rows>0: X[nb_zero_rows,:] = median
# Update primal variable xb
Xb = 2.* X - Xold
# Difference between two iterations
diffX = np.linalg.norm(X-Xold)
# Reconstruction error
err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)
# Plot
if not k%50:
clear_output(wait=True)
plt.figure(1)
plt.imshow(X, interpolation='nearest', cmap='jet')
plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
print('diffX',diffX)
clear_output(wait=True)
print('Reconstruction Error: '+ str(round(err_test,5)))
# Final plot
plt.figure(2)
plt.imshow(Mgt, interpolation='nearest', cmap='jet')
plt.title('Ground truth low-rank matrix M')
plt.figure(3)
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')
plt.title('Observed values of M')
plt.figure(4)
plt.imshow(X, interpolation='nearest', cmap='jet')
plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,2)))
plt.show()
diffX 2.347719681709583
diffX 1.803709307199022
diffX 1.4771937408778328
diffX 1.1861059871897621
diffX 0.9225570691441073
diffX 0.688543926854264
diffX 0.49170614216425157
diffX 0.32219978409335964
diffX 0.194261221724262
diffX 0.11464841985336849 Reconstruction Error: 138.75954
代码
文本
Real-world dataset SWEETRS
代码
文本
[6]
# Load graphs of rows/users and columns/products
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario1.mat')
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario2.mat')
# mat = scipy.io.loadmat('datasets/real_sweetrs_scenario3.mat')
M = mat['M']
Otraining = mat['Otraining']
Otest = mat['Otest']
Wrow = mat['Wrow']
Wcol = mat['Wcol']
print('M', M.shape)
print('Otraining', Otraining.shape)
print('Otest', Otest.shape)
print('Wrow', Wrow.shape)
print('Wcol', Wcol.shape)
n,m = M.shape
print('n,m=',n,m)
Mgt = M # Ground truth
O = Otraining
M = O* Mgt
perc_obs_training = np.sum(Otraining) / (n*m)
print('perc_obs_training=',perc_obs_training)
perc_obs_test = np.sum(Otest) / (n*m)
M (664, 77) Otraining (664, 77) Otest (664, 77) Wrow (664, 664) Wcol (77, 77) n,m= 664 77 perc_obs_training= 0.1317868878109842
代码
文本
[7]
# Visualize the original rating matrix
plt.figure(1,figsize=(10,10))
plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])
# Visualize the observed rating matrix
plt.figure(2, figsize=(10,10))
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])
plt.show()
代码
文本
[8]
# Collaborative filtering / low-rank approximation by nuclear norm
# Norm of the operator
OM = O*M
normOM = np.linalg.norm(OM,2)
#######################################
# Select the set of hyper-parameters
#######################################
# scenario 1 : low number of ratings, 1.3%, error metric = 744.10
lambdaNuc = normOM/4; lambdaDF = 1e3 * 1e-2
# scenario 2 : intermediate number of ratings, 13.1%, error metric = 412.01
lambdaNuc = normOM/4 * 1e2; lambdaDF = 1e3 * 1e0
# scenario 3 : large number of ratings, 52.7%, error metric = 698.97
# lambdaNuc = normOM/4 * 1e2; lambdaDF = 1e3
# Indentify zero columns and zero rows in the data matrix X
idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]
idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]
nb_zero_cols = len(idx_zero_cols)
nb_zero_rows = len(idx_zero_rows)
# Initialization
X = M; Xb = X;
Y = np.zeros([n,m])
normA = 1.
sigma = 1./normA
tau = 1./normA
diffX = 1e10
min_nm = np.min([n,m])
k = 0
while (k<2000) & ( diffX>1e-1 or k<100 ) :
# Update iteration
k += 1
# Update dual variable y
Y = Y + sigma* Xb
U,S,V = np.linalg.svd(Y/sigma)
Sdiag = shrink( S , lambdaNuc/ sigma )
I = np.array(range(min_nm))
Sshrink = np.zeros([n,m])
Sshrink[I,I] = Sdiag
Y = Y - sigma* U.dot(Sshrink.dot(V))
# Update primal variable x
Xold = X
X = X - tau* Y
X = ( X + tau* lambdaDF* O* M)/ (1 + tau* lambdaDF* O)
# Fix issue with no observations along some rows and columns
r,c = np.where(X>0.0); median = np.median(X[r,c])
if nb_zero_cols>0: X[:,idx_zero_cols] = median
if nb_zero_rows>0: X[nb_zero_rows,:] = median
# Update primal variable xb
Xb = 2.* X - Xold
# Difference between two iterations
diffX = np.linalg.norm(X-Xold)
# Reconstruction error
err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)
# Plot
if not k%50:
clear_output(wait=True)
plt.figure(figsize=(10,10))
plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
print('diffX',diffX)
clear_output(wait=True)
print('Reconstruction Error: '+ str(round(err_test,5)))
# Final plots
plt.figure(2, figsize=(10,10))
plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])
plt.show()
plt.figure(3, figsize=(10,10))
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])
plt.show()
plt.figure(4, figsize=(10,10))
plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
diffX 9.715413387006436
diffX 1.3421208842881098
diffX 0.6940819839265959
diffX 10.996647752952654
diffX 4.6520503707714935
diffX 2.7190054210572567
diffX 2.2671834772418658
diffX 0.6633559110697017
diffX 1.7002771816583142
diffX 0.3379443310380693
diffX 0.13896755511412323
diffX 0.19190345371713774 Reconstruction Error: 409.76023
代码
文本
Lab 03 : Content recommendation
代码
文本
[3]
# Load libraries
import numpy as np
import scipy.io
%matplotlib inline
#%matplotlib notebook
from matplotlib import pyplot
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
import time
import sys; sys.path.insert(0, '/bohr/dataset-6exj/v1/')
from lib.utils import shrink
from lib.utils import graph_laplacian
import scipy.sparse.linalg
import warnings; warnings.filterwarnings("ignore")
from lib.utils import compute_ncut, reindex_W_with_classes, construct_knn_graph
import torch
import networkx as nx
代码
文本
Synthetic dataset
代码
文本
[5]
# Load graphs of rows/users and columns/movies
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/synthetic_netflix.mat')
M = mat['M']
Otraining = mat['Otraining']
Otest = mat['Otest']
Wrow = mat['Wrow']
Wcol = mat['Wcol']
n,m = M.shape
print('n,m=',n,m)
Mgt = M # Ground truth
O = Otraining
M = O* Mgt
perc_obs_training = np.sum(Otraining) / (n*m)
print('perc_obs_training=',perc_obs_training)
n,m= 150 200 perc_obs_training= 0.03
代码
文本
[6]
# Viusalize the rating matrix
plt.figure(1)
plt.imshow(Mgt, interpolation='nearest', cmap='jet')
plt.title('Low-rank Matrix M.\nNote: We NEVER observe it\n in real-world applications')
plt.show()
plt.figure(2)
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')
plt.title('Observed values of M\n for TRAINING.\n Percentage=' + str(perc_obs_training))
plt.show()
代码
文本
[7]
# Content Filtering / Graph Regularization by Dirichlet Energy
#######################################
# Select the set of hyper-parameters
#######################################
# scenario : very low number of ratings, 0.03%, error metric = 161.32
lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02
# Compute Graph Laplacians
Lr = graph_laplacian(Wrow)
Lc = graph_laplacian(Wcol)
I = scipy.sparse.identity(m, dtype=Lr.dtype)
Lr = scipy.sparse.kron( I, Lr )
Lr = scipy.sparse.csr_matrix(Lr)
I = scipy.sparse.identity(n, dtype=Lc.dtype)
Lc = scipy.sparse.kron( Lc, I )
Lc = scipy.sparse.csr_matrix(Lc)
# Pre-processing
L = alpha* Lc + (1.-alpha)* Lr
vecO = np.reshape(O.T,[-1])
vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)
vecO = scipy.sparse.csr_matrix(vecO)
At = lambdaDir* L + lambdaDF* vecO
vecM = np.reshape(M.T,[-1])
bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T
bt = np.array(bt.todense()).squeeze()
# Solve by linear system
x,_ = scipy.sparse.linalg.cg(At, bt, x0=bt, tol=1e-9, maxiter=100)
X = np.reshape(x,[m,n]).T
# Reconstruction error
err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)
print('Reconstruction Error: '+ str(round(err_test,5)))
# Plot
plt.figure(2)
plt.imshow(Mgt, interpolation='nearest', cmap='jet')
plt.title('Ground truth low-rank matrix M')
plt.figure(3)
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')
plt.title('Observed values of M')
plt.figure(4)
plt.imshow(X, interpolation='nearest', cmap='jet')
plt.title('Content Filtering\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
Reconstruction Error: 161.32238
代码
文本
Real world dataset SWEETRS
代码
文本
[8]
# Load graphs of rows/users and columns/products
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario1.mat')
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario2.mat')
# mat = scipy.io.loadmat('datasets/real_sweetrs_scenario3.mat')
M = mat['M']
Otraining = mat['Otraining']
Otest = mat['Otest']
Wrow = mat['Wrow']
Wcol = mat['Wcol']
print('M', M.shape)
print('Otraining', Otraining.shape)
print('Otest', Otest.shape)
print('Wrow', Wrow.shape)
print('Wcol', Wcol.shape)
n,m = M.shape
print('n,m=',n,m)
Mgt = M # Ground truth
O = Otraining
M = O* Mgt
perc_obs_training = np.sum(Otraining)/(n*m)
print('perc_obs_training=',perc_obs_training)
perc_obs_test = np.sum(Otest) / (n*m)
M (664, 77) Otraining (664, 77) Otest (664, 77) Wrow (664, 664) Wcol (77, 77) n,m= 664 77 perc_obs_training= 0.1317868878109842
代码
文本
[14]
# Visualize the original rating matrix
plt.figure(1,figsize=(10,10))
plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])
# Visualize the observed rating matrix
plt.figure(2, figsize=(10,10))
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])
plt.show()
代码
文本
[9]
# Visualize graph of users and graph of products
# Plot adjacency matrix w.r.t. NCut communities
# plot graph of users
W = Wrow
nc = 10; Cncut, _ = compute_ncut(W, np.zeros(Mgt.shape[0]), nc)# compute NCut clustering
[reindexed_W_ncut,reindexed_C_ncut] = reindex_W_with_classes(W,Cncut)
plt.figure(1)
plt.spy(reindexed_W_ncut, precision=0.01, markersize=1)
plt.title('Adjacency matrix of users indexed \naccording to the NCut communities')
plt.show()
A = W.copy()
A.setdiag(0)
A.eliminate_zeros()
G_nx = nx.from_scipy_sparse_array(A)
plt.figure(2,figsize=[30,30])
nx.draw_networkx(G_nx, with_labels=True, node_color=np.array(Cncut), cmap='jet')
# plot graph of products
W = Wcol
nc = 10; Cncut, _ = compute_ncut(W, np.zeros(Mgt.shape[1]), nc)# compute NCut clustering
[reindexed_W_ncut,reindexed_C_ncut] = reindex_W_with_classes(W,Cncut)
plt.figure(3)
plt.spy(reindexed_W_ncut, precision=0.01, markersize=1)
plt.title('Adjacency matrix of products indexed \naccording to the NCut communities')
plt.show()
A = W.copy()
A.setdiag(0)
A.eliminate_zeros()
G_nx = nx.from_scipy_sparse_array(A)
plt.figure(4,figsize=[30,30])
nx.draw_networkx(G_nx, with_labels=True, node_color=np.array(Cncut), cmap='jet')
代码
文本
[10]
# Content Filtering / Graph Regularization by Dirichlet Energy
#######################################
# Select the set of hyper-parameters
#######################################
# scenario 1 : low number of ratings, e.g. 1.3%, error metric = 399.89
lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02
# scenario 2 : intermediate number of ratings, e.g. 13.1%, error metric = 411.24
lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02
# scenario 3 : large number of ratings, e.g. 52.7%, error metric = 748.52
# lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02
# Compute Graph Laplacians
Lr = graph_laplacian(Wrow)
Lc = graph_laplacian(Wcol)
I = scipy.sparse.identity(m, dtype=Lr.dtype)
Lr = scipy.sparse.kron( I, Lr )
Lr = scipy.sparse.csr_matrix(Lr)
I = scipy.sparse.identity(n, dtype=Lc.dtype)
Lc = scipy.sparse.kron( Lc, I )
Lc = scipy.sparse.csr_matrix(Lc)
# Pre-processing
L = alpha* Lc + (1.-alpha)* Lr
vecO = np.reshape(O.T,[-1])
vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)
vecO = scipy.sparse.csr_matrix(vecO)
At = lambdaDir* L + lambdaDF* vecO
vecM = np.reshape(M.T,[-1])
bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T
bt = np.array(bt.todense()).squeeze()
# Solve by linear system
x,_ = scipy.sparse.linalg.cg(At, bt, x0=bt, tol=1e-9, maxiter=100)
X = np.reshape(x,[m,n]).T
# Reconstruction error
err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)
print('Reconstruction Error: '+ str(round(err_test,5)))
# Plots
plt.figure(2, figsize=(10,10))
plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])
plt.show()
plt.figure(3, figsize=(10,10))
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])
plt.show()
plt.figure(4, figsize=(10,10))
plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Content Filtering\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
Reconstruction Error: 411.24489
代码
文本
Lab 04 : Hybrid recommendation
代码
文本
[11]
# Load libraries
import numpy as np
import scipy.io
%matplotlib inline
#%matplotlib notebook
from matplotlib import pyplot
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
from IPython.display import display, clear_output
import time
import sys; sys.path.insert(0, '/bohr/dataset-6exj/v1/lib/')
from lib.utils import shrink
from lib.utils import graph_laplacian
import scipy.sparse.linalg
import warnings; warnings.filterwarnings("ignore")
代码
文本
Synthetic dataset
代码
文本
[12]
# Load graphs of rows/users and columns/movies
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/synthetic_netflix.mat')
M = mat['M']
Otraining = mat['Otraining']
Otest = mat['Otest']
Wrow = mat['Wrow']
Wcol = mat['Wcol']
n,m = M.shape
print('n,m=',n,m)
Mgt = M # Ground truth
O = Otraining
M = O* Mgt
perc_obs_training = np.sum(Otraining) / (n*m)
print('perc_obs_training=',perc_obs_training)
n,m= 150 200 perc_obs_training= 0.03
代码
文本
[13]
# Viusalize the rating matrix
plt.figure(1)
plt.imshow(Mgt, interpolation='nearest', cmap='jet')
plt.title('Low-rank Matrix M.\nNote: We NEVER observe it\n in real-world applications')
plt.show()
plt.figure(2)
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')
plt.title('Observed values of M\n for TRAINING.\n Percentage=' + str(perc_obs_training))
plt.show()
代码
文本
[14]
# Hybrid system : Matrix Completion on graphs
# Norm of the operator
OM = O*M
normOM = np.linalg.norm(OM,2)
#######################################
# Select the set of hyper-parameters
#######################################
# scenario : very low number of ratings, 0.03%, error metric = 112.68
lambdaDir = 1e-1 * 1e0; lambdaDF = 1e3; lambdaNuc = normOM/4; alpha = 0.1
#Compute Graph Laplacians
Lr = graph_laplacian(Wrow)
Lc = graph_laplacian(Wcol)
I = scipy.sparse.identity(m, dtype=Lr.dtype)
Lr = scipy.sparse.kron( I, Lr )
Lr = scipy.sparse.csr_matrix(Lr)
I = scipy.sparse.identity(n, dtype=Lc.dtype)
Lc = scipy.sparse.kron( Lc, I )
Lc = scipy.sparse.csr_matrix(Lc)
# Indentify zero columns and zero rows in the data matrix X
idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]
idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]
nb_zero_cols = len(idx_zero_cols)
nb_zero_rows = len(idx_zero_rows)
# Pre-processing
L = alpha* Lc + (1.-alpha)* Lr
vecO = np.reshape(O.T,[-1])
vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)
vecO = scipy.sparse.csr_matrix(vecO)
At = lambdaDir* L + lambdaDF* vecO
vecM = np.reshape(M.T,[-1])
bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T
bt = np.array(bt.todense()).squeeze()
Id = scipy.sparse.identity(n*m)
Id = scipy.sparse.csr_matrix(Id)
# Initialization
X = M; Xb = X;
Y = np.zeros([n,m])
normA = 1.
sigma = 1./normA
tau = 1./normA
diffX = 1e10
min_nm = np.min([n,m])
k = 0
while (k<2000) & (diffX>1e-1):
# Update iteration
k += 1
# Update dual variable y
Y = Y + sigma* Xb
U,S,V = np.linalg.svd(Y/sigma) # % Y/sigma = U*S*V'
Sdiag = shrink( S , lambdaNuc/ sigma )
I = np.array(range(min_nm))
Sshrink = np.zeros([n,m])
Sshrink[I,I] = Sdiag
Y = Y - sigma* U.dot(Sshrink.dot(V))
# Update primal variable x
Xold = X
X = X - tau* Y
A = tau* At + Id
vecX = np.reshape(X.T,[-1])
vecX = scipy.sparse.csr_matrix(vecX)
b = tau* bt + vecX
b = np.array(b).squeeze()
# Solve by linear system
x,_ = scipy.sparse.linalg.cg(A, b, x0=b, tol=1e-6, maxiter=25)
X = np.reshape(x,[m,n]).T
# Fix issue with no observations along some rows and columns
r,c = np.where(X>0.0); median = np.median(X[r,c])
if nb_zero_cols>0: X[:,idx_zero_cols] = median
if nb_zero_rows>0: X[nb_zero_rows,:] = median
# Update primal variable xb
Xb = 2.* X - Xold
# Difference between two iterations
diffX = np.linalg.norm(X-Xold)
# Reconstruction error
err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)
# Plot
if not k%10:
clear_output(wait=True)
plt.figure(1)
plt.imshow(X, interpolation='nearest', cmap='jet')
plt.title('Hybrid Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
print('diffX',diffX)
clear_output(wait=True)
print('Reconstruction Error: '+ str(round(err_test,5)))
# Final plot
plt.figure(2)
plt.imshow(Mgt, interpolation='nearest', cmap='jet')
plt.title('Ground truth low-rank matrix M')
plt.figure(3)
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')
plt.title('Observed values of M')
plt.figure(4)
plt.imshow(X, interpolation='nearest', cmap='jet')
plt.title('Hybrid Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,2)))
plt.show()
diffX 5.423475545647113
diffX 4.415389182853458
diffX 4.005068522915616
diffX 3.7415004227665705
diffX 3.50594108628272
diffX 3.2803732236271568
diffX 3.0570482317689125
diffX 2.843310938486321
diffX 2.6247594661855036
diffX 2.405232860108669
diffX 2.1879916044501577
diffX 1.9776791515588914
diffX 1.7658093663297512
diffX 1.5595074395634871
diffX 1.3598147102245886
diffX 1.1652339022755627
diffX 0.9849909079879685
diffX 0.8174318867983877
diffX 0.6653730707325888
diffX 0.5346109617226651
diffX 0.4207750354153088
diffX 0.3237873083897306
diffX 0.24437741854587874
diffX 0.18459323256238605
diffX 0.1364669834151825
diffX 0.10174444993658202 Reconstruction Error: 112.68308
代码
文本
Real-world dataset SWEETRS
代码
文本
[16]
# Load graphs of rows/users and columns/products
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario1.mat')
mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario2.mat')
# mat = scipy.io.loadmat('datasets/real_sweetrs_scenario3.mat')
M = mat['M']
Otraining = mat['Otraining']
Otest = mat['Otest']
Wrow = mat['Wrow']
Wcol = mat['Wcol']
print('M', M.shape)
print('Otraining', Otraining.shape)
print('Otest', Otest.shape)
print('Wrow', Wrow.shape)
print('Wcol', Wcol.shape)
n,m = M.shape
print('n,m=',n,m)
Mgt = M # Ground truth
O = Otraining
M = O* Mgt
perc_obs_training = np.sum(Otraining)/(n*m)
print('perc_obs_training=',perc_obs_training)
perc_obs_test = np.sum(Otest) / (n*m)
M (664, 77) Otraining (664, 77) Otest (664, 77) Wrow (664, 664) Wcol (77, 77) n,m= 664 77 perc_obs_training= 0.1317868878109842
代码
文本
[17]
# Visualize the original rating matrix
plt.figure(1,figsize=(10,10))
plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])
# Visualize the observed rating matrix
plt.figure(2, figsize=(10,10))
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])
plt.show()
代码
文本
[18]
# Hybrid system : Matrix Completion on graphs
# Norm of the operator
OM = O*M
normOM = np.linalg.norm(OM,2)
#######################################
# Select the set of hyper-parameters
#######################################
# scenario 1 : low number of ratings, e.g. 1.3%, error metric = 402.71
lambdaDir = 1e-1 * 1e3 * 0.5; lambdaDF = 1e3; lambdaNuc = normOM/4; alpha = 0.02
# # scenario 2 : intermediate number of ratings, e.g. 13.1%, error metric = 397.47
lambdaDir = 1e-1; lambdaDF = 1e3 * 10; lambdaNuc = normOM/4 /10; alpha = 0.25
# # # scenario 3 : large number of ratings, e.g. 52.7%, error metric = 695.00
# lambdaDir = 1e-1 * 1e1; lambdaDF = 1e3; lambdaNuc = normOM/4; alpha = 0.02
#Compute Graph Laplacians
Lr = graph_laplacian(Wrow)
Lc = graph_laplacian(Wcol)
I = scipy.sparse.identity(m, dtype=Lr.dtype)
Lr = scipy.sparse.kron( I, Lr )
Lr = scipy.sparse.csr_matrix(Lr)
I = scipy.sparse.identity(n, dtype=Lc.dtype)
Lc = scipy.sparse.kron( Lc, I )
Lc = scipy.sparse.csr_matrix(Lc)
# Indentify zero columns and zero rows in the data matrix X
idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]
idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]
nb_zero_cols = len(idx_zero_cols)
nb_zero_rows = len(idx_zero_rows)
# Pre-processing
L = alpha* Lc + (1.-alpha)* Lr
vecO = np.reshape(O.T,[-1])
vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)
vecO = scipy.sparse.csr_matrix(vecO)
At = lambdaDir* L + lambdaDF* vecO
vecM = np.reshape(M.T,[-1])
bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T
bt = np.array(bt.todense()).squeeze()
Id = scipy.sparse.identity(n*m)
Id = scipy.sparse.csr_matrix(Id)
# Initialization
X = M; Xb = X;
Y = np.zeros([n,m])
normA = 1.
sigma = 1./normA
tau = 1./normA
diffX = 1e10
min_nm = np.min([n,m])
k = 0
while (k<2000) & (diffX>1e-1):
# Update iteration
k += 1
# Update dual variable y
Y = Y + sigma* Xb
U,S,V = np.linalg.svd(Y/sigma) # % Y/sigma = U*S*V'
Sdiag = shrink( S , lambdaNuc/ sigma )
I = np.array(range(min_nm))
Sshrink = np.zeros([n,m])
Sshrink[I,I] = Sdiag
Y = Y - sigma* U.dot(Sshrink.dot(V))
# Update primal variable x
Xold = X
X = X - tau* Y
A = tau* At + Id
vecX = np.reshape(X.T,[-1])
vecX = scipy.sparse.csr_matrix(vecX)
b = tau* bt + vecX
b = np.array(b).squeeze()
# Solve by linear system
x,_ = scipy.sparse.linalg.cg(A, b, x0=b, tol=1e-6, maxiter=25)
X = np.reshape(x,[m,n]).T
# Fix issue with no observations along some rows and columns
r,c = np.where(X>0.0); median = np.median(X[r,c])
if nb_zero_cols>0: X[:,idx_zero_cols] = median
if nb_zero_rows>0: X[nb_zero_rows,:] = median
# Update primal variable xb
Xb = 2.* X - Xold
# Difference between two iterations
diffX = np.linalg.norm(X-Xold)
# Reconstruction error
err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)
# Plot
if not k%10:
clear_output(wait=True)
plt.figure(figsize=(10,10))
plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Hybrib Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
print('diffX',diffX)
clear_output(wait=True)
print('Reconstruction Error: '+ str(round(err_test,5)))
# Final plots
plt.figure(2, figsize=(10,10))
plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])
plt.show()
plt.figure(3, figsize=(10,10))
plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])
plt.show()
plt.figure(4, figsize=(10,10))
plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)
plt.colorbar(shrink=0.65)
plt.title('Hybrib Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))
plt.show()
diffX 12.97330201424323
diffX 10.214306996079092
diffX 7.865498569445742
diffX 5.817653613498639
diffX 4.159594766250172
diffX 2.9581038165169664
diffX 2.1708126517625774
diffX 1.6884458895360115
diffX 1.3985506569506554
diffX 1.2169039432337356
diffX 1.0916012474096615
diffX 0.9948503362171524
diffX 0.912897115326623
diffX 0.8391816776714556
diffX 0.7706457184758235
diffX 0.705924882621576
diffX 0.64448477047351
diffX 0.5861982384190937
diffX 0.5311253492817951
diffX 0.47939264184555236
diffX 0.4311249834863117
diffX 0.3864073896313916
diffX 0.3452663227139773
diffX 0.3076645073517481
diffX 0.27350489526463656
diffX 0.24264018047040123
diffX 0.21488492611358004
diffX 0.1900280978213643
diffX 0.16784455040381902
diffX 0.14810468029652096
diffX 0.130581966254459
diffX 0.11505845924160628
diffX 0.10132847119460266 Reconstruction Error: 397.4795
代码
文本
点个赞吧
本文被以下合集收录
Graph Machine learning
xuxh@dp.tech
更新于 2024-10-08
44 篇0 人关注
推荐阅读
公开
第三讲 图聚类 xuxh@dp.tech
更新于 2024-10-17
公开
第二讲 图科学初步xuxh@dp.tech
更新于 2024-10-17