空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

第五讲图相关推荐系统

Machine Learning

xuxh@dp.tech

doggyceneks

更新于 2024-10-24

推荐镜像 :Basic Image:bohrium-notebook:2023-04-07

推荐机型 :c2_m4_cpu

数据集

辅助文件和数据-推荐系统(v1)

本课程由新加坡国立大学(NUS)的计算科学系开设，课程编号CS6208，课程主页Xavier Bresson — Teaching Resources，GitHub主页GML2023，读者可访问原链接查看完整slides。

代码

文本

Lecture 5 : Recommendation on Graphs

内容提要

Intro
- 推荐系统使用产品特征、用户特征、历史的或动态的打分数据以及产品、用户、用户-产品之间的关系做推荐.
- 本讲中我们关注
  1. 只对图建立的推荐系统：PageRank 和 content reccomendation
  2. 只对特征建立的推荐系统：Low-rank recommendation, a.k.a. collaborative recommendation
  3. 同时处理图和特征的推荐系统：Hybrid systems
PageRank
- 排序算法，对任意图中的节点进行排序，主要依据是节点的popularity（用节点的入度衡量）
- 平稳状态的向量根据节点的连接影响程度给出节点的重要性的一个度量 --> 可以通过求解图G的平稳状态得到拍寻问题的解
- 这是一个奇异值分解问题，但是EVD算法复杂度高，可选用 Power method 算法求解
- Lab1: PageRank实现，比较EVD和Power method的速度，可视化一个简单网络的PageRank函数
Collaborative recommendation
- 依据稀疏的评分矩阵做矩阵填充，预测用户可能的打分并基于此做推荐
- 具体求解的问题
- NP难问题，两种求解算法
  - Convex relaxation with nuclear norm
  - Non-convex relaxation with matrix factorization
- Lab 2: 协同过滤，计算一个低秩解，可视化矩阵重建过程
Content recommendation
- 通过用户之间和物品之间的相似性来预测评分，相似性主要通过用户之间和物品之间的关系图来度量
- Graph-based content filtering相当于用Graph diffusion做协同推荐
- 问题的求解是一个二次优化问题
- Lab 03: 实现Content filtering
Hybrid system
- 集成协同和内容推荐技术
- 几种方法的效果
- Lab4 ：求hybrid解并可视化解

代码

文本

Lab 01 : PageRank

代码

文本

[1]

# Load libraries

import numpy as np

import scipy.io

%matplotlib inline

#%matplotlib notebook

from matplotlib import pyplot

import matplotlib.pyplot as plt

plt.rcParams.update({'figure.max_open_warning': 0})

import time

import sys; sys.path.insert(0, 'lib/')

import scipy.sparse.linalg

import warnings; warnings.filterwarnings("ignore")

代码

文本

Synthetic small graph

代码

文本

[2]

# Data matrix

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/pagerank_synthetic.mat')

W = mat['W']

W = scipy.sparse.csr_matrix(W)

Wref = W

X = mat['X']

n = X.shape[0]

d = X.shape[1]

E = mat['E']

XE = mat['X2']

print('num_nodes:',n)

num_nodes: 11

代码

文本

[3]

plt.figure(1)

size_vertex_plot = 100

plt.scatter(X[:,0], X[:,1], s=size_vertex_plot*np.ones(n))

plt.quiver(XE[:,0], XE[:,1], E[:,0], E[:,1], scale=1., units='xy')

plt.title('Visualization of the artificial WWW')

plt.axis('equal')

plt.axis('off')

plt.show()

代码

文本

[5]

# Solve eigenproblem

# vector of 1's

e = np.ones([n,1])/n

one = np.ones([n,1])

# Dumpling vector

D = np.array(W.sum(axis=1),dtype='float32').squeeze()

a_idx = np.zeros([n],dtype='int32')

a_idx[np.where(D<1./2)] = 1

a = (1.0* a_idx)[:,None]

# Compute P = W D^{-1}

invD = 1./(D+1e-10)

invD[a_idx==1] = 0

invD = np.diag(invD)

W = Wref.todense()

P = invD.dot(W).T

# EVD

alpha = 0.85

start = time.time()

Phat = alpha* P + alpha* e.dot(a.T) + (1.0-alpha)* e.dot(one.T)

Phat = scipy.sparse.csr_matrix(Phat)

lamb, U = scipy.sparse.linalg.eigs(Phat, k=1, which='LM')

x_pagerank = np.abs(U[:,0])/ np.sum(np.abs(U[:,0]))

# Computational time

print('Computational time for PageRank solution with EIGEN Method (sec):',time.time() - start)

Computational time for PageRank solution with EIGEN Method (sec): 0.001756429672241211

代码

文本

[6]

plt.figure(2)

size_vertex_plot = 1e3*6

plt.scatter(X[:,0], X[:,1], s=size_vertex_plot*x_pagerank)

plt.quiver(XE[:,0], XE[:,1], E[:,0], E[:,1], scale=1., units='xy')

plt.title('PageRank solution with EIGEN Method.')

plt.axis('equal')

plt.axis('off')

plt.show()

代码

文本

[7]

# PageRank values

x = x_pagerank

val = np.sort(x)[::-1]

idx = np.argsort(x)[::-1]

index = np.array(range(1,1+n))

in_degree = np.array(W.sum(axis=0)).squeeze(axis=0)

out_degree = np.array(W.sum(axis=1)).squeeze(axis=1)

index = index[idx]

in_degree = in_degree[idx]

out_degree = out_degree[idx]

print('\n ''Node'' | ''PageRank'' | ''In-degree'' | ''Out-degree'' ')

for i in range(n):

print(' ',index[i], ' ', round(val[i],3) ,' ', in_degree[i],' ', out_degree[i], end='\n')

  Node | PageRank | In-degree | Out-degree 
    2    0.384        7.0        1.0
    3    0.343        1.0        1.0
    5    0.081        6.0        3.0
    4    0.039        1.0        2.0
    6    0.039        1.0        2.0
    1    0.033        1.0        0.0
    10    0.016        0.0        1.0
    8    0.016        0.0        2.0
    9    0.016        0.0        2.0
    7    0.016        0.0        2.0
    11    0.016        0.0        1.0

代码

文本

[8]

# Power Method

# Initialization

x = e

diffx = 1e10

k = 0

# Iterative scheme

start = time.time()

alpha = 0.85

while (k<1000) & (diffx>1e-6):

# Update iteration

k += 1

# Update x

xold = x

x = alpha* P.dot(x) + e.dot( alpha* a.T.dot(x) + (1.0-alpha) )

# Stopping condition

diffx = np.linalg.norm(x-xold,1)

x_pagerank_PM = np.array(x).squeeze(axis=1)

# Computational time

print('Computational time for PageRank solution with POWER Method (sec):',time.time() - start)

plt.figure(3)

size_vertex_plot = 1e3*6

plt.scatter(X[:,0], X[:,1], s=size_vertex_plot*x_pagerank)

plt.quiver(XE[:,0], XE[:,1], E[:,0], E[:,1], scale=1., units='xy')

plt.title('PageRank solution with POWER Method.')

plt.axis('equal')

plt.axis('off')

plt.show()

Computational time for PageRank solution with POWER Method (sec): 0.0045511722564697266

代码

文本

Real-world dataset CALIFORNIA

代码

文本

[9]

###########################

# California graph

# http://vlado.fmf.uni-lj.si/pub/networks/data/mix/mixed.htm

# This graph was constructed by expanding a 200-page response set

# to a search engine query 'California'.

###########################

network = np.loadtxt('/bohr/dataset-6exj/v1/datasets/california.dat')

row = network[:,0]-1

col = network[:,1]-1

n = int(np.max(network))+1 # nb of vertices

ne = len(row)

print('nb of nodes=',n)

print('nb of edges=',ne)

# Create Adjacency matrix W

data = np.ones([ne])

W = scipy.sparse.csr_matrix((data, (row, col)), shape=(n, n))

Wref = W

print(W.shape)

# Plot adjacency matrix

plt.figure(4)

plt.spy(W,precision=0.01, markersize=1)

plt.show()

nb of nodes= 9665
nb of edges= 16150
(9665, 9665)

代码

文本

[4]

# Solve eigenproblem

# vector of 1's

e = np.ones([n,1])/n

one = np.ones([n,1])

# Dumpling vector

D = np.array(W.sum(axis=1),dtype='float32').squeeze()

a_idx = np.zeros([n],dtype='int32')

a_idx[np.where(D<1./2)] = 1

a = (1.0* a_idx)[:,None]

# Compute P = W D^{-1}

invD = 1./(D+1e-10)

invD[a_idx==1] = 0

invD = np.diag(invD)

W = Wref.todense()

P = invD.dot(W).T

# EVD

alpha = 0.85

start = time.time()

Phat = alpha* P + alpha* e.dot(a.T) + (1.0-alpha)* e.dot(one.T)

Phat = scipy.sparse.csr_matrix(Phat)

lamb, U = scipy.sparse.linalg.eigs(Phat, k=1, which='LM')

x_pagerank = np.abs(U[:,0])/ np.sum(np.abs(U[:,0]))

# Computational time

print('Computational time for PageRank solution with EIGEN Method (sec):',time.time() - start)

Computational time for PageRank solution with EIGEN Method (sec): 0.024279356002807617

代码

文本

[ ]

# Power Method

# Initialization

e = np.ones([n,1])/n

x = e

diffx = 1e10

k = 0

W = Wref.todense()

P = invD.dot(W).T

# Iterative scheme

start = time.time()

alpha = 0.85

while (k<1000) & (diffx>1e-6):

# Update iteration

k += 1

# Update x

xold = x

x = alpha* P.dot(x) + e.dot( alpha* a.T.dot(x) + (1.0-alpha) )

# Stopping condition

diffx = np.linalg.norm(x-xold,1)

x_pagerank_PM = np.array(x).squeeze(axis=1)

# Computational time

print('Computational time for PageRank solution with POWER Method (sec):',time.time() - start)

代码

文本

Lab 02 : Collaborative filtering

代码

文本

[2]

# Load libraries

import numpy as np

import scipy.io

%matplotlib inline

#%matplotlib notebook

from IPython.display import display, clear_output

from matplotlib import pyplot

import matplotlib.pyplot as plt

plt.rcParams.update({'figure.max_open_warning': 0})

import time

import sys; sys.path.insert(0, '/bohr/dataset-6exj/v1/lib/')

from lib.utils import shrink

import scipy.sparse.linalg

import warnings; warnings.filterwarnings("ignore")

代码

文本

Synthetic dataset

代码

文本

[3]

# Load graphs of rows/users and columns/movies

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/synthetic_netflix.mat')

M = mat['M']

Otraining = mat['Otraining']

Otest = mat['Otest']

Wrow = mat['Wrow']

Wcol = mat['Wcol']

n,m = M.shape

print('n,m=',n,m)

Mgt = M # Ground truth

O = Otraining

M = O* Mgt

perc_obs_training = np.sum(Otraining) / (n*m)

print('perc_obs_training=',perc_obs_training)

n,m= 150 200
perc_obs_training= 0.03

代码

文本

[4]

# Viusalize the rating matrix

plt.figure(1)

plt.imshow(Mgt, interpolation='nearest', cmap='jet')

plt.title('Low-rank Matrix M.\nNote: We NEVER observe it\n in real-world applications')

plt.show()

plt.figure(2)

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')

plt.title('Observed values of M\n for TRAINING.\n Percentage=' + str(perc_obs_training))

plt.show()

代码

文本

[5]

# Collaborative filtering / low-rank approximation by nuclear norm

# Norm of the operator

OM = O*M

normOM = np.linalg.norm(OM,2)

#######################################

# Select the set of hyper-parameters

#######################################

# scenario : very low number of ratings, 0.03%, error metric = 138.75

lambdaNuc = normOM/4; lambdaDF = 1e3 * 1e-2

# Indentify zero columns and zero rows in the data matrix X

idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]

idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]

nb_zero_cols = len(idx_zero_cols)

nb_zero_rows = len(idx_zero_rows)

# Initialization

X = M; Xb = X;

Y = np.zeros([n,m])

normA = 1.

sigma = 1./normA

tau = 1./normA

diffX = 1e10

min_nm = np.min([n,m])

k = 0

while (k<2000) & (diffX>1e-1):

# Update iteration

k += 1

# Update dual variable y

Y = Y + sigma* Xb

U,S,V = np.linalg.svd(Y/sigma)

Sdiag = shrink( S , lambdaNuc/ sigma )

I = np.array(range(min_nm))

Sshrink = np.zeros([n,m])

Sshrink[I,I] = Sdiag

Y = Y - sigma* U.dot(Sshrink.dot(V))

# Update primal variable x

Xold = X

X = X - tau* Y

X = ( X + tau* lambdaDF* O* M)/ (1 + tau* lambdaDF* O)

# Fix issue with no observations along some rows and columns

r,c = np.where(X>0.0); median = np.median(X[r,c])

if nb_zero_cols>0: X[:,idx_zero_cols] = median

if nb_zero_rows>0: X[nb_zero_rows,:] = median

# Update primal variable xb

Xb = 2.* X - Xold

# Difference between two iterations

diffX = np.linalg.norm(X-Xold)

# Reconstruction error

err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)

# Plot

if not k%50:

clear_output(wait=True)

plt.figure(1)

plt.imshow(X, interpolation='nearest', cmap='jet')

plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

print('diffX',diffX)

clear_output(wait=True)

print('Reconstruction Error: '+ str(round(err_test,5)))

# Final plot

plt.figure(2)

plt.imshow(Mgt, interpolation='nearest', cmap='jet')

plt.title('Ground truth low-rank matrix M')

plt.figure(3)

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')

plt.title('Observed values of M')

plt.figure(4)

plt.imshow(X, interpolation='nearest', cmap='jet')

plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,2)))

plt.show()

diffX 2.347719681709583

diffX 1.803709307199022

diffX 1.4771937408778328

diffX 1.1861059871897621

diffX 0.9225570691441073

diffX 0.688543926854264

diffX 0.49170614216425157

diffX 0.32219978409335964

diffX 0.194261221724262

diffX 0.11464841985336849
Reconstruction Error: 138.75954

代码

文本

Real-world dataset SWEETRS

代码

文本

[6]

# Load graphs of rows/users and columns/products

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario1.mat')

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario2.mat')

# mat = scipy.io.loadmat('datasets/real_sweetrs_scenario3.mat')

M = mat['M']

Otraining = mat['Otraining']

Otest = mat['Otest']

Wrow = mat['Wrow']

Wcol = mat['Wcol']

print('M', M.shape)

print('Otraining', Otraining.shape)

print('Otest', Otest.shape)

print('Wrow', Wrow.shape)

print('Wcol', Wcol.shape)

n,m = M.shape

print('n,m=',n,m)

Mgt = M # Ground truth

O = Otraining

M = O* Mgt

perc_obs_training = np.sum(Otraining) / (n*m)

print('perc_obs_training=',perc_obs_training)

perc_obs_test = np.sum(Otest) / (n*m)

M (664, 77)
Otraining (664, 77)
Otest (664, 77)
Wrow (664, 664)
Wcol (77, 77)
n,m= 664 77
perc_obs_training= 0.1317868878109842

代码

文本

[7]

# Visualize the original rating matrix

plt.figure(1,figsize=(10,10))

plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])

# Visualize the observed rating matrix

plt.figure(2, figsize=(10,10))

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])

plt.show()

代码

文本

[8]

# Collaborative filtering / low-rank approximation by nuclear norm

# Norm of the operator

OM = O*M

normOM = np.linalg.norm(OM,2)

#######################################

# Select the set of hyper-parameters

#######################################

# scenario 1 : low number of ratings, 1.3%, error metric = 744.10

lambdaNuc = normOM/4; lambdaDF = 1e3 * 1e-2

# scenario 2 : intermediate number of ratings, 13.1%, error metric = 412.01

lambdaNuc = normOM/4 * 1e2; lambdaDF = 1e3 * 1e0

# scenario 3 : large number of ratings, 52.7%, error metric = 698.97

# lambdaNuc = normOM/4 * 1e2; lambdaDF = 1e3

# Indentify zero columns and zero rows in the data matrix X

idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]

idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]

nb_zero_cols = len(idx_zero_cols)

nb_zero_rows = len(idx_zero_rows)

# Initialization

X = M; Xb = X;

Y = np.zeros([n,m])

normA = 1.

sigma = 1./normA

tau = 1./normA

diffX = 1e10

min_nm = np.min([n,m])

k = 0

while (k<2000) & ( diffX>1e-1 or k<100 ) :

# Update iteration

k += 1

# Update dual variable y

Y = Y + sigma* Xb

U,S,V = np.linalg.svd(Y/sigma)

Sdiag = shrink( S , lambdaNuc/ sigma )

I = np.array(range(min_nm))

Sshrink = np.zeros([n,m])

Sshrink[I,I] = Sdiag

Y = Y - sigma* U.dot(Sshrink.dot(V))

# Update primal variable x

Xold = X

X = X - tau* Y

X = ( X + tau* lambdaDF* O* M)/ (1 + tau* lambdaDF* O)

# Fix issue with no observations along some rows and columns

r,c = np.where(X>0.0); median = np.median(X[r,c])

if nb_zero_cols>0: X[:,idx_zero_cols] = median

if nb_zero_rows>0: X[nb_zero_rows,:] = median

# Update primal variable xb

Xb = 2.* X - Xold

# Difference between two iterations

diffX = np.linalg.norm(X-Xold)

# Reconstruction error

err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)

# Plot

if not k%50:

clear_output(wait=True)

plt.figure(figsize=(10,10))

plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

print('diffX',diffX)

clear_output(wait=True)

print('Reconstruction Error: '+ str(round(err_test,5)))

# Final plots

plt.figure(2, figsize=(10,10))

plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])

plt.show()

plt.figure(3, figsize=(10,10))

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])

plt.show()

plt.figure(4, figsize=(10,10))

plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Collaborative Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

diffX 9.715413387006436

diffX 1.3421208842881098

diffX 0.6940819839265959

diffX 10.996647752952654

diffX 4.6520503707714935

diffX 2.7190054210572567

diffX 2.2671834772418658

diffX 0.6633559110697017

diffX 1.7002771816583142

diffX 0.3379443310380693

diffX 0.13896755511412323

diffX 0.19190345371713774
Reconstruction Error: 409.76023

代码

文本

Lab 03 : Content recommendation

代码

文本

[3]

# Load libraries

import numpy as np

import scipy.io

%matplotlib inline

#%matplotlib notebook

from matplotlib import pyplot

import matplotlib.pyplot as plt

plt.rcParams.update({'figure.max_open_warning': 0})

import time

import sys; sys.path.insert(0, '/bohr/dataset-6exj/v1/')

from lib.utils import shrink

from lib.utils import graph_laplacian

import scipy.sparse.linalg

import warnings; warnings.filterwarnings("ignore")

from lib.utils import compute_ncut, reindex_W_with_classes, construct_knn_graph

import torch

import networkx as nx

代码

文本

Synthetic dataset

代码

文本

[5]

# Load graphs of rows/users and columns/movies

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/synthetic_netflix.mat')

M = mat['M']

Otraining = mat['Otraining']

Otest = mat['Otest']

Wrow = mat['Wrow']

Wcol = mat['Wcol']

n,m = M.shape

print('n,m=',n,m)

Mgt = M # Ground truth

O = Otraining

M = O* Mgt

perc_obs_training = np.sum(Otraining) / (n*m)

print('perc_obs_training=',perc_obs_training)

n,m= 150 200
perc_obs_training= 0.03

代码

文本

[6]

# Viusalize the rating matrix

plt.figure(1)

plt.imshow(Mgt, interpolation='nearest', cmap='jet')

plt.title('Low-rank Matrix M.\nNote: We NEVER observe it\n in real-world applications')

plt.show()

plt.figure(2)

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')

plt.title('Observed values of M\n for TRAINING.\n Percentage=' + str(perc_obs_training))

plt.show()

代码

文本

[7]

# Content Filtering / Graph Regularization by Dirichlet Energy

#######################################

# Select the set of hyper-parameters

#######################################

# scenario : very low number of ratings, 0.03%, error metric = 161.32

lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02

# Compute Graph Laplacians

Lr = graph_laplacian(Wrow)

Lc = graph_laplacian(Wcol)

I = scipy.sparse.identity(m, dtype=Lr.dtype)

Lr = scipy.sparse.kron( I, Lr )

Lr = scipy.sparse.csr_matrix(Lr)

I = scipy.sparse.identity(n, dtype=Lc.dtype)

Lc = scipy.sparse.kron( Lc, I )

Lc = scipy.sparse.csr_matrix(Lc)

# Pre-processing

L = alpha* Lc + (1.-alpha)* Lr

vecO = np.reshape(O.T,[-1])

vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)

vecO = scipy.sparse.csr_matrix(vecO)

At = lambdaDir* L + lambdaDF* vecO

vecM = np.reshape(M.T,[-1])

bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T

bt = np.array(bt.todense()).squeeze()

# Solve by linear system

x,_ = scipy.sparse.linalg.cg(At, bt, x0=bt, tol=1e-9, maxiter=100)

X = np.reshape(x,[m,n]).T

# Reconstruction error

err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)

print('Reconstruction Error: '+ str(round(err_test,5)))

# Plot

plt.figure(2)

plt.imshow(Mgt, interpolation='nearest', cmap='jet')

plt.title('Ground truth low-rank matrix M')

plt.figure(3)

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')

plt.title('Observed values of M')

plt.figure(4)

plt.imshow(X, interpolation='nearest', cmap='jet')

plt.title('Content Filtering\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

Reconstruction Error: 161.32238

代码

文本

Real world dataset SWEETRS

代码

文本

[8]

# Load graphs of rows/users and columns/products

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario1.mat')

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario2.mat')

# mat = scipy.io.loadmat('datasets/real_sweetrs_scenario3.mat')

M = mat['M']

Otraining = mat['Otraining']

Otest = mat['Otest']

Wrow = mat['Wrow']

Wcol = mat['Wcol']

print('M', M.shape)

print('Otraining', Otraining.shape)

print('Otest', Otest.shape)

print('Wrow', Wrow.shape)

print('Wcol', Wcol.shape)

n,m = M.shape

print('n,m=',n,m)

Mgt = M # Ground truth

O = Otraining

M = O* Mgt

perc_obs_training = np.sum(Otraining)/(n*m)

print('perc_obs_training=',perc_obs_training)

perc_obs_test = np.sum(Otest) / (n*m)

M (664, 77)
Otraining (664, 77)
Otest (664, 77)
Wrow (664, 664)
Wcol (77, 77)
n,m= 664 77
perc_obs_training= 0.1317868878109842

代码

文本

[14]

# Visualize the original rating matrix

plt.figure(1,figsize=(10,10))

plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])

# Visualize the observed rating matrix

plt.figure(2, figsize=(10,10))

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])

plt.show()

代码

文本

[9]

# Visualize graph of users and graph of products

# Plot adjacency matrix w.r.t. NCut communities

# plot graph of users

W = Wrow

nc = 10; Cncut, _ = compute_ncut(W, np.zeros(Mgt.shape[0]), nc)# compute NCut clustering

[reindexed_W_ncut,reindexed_C_ncut] = reindex_W_with_classes(W,Cncut)

plt.figure(1)

plt.spy(reindexed_W_ncut, precision=0.01, markersize=1)

plt.title('Adjacency matrix of users indexed \naccording to the NCut communities')

plt.show()

A = W.copy()

A.setdiag(0)

A.eliminate_zeros()

G_nx = nx.from_scipy_sparse_array(A)

plt.figure(2,figsize=[30,30])

nx.draw_networkx(G_nx, with_labels=True, node_color=np.array(Cncut), cmap='jet')

# plot graph of products

W = Wcol

nc = 10; Cncut, _ = compute_ncut(W, np.zeros(Mgt.shape[1]), nc)# compute NCut clustering

[reindexed_W_ncut,reindexed_C_ncut] = reindex_W_with_classes(W,Cncut)

plt.figure(3)

plt.spy(reindexed_W_ncut, precision=0.01, markersize=1)

plt.title('Adjacency matrix of products indexed \naccording to the NCut communities')

plt.show()

A = W.copy()

A.setdiag(0)

A.eliminate_zeros()

G_nx = nx.from_scipy_sparse_array(A)

plt.figure(4,figsize=[30,30])

nx.draw_networkx(G_nx, with_labels=True, node_color=np.array(Cncut), cmap='jet')

代码

文本

[10]

# Content Filtering / Graph Regularization by Dirichlet Energy

#######################################

# Select the set of hyper-parameters

#######################################

# scenario 1 : low number of ratings, e.g. 1.3%, error metric = 399.89

lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02

# scenario 2 : intermediate number of ratings, e.g. 13.1%, error metric = 411.24

lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02

# scenario 3 : large number of ratings, e.g. 52.7%, error metric = 748.52

# lambdaDir = 1e-1; lambdaDF = 1e3; alpha = 0.02

# Compute Graph Laplacians

Lr = graph_laplacian(Wrow)

Lc = graph_laplacian(Wcol)

I = scipy.sparse.identity(m, dtype=Lr.dtype)

Lr = scipy.sparse.kron( I, Lr )

Lr = scipy.sparse.csr_matrix(Lr)

I = scipy.sparse.identity(n, dtype=Lc.dtype)

Lc = scipy.sparse.kron( Lc, I )

Lc = scipy.sparse.csr_matrix(Lc)

# Pre-processing

L = alpha* Lc + (1.-alpha)* Lr

vecO = np.reshape(O.T,[-1])

vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)

vecO = scipy.sparse.csr_matrix(vecO)

At = lambdaDir* L + lambdaDF* vecO

vecM = np.reshape(M.T,[-1])

bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T

bt = np.array(bt.todense()).squeeze()

# Solve by linear system

x,_ = scipy.sparse.linalg.cg(At, bt, x0=bt, tol=1e-9, maxiter=100)

X = np.reshape(x,[m,n]).T

# Reconstruction error

err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)

print('Reconstruction Error: '+ str(round(err_test,5)))

# Plots

plt.figure(2, figsize=(10,10))

plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])

plt.show()

plt.figure(3, figsize=(10,10))

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])

plt.show()

plt.figure(4, figsize=(10,10))

plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Content Filtering\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

Reconstruction Error: 411.24489

代码

文本

Lab 04 : Hybrid recommendation

代码

文本

[11]

# Load libraries

import numpy as np

import scipy.io

%matplotlib inline

#%matplotlib notebook

from matplotlib import pyplot

import matplotlib.pyplot as plt

plt.rcParams.update({'figure.max_open_warning': 0})

from IPython.display import display, clear_output

import time

import sys; sys.path.insert(0, '/bohr/dataset-6exj/v1/lib/')

from lib.utils import shrink

from lib.utils import graph_laplacian

import scipy.sparse.linalg

import warnings; warnings.filterwarnings("ignore")

代码

文本

Synthetic dataset

代码

文本

[12]

# Load graphs of rows/users and columns/movies

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/synthetic_netflix.mat')

M = mat['M']

Otraining = mat['Otraining']

Otest = mat['Otest']

Wrow = mat['Wrow']

Wcol = mat['Wcol']

n,m = M.shape

print('n,m=',n,m)

Mgt = M # Ground truth

O = Otraining

M = O* Mgt

perc_obs_training = np.sum(Otraining) / (n*m)

print('perc_obs_training=',perc_obs_training)

n,m= 150 200
perc_obs_training= 0.03

代码

文本

[13]

# Viusalize the rating matrix

plt.figure(1)

plt.imshow(Mgt, interpolation='nearest', cmap='jet')

plt.title('Low-rank Matrix M.\nNote: We NEVER observe it\n in real-world applications')

plt.show()

plt.figure(2)

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')

plt.title('Observed values of M\n for TRAINING.\n Percentage=' + str(perc_obs_training))

plt.show()

代码

文本

[14]

# Hybrid system : Matrix Completion on graphs

# Norm of the operator

OM = O*M

normOM = np.linalg.norm(OM,2)

#######################################

# Select the set of hyper-parameters

#######################################

# scenario : very low number of ratings, 0.03%, error metric = 112.68

lambdaDir = 1e-1 * 1e0; lambdaDF = 1e3; lambdaNuc = normOM/4; alpha = 0.1

#Compute Graph Laplacians

Lr = graph_laplacian(Wrow)

Lc = graph_laplacian(Wcol)

I = scipy.sparse.identity(m, dtype=Lr.dtype)

Lr = scipy.sparse.kron( I, Lr )

Lr = scipy.sparse.csr_matrix(Lr)

I = scipy.sparse.identity(n, dtype=Lc.dtype)

Lc = scipy.sparse.kron( Lc, I )

Lc = scipy.sparse.csr_matrix(Lc)

# Indentify zero columns and zero rows in the data matrix X

idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]

idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]

nb_zero_cols = len(idx_zero_cols)

nb_zero_rows = len(idx_zero_rows)

# Pre-processing

L = alpha* Lc + (1.-alpha)* Lr

vecO = np.reshape(O.T,[-1])

vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)

vecO = scipy.sparse.csr_matrix(vecO)

At = lambdaDir* L + lambdaDF* vecO

vecM = np.reshape(M.T,[-1])

bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T

bt = np.array(bt.todense()).squeeze()

Id = scipy.sparse.identity(n*m)

Id = scipy.sparse.csr_matrix(Id)

# Initialization

X = M; Xb = X;

Y = np.zeros([n,m])

normA = 1.

sigma = 1./normA

tau = 1./normA

diffX = 1e10

min_nm = np.min([n,m])

k = 0

while (k<2000) & (diffX>1e-1):

# Update iteration

k += 1

# Update dual variable y

Y = Y + sigma* Xb

U,S,V = np.linalg.svd(Y/sigma) # % Y/sigma = U*S*V'

Sdiag = shrink( S , lambdaNuc/ sigma )

I = np.array(range(min_nm))

Sshrink = np.zeros([n,m])

Sshrink[I,I] = Sdiag

Y = Y - sigma* U.dot(Sshrink.dot(V))

# Update primal variable x

Xold = X

X = X - tau* Y

A = tau* At + Id

vecX = np.reshape(X.T,[-1])

vecX = scipy.sparse.csr_matrix(vecX)

b = tau* bt + vecX

b = np.array(b).squeeze()

# Solve by linear system

x,_ = scipy.sparse.linalg.cg(A, b, x0=b, tol=1e-6, maxiter=25)

X = np.reshape(x,[m,n]).T

# Fix issue with no observations along some rows and columns

r,c = np.where(X>0.0); median = np.median(X[r,c])

if nb_zero_cols>0: X[:,idx_zero_cols] = median

if nb_zero_rows>0: X[nb_zero_rows,:] = median

# Update primal variable xb

Xb = 2.* X - Xold

# Difference between two iterations

diffX = np.linalg.norm(X-Xold)

# Reconstruction error

err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)

# Plot

if not k%10:

clear_output(wait=True)

plt.figure(1)

plt.imshow(X, interpolation='nearest', cmap='jet')

plt.title('Hybrid Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

print('diffX',diffX)

clear_output(wait=True)

print('Reconstruction Error: '+ str(round(err_test,5)))

# Final plot

plt.figure(2)

plt.imshow(Mgt, interpolation='nearest', cmap='jet')

plt.title('Ground truth low-rank matrix M')

plt.figure(3)

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet')

plt.title('Observed values of M')

plt.figure(4)

plt.imshow(X, interpolation='nearest', cmap='jet')

plt.title('Hybrid Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,2)))

plt.show()

diffX 5.423475545647113

diffX 4.415389182853458

diffX 4.005068522915616

diffX 3.7415004227665705

diffX 3.50594108628272

diffX 3.2803732236271568

diffX 3.0570482317689125

diffX 2.843310938486321

diffX 2.6247594661855036

diffX 2.405232860108669

diffX 2.1879916044501577

diffX 1.9776791515588914

diffX 1.7658093663297512

diffX 1.5595074395634871

diffX 1.3598147102245886

diffX 1.1652339022755627

diffX 0.9849909079879685

diffX 0.8174318867983877

diffX 0.6653730707325888

diffX 0.5346109617226651

diffX 0.4207750354153088

diffX 0.3237873083897306

diffX 0.24437741854587874

diffX 0.18459323256238605

diffX 0.1364669834151825

diffX 0.10174444993658202
Reconstruction Error: 112.68308

代码

文本

Real-world dataset SWEETRS

代码

文本

[16]

# Load graphs of rows/users and columns/products

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario1.mat')

mat = scipy.io.loadmat('/bohr/dataset-6exj/v1/datasets/real_sweetrs_scenario2.mat')

# mat = scipy.io.loadmat('datasets/real_sweetrs_scenario3.mat')

M = mat['M']

Otraining = mat['Otraining']

Otest = mat['Otest']

Wrow = mat['Wrow']

Wcol = mat['Wcol']

print('M', M.shape)

print('Otraining', Otraining.shape)

print('Otest', Otest.shape)

print('Wrow', Wrow.shape)

print('Wcol', Wcol.shape)

n,m = M.shape

print('n,m=',n,m)

Mgt = M # Ground truth

O = Otraining

M = O* Mgt

perc_obs_training = np.sum(Otraining)/(n*m)

print('perc_obs_training=',perc_obs_training)

perc_obs_test = np.sum(Otest) / (n*m)

M (664, 77)
Otraining (664, 77)
Otest (664, 77)
Wrow (664, 664)
Wcol (77, 77)
n,m= 664 77
perc_obs_training= 0.1317868878109842

代码

文本

[17]

# Visualize the original rating matrix

plt.figure(1,figsize=(10,10))

plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])

# Visualize the observed rating matrix

plt.figure(2, figsize=(10,10))

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])

plt.show()

代码

文本

[18]

# Hybrid system : Matrix Completion on graphs

# Norm of the operator

OM = O*M

normOM = np.linalg.norm(OM,2)

#######################################

# Select the set of hyper-parameters

#######################################

# scenario 1 : low number of ratings, e.g. 1.3%, error metric = 402.71

lambdaDir = 1e-1 * 1e3 * 0.5; lambdaDF = 1e3; lambdaNuc = normOM/4; alpha = 0.02

# # scenario 2 : intermediate number of ratings, e.g. 13.1%, error metric = 397.47

lambdaDir = 1e-1; lambdaDF = 1e3 * 10; lambdaNuc = normOM/4 /10; alpha = 0.25

# # # scenario 3 : large number of ratings, e.g. 52.7%, error metric = 695.00

# lambdaDir = 1e-1 * 1e1; lambdaDF = 1e3; lambdaNuc = normOM/4; alpha = 0.02

#Compute Graph Laplacians

Lr = graph_laplacian(Wrow)

Lc = graph_laplacian(Wcol)

I = scipy.sparse.identity(m, dtype=Lr.dtype)

Lr = scipy.sparse.kron( I, Lr )

Lr = scipy.sparse.csr_matrix(Lr)

I = scipy.sparse.identity(n, dtype=Lc.dtype)

Lc = scipy.sparse.kron( Lc, I )

Lc = scipy.sparse.csr_matrix(Lc)

# Indentify zero columns and zero rows in the data matrix X

idx_zero_cols = np.where(np.sum(Otraining,axis=0)<1e-9)[0]

idx_zero_rows = np.where(np.sum(Otraining,axis=1)<1e-9)[0]

nb_zero_cols = len(idx_zero_cols)

nb_zero_rows = len(idx_zero_rows)

# Pre-processing

L = alpha* Lc + (1.-alpha)* Lr

vecO = np.reshape(O.T,[-1])

vecO = scipy.sparse.diags(vecO, 0, shape=(n*m, n*m) ,dtype=L.dtype)

vecO = scipy.sparse.csr_matrix(vecO)

At = lambdaDir* L + lambdaDF* vecO

vecM = np.reshape(M.T,[-1])

bt = lambdaDF* scipy.sparse.csr_matrix( vecM ).T

bt = np.array(bt.todense()).squeeze()

Id = scipy.sparse.identity(n*m)

Id = scipy.sparse.csr_matrix(Id)

# Initialization

X = M; Xb = X;

Y = np.zeros([n,m])

normA = 1.

sigma = 1./normA

tau = 1./normA

diffX = 1e10

min_nm = np.min([n,m])

k = 0

while (k<2000) & (diffX>1e-1):

# Update iteration

k += 1

# Update dual variable y

Y = Y + sigma* Xb

U,S,V = np.linalg.svd(Y/sigma) # % Y/sigma = U*S*V'

Sdiag = shrink( S , lambdaNuc/ sigma )

I = np.array(range(min_nm))

Sshrink = np.zeros([n,m])

Sshrink[I,I] = Sdiag

Y = Y - sigma* U.dot(Sshrink.dot(V))

# Update primal variable x

Xold = X

X = X - tau* Y

A = tau* At + Id

vecX = np.reshape(X.T,[-1])

vecX = scipy.sparse.csr_matrix(vecX)

b = tau* bt + vecX

b = np.array(b).squeeze()

# Solve by linear system

x,_ = scipy.sparse.linalg.cg(A, b, x0=b, tol=1e-6, maxiter=25)

X = np.reshape(x,[m,n]).T

# Fix issue with no observations along some rows and columns

r,c = np.where(X>0.0); median = np.median(X[r,c])

if nb_zero_cols>0: X[:,idx_zero_cols] = median

if nb_zero_rows>0: X[nb_zero_rows,:] = median

# Update primal variable xb

Xb = 2.* X - Xold

# Difference between two iterations

diffX = np.linalg.norm(X-Xold)

# Reconstruction error

err_test = np.sqrt(np.sum((Otest*(X-Mgt))**2)) / np.sum(Otest) * (n*m)

# Plot

if not k%10:

clear_output(wait=True)

plt.figure(figsize=(10,10))

plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Hybrib Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

print('diffX',diffX)

clear_output(wait=True)

print('Reconstruction Error: '+ str(round(err_test,5)))

# Final plots

plt.figure(2, figsize=(10,10))

plt.imshow(Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Original rating matrix\n Percentage observed ratings: ' + str(100*np.sum(Mgt>0)/(n*m))[:5])

plt.show()

plt.figure(3, figsize=(10,10))

plt.imshow(Otraining*Mgt, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Observed rating matrix\n Percentage observed ratings: ' + str(100*perc_obs_training)[:5])

plt.show()

plt.figure(4, figsize=(10,10))

plt.imshow(X, interpolation='nearest', cmap='jet', aspect=0.1)

plt.colorbar(shrink=0.65)

plt.title('Hybrib Filtering\nIteration='+ str(k)+'\nReconstruction Error= '+ str(round(err_test,5)))

plt.show()

diffX 12.97330201424323

diffX 10.214306996079092

diffX 7.865498569445742

diffX 5.817653613498639

diffX 4.159594766250172

diffX 2.9581038165169664

diffX 2.1708126517625774

diffX 1.6884458895360115

diffX 1.3985506569506554

diffX 1.2169039432337356

diffX 1.0916012474096615

diffX 0.9948503362171524

diffX 0.912897115326623

diffX 0.8391816776714556

diffX 0.7706457184758235

diffX 0.705924882621576

diffX 0.64448477047351

diffX 0.5861982384190937

diffX 0.5311253492817951

diffX 0.47939264184555236

diffX 0.4311249834863117

diffX 0.3864073896313916

diffX 0.3452663227139773

diffX 0.3076645073517481

diffX 0.27350489526463656

diffX 0.24264018047040123

diffX 0.21488492611358004

diffX 0.1900280978213643

diffX 0.16784455040381902

diffX 0.14810468029652096

diffX 0.130581966254459

diffX 0.11505845924160628

diffX 0.10132847119460266
Reconstruction Error: 397.4795

代码

文本

Machine Learning

点个赞吧

本文被以下合集收录

Graph Machine learning

xuxh@dp.tech

更新于 2024-10-08

44 篇0 人关注