Learning Steady-States of Iterative Algorithms over Graphs

ICML 2018 9/15/2020

Motivation

Many graph analytics problems can be solved via iterative algorithms according to the graph structure, and the solutions of the algorithms are often characterized by a set of steady-state conditions.

PageRank: score of a node
Mean field inference: the posterior distribution of a variable

Instead of designing algorithms for each individual graph problem, we take a different perspective:

Can we design a learning framework for a diverse range of graph problems that learns the algorithm over large graphs achieving the steady-state solutions efficiently and effectively?

How to represent the meta learner for such algorithm and how to carry out the learning of these algorithms?

Iterative algorithm over graphs

For a graph: $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ , with node set $\mathcal{V}$ and edge set $\mathcal{E}$ , many iterative algorithms over graphs can be formulated into:

\begin{array}{l} h_{v}^{(t+1)} \leftarrow \mathcal{T}\left(\left\{h_{u}^{(t)}\right\}_{u \in \mathcal{N}(v)}\right), \forall t \geqslant 1, \text { and } \\ h_{v}^{(0)} \leftarrow \text { constant }, \forall v \in \mathcal{V} \end{array} \space\space\space\space\space\space (1)

until the steady-state conditions are met:

h_{v}^{*}=\mathcal{T}\left(\left\{h_{u}^{*}\right\}_{u \in \mathcal{N}(v)}\right), \forall v \in \mathcal{V} \space\space\space\space\space\space (2)

$node: v,u\in \mathcal{V}$
$\text{the set of neighbor nodes of } v: \mathcal{N}(v)$
$\text{some operator }: \mathcal{T}(\cdot)$
$\text{Intermediate representation of node }v: h_v$
$\text{Intermediate representation of node }v \text{ at step } t: h_v^{(t)}$
$\text{Final (converged/steady) intermediate representation of node }v : h_v^{*}$

Specifically:

1) Graph component detection problem

Goal: find all nodes within the same connected component as source node $s\in \mathcal{V}$

How: iteratively propagate the label at node $s$ to other nodes:

y_{v}^{(t+1)}=\max _{u \in \mathcal{N}(v)} y_{u}^{(t)}, y_{s}^{(0)}=1, y_{v}^{(0)}=0, \forall v \in \mathcal{V}

$y_s:$ label of node $s$
$y_s^{(t)}:$ label of node $s$ at step $t$
at initial step $t=0$ , the label at node $s$ are set to 1 (infected), 0 for all other nodes.

Steady state: nodes in the same connected component as $s$ are infected, (labelled as 1)

y_v^{*}= \max_{u\in \mathcal{N}(v)}y_u^{*}

2) PageRank scores for node importance

Goal: estimate the importance of each node in a graph

How: update score of each node iteratively

r_{v}^{(t+1)} \leftarrow \frac{(1-\lambda)}{|\mathcal{V}|}+\frac{\lambda}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} r_{u}^{(t)}, \forall v \in \mathcal{V}

$r_v^{(t)}$ : score of node $v$ at step $t$
Initialization: $r_v^{(0)} = 0, \forall v \in \mathcal{V}$

Steady state: $r_{v}^{*} = \frac{(1-\lambda)}{|\mathcal{V}|}+\frac{\lambda}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} r_{u}^{*}$

3) Mean field inference in graphical model

Goal: approximate the marginal distributions of a set of variables $x_v$ in a graph model defined on $\mathcal{G}$

In graphical model, we know:

p\left(\left\{x_{v}\right\}_{v \in \mathcal{V}}\right) \propto \prod_{v \in \mathcal{V}} \phi\left(x_{v}\right) \prod_{(u, v) \in \mathcal{E}} \phi\left(x_{u}, x_{v}\right)

$p\left(\left\{x_{v}\right\}_{v \in \mathcal{V}}\right)$ : true marginal distribution
$\phi(x_v) \text{ and } \phi(x_u,x_v)$ are node and edge potential respectively

How: The marginal approximation can be obtained in an iterative fashion by the following mean field update:

\begin{aligned} q^{(t+1)}\left(x_{v}\right) \leftarrow \phi\left(x_{v}\right) \prod_{u \in \mathcal{N}(v)} & \exp \left(\int_{u}q^{(t)}\left(x_{u}\right) \log \phi\left(x_{u}, x_{v}\right) \mathrm{d} u\right) \end{aligned}

$q^{(t+1)}\left(x_{v}\right)$ : marginal approximation of a set of variables $x_v$ at step $t+1$

Steady State: $\begin{aligned} q^{*}\left(x_{v}\right) = \phi\left(x_{v}\right) \prod_{u \in \mathcal{N}(v)} & \exp \left(\int_{u}q^{*}\left(x_{u}\right) \log \phi\left(x_{u}, x_{v}\right) \mathrm{d} u\right) \end{aligned}$

4) Compute long range graph convolution features (node classification)

Goal: extract long range features from graph and use that features to figure to capture the relation between graph topology and external labels

How: One possible parametrization of graph convolution features $h_v$ can be updated from zeros initialization as:

h_{v}^{(t+1)} \leftarrow \sigma\left(W_{1} x_{v}+W_{2} \sum_{u \in \mathcal{N}(v)} h_{u}^{(t)}\right)

$h_{v}^{(t+1)} : \text{graph convolution features for node } v \text{ at step } t+1$
$\sigma : \text{a nonlinear element-wise operation}$
$W_1,W_2: \text{parameters of the operator}$

Steady State: $h_{v}^{*} \leftarrow \sigma\left(W_{1} x_{v}+W_{2} \sum_{u \in \mathcal{N}(v)} h_{u}^{*}\right)$

After that: the label for each node will be determined by the steady state feature $h_v^*$ by a labeling function: $f(h_v^*)$

The Algorithm Learning Problem: framework of algorithm design

Assumption: we have collected the output of an iterative algorithm $\mathcal{T}$ over a single large graph.

Training dataset (input of the proposed algorithm) consists of:

input graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$
output of the iterative algorithm for a subset of nodes, $\mathcal{V}^{(y)} \subseteq\mathcal{V}$ , (Note: labeled nodes) from the graph

$\mathcal{D}=\left\{f_v^{*}:=f(h_v^{*})|h_v^{*}=\mathcal{T[\{h_u^{*}\}_{u\in \mathcal{N}(v)}], v\in \mathcal{V}}^{(y)}\right\} \space\space\space\space\space (3)$

$h_v^{*}$ : is the quantity in the iterative algorithm satisfying the steady-state conditions
$f(\cdot)$ : an additional labeling function taking input the steady-state quantity, produces the final label for each node
$f_v^{*}:$ ground truth of node $v$

Given the above $\mathcal{D}$ ,

Goal: to learn a parameterized algorithm $\mathcal{A}_{\Theta}$ , such that the output of the algorithm $\mathcal{A}_{\Theta}$ can mimic the output of the original algorithm $\mathcal{T}$ .

Namely:

The output of $\mathcal{A}_{\Theta}$ is: $\mathcal{A}_{\Theta}[\mathcal{G}]=\{\hat{f_v}\}_{v\in \mathcal{V}^{(y)}}$ , which are close to $f_v^{*}$ according to some loss function.

The algorithm learning problem for $\mathcal{A}_{\Theta}$ can be formulated into the following optimization problem:

\min_{\Theta} \sum_{v \in \mathcal{V}^{(y)}} \ell\left(f_{v}^{*}, \widehat{f}_{v}\right) \space\space\space\space\space\space\space (4)\\ \text { s.t. }\left\{\widehat{f}_{v}\right\}_{v \in \mathcal{V}^{(y)}}=\mathcal{A}_{\Theta}[\mathcal{G}] \space\space\space\space\space\space\space (5)

$\ell\left(f_{v}^{*}, \widehat{f}_{v}\right)$ : loss function

Design goal: respect steady-state conditions and learn fast.

Core:

a steady-state operator $\mathcal{T}_{\Theta}$ between vector embedding representations of nodes
a link function $g$ mapping the embedding to the algorithm output.
Note: namely $\mathcal{A}_{\Theta}: \mathcal{T}_{\Theta} \text{ and } g$

Stead-state operator and link function

\begin{aligned} \text { output }: &\left\{\widehat{f}_{v}:=g(\hat{h}_{v})\right\}_{v \in \mathcal{V}} \space\space\space\space\space\space\space\space\space (6)\\ \text { s.t. } & \widehat{h}_{v}=\mathcal{T}_{\Theta}\left[\left\{\widehat{h}_{u}\right\}_{u \in \mathcal{N}(v)}\right] \space\space\space\space\space\space\space\space\space (7) \end{aligned}

initialization: $\widehat{h}_v \leftarrow \text{constant for all } v\in \mathcal{V}$
update using equation (7)

Operator $\mathcal{T}_{\Theta}$ : a two-layer NN

General nolinear function class
The operator enforces the steady-state condition of node embeddings based on 1-hop
local neighborhood information.
Due to the variety of graph structures, this function should be able to handle different
number of inputs (i.e., different number of neighbor nodes)

\widehat{h}_v = \mathcal{T}_{\Theta}\left[\left\{\widehat{h}_{u}\right\}_{u \in \mathcal{N}(v)}\right]=W_{1} \sigma\left(W_{2}\left[x_{v}, \sum_{u \in \mathcal{N}(v)}\left[\widehat{h}_{u}, x_{u}\right]\right]\right) \space\space\space\space\space\space\space\space (9)

$\sigma(\cdot)$ : element-wise activation function: Sigmoid, ReLU
$W_1, W_2 : \text{weight matrices of NN} .$ $W_1: \text{first layer, } W_2: \text{2nd layer}$
$x_v$ : the optional feature representation of nodes

Link function (prediction function) $g$ : a two-layer NN

General nolinear function class
input: node embeddings
predict: the corresponding algorithm outputs (like label of node)

g\left(\widehat{h}_{v}\right)=\sigma\left(V_{2}^{\top} \operatorname{ReLU}\left(V_{1}^{\top} \widehat{h}_{v}\right)\right) \space\space\space\space\space\space (10)

$\widehat{h}_{v}$ : node embeddings
$V_1, V_2 : \text{parameters of } g.$ $V_1: \text{first layer, } V_2: \text{2nd layer}$
$\sigma:\text{task-specific activation function}$
- linear regression: identity function $\sigma(x)=x$
- multi-class classification: $\sigma(\cdot)$ is softmax (output a probabilistic simplex)

The overall optimization problem

\begin{array}{c} \min _{\left\{W_{i}, V_{i}\right\}_{i=1}^{2}} \mathcal{L}\left(\left\{W_{i}, V_{i}\right\}_{i=1}^{2}\right):=\frac{1}{\left|\mathcal{V}^{y}\right|} \sum_{v \in \mathcal{V}^{(y)}} \ell(f_{v}^{*}, g(\hat{h}_{v})) \\ \text { s.t. } \widehat{h}_{v}=\mathcal{T}_{\Theta}\left[\left\{\widehat{h}_{u}\right\}_{u \in \mathcal{N}(v)}\right], \forall v \in \mathcal{V} \end{array} \space\space\space\space\space\space (11)

$W_1,W_2: \text{parameters of } \mathcal{T}_{\Theta}$
$V_1,V_2: \text{parameters of } g$
my understanding: semi-surpervised learning

How to solve (11): Stochastic Steady-state Embedding(SSE)

An alternating algorithm: alternate between:

using most current model to find the embeddings and make prediction
using the gradient of the loss with respect to $\{W_1, W_2, V_1, V_2\}$ for update these parameters

Intuition:

RL (policy iteration): improve the policy minimizing the cost proportional to $f^{*}$ by updating the parameters $\mathcal{T}_{\Theta} \text{ and } g$
- steady-state $\hat{h_v}$ for each node: "value function"
- embedding operator $\mathcal{T}_{\Theta}$ and classifier function $g$ : "policy"
K-means and EM (mine)

"Value" estimation: estimate steady-state $\hat{h_v}$

limitation: it is prohibitive to solve the steady-state equation exactly in large-scale graph with millions of vertices since it requires visiting all the nodes in the graph.

Solution: stochastic fixed point iteration, the extra randomness on the constraints for sampling the constraints to tackle the groups of equations approximately.

In $k\text{-th}$ step, first sample a set of nodes $\tilde{\mathcal{V}}=\{v_1,v_2,\dots,v_N\}\in \mathcal{V}$ from the entire node set rather of the labeled set. Update the new embedding by moving average：

\widehat{h}_{v_{i}}^{(k)} \leftarrow(1-\alpha) \hat{h}_{v_{i}}^{(k-1)}+\alpha \mathcal{T}_{\Theta}\left[\{\widehat{h}_{u}^{(k-1)}\}_{u \in \mathcal{N}\left(v_{i}\right)}\right], \forall v_{i} \in \tilde{\mathcal{V}} \space\space\space\space\space\space\space\space (12)

$\alpha: 0 \leq \alpha\leq1$

"Policy" improvement: update parameters of $\mathcal{T}_{\Theta} \text{ and } g$

At the $k\text{-th}$ step, once we have $\{\widehat{h}_{v}^{(k)}\}_{v\in \mathcal{V}}$ satisfying the steady-state equation, we use vanilla stochastic gradient descent to update parameters $\{W_1, W_2, V_1, V_2\}$ :

\begin{aligned} \frac{\partial \mathcal{L}}{\partial \color{red}V_{i}} &=\widehat{\mathbb{E}}\left[\frac{\partial \ell\left(f_{v}^{*}, g\left(\hat{h}_{v}^{k}\right)\right)}{\partial g\left(\hat{h}_{v}^{k}\right)} \frac{\partial g\left(\hat{h}_{v}^{k}\right)}{\partial \color{red}V_{i}}\right] \\ \frac{\partial \mathcal{L}}{\partial \color{red}W_{i}} &=\widehat{\mathbb{E}}\left[\frac{\partial \ell\left(f_{v}^{*}, g\left(\hat{h}_{v}^{k}\right)\right)}{\partial \widehat{h}_{v}^{k}} \frac{.\partial \mathcal{T}_{\Theta}}{\partial \color{red}W_{i}}\right] \end{aligned}

$\widehat{\mathbb{E}}[\cdot]$ : the expectation is w.r.t. uniform distribution over labeled nodes $\mathcal{V}^{(y)}$ .

$n_h$ : the # of inner loops in "value" estimation
$n_f$ : the # of inner loops in "policy" improvement

Complexity

Memory space
- $O(|\mathcal{V}|):$ The dominating part is the persistent node embedding matrix $\{\widehat{h}_v\}_{v\in \mathcal{V}}$
- $O(T|\mathcal{V}|):$ $T\text{-hops}$ for GNN family
Time: the computational cost in each iteration is just proportional to the number of edges in each mini-batch.
- "policy" improvement: $O(M\frac{|\mathcal{E}|}{|\mathcal{V}|})$
- "value" estimation: $O(N\frac{|\mathcal{E}|}{|\mathcal{V}|})$

Reference

PreviousIncremental Few Shot Learning with Attention Attractor Network NextExperiments

Last updated 4 years ago

Was this helpful?