📒
PaperNotes
  • PAPER NOTES
  • Meta-Learning with Implicit Gradient
  • DARTS: Differentiable Architecture Search
  • Meta-Learning of Neural Architectures for Few-Shot Learning
  • Towards Fast Adaptation of Neural Architectures with Meta Learning
  • Editable Neural Networks
  • ANIL (Almost No Inner Loop)
  • Meta-Learning Representation for Continual Learning
  • Learning to learn by gradient descent by gradient descent
  • Modular Meta-Learning with Shrinkage
  • NADS: Neural Architecture Distribution Search for Uncertainty Awareness
  • Modular Meta Learning
  • Sep
    • Incremental Few Shot Learning with Attention Attractor Network
    • Learning Steady-States of Iterative Algorithms over Graphs
      • Experiments
    • Learning combinatorial optimization algorithms over graphs
    • Meta-Learning with Shared Amortized Variational Inference
    • Concept Learners for Generalizable Few-Shot Learning
    • Progressive Graph Learning for Open-Set Domain Adaptation
    • Probabilistic Neural Architecture Search
    • Large-Scale Long-Tailed Recognition in an Open World
    • Learning to stop while learning to predict
    • Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift
    • Learning to Generalize: Meta-Learning for Domain Generalization
  • Oct
    • Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization
    • Network Architecture Search for Domain Adaptation
    • Continuous Meta Learning without tasks
    • Learning Causal Models Online
    • Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
    • Conditional Neural Progress (CNPs)
    • Reviving and Improving Recurrent Back-Propagation
    • Meta-Q-Learning
    • Learning Self-train for semi-supervised few shot classification
    • Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
  • Nov
    • Neural Process
    • Adversarially Robust Few-Shot Learning: A Meta-Learning Approach
    • Learning to Adapt to Evolving Domains
  • Tutorials
    • Relax constraints to continuous
    • MAML, FO-MAML, Reptile
    • Gradient Descent
      • Steepest Gradient Descent
      • Conjugate Gradient Descent
  • KL, Entropy, MLE, ELBO
  • Coding tricks
    • Python
    • Pytorch
  • ml
    • kmeans
Powered by GitBook
On this page
  • Motivation
  • Iterative algorithm over graphs
  • 1) Graph component detection problem
  • 2) PageRank scores for node importance
  • 3) Mean field inference in graphical model
  • 4) Compute long range graph convolution features (node classification)
  • The Algorithm Learning Problem: framework of algorithm design
  • Operator : a two-layer NN
  • Link function (prediction function) : a two-layer NN
  • The overall optimization problem
  • How to solve (11): Stochastic Steady-state Embedding(SSE)
  • "Value" estimation: estimate steady-state
  • "Policy" improvement: update parameters of
  • Complexity
  • Reference

Was this helpful?

  1. Sep

Learning Steady-States of Iterative Algorithms over Graphs

ICML 2018 9/15/2020

Motivation

Many graph analytics problems can be solved via iterative algorithms according to the graph structure, and the solutions of the algorithms are often characterized by a set of steady-state conditions.

  • PageRank: score of a node

  • Mean field inference: the posterior distribution of a variable

Instead of designing algorithms for each individual graph problem, we take a different perspective:

Can we design a learning framework for a diverse range of graph problems that learns the algorithm over large graphs achieving the steady-state solutions efficiently and effectively?

How to represent the meta learner for such algorithm and how to carry out the learning of these algorithms?

Iterative algorithm over graphs

For a graph: G=(V,E)\mathcal{G}=(\mathcal{V}, \mathcal{E})G=(V,E) , with node set V\mathcal{V} V and edge set E\mathcal{E}E , many iterative algorithms over graphs can be formulated into:

hv(t+1)←T({hu(t)}u∈N(v)),∀t⩾1, and hv(0)← constant ,∀v∈V      (1)\begin{array}{l} h_{v}^{(t+1)} \leftarrow \mathcal{T}\left(\left\{h_{u}^{(t)}\right\}_{u \in \mathcal{N}(v)}\right), \forall t \geqslant 1, \text { and } \\ h_{v}^{(0)} \leftarrow \text { constant }, \forall v \in \mathcal{V} \end{array} \space\space\space\space\space\space (1)hv(t+1)​←T({hu(t)​}u∈N(v)​),∀t⩾1, and hv(0)​← constant ,∀v∈V​      (1)

until the steady-state conditions are met:

hv∗=T({hu∗}u∈N(v)),∀v∈V      (2)h_{v}^{*}=\mathcal{T}\left(\left\{h_{u}^{*}\right\}_{u \in \mathcal{N}(v)}\right), \forall v \in \mathcal{V} \space\space\space\space\space\space (2)hv∗​=T({hu∗​}u∈N(v)​),∀v∈V      (2)
  • node:v,u∈Vnode: v,u\in \mathcal{V}node:v,u∈V

  • the set of neighbor nodes of v:N(v)\text{the set of neighbor nodes of } v: \mathcal{N}(v)the set of neighbor nodes of v:N(v)

  • some operator :T(⋅)\text{some operator }: \mathcal{T}(\cdot)some operator :T(⋅)

  • Intermediate representation of node v:hv\text{Intermediate representation of node }v: h_vIntermediate representation of node v:hv​

  • Intermediate representation of node v at step t:hv(t)\text{Intermediate representation of node }v \text{ at step } t: h_v^{(t)}Intermediate representation of node v at step t:hv(t)​

  • Final (converged/steady) intermediate representation of node v:hv∗\text{Final (converged/steady) intermediate representation of node }v : h_v^{*}Final (converged/steady) intermediate representation of node v:hv∗​

Specifically:

1) Graph component detection problem

Goal: find all nodes within the same connected component as source node s∈Vs\in \mathcal{V}s∈V

How: iteratively propagate the label at node sss to other nodes:

yv(t+1)=max⁡u∈N(v)yu(t),ys(0)=1,yv(0)=0,∀v∈Vy_{v}^{(t+1)}=\max _{u \in \mathcal{N}(v)} y_{u}^{(t)}, y_{s}^{(0)}=1, y_{v}^{(0)}=0, \forall v \in \mathcal{V}yv(t+1)​=u∈N(v)max​yu(t)​,ys(0)​=1,yv(0)​=0,∀v∈V
  • ys:y_s:ys​: label of node sss

  • ys(t):y_s^{(t)}:ys(t)​: label of node sss at step ttt

  • at initial step t=0t=0t=0, the label at node sss are set to 1 (infected), 0 for all other nodes.

Steady state: nodes in the same connected component as sss are infected, (labelled as 1)

yv∗=max⁡u∈N(v)yu∗y_v^{*}= \max_{u\in \mathcal{N}(v)}y_u^{*}yv∗​=u∈N(v)max​yu∗​

2) PageRank scores for node importance

Goal: estimate the importance of each node in a graph

How: update score of each node iteratively

rv(t+1)←(1−λ)∣V∣+λ∣N(v)∣∑u∈N(v)ru(t),∀v∈Vr_{v}^{(t+1)} \leftarrow \frac{(1-\lambda)}{|\mathcal{V}|}+\frac{\lambda}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} r_{u}^{(t)}, \forall v \in \mathcal{V}rv(t+1)​←∣V∣(1−λ)​+∣N(v)∣λ​u∈N(v)∑​ru(t)​,∀v∈V
  • rv(t)r_v^{(t)}rv(t)​ : score of node vvv at step ttt

  • Initialization: rv(0)=0,∀v∈Vr_v^{(0)} = 0, \forall v \in \mathcal{V}rv(0)​=0,∀v∈V

Steady state: rv∗=(1−λ)∣V∣+λ∣N(v)∣∑u∈N(v)ru∗r_{v}^{*} = \frac{(1-\lambda)}{|\mathcal{V}|}+\frac{\lambda}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} r_{u}^{*}rv∗​=∣V∣(1−λ)​+∣N(v)∣λ​∑u∈N(v)​ru∗​

3) Mean field inference in graphical model

Goal: approximate the marginal distributions of a set of variables xvx_vxv​ in a graph model defined on G\mathcal{G}G

In graphical model, we know:

p({xv}v∈V)∝∏v∈Vϕ(xv)∏(u,v)∈Eϕ(xu,xv)p\left(\left\{x_{v}\right\}_{v \in \mathcal{V}}\right) \propto \prod_{v \in \mathcal{V}} \phi\left(x_{v}\right) \prod_{(u, v) \in \mathcal{E}} \phi\left(x_{u}, x_{v}\right)p({xv​}v∈V​)∝v∈V∏​ϕ(xv​)(u,v)∈E∏​ϕ(xu​,xv​)
  • p({xv}v∈V)p\left(\left\{x_{v}\right\}_{v \in \mathcal{V}}\right) p({xv​}v∈V​) : true marginal distribution

  • ϕ(xv) and ϕ(xu,xv)\phi(x_v) \text{ and } \phi(x_u,x_v)ϕ(xv​) and ϕ(xu​,xv​) are node and edge potential respectively

How: The marginal approximation can be obtained in an iterative fashion by the following mean field update:

q(t+1)(xv)←ϕ(xv)∏u∈N(v)exp⁡(∫uq(t)(xu)log⁡ϕ(xu,xv)du)\begin{aligned} q^{(t+1)}\left(x_{v}\right) \leftarrow \phi\left(x_{v}\right) \prod_{u \in \mathcal{N}(v)} & \exp \left(\int_{u}q^{(t)}\left(x_{u}\right) \log \phi\left(x_{u}, x_{v}\right) \mathrm{d} u\right) \end{aligned}q(t+1)(xv​)←ϕ(xv​)u∈N(v)∏​​exp(∫u​q(t)(xu​)logϕ(xu​,xv​)du)​
  • q(t+1)(xv)q^{(t+1)}\left(x_{v}\right) q(t+1)(xv​) : marginal approximation of a set of variables xvx_vxv​ at step t+1t+1t+1

Steady State: q∗(xv)=ϕ(xv)∏u∈N(v)exp⁡(∫uq∗(xu)log⁡ϕ(xu,xv)du)\begin{aligned} q^{*}\left(x_{v}\right) = \phi\left(x_{v}\right) \prod_{u \in \mathcal{N}(v)} & \exp \left(\int_{u}q^{*}\left(x_{u}\right) \log \phi\left(x_{u}, x_{v}\right) \mathrm{d} u\right) \end{aligned}q∗(xv​)=ϕ(xv​)u∈N(v)∏​​exp(∫u​q∗(xu​)logϕ(xu​,xv​)du)​

4) Compute long range graph convolution features (node classification)

Goal: extract long range features from graph and use that features to figure to capture the relation between graph topology and external labels

How: One possible parametrization of graph convolution features hvh_vhv​ can be updated from zeros initialization as:

hv(t+1)←σ(W1xv+W2∑u∈N(v)hu(t))h_{v}^{(t+1)} \leftarrow \sigma\left(W_{1} x_{v}+W_{2} \sum_{u \in \mathcal{N}(v)} h_{u}^{(t)}\right)hv(t+1)​←σ​W1​xv​+W2​u∈N(v)∑​hu(t)​​
  • hv(t+1):graph convolution features for node v at step t+1h_{v}^{(t+1)} : \text{graph convolution features for node } v \text{ at step } t+1 hv(t+1)​:graph convolution features for node v at step t+1

  • σ:a nonlinear element-wise operation\sigma : \text{a nonlinear element-wise operation}σ:a nonlinear element-wise operation

  • W1,W2:parameters of the operatorW_1,W_2: \text{parameters of the operator}W1​,W2​:parameters of the operator

Steady State: hv∗←σ(W1xv+W2∑u∈N(v)hu∗)h_{v}^{*} \leftarrow \sigma\left(W_{1} x_{v}+W_{2} \sum_{u \in \mathcal{N}(v)} h_{u}^{*}\right)hv∗​←σ(W1​xv​+W2​∑u∈N(v)​hu∗​)

After that: the label for each node will be determined by the steady state feature hv∗h_v^*hv∗​ by a labeling function: f(hv∗)f(h_v^*)f(hv∗​)

The Algorithm Learning Problem: framework of algorithm design

Assumption: we have collected the output of an iterative algorithm T\mathcal{T}T over a single large graph.

Training dataset (input of the proposed algorithm) consists of:

  • input graph G=(V,E)\mathcal{G}=(\mathcal{V},\mathcal{E})G=(V,E)

  • output of the iterative algorithm for a subset of nodes, V(y)⊆V\mathcal{V}^{(y)} \subseteq\mathcal{V}V(y)⊆V , (Note: labeled nodes) from the graph

D={fv∗:=f(hv∗)∣hv∗=T[{hu∗}u∈N(v)],v∈V(y)}     (3)\mathcal{D}=\left\{f_v^{*}:=f(h_v^{*})|h_v^{*}=\mathcal{T[\{h_u^{*}\}_{u\in \mathcal{N}(v)}], v\in \mathcal{V}}^{(y)}\right\} \space\space\space\space\space (3)D={fv∗​:=f(hv∗​)∣hv∗​=T[{hu∗​}u∈N(v)​],v∈V(y)}     (3)

  • hv∗h_v^{*}hv∗​ : is the quantity in the iterative algorithm satisfying the steady-state conditions

  • f(⋅)f(\cdot)f(⋅) : an additional labeling function taking input the steady-state quantity, produces the final label for each node

  • fv∗:f_v^{*}:fv∗​: ground truth of node vvv

Given the above D\mathcal{D}D ,

Goal: to learn a parameterized algorithm AΘ\mathcal{A}_{\Theta}AΘ​ , such that the output of the algorithm AΘ\mathcal{A}_{\Theta}AΘ​ can mimic the output of the original algorithm T\mathcal{T}T .

Namely:

The output of AΘ\mathcal{A}_{\Theta}AΘ​ is: AΘ[G]={fv^}v∈V(y)\mathcal{A}_{\Theta}[\mathcal{G}]=\{\hat{f_v}\}_{v\in \mathcal{V}^{(y)}}AΘ​[G]={fv​^​}v∈V(y)​ , which are close to fv∗f_v^{*}fv∗​ according to some loss function.

The algorithm learning problem for AΘ\mathcal{A}_{\Theta}AΘ​ can be formulated into the following optimization problem:

min⁡Θ∑v∈V(y)ℓ(fv∗,f^v)       (4) s.t. {f^v}v∈V(y)=AΘ[G]       (5) \min_{\Theta} \sum_{v \in \mathcal{V}^{(y)}} \ell\left(f_{v}^{*}, \widehat{f}_{v}\right) \space\space\space\space\space\space\space (4)\\ \text { s.t. }\left\{\widehat{f}_{v}\right\}_{v \in \mathcal{V}^{(y)}}=\mathcal{A}_{\Theta}[\mathcal{G}] \space\space\space\space\space\space\space (5) Θmin​v∈V(y)∑​ℓ(fv∗​,f​v​)       (4) s.t. {f​v​}v∈V(y)​=AΘ​[G]       (5)
  • ℓ(fv∗,f^v)\ell\left(f_{v}^{*}, \widehat{f}_{v}\right) ℓ(fv∗​,f​v​) : loss function

Design goal: respect steady-state conditions and learn fast.

Core:

  • a steady-state operator TΘ\mathcal{T}_{\Theta}TΘ​between vector embedding representations of nodes

  • a link function ggg mapping the embedding to the algorithm output.

  • Note: namely AΘ:TΘ and g\mathcal{A}_{\Theta}: \mathcal{T}_{\Theta} \text{ and } gAΘ​:TΘ​ and g

Stead-state operator and link function

 output :{f^v:=g(h^v)}v∈V         (6) s.t. h^v=TΘ[{h^u}u∈N(v)]         (7)\begin{aligned} \text { output }: &\left\{\widehat{f}_{v}:=g(\hat{h}_{v})\right\}_{v \in \mathcal{V}} \space\space\space\space\space\space\space\space\space (6)\\ \text { s.t. } & \widehat{h}_{v}=\mathcal{T}_{\Theta}\left[\left\{\widehat{h}_{u}\right\}_{u \in \mathcal{N}(v)}\right] \space\space\space\space\space\space\space\space\space (7) \end{aligned} output : s.t. ​{f​v​:=g(h^v​)}v∈V​         (6)hv​=TΘ​[{hu​}u∈N(v)​]         (7)​
  • initialization: h^v←constant for all v∈V\widehat{h}_v \leftarrow \text{constant for all } v\in \mathcal{V}hv​←constant for all v∈V

  • update using equation (7)

Operator TΘ\mathcal{T}_{\Theta}TΘ​ : a two-layer NN

  • General nolinear function class

  • The operator enforces the steady-state condition of node embeddings based on 1-hop

    local neighborhood information.

  • Due to the variety of graph structures, this function should be able to handle different

    number of inputs (i.e., different number of neighbor nodes)

h^v=TΘ[{h^u}u∈N(v)]=W1σ(W2[xv,∑u∈N(v)[h^u,xu]])        (9)\widehat{h}_v = \mathcal{T}_{\Theta}\left[\left\{\widehat{h}_{u}\right\}_{u \in \mathcal{N}(v)}\right]=W_{1} \sigma\left(W_{2}\left[x_{v}, \sum_{u \in \mathcal{N}(v)}\left[\widehat{h}_{u}, x_{u}\right]\right]\right) \space\space\space\space\space\space\space\space (9)hv​=TΘ​[{hu​}u∈N(v)​]=W1​σ​W2​​xv​,u∈N(v)∑​[hu​,xu​]​​        (9)
  • σ(⋅)\sigma(\cdot)σ(⋅) : element-wise activation function: Sigmoid, ReLU

  • W1,W2:weight matrices of NN.W_1, W_2 : \text{weight matrices of NN} .W1​,W2​:weight matrices of NN. W1:first layer, W2:2nd layerW_1: \text{first layer, } W_2: \text{2nd layer}W1​:first layer, W2​:2nd layer

  • xvx_vxv​ : the optional feature representation of nodes

Link function (prediction function) ggg : a two-layer NN

  • General nolinear function class

  • input: node embeddings

  • predict: the corresponding algorithm outputs (like label of node)

g(h^v)=σ(V2⊤ReLU⁡(V1⊤h^v))      (10)g\left(\widehat{h}_{v}\right)=\sigma\left(V_{2}^{\top} \operatorname{ReLU}\left(V_{1}^{\top} \widehat{h}_{v}\right)\right) \space\space\space\space\space\space (10)g(hv​)=σ(V2⊤​ReLU(V1⊤​hv​))      (10)
  • h^v\widehat{h}_{v}hv​ : node embeddings

  • V1,V2:parameters of g.V_1, V_2 : \text{parameters of } g.V1​,V2​:parameters of g. V1:first layer, V2:2nd layerV_1: \text{first layer, } V_2: \text{2nd layer}V1​:first layer, V2​:2nd layer

  • σ:task-specific activation function\sigma:\text{task-specific activation function}σ:task-specific activation function

    • linear regression: identity function σ(x)=x\sigma(x)=xσ(x)=x

    • multi-class classification: σ(⋅)\sigma(\cdot)σ(⋅) is softmax (output a probabilistic simplex)

The overall optimization problem

min⁡{Wi,Vi}i=12L({Wi,Vi}i=12):=1∣Vy∣∑v∈V(y)ℓ(fv∗,g(h^v)) s.t. h^v=TΘ[{h^u}u∈N(v)],∀v∈V      (11)\begin{array}{c} \min _{\left\{W_{i}, V_{i}\right\}_{i=1}^{2}} \mathcal{L}\left(\left\{W_{i}, V_{i}\right\}_{i=1}^{2}\right):=\frac{1}{\left|\mathcal{V}^{y}\right|} \sum_{v \in \mathcal{V}^{(y)}} \ell(f_{v}^{*}, g(\hat{h}_{v})) \\ \text { s.t. } \widehat{h}_{v}=\mathcal{T}_{\Theta}\left[\left\{\widehat{h}_{u}\right\}_{u \in \mathcal{N}(v)}\right], \forall v \in \mathcal{V} \end{array} \space\space\space\space\space\space (11)min{Wi​,Vi​}i=12​​L({Wi​,Vi​}i=12​):=∣Vy∣1​∑v∈V(y)​ℓ(fv∗​,g(h^v​)) s.t. hv​=TΘ​[{hu​}u∈N(v)​],∀v∈V​      (11)
  • W1,W2:parameters of TΘW_1,W_2: \text{parameters of } \mathcal{T}_{\Theta}W1​,W2​:parameters of TΘ​

  • V1,V2:parameters of gV_1,V_2: \text{parameters of } gV1​,V2​:parameters of g

  • my understanding: semi-surpervised learning

How to solve (11): Stochastic Steady-state Embedding(SSE)

An alternating algorithm: alternate between:

  • using most current model to find the embeddings and make prediction

  • using the gradient of the loss with respect to {W1,W2,V1,V2}\{W_1, W_2, V_1, V_2\}{W1​,W2​,V1​,V2​} for update these parameters

Intuition:

  • RL (policy iteration): improve the policy minimizing the cost proportional to f∗f^{*}f∗ by updating the parameters TΘ and g\mathcal{T}_{\Theta} \text{ and } gTΘ​ and g

    • steady-state hv^\hat{h_v}hv​^​ for each node: "value function"

    • embedding operator TΘ\mathcal{T}_{\Theta}TΘ​ and classifier function ggg : "policy"

  • K-means and EM (mine)

"Value" estimation: estimate steady-state hv^\hat{h_v}hv​^​

limitation: it is prohibitive to solve the steady-state equation exactly in large-scale graph with millions of vertices since it requires visiting all the nodes in the graph.

Solution: stochastic fixed point iteration, the extra randomness on the constraints for sampling the constraints to tackle the groups of equations approximately.

In k-thk\text{-th}k-th step, first sample a set of nodes V~={v1,v2,…,vN}∈V\tilde{\mathcal{V}}=\{v_1,v_2,\dots,v_N\}\in \mathcal{V}V~={v1​,v2​,…,vN​}∈V from the entire node set rather of the labeled set. Update the new embedding by moving average:

h^vi(k)←(1−α)h^vi(k−1)+αTΘ[{h^u(k−1)}u∈N(vi)],∀vi∈V~        (12)\widehat{h}_{v_{i}}^{(k)} \leftarrow(1-\alpha) \hat{h}_{v_{i}}^{(k-1)}+\alpha \mathcal{T}_{\Theta}\left[\{\widehat{h}_{u}^{(k-1)}\}_{u \in \mathcal{N}\left(v_{i}\right)}\right], \forall v_{i} \in \tilde{\mathcal{V}} \space\space\space\space\space\space\space\space (12)hvi​(k)​←(1−α)h^vi​(k−1)​+αTΘ​[{hu(k−1)​}u∈N(vi​)​],∀vi​∈V~        (12)
  • α:0≤α≤1\alpha: 0 \leq \alpha\leq1α:0≤α≤1

"Policy" improvement: update parameters of TΘ and g\mathcal{T}_{\Theta} \text{ and } gTΘ​ and g

At the k-thk\text{-th}k-th step, once we have {h^v(k)}v∈V\{\widehat{h}_{v}^{(k)}\}_{v\in \mathcal{V}}{hv(k)​}v∈V​ satisfying the steady-state equation, we use vanilla stochastic gradient descent to update parameters {W1,W2,V1,V2}\{W_1, W_2, V_1, V_2\}{W1​,W2​,V1​,V2​} :

∂L∂Vi=E^[∂ℓ(fv∗,g(h^vk))∂g(h^vk)∂g(h^vk)∂Vi]∂L∂Wi=E^[∂ℓ(fv∗,g(h^vk))∂h^vk.∂TΘ∂Wi]\begin{aligned} \frac{\partial \mathcal{L}}{\partial \color{red}V_{i}} &=\widehat{\mathbb{E}}\left[\frac{\partial \ell\left(f_{v}^{*}, g\left(\hat{h}_{v}^{k}\right)\right)}{\partial g\left(\hat{h}_{v}^{k}\right)} \frac{\partial g\left(\hat{h}_{v}^{k}\right)}{\partial \color{red}V_{i}}\right] \\ \frac{\partial \mathcal{L}}{\partial \color{red}W_{i}} &=\widehat{\mathbb{E}}\left[\frac{\partial \ell\left(f_{v}^{*}, g\left(\hat{h}_{v}^{k}\right)\right)}{\partial \widehat{h}_{v}^{k}} \frac{.\partial \mathcal{T}_{\Theta}}{\partial \color{red}W_{i}}\right] \end{aligned}∂Vi​∂L​∂Wi​∂L​​=E​∂g(h^vk​)∂ℓ(fv∗​,g(h^vk​))​∂Vi​∂g(h^vk​)​​=E​∂hvk​∂ℓ(fv∗​,g(h^vk​))​∂Wi​.∂TΘ​​​​
  • E^[⋅]\widehat{\mathbb{E}}[\cdot]E[⋅] : the expectation is w.r.t. uniform distribution over labeled nodes V(y)\mathcal{V}^{(y)}V(y) .

  • nhn_hnh​ : the # of inner loops in "value" estimation

  • nfn_fnf​ : the # of inner loops in "policy" improvement

Complexity

  • Memory space

    • O(∣V∣):O(|\mathcal{V}|): O(∣V∣): The dominating part is the persistent node embedding matrix{h^v}v∈V\{\widehat{h}_v\}_{v\in \mathcal{V}}{hv​}v∈V​

    • O(T∣V∣):O(T|\mathcal{V}|):O(T∣V∣): T-hopsT\text{-hops}T-hops for GNN family

  • Time: the computational cost in each iteration is just proportional to the number of edges in each mini-batch.

    • "policy" improvement: O(M∣E∣∣V∣)O(M\frac{|\mathcal{E}|}{|\mathcal{V}|})O(M∣V∣∣E∣​)

    • "value" estimation: O(N∣E∣∣V∣)O(N\frac{|\mathcal{E}|}{|\mathcal{V}|})O(N∣V∣∣E∣​)

Reference

PreviousIncremental Few Shot Learning with Attention Attractor NetworkNextExperiments

Last updated 4 years ago

Was this helpful?

http://proceedings.mlr.press/v80/dai18a/dai18a.pdf
https://github.com/Hanjun-Dai/steady_state_embedding
https://docs.dgl.ai/en/latest/tutorials/models/1_gnn/8_sse_mx.html?highlight=sse
https://www.cnblogs.com/lart/p/10463706.html
https://jameszhan.github.io/2017/09/16/fixed-point-iteration.html
https://blog.csdn.net/StreamRock/article/details/88718051
https://zhuanlan.zhihu.com/p/58728914
https://blog.csdn.net/jbb0523/article/details/52459797
https://www.wikiwand.com/en/Fixed-point_iteration
Policy Iteration for comparison