Coder Social home page Coder Social logo

npa's Introduction

NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention

Overview

This repo contains an example implementation of the Non-Parameter Attention (NPA) with SGC as a baseline. Equipped with NPA, SGC can achieve significant performance gains without losing scalability.

Requirements

Environments: Win-64 or Linux-64, GPU:TITAN RTX. Other requirements can be seen in requirements.txt.

Note that we offer a pure python version and a python+cpp version to accelerate python loop. The details can be seen in run.sh.

Run the Codes

To evalute the performance of SGC+NPA, please run the commands in run.sh.

Test Results

Method Cora Citeseer Pubmed
SGC 81.0 71.9 78.9
SGC+NPA 83.0 73.6 80.1

Theoretical Analysis

Theorem 1: In discussed non-parametric GNNs, The classifier $\Phi(\mathbf{x})$ has 2 properties:

  • Property 1: Given feature $\mathbf{x_a} \in R^f$ ($f$ is feature dimension) and a non-zero scalar $m \in R$, $\arg\max \Phi(\mathbf{x_a}) = \mathop{\arg\max}\Phi(m\mathbf{x_a})$.

  • Property 2: Given two features $\mathbf{x_a}, \mathbf{x_b} \in R^f$, if $\arg \max \Phi(\mathbf{x_a}) = \arg\max(\mathbf{x_b})$, then $\mathop{\arg\max}\Phi(\mathbf{x_a} + \mathbf{x_b}) = \arg\max\Phi(\mathbf{x_a}) = \arg\max\Phi(\mathbf{x_b})$.

Proof. When $\Phi$ is a linear classifier, i.e., $\Phi(\mathbf{x}) = softmax(\mathbf{Wx})$, due to the property of linear transformation and softmax, Property 1 and Property 2 are held. When $\Phi$ is a Multi-Layer Perceptron, i.e., $\Phi(\mathbf{x}) = softmax(MLP(\mathbf{x}))$. Assume that the activation in MLP is ReLU. We decompose the proof into each layer in MLP. Each layer contains a linear transformation and ReLU. Since ReLU is monotonic increasing when $x \geq 0$, it will not distort the relative relationship of two scalars. Due to the property of linear transformation and ReLU, Property 1 and Property 2 are held in each hidden layer. In the output layer, softmax will not distort the relative relationship of two scalars, either. Thus Property 1 and Property 2 are also held in the final output.

We use $\mathbf{x_i}$ to denote the feature of node $v_i$ and use $y_i$ to denote the label of node $v_i$. Assume that the graph is ideally homogeneous, i.e., $\forall v_j \in N(v_i), y_j = y_i$. In other words, if we have a well-trained classifier $\Phi(\mathbf{x})$, we have:

$$ \forall v_j \in N(v_i), \mathop{\arg\max} \Phi(\mathbf{x_i}) = \mathop{\arg\max}\Phi(\mathbf{x_j}). \tag{1} \label{eq: ideal homogeneous} $$

Considering the node $v_i$ and its immediate neighbors $v_j \in N(v_i)$ and the corresponding $w_{ij}$, we assume that each neighbor's feature is close to the $x_i$ with error $\mathbf{\epsilon_{ij}}$ :

$$ \mathbf{x_j} = \mathbf{x_i} + \mathbf{\epsilon_{ij}}, $$

thus with weights $w_{ij}, \sum_{v_j \in N(v_i)}w_{ij} = 1$, the node's feature $x_i$ can be reconstructed by its immediate neighbors with errors:

$$ \mathbf{x_i} = \sum_{v_j \in N(v_i)} w_{ij}\mathbf{x_j} - \sum_{v_j \in N(v_i)} w_{ij}\mathbf{\epsilon_{ij}}. \tag{2} \label{eq: reconstruction} $$

Here we sort the edge weights $w_{ij}$ in the descendant order, i.e., ${w_{ij}} = {w_{ij_1} > w_{ij_2} > ... > w_{ij_{|N(v_i)|}}}$. And rearrange its immediate neighbors' features such that the $L_2$ norm of $\epsilon_{ij}$ is ascendant, i.e., $\{\mathbf{x_j}\} = \{\mathbf{x_{j_1}}, ..., \mathbf{x_{j_k}}, \mathbf{x_{j_{k+1}}}, ..., \mathbf{x_{j_{|N(v_i)|}}}\}, \forall k \in [1, |N(v_i)|-1], \|\mathbf{\epsilon_{ij_k}}\|_2 < \|\mathbf{\epsilon_{ij_{k+1}}}\|_2$, we have:

Theorem 2: Assigning weight $w_{ij_k}$ to neighbor feature $\mathbf{x_{j_k}}$ in propagation can lead to learning a well-trained classifier easier.

Proof. Since we assume that we have an ideal homogeneous graph, from the perspective of Eq. $\eqref{eq: ideal homogeneous}$ of the homogeneous graph, Property 1 and Property 2 of well-trained classifier $\Phi$ in Theorem 1, we have:

$$ \begin{equation} \begin{split} \mathop{\arg\max} \Phi(\mathbf{x_i}) =& \mathop{\arg\max} \Phi(\mathbf{x_j}; v_j \in N(v_i)) \\ =& \mathop{\arg\max} \Phi(w_{ij}\mathbf{x_j}; v_j \in N(v_i)) \\ =& \mathop{\arg\max} \Phi(w_{ij}\mathbf{x_j}+w_{ik}\mathbf{x_k}; v_j, v_k\in N(v_i))\\ =& \mathop{\arg\max} \Phi(\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j}). \end{split} \end{equation} $$

From another perspective, with Eq. $\eqref{eq: reconstruction}$, we have:

$$ \begin{equation} \small \mathop{\arg\max} \Phi(\mathbf{x_i}) = \mathop{\arg\max} \Phi(\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j} - \sum_{v_j \in N(v_i)} w_{ij}\mathbf{\epsilon_{ij}}). \end{equation} $$

Since both prediction $\mathop{\arg\max} \Phi(\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j})$ and $\mathop{\arg\max} \Phi(\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j} - \sum_{v_j \in N(v_i)} w_{ij}\mathbf{\epsilon_{ij}})$ are equal to the prediction $\mathop{\arg\max} \Phi(\mathbf{x_i})$, a expected classifier should make the same prediction given the two inputs. If the two inputs are closer, the classifier is much easier to train to make the same prediction. On the contrary, if the two inputs are far away, the classifier must approximate a complex decision manifold in the feature space, which could be intractable, leading to sub-optimal results. Thus, we measure the difference between $\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j}$ and $\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j} - \sum_{v_j \in N(v_i)} w_{ij}\mathbf{\epsilon_{ij}}$ as:

$$\begin{equation}\begin{split} diff. =& \|(\sum_{v_j \in N(v_i)}w_{ij}\mathbf{x_j}) - (\sum_{v_j \in N(v_i)} w_{ij}\mathbf{x_j} - \sum_{v_j \in N(v_i)} w_{ij}\mathbf{\epsilon_{ij}})\|_2 \\\ =& \|\sum_{v_j \in N(v_i)} w_{ij}\mathbf{\epsilon_{ij}}\|_2 \\ \leq& \sum_{v_j\in N(v_i)}w_{ij}\|\mathbf{\epsilon_{ij}}\|_2, \end{split}\end{equation}$$

and according to the rearrangement inequality, letting ${w_{ij}}$ be the reversed order of ${|\mathbf{\epsilon_{ij}|_2}}$ can minimize the upper bound of the difference, which may further reduce the difference between two inputs. To conclude, we showed that re-assigning propagation weights according to "how close between the neighbor node feature and its center node feature" is beneficial in achieving optimal prediction performance. From this perspective, our local attention is designed to additionally assign more weights to the nodes that have more similar features with that of their center nodes.

npa's People

Contributors

youth-49 avatar

Watchers

 avatar

Forkers

zwt233

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.