Graph Neural Networks.

Learning on graphs is notoriously hard for traditional neural networks. This comes from the fact that graphs, unlike images and text, have no canonical ordering. This raises many issues when we try to perform analysis tasks on graphs with labels and feature vectors. A number of methods have been suggested for dealing with this issue. (As a heads up, the first seven paragraphs are background. If you want to jump to what I have worked on, feel free to go to the end.)

Broadly speaking, this problem is referred to as graph embedding [1]. The idea of graph embedding is to transform an unordered graph into an ordered vector which can be analyzed using traditional analysis techniques. There are many parts of a graph one may wish to embed. If we are comparing graphs, we want to reduce a whole graph to a vector. Alternatively if we are performing a clustering task, we may wish to come up with node-embeddings for each node. This can be hard to imagine, so let’s do a simple example embedding an entire graph using the degrees of the nodes. The degree of a node in a graph is simply the number of connections that it has. Our embedding will work as follows:

Obtain the degree of every node in the graph up to degree 100
Form a 100-dimensional vector, where each entry corresponds to the number of nodes with that degree.

In this case, the first entry in the vector is the number of degree-1 nodes, the second is the degree-2 nodes, and so on. This is just the same as the graphs degree sequence up to degree 100. We now have an embedding for any arbitrary graph that can be analyzed. This is not a great embedding. For instance, say that we want to know if there is any node in the graph which would disconnect the graph if it were removed. I.E. if this node is removed, our original graph is split into multiple pieces which have no edge between them. Our embedding tells us nothing about this.

Obviously, for general graph analysis, our embedding should be more complex. There are a lot of ways people do this. Sometimes spectral information is used. In this case, the eigenvalues of the graph Laplacian can be computed, and sorted to produce a vector. This is useful for determining which graphs have similar clustering properties, and for which graphs spreading processes will behave similarly. Many variations of spectral embeddings exist for graphs, with their own advantages and disadvantages.

Some very popular methods for graph embeddings use random walks. DeepWalk [2] uses a very simple method for this that borrows the SkipGram model from natural language processing (Managed to get that buzzword in here. Nice.). The idea is that instead of viewing the entire graph, we can view the graph as a series of samples, obtained by performing random walks of length ‘t’. These random walks are stored as vectors, where each entry is the identity of the next node in the random walk, and SkipGram is applied to them. This has a very interesting parallel to language. We can view each node as a word, and each sampled walk of length ‘t’ as a sentence. Then, we can view our sampled walks as a text corpus, and SkipGram is trying to predict the most likely next word in a sentence given that corpus of sampled sentences.

Again, random-walk-based methods come short of what we want from general graph embeddings. They can be very powerful, but for many tasks the results they produce are lack-luster. Enter the Kipf and Welling GCN architecture [3]. In 2017 Kipf and Welling published a very interesting paper in which they outlined a graph embedding technique based on the normalized graph Laplacian (Which implicitly includes random walk information.).

Figure 1: A visualization of node classification using graph embedding. Nodes are given embeddings (shown as the colored bars/vectors above each node on the left), and some classification is performed on top of those classification which yields classes shown on the right by the nodes color. The task of node-embedding is obtaining these colored vectors shown on the left.

The Kipf and Welling GCN works as follows for node classification. We are given a graph where each node has a feature vector. To obtain node-embeddings, this feature vector is combined with its neighbor’s vectors at each step. This is repeated for some number of steps, until we obtain our embedding. In a basic implementation, we could take this combination to be the average and stop there, however this lacks the depth we may want. Instead, the task of a GNN/GCN is to learn the importance of the combination function at each step. To obtain the embedding for a given node ‘v’ we average its neighbors, and weight this average somehow before adding it to the embedding of our node ‘v’. The significance of each average is determined by a neural network which we train. A visualization of a more simple averaging operation is shown in Figure 2.

Figure 2: A visualization of the averaging done in a GCN. In this case the nodes are simply averaged with their neighbors and the result for each entry is truncated to the first decimal place. In an actual Kipf and Welling GCN the averaging operation is more complicated, but this gets the idea across! The color of each node is given by its vector embedding taken as an RGB value associated with each entry.

GNNs still have their own problems. Graph data is often enormous. If we want to perform node classification of the Facebook graph, this training will become incredibly prohibitive as these averaging operations become larger and larger due to neighborhood blow-up. Perhaps the most popular way to deal with this in application is graph pooling [4]. The idea of graph pooling is to analyze a graph that is simpler than the whole thing, and hope that any inaccuracies work their way out in the wash. This can be done in many ways. One is graph coarsening, where nodes are merged based on some similarity metric to form a simpler graph. Another method is subgraph pooling, where sparse subgraphs are sampled for training to limit neighborhood blowup.

That is enough background, maybe too much for a short post! Let’s talk about what I have worked on regarding this topic. In 2020 two papers were published which use something called a `Koopman operator’ to accelerate the training of deep neural networks (DNN). Koopman operators deserve their own post, but for the purpose of this write-up think of them as a way to approximate complicated dynamics with a simple system. That is, we can take a nonlinear dynamical system and turn it into a matrix. These two papers, written by Tano and Portwood [5] and Dogra and Redman [6] respectively, study how backpropagation can be thought of as dynamical system, and Koopman operators can be used to approximate the way a neural network evolves. This work, while novel, was very limited. Neither paper parallelized their results on GPU, which makes them of little use to practitioners. It should be noted that speed-ups for backpropagation has potential to be applied to any neural network. This means that Koopman methods may potentially be applied to GNNs.

I conducted research as part of the High Performance Computing and Graph Analytics Laboratory (link) to create a GPU-parallel Koopman accelerator for backpropagation. This work is currently under review with the name, “Acceleration of GNN Node Classification Using Koopman Operator Theory on GPU“ [7]. In the paper we show that, through this method, practical speed-ups of multiple times can be achieved for GNN node classification using Adam for backpropagation. Details of this paper will be updated on this site as the paper goes through review. A figure from the paper showing the loss and accuracy curves for GPU-Koopman training versus standard Adam is shown in Figure 3.

Figure 3: Accuracy and loss curves for GCN training using Koopman accelerated Adam (red-solid) versus standard Adam (black-dashed). These are all performed on the Cora data set, and from left to right, have the maximum learning rates of 0.0001, 0.001, and 0.01 respectively.

Beyond this, I have worked with Sandia National Labs studying partitioning techniques for training physics informed graph neural networks. This is discussed in a presentation given by my collaborator and P.I. Andy Huang. The associated presentation slides can be found on my Google Scholar, and are named Machine learning of Physics Informed Graph Neural Networks from TCAD models. The general idea is that reduced order models for dynamical systems may be learned by combining clever graph coarsening with graph attention networks, which are a whole other topic we won’t get in to here.

[1] Cai, Hongyun, Vincent W. Zheng, and Kevin Chen-Chuan Chang. "A comprehensive survey of graph embedding: Problems, techniques, and applications." IEEE transactions on knowledge and data engineering 30.9 (2018): 1616-1637.

[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.

[3] Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).

[4] Liu, Chuang, et al. "Graph pooling for graph neural networks: Progress, challenges, and opportunities." arXiv preprint arXiv:2204.07321 (2022).

[5] Tano, Mauricio E., Gavin D. Portwood, and Jean C. Ragusa. "Accelerating training in artificial neural networks with dynamic mode decomposition." arXiv preprint arXiv:2006.14371 (2020).

[6] Dogra, Akshunna S., and William Redman. "Optimizing neural networks via Koopman operator theory." Advances in Neural Information Processing Systems 33 (2020): 2087-2097.

[7] Brissette, Hawkins, Slota, “Acceleration of GNN Node Classification Using Koopman Operator Theory on GPU” under review (2023)

Graph Coarsening.

Null Graph Generation.

Dr. Christopher Brissette