Multilevel Schemes Without Memory Movement.

Sep 13

Oftentimes problems may be too large to solve given a limited amount of compute power. When it comes to graph and matrix problems (which are more-or-less one in the same), this is often overcome with multilevel schemes. In a multilevel scheme, the practitioner considers the problem at multiple fidelities. The general idea being that the problem will be fast to solve at a coarse fidelity, where the problem is small, and slow to solve with fine grained fidelity. In a multilevel scheme we scaffold up from the coarse solution, to obtain an approximate solution in the original problem domain. I have written a bit about this in the “Graph Coarsening” section of my research interests.

Multilevel schemes suffer from an issue that algorithm design often suffers from, that is the tradeoff between memory and speed. In general, you must pick one or the other when designing an efficient algorithm. Obviously, in bad algorithm design I can pick neither, and be inefficient in all respects, but we will ignore that case. Consider dynamic programming for instance. If you are unfamiliar with this programming paradigm, the idea is as follows. Say I have a problem where each optimal solution can be constructed from “smaller optimal solutions”. That is, if I am looking for a solution OPT(n) it can be constructed from some solution OPT(n-1)+A, where A requires some relatively small amount of computation if I know OPT(n-1). In problems of this sort, it is often incredibly fast to solve for OPT(n) by scaffolding small solutions, whereas a direct solution of OPT(n) may require a combinatorially infeasible search. This is the idea of dynamic programming. The issue that arises, is that we must store each optimal solution which creates a memory overhead. In our example, we have to store ‘n’ intermediate solutions, but in general we may have more than one parameter, and these overheads may be polynomial, or pseudo-polynomial in the memory requirements.

“…it is often incredibly fast to solve for OPT(n) by scaffolding small solutions, whereas a direct solution of OPT(n) may require a combinatorially infeasible search”

Multilevel schemes are very similar to dynamic programming, albeit they are generally searching for approximate solutions instead of optimal solutions. This gives us a bit more room to wave our hands and ignore annoying details. The way a multilevel scheme for graph algorithms typically works is by taking an initial graph and performing some coarsening, this typically takes the form of a maximal matching augmented with some clever heuristic. This is performed multiple times on the ever-coarsened representations, and each representation is stored in secondary memory (RAM) for use later. Then, whatever problem we are solving, say community detection, is run on the coarsest graph. This provides us with an initial guess at a solution. This guess is used at the second coarsest graph, third, fourth, and so on until it is “lifted” to the original problem domain. Now, if we only lifted the solution, there isn’t a great reason to store these intermediate graphs, we could simply jump from coarse representation to initial representation. The reason these are stored is that many multilevel algorithms will refine the solution at each layer. In the case of partitioning, an initial solution on the coarsest graph will be lifted to the second coarsest representation, and then a refinement algorithm such as the Kernighan-Lin algorithm [2] may be run to improve the cluster quality. This would then be repeated up to the initial problem size. It should be noted that this is essentially how METIS [1], a popular graph partitioner works.

You may wonder how this is faster. After-all, we still have to solve a problem on the original graph. The idea here is that each refinement is drastically faster than performing the initial algorithm at the base level. Because of this, their sum of run times will actually still be less than a direct solution. While this is amazing, there is a glaring problem mentioned earlier. It is the same problem as with dynamic programming. What happens when our initial graph is huge? What happens if it is so big it takes up nearly all of our secondary memory?

“…there is a glaring problem mentioned earlier. It is the same problem as with dynamic programming. What happens when our initial graph is huge? What happens if it is so big it takes up nearly all of our secondary memory?”

In the case where almost all of our secondary memory is taken up by our initial graph representation, we cannot store our intermediate representations. There are a few work arounds to this. A big one that comes to mind is to distribute the graph over multiple computers, and incorporate message passing to communicate between subgraphs. This can be clunky, and not everyone has access to compute centers, which means this is technologically out of reach for some. Instead, what if we could store all representations in constant space, at the cost of some required look-up times?

Consider a graph’s adjacency matrix. Each row of this matrix represents a node, and each element in that row tells you the weight of connection to the node associated with that column. E.g. having a one in the (3,5) position means that there is a connection from node 3 to node 5 with weight one. For speed, adjacency-lists tend to be more popular than edge-lists because they require less “searching around”, however, as we have described it, they are very memory inefficient. This is because we have a full matrix, where non-connections are represented by zeros. This is just senseless memory usage, so we typically store these in a sparse format like CSR. CSR is great, because we only end up storing data for the edges that exist, however sparse matrix formats aside from COO are generally slow for insertion and deletion of elements. For this reason, if we want to store many graphs without memory movement, we need to do it without altering the original adjacency matrix.

“…if we want to store many graphs without memory movement, we need to do it without altering the original adjacency matrix.“

In order to work with this ‘single-adjacency matrix’ method we will require some add-ons to our data structure. For this purpose we add an additional O(n) memory overhead where ‘n’ is the number of nodes in the graph. In general this is a minimal overhead when compared with the edge set, so we are comfortable with this. The idea here is as follows. Each node in the graph gets two values associated with it. One value tells you what other node has subsumed this one, and at what level of coarsening. The other value tells you what node I have subsumed and at what level of coarsening. This effectively adds a dual linked list structure to our adjacency matrix. If I want to determine the structure of the graph at the third layer of coarsening, I know that the nodes in use at that level will be the nodes which are either not-subsumed, or have not been subsumed until a level after three. The connections between nodes can then be obtained by following this linked list to the unique node that has not coarsened anything else, and adding together those connections.

These linked lists end up taking the same form, albeit the entries are different. They are filled with numbers of the form ‘x = n*r + v’ where ‘n’ is the number of nodes in the graph, ‘r’ is the level of coarsening, and ‘v’ is the node that I have either coarsened, or I have been coarsened by. Then, as a practitioner, I can obtain the level of coarsening of that node by computing the integer division of ‘x’ by ‘n’, and I get the node associated with that step by computing ‘x mod n‘. How does this affect run times?

“How does this affect run times?“

Let’s loosely examine how difficult it is to get the neighbors of a node ‘u’ at a certain level of coarsening ‘k’. I start at the given node, and I have to search through each node is has absorbed. This number will depend on whether we have performed a matching-based, or agglomerative-based coarsening. In agglomerative based coarsening, we may coarsen many nodes at each level, and we don’t actually have a good estimate on that number unfortunately. In the case of matching however, we know that this is bounded above by the number of coarsening levels ‘k’, which is bounded above by ‘log(n)’. Then for each edge I need to do the same check, but backwards, looking for its root note at the desired level of coarsening. Again each of these is bounded above by ‘log(n)’ checks. In the worst case scenario, every sub-node will have the maximum degree ‘Dmax’ of the graph, meaning that this requires O([Dmax]log(n)log(n)) work to find the neighbors of a node at any arbitrary level of coarsening. On average we can wave our hands a little bit and say this is more realistically O(<D>log(n)log(n)) for ‘<D>’ being the average degree of the graph. For a standard coarsened multigraph, we require nearly the same amount of work! In this scenario we need to check a worst case of O([Dmax]log(n)) edges to obtain all of our neighbors.

How does this affect a breadth first search (BFS) for instance? Well, in a BFS we essentially go node-by-node, check which of our neighbors we have visited, and visit all new neighbors at each stage. This requires knowing the neighbors at our level of coarsening. That means that we pick up an extra factor of ‘log(n)’ work. Instead, what if we wanted to do spectral clustering on our coarsest graph? This will get significantly more difficult because this requires global information that we cannot directly compute without some extra memory overhead. In general, given our coarsening linked-list, we CAN construct a coarsened matrix, but it is against the spirit of the method if we require additional memory overhead. This suggests that constant-memory coarsening may work well for local analysis, but requires significant additional overhead for more global problems. This is definitely worth further study. Perhaps there are faster ways to infer global information than what is suggested here.

[1] Karypis, George, and Vipin Kumar. "METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices." (1997).

[2] Kernighan, Brian W., and Shen Lin. "An efficient heuristic procedure for partitioning graphs." The Bell system technical journal 49, no. 2 (1970): 291-307.

Christopher Brissette

Multilevel Schemes Without Memory Movement.

Detecting A Lying Language Model.

The Shaky Grounds of Modularity Maximization.

Dr. Christopher Brissette