The Shaky Grounds of Modularity Maximization.
Modularity maximization [1,2] is a widely studied method for community detection [3]. The idea is very simple. If I have a graph and I want to know if the communities I have defined are “well clustered” I can compare the number of connections within each community, to the number of connections I expect to be inside each community. If each cluster has more connections than I expect, it is a reasonable clustering. There are some well known issues with modularity, such as the well studied “resolution limit” [4] which states that, for certain edge densities, communities under a certain size cannot be discerned. However, there is a more fundamental issue with the theory behind modularity, and it has received little recognition historically. For many graphs, we don’t have a good estimate for the number of edges we should expect within each community.
“For many graphs, we don’t have a good estimate for the number of edges we should expect within each community.”
Another way of phrasing this issue is that the choice of null-model is a poor choice for many graphs. To understand why, we need to understand where this null model comes from, and why it lies at the basis of modularity maximization. We may ask ourselves, “What do we want from our null model?” and there are a couple properties that are ideal.
Easy to compute: This is fairly straight forward. If it is difficult to construct our null model, then we can use any number of more intricate and expensive community detection algorithms. Ultimately without an easy to compute model it kind of defeats the purpose of relying on a heuristic.
Calculate it from a single graph: We are trying to cluster a single graph, and it may be the only one that we have. Because of this, we cannot rely on having a large sample from which to infer our model.
Theoretically sound: Our method ideally will have some mathematical justification. This allows for us to prove things about our communities in relation to our null model. Without this we have to rely entirely on experimental data.
All three of these criteria are met by the Chung-Lu random graph model [5] . This model can be calculated with only a degree sequence, has strong theory backing it, and the probability of any given edge existing in a Chung-Lu graph is trivial to compute. The way it works is as follows. Consider a set of nodes with no connections, and a given degree sequence. Match each degree in this sequence with a node: we will call these the “weights” of our nodes. Then if I want to know the probability that two nodes are connected it can be computed by multiplying their respective weights and dividing by the sum of weights across all nodes.
Let’s check that this satisfies all of our criteria! It is easy to compute, all we need to know in order to compute the connection probability between two nodes is the sum of degrees in the graph and the degree of each node. We can compute it from a single graph, because again, we only need to know the degree sequence. And it can be proven that multi-graphs generated using this as a null model have the same degree distribution as desired, on average. So, why do we need to fix it? The problem really comes from this last statement.
“And it can be proven that multi-graphs generated using this as a null model have the same degree distribution as desired, on average.”
The theoretical guarantees of Chung-Lu generation are statistical in nature. They say that, the sum of connection probabilities for all possible edges connecting to an individual node will be equal to the degree we initially assigned it. This just means that we expect the degree sequence to be the same as the one we input. However, there is a sneaky term in there, “multi-graph”. When constructing a graph null model we have to decide where this null model is coming from. We do not want to cluster a protein interaction network by comparing it to a random social network for instance. In this case we are comparing every single graph we want to cluster with one that is allowed to have multiple edges connecting any two nodes: this is a multi-graph. This is in contrast to the fact that a large proportion of test instances, either real world, or generated, are simple graphs. That is, they only have a maximum of one connection per pair of nodes.
When generating simple graphs, Chung-Lu does not have the same nice guarantees as before. For starters, if any degree is larger than the square root of the sum of all degrees, there are over-unity probabilities which cannot be accounted for in a simple graph. Furthermore, graphs with many nodes of a single degree cannot be reliably generated. For details relating to this see some of my work [6,7] on the Chung-Lu graph model. Bringing this back to modularity, why would we use a null model that does not represent our data?
“…why would we use a null model that does not represent our data?”
The sensible answer is that we shouldn’t, but it’s convenient. Indeed, the ease of use brought to us by Chung-Lu is hard to debate. It is something, and it is a better null model than guessing. There are a small number of publications which study this issue. Some work in Nature Scientific Reports [8] has studied how modularity-like metrics can be applied to meshes, which break many of the Chung-Lu model’s assumptions. Other work has studied graph sampling, and how randomly swapping edges in a graph can be used to derive new models, and be used to compute modularity [9]. Despite this work being eight and five years old at the time of writing this, this is a problem still in its infancy, and not much has been written about it.
It does come with unique issues to be sure. For instance, in the realm of constructing null models from sampled edge swaps, the number of samples required for a “good” null model is not well understood, and the bounds we do have are unreasonable. For such a method to work it will require efficient parallel algorithms, or a lot of compute time, and the latter of these violates our modularity null model rule 1. Domain specific null models could be used which consider more topological features beyond the degree sequence. For instance, we may be interested in graphs with a given degree sequence and set diameter. Unfortunately this creates a new “node labeling” problem because there isn’t an obvious way to determine each edge probability. In the Chung-Lu model it was based on the degree we wanted each node to have. Now each connection depends on global information we don’t know at the outset, and finding the “most-likely” weights for each node is likely difficult to determine.
Few researchers have dove into the question, and fewer yet seem to see it as a problem. Despite this, the generation of domain-specific modularity maximization has the possibility to drastically alter the usefulness of this well established method.
[1] Newman, Mark EJ. "Modularity and community structure in networks." Proceedings of the national academy of sciences 103, no. 23 (2006): 8577-8582.
[2] Newman, Mark EJ. "Fast algorithm for detecting community structure in networks." Physical review E 69, no. 6 (2004): 066133.
[3] Fortunato, Santo. "Community detection in graphs." Physics reports 486, no. 3-5 (2010): 75-174.
[4] Fortunato, Santo, and Marc Barthelemy. "Resolution limit in community detection." Proceedings of the national academy of sciences 104, no. 1 (2007): 36-41.
[5] Winlaw, M., H. DeSterck, and G. Sanders. An in-depth analysis of the chung-lu model. No. LLNL-TR-678729. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2015.
[6] Brissette, Christopher, David Liu, and George M. Slota. "Correcting Output Degree Sequences in Chung-Lu Random Graph Generation." In International Conference on Complex Networks and Their Applications, pp. 69-80. Cham: Springer International Publishing, 2022.
[7] Brissette, Christopher, and George Slota. "Limitations of Chung Lu random graph generation." In Complex Networks & Their Applications X: Volume 1, Proceedings of the Tenth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2021 10, pp. 451-462. Springer International Publishing, 2022.
[8] Kishore, Raj, Ajay K. Gogineni, Zohar Nussinov, and Kisor K. Sahu. "A nature inspired modularity function for unsupervised learning involving spatially embedded networks." Scientific reports 9, no. 1 (2019): 2631.
[9] Fosdick, Bailey K., Daniel B. Larremore, Joel Nishimura, and Johan Ugander. "Configuring random graph models with fixed degree sequences." Siam Review 60, no. 2 (2018): 315-355.