In one of my first network projects, as a student, I studied how networks break down when you remove edges in order of their betweenness. Simultaneously, Girvan and Newman used exactly the same approach to make the first modern community detection algorithm. This was before authors used to make their code publicly available, so when I realized I was one of few people with a community detection algorithm up and running, I rushed to detect some communities. I might be one of the first netsci’ers to have applied community detection methods, and I still do. On the other hand, I have never proposed an algorithm myself. I am a consumer of community detection, not a producer. With Sweden’s great tradition in defending the consumers’ rights—I could spell “konsumentombudsmannen” by the age of six and idolized our national consumer-rights hero Sverker Olofsson in my teens—you might have seen this blog post coming.
These days, I find myself trying out different community-detection algorithms once again, Infomap, Louvain, SBMs…. I have a really cool project (that you will hear more about later), where there are obvious communities. Yet none of the methods seem to agree with my intuition. In my experience, this is always the case. Sometimes, the algorithms don’t split the graphs where I expect it from the graph layout. Sometimes, I have metadata (“ground truth”) that doesn’t seem to match the communities that comes out.
“Sure,” you might say, “that is exactly why you should use community detection algorithms—to see how the network clusters the nodes, not your intuition.” . . but then how should I interpret the discrepancy between the detected clusters and my intuition? Certainly, there could be many factors behind the difference: 1. I could have misunderstood the reason why the nodes form communities. 2. There could be errors in the network data. 3. There could be errors in the metadata. 4. The graph layout program could fail to give the right impression. It is rarely (maybe never?) possible to weight these factors against each other (which is why so-called ground truth approaches often fail).
Assume I can ignore all the above caveats (which I probably will do anyway). Now, what do the detected communities mean? If I expect A and B to be grouped together, but they are not, then what can I conclude? Even if I understand the mechanisms of community formation, there might be other forces at work, blurring the communities. I.e. I wouldn’t have a full model of the network formation. A community detection algorithm could—making all the strong assumptions of this paragraph—tell me that my model is wrong, but it wouldn’t help me to fix it. On the other hand, if I had a very accurate model of the network formation, I wouldn’t need community detection, I could fit the model to the data in one way or another.
In addition to the above, there are oh so many network clustering algorithms. They all have their rationales and features. Even though the detected communities follow the general idea of being densely connected within and sparsely connected between each other, they do split the network in different ways (usually in slightly differently, but sometimes in surprisingly different ways). If you read the fine print, the assumptions of the algorithms are always very restrictive. You should be able to model a community as a random graph; you should be able to map your problem to an information-theoretic compression problem; etc. No matter what your purposes are, it is hard to meet these preconditions. This level of detail is just a consequence of that one has to code up community detection as a program—training a deep-learning network by classifying partitions by our intuition would take too many life times.
Of course, we need community detection! We all need to describe networks with clusters as systematically as possible. But we have to be aware that, if one starts out with an arbitrary empirical network, the clusters that come out will not tell your story for you. The most important aspect might actually be that the tinkering with different clustering algorithms itself will give us many insights about our data. In other words, experimenting with community detection algorithms helps us to build an intuition about the data . . . and we made a full circle. I mentioned training neural networks, but really it is our brains that are trained.
Many smart people have thought much deeper than me about these issues. Although they are producers rather than consumers, I’ll list ten personal favorites (probably much because of the order which I read the literature, and the many papers I missed . . it is not a stab at a real top list):
- Fortunato, Hric, 2016. Community detection in networks: A user guide — a clear and rather complete introduction to community detection; my go-to reference.
- Peel, Larremore, Clauset, 2017. The ground truth about metadata and community detection in networks — I agree with computer scientists: It is always a good idea to have an impossibility theorem (you can get two things you wish for, but not three) in your paper!
- Kernighan, Lin, 1970. An efficient heuristic procedure for partitioning graphs — A true classic. And even though it doesn’t exactly do community detection, it could be recast as such an algorithm. CS papers from the 1970’s are always nice to read, always evoking images of buzzing, fridge-sized IBM machines, tweed suits and the cold war.
- Schaub, Delvenne, Rosvall, Lambiotte, 2017. The many facets of community detection in complex networks. — This I also helped me to see the field from different perspectives, and it has a nice, unifying, let’s-be-friends-again undertone.
- Zachary, 1977. An information flow model for conflict and fission in small groups. — We all know the network itself, but this paper also contains a surprisingly modern community detection algorithm.
- Schuster, Pfeiffer, Moldenhauer, Koch, Dandekar, 2002. Exploring the pathway structure of metabolism: Decomposition into subnetworks and application to Mycoplasma pneumoniae. — Community detection is a field mostly driven by social data, but there was a time biological applications was also a source of inspiration. This paper develops a method similar to the Girvan-Newman algorithm, for situations (like metabolic networks) where the highest-degree nodes are also the ones least important to the functionality of the detected communities.
- Girvan, Newman, 2002. Community structure in social and biological networks — The paper that introduced the problem to me. It was slightly bizarre feeling that I coded it up before I knew what it could be used for (see above). This taught me that finding a good problem is sometimes more important than finding a solution.
- Peixoto, 2018. Bayesian stochastic blockmodeling — This is maybe the most readable intro to SBMs, by the high priest himself. It’s also the lead I’m currently following for the project mentioned above.
- Palla, Derenyi, Farkas, Viscek, 2005. Uncovering the overlapping community structure of complex networks in nature and society — This is interesting because the rationale is so different. It feels like I could have come up with all of the papers above, but this one . . never.
- White, Boorman, Breiger, 1976. Social structure from multiple networks. I. Blockmodels of roles and positions — Identity and Control has a certain charm, but I prefer the early White.