This is not really a blog post, but rather a checklist of more or less painful mistakes in networks-science programs by me, or my students or teachers. (As a side note, computer scientists tend to be better a this, but are by no means immune to it.)
This list is incomplete. I plan to fill it up as I go along (suggestions are welcome). I let n denote the number of nodes and m the number of edges, and assume we are dealing with simple graphs.
Isolated nodes are also nodes
The most common format for network data (and the one I always use) is the edge list—an m×2 matrix where every row represents an edge. Isolated nodes will never appear in an edge list. So a common mistake is, of course, to assume that they do. It typically happened to me when I compared an empirical network without isolated nodes with a model network, or a randomized version of the original, that happens to have them.
We can get around this problem by setting the largest node index to n – 1 (or n) and assuming nodes between 0 (1) and n – 1 (n) that do not appear in the file are isolates. However, I many times forgot to check that the largest index node is not an isolate.
Swap it right (and left)
To update networks while keeping the degree sequence fixed, we often rewire edges. The way not to do it is:
- Pick two random edges (i,j) and (i’,j’).
- Check that i and j’ are different, that i’ and j are different and that (i,j’) and (i’,j) are not edges. If so, then:
- Replace (i,j) by (i,j’) and (i’,j’) by (i’,j).
The problem with this is that a node in the first argument of an edge will never leave the first argument. So if in the input data, i and j both only occur in the first argument, there will never be an edge between them. Many times the edges are sorted, so this could cause very biased results. There are many ways of fixing this. One can occasionally flip the edges (i,j) → (j,i). One can occasionally (or maybe always?) rewire to (i,i’) and (j,j’) instead, or to (i,j’) and (j,i’).
Getting a random node (or edge)
This should be just routine for us all, and I assume if you use some network library in Python or R there is nothing to worry about. (Is there?) If you use C++ or such, however, don’t just take a random number modulo n. I think the fastest and most elegant way of getting it correct is as in this routine (slightly rewritten from the PCG random number generator)
unsigned int pcg_32_bounded (unsigned int bound) {
unsigned int threshold = -bound % bound, r;do r = pcg_32(); // getting a 32-bit random number
while (r < threshold);return r % bound;
}
Here threshold is 232 minus the largest multiple of bound that fits into 32 bits (note that nifty trick with the two’s complement of bound that wouldn’t work for a signed integer). In practice, I don’t think I ever dealt with networks where just pcg_32() % bound would give probabilities that are more than 1 percent off, but it is wrong not to do things right. And if one would get the chance to play with an n = 3,221,225,472 network then the naïve approach would sample half of the nodes twice as frequently as the other half.
List of nodes in random order
(Of course, in Python you would use random.shuffle or in R sample, but if you have to do it from scratch.) The common mistake is to go through all nodes and swap each node with a random node. This is el classico of simulation-code errors and I’ve seen it committed by prominent members of my field. What you should really use is the Fisher-Yates shuffle. Once again, the naïve algorithm does not get it that wrong, and it is hard to believe any textbooks need to be rewritten if it was weeded out, but still.