A graph is an abstraction representing some objects and connections between pairs of objects. In its simplest form, a graph has just two pieces:
More generally, a graph contains information about what pairs of objects are related to each other in some way.
Let us first define some terminology that will make talking about graphs less cumbersome. If has an edge , we say and are adjacent. We also say is a neighbor of and is a neighbor of . The edge is also referred to as being incident on (and , as well).
The degree of a node (denoted ) is the number of other nodes adjacent to it, or in other words, the number of neighbors it has. Equivalently, is the number of edges incident on .
We define a path as a sequence of nodes such that any two adjacent nodes in the sequence are adjacent in . Two nodes and are then reachable from each other if there is a path that starts at and ends at (or vice versa).
As with most sufficiently complex mathematical notions, there is an infinite black hole of graph theory terminology, much of which we will see in the future. For now, however, these terms are sufficient for making basic claims about graphs. The following is one of the most important facts about graphs, especially for the purpose of runtime analysis.
The Handshaking Lemma. For any graph , the sum of the degrees of every node is twice the number of edges. That is,
Intuition. It is more useful to consider the edge-incidence definition of degree here. First, observe that every edge is incident on exactly two vertices. Thus, in our sum of degrees, any given edge contributes to exactly two terms by 1. Therefore, we would expect this sum to be equal to twice the number of edges.
Proof. We can formalize this intuitive idea. Let if and only if is incident on vertex ; otherwise, it is defined as 0. So, we can now rewrite . So, the sum of all of these degrees is
Here we simply rearrange the sum and use the fact that there are exactly 2 vertices incident on any edge, so .
We have discussed the mathematical idea of a graph, but it remains to define the abstract data type for use in our code:
Operation | Description |
---|---|
Vertex iteration | Iterate over every vertex . |
Edge iteration | Iterate over every edge . |
Edge detection | For two vertices and , determine whether . |
Neighbor iteration | For some specified , iterate over every where . |
There are two common ways of implementing this data type.
One way to interpret a graph is that it is simply a binary relation on the vertices – that is, any two vertices are either adjacent or they aren’t adjacent. This leads to a natural programmatic representation – for every pair of vertices, keep track of whether they are adjacent or not. We can use any data type that acts as a two-place boolean predicate, but the most obvious choice is a two dimensional array. True to the relation interpretation, we could use a two-dimensional boolean array, but for reasons that will make sense later, it is more typical to use an integer array and store a for adjacency and for non-adjacency. This array is known as an adjacency matrix.
Adjacency matrices perform one operation very quickly: to detect whether two vertices are connected by an edge, we can simply test whether the adjacency matrix has a 1 in the corresponding position. This can be done with (approximately) two pointer deferences, so this operation takes time.
Let us investigate the other operations. To iterate over every neighbor of , we can iterate through the -th row and report all the entries which are 1. Since a row is size , this always takes time. Similarly, to iterate over all edges, we can go through the entire matrix and return all of the entries with a 1 in them, taking time. As we will see, neither of these are particularly good – this is because we waste a lot of time iterating over entries which don’t actually represent any edges, but are just empty placeholds for where edges could be. The next representation, the adjacency list, addresses this issue.
One useful metric of graphs is their density, which is how many edges they have relative to the number of vertices they have. A graph is called dense if its number of edges is asymptotically close to the maximum. A graph is called sparse if it is not dense; that is, if its number of edges is far less than the maximum. Some formalizations of these terms exist (see tangent
In practice, many graphs are sparse. For example, imagine a graph representing locations in the real world. You could imagine a complex enough graph of this type having, say, a million different locations. For any particular location, however, the maximum number of edges between that location and adjacent locations is probably no more than 4 – maybe up to 5, 6, if you have some really crazy intersection. It is certainly nowhere close to a million. We would expect, then, that this graph probably has no more than 5 to 6 million edges, even though it could in principle have up to edges.
Sparse graphs benefit from a more efficient representation called an adjacency list. Here, for each vertex, we simply store a “list” of adjacent vertices. I put list in scare quotes since it really makes more sense to think of this as a set (there are no duplicates, nor is there any particular order), but conventionally it’s just called a list anyway.
Comparing the two figures, it should hopefully clear where the speedup comes from: instead of checking every node to see if it is adjacent to some vertex , we can just instantly have access to a list of neighbors and iterate over it. So, neighbor iteration for adjacency lists is in . Since there are neighbors anyway, this is literally the best we can do. We can also iterate over all edges quickly by iterating every list; this takes time.
On the other hand, edge detection is not as nice. In order to determine if , we have to go through ’s list and check if is in it, or vice versa. As long as we know the length of each list, we can pick the shortest, so edge detection takes . Note that for dense graphs, this can be quite bad… , in the worst case! Much worse than the we had for adjacency matrices. But in sparse graphs, where we expect the degree to be relatively small, this is still quite efficient.
So, we can see that there is a tradeoff – by using lists, we get faster neighbor and edge iteration, but slower edge detection. By using matrices, we get faster edge detection, but slower neighbor and edge iteration. As it turns out, however, neighbor iteration is used wayyy more than any other operation. This operation is essentially the fundamental operation used when searching through graphs, which is kind of the thing you do with graphs. Hence, adjacency lists are generally used unless there is a good reason in particular to use adjacency matrices.
So far, we have only answered local questions about graphs:
But not yet questions which answer higher level, global properties of graphs. Here is an example:
The reachability problem: Given a graph and two vertices and , determine whether there is a path between and using edges in .
It is easy to see why this is a practically useful problem to have a solution for.
The most common way to solve the reachability problem is with a searching algorithm. There are many different ways to search a graph, but almost all of them are basic variations of the following high level algorithm.
As it turns out, it’s not much harder to solve a more general problem: which vertices are reachable from ? This is the problem searching algorithms actually solve. Here’s how: starting at , we begin building a set of explored vertices, which we know for sure are reachable from . We repeatedly expand by finding vertices which are connected to already explored vertices. Eventually, we run out of vertices to add, and we stop – at this point, contains all of the vertices reachable from !
Let’s try to describe this in a way that is a little closer to an actual algorithm. Let’s say a vertex is a fringe vertex if , but for some . Basically, is a fringe vertex if it hasn’t been explored, but it is one edge away from a vertex that has been explored. Then our algorithm should repeatedly find a fringe vertex and add it to . Algorithm 1 summarizes this approach, and Figure 4 visualizes the idea.
We have left out a lot of details here, but let’s first prove that this high-level algorithm accomplishes our goal before we try to flesh it out.
Theorem. After Algorithm 1 terminates, contains exactly the vertices in which are reachable from .
Let’s first prove the following invariant: all vertices in are reachable from .This certainly holds before the loop starts ( is reachable from itself), so we just need to show that it holds after each iteration of the loop. We need to show that the new element we’ve added to , the fringe vertex , is reachable from . By definition of fringe vertex, there is some such that . By the invariant, is reachable from , and so there is some path
If we construct a new path by simply appending the edge , we get the path
and so is reachable from . Thus, the invariant is preserved.
So, when the algorithm terminates, contains only vertices rachable from . Aren’t we done, then? Not quite. We have shown that if a vertex is in , it is reachable from , but we still need to show that if a vertex is reachable from , it is in . Essentially, thus far we have only shown that is a subset of the vertices reachable from , but we need to show it is equal.
This direction is slightly more involved. There are many ways to proceed, but here is one idea: for each vertex that is -reachable, we can prove the following invariant: either contains or there is a fringe vertex such that there is a path from to that passes through .
In other words, the invariant says that we have either found , or we are on our way to finding it through some fringe node . Let us now prove the invariant, again by induction. First, the base case: when , does this invariant hold? If , then the first part of the “or” is clearly true, since . Otherwise, consider any path from to . Since , the path consists of more than one path, and so there must be some node such that . But since is the only explored node, must be a fringe node.
Now, assuming the invariant is true for some iteration, we need to show that it holds for the next as well. Every iteration we add some fringe node to . If , then clearly the invariant is still true, since now . If , then is stil a fringe node, and so the second part of the or “or” is still true. The more complicated case is when . We know there is some path that connects and and contains . Notice the suspiciously placed , which is the node after in the path. Since is a path, . After exp
So we’ve shown that the algorithm correctly finds exactly the vertices we need to find. However, we haven’t specified completely how we even find fringe vertices in the first place. Until we do that, we can’t even implement this algorithm, let alone analyze its runtime.