Control Flow Analysis
Subsequent branch-free instructions make up segments of linear code. The first instruction of the program is a leader, as is every target of a branch, and every instruction immediately following a branch (including conditional branches and procedure returns) is a leader. A basic block is the maximum segment of linear code associated with each leader -- it ends with either the program's last instruction or the first branch following a leader. A basic block a flows to b if and only if:
- either b immediately follows a, and a does not end in an unconditional branch,
- or a ends in a branch, of which b is a potential target.
Note that an indirect branch, without context information, trivializes all blocks (every instruction becomes a leader) and flows to them all from at least that point (an arborescence is formed)!
The directed multigraph defined by interpreting basic blocks as vertices, and flow relationships as edges, yields its control flow graph (CFG). A start node exists for each CFG, corresponding to the basic block whose header is the first instruction of the program.
The antisymmetric, transitive, reflexive domination relation is defined on vertices of a CFG (and thus basic blocks of the underlying program). A vertex a dominates b (a <= b) if every path from the start node s to b passes through a. A vertex a properly dominates b (a < b) if a dominates and is not equal to b. A vertex a directly/immediately dominates b (a <d b) if a properly dominates b, and a dominates no vertex c that dominates b. This relation induces the dominator tree, where nodes dominate all descendents in the tree. The start node s dominates all nodes, properly dominates all nodes but itself, and roots the dominator tree.
- It should be obvious that a's preceding of b in the CFG is not necessary for even immediate dominance of b by a.
Dataflow analysis is most usefully performed into and out of regions, subsets of the nodes such that a header exists which dominates all nodes in the region, and all edges between nodes in the region are themselves in the region. A loop is a region which is strongly connected, where all back-edges to the header are themselves within the region FIXME -- unclear.
Loops can be discovered via domination analysis (it is important to note that this refers to loops in the generated code, not loop constructs of the source language, and furthermore that all possible loops will be found (ie, unstructured loops constructed from C gotos)). Discover all strongly-connected subgraphs (SCCs) of the CFG (subgraphs where, for each vertex, a path (not necessarily an edge) exists from that vertex to all other nodes of the subgraph); if a subgraph contains a node dominating all that subgraph's nodes, the subgraph is a loop. The trivial case is, of course, a statement which jumps to itself, ala the BASIC program 10 GOTO 10. Implementation via Kosaraju's Algorithm is simple, with O(|V|+|E|) time complexity using graph encoding and O(N2) time complexity using adjacency matrices:
- Perform a recursive depth-first traversal of the graph starting from s. Each time you return, add that node onto an auxiliary vector. Upon the traversal's completion, this vector sorts the nodes topologically.
- Until the the vector is empty, use the last node of the vector to begin traversing the transpose graph. Remove the path from the vector; these paths partition the graph into SCCs.
Kosaraju's algorithm is improved upon by Tarjan's Algorithm and Gabow's Algorithm. Natural loop identification proceeds via identification of back edges (edges from a node b to a node a, where a dominates b). A loop is associated with every such back edge; if a backedge exists from b to a, the associated loop is entered at a, and consists additionally of all nodes which can reach b without going through a. Similarly, a loop is associated with the target of every back edge, this being the union of all such backedges' associated natural loops.