Control Flow Analysis
Subsequent branch-free instructions make up segments of linear code. The first instruction of the program is a leader, as is every target of a branch, and every instruction immediately following a branch (including conditional branches and procedure returns) is a leader. A basic block is the maximum segment of linear code associated with each leader -- it ends with either the program's last instruction or the first branch following a leader. Points exist before and after each statement in a basic block (there are thus a minimum of 2 points per block, and s + 1 for a block with s statements). A basic block a flows to b if and only if:
- either b immediately follows a, and a does not end in an unconditional branch,
- or a ends in a branch, of which b is a potential target.
Note that an indirect branch, without context information, trivializes all blocks (every instruction becomes a leader) and flows to them all from at least that point (an arborescence is formed)!
The directed multigraph defined by interpreting basic blocks as vertices, and flow relationships as edges, yields its control flow graph (CFG). A start node exists for each CFG, corresponding to the basic block whose header is the first instruction of the program.
The antisymmetric, transitive, reflexive domination relation is defined on vertices of a CFG (and thus basic blocks of the underlying program). A vertex a dominates b (a <= b) if every path from the start node s to b passes through a. A vertex a properly dominates b (a < b) if a dominates and is not equal to b. A vertex a directly/immediately dominates b (a <d b) if a properly dominates b, and a dominates no vertex c that dominates b. This relation induces the dominator tree, where nodes dominate all descendents in the tree. The start node s dominates all nodes, properly dominates all nodes but itself, and roots the dominator tree.
- It should be obvious that a's preceding of b in the CFG is not necessary for even immediate dominance of b by a.
Paths p1,p2...pn are defined as sequences of points such that either pi+1 immediately follows pi in a basic block, or pi+1 begins a successor block to a block terminated by pi. It is on these points that dataflow equations are evaluated.
Dataflow analysis is most usefully performed into and out of regions, subsets of the nodes such that a header exists which dominates all nodes in the region, and all edges between nodes in the region are themselves in the region. A loop is a region which is strongly connected, where all back-edges to the header are themselves within the region FIXME -- unclear.
Loops can be discovered via domination analysis (it is important to note that this refers to loops in the generated code, not loop constructs of the source language, and furthermore that all possible loops will be found (ie, unstructured loops constructed from C gotos)). Discover all strongly-connected subgraphs (SCCs) of the CFG (subgraphs where, for each vertex, a path (not necessarily an edge) exists from that vertex to all other nodes of the subgraph); if a subgraph contains a node dominating all that subgraph's nodes, the subgraph is a loop. The trivial case is, of course, a statement which jumps to itself, ala the BASIC program 10 GOTO 10. Implementation via Kosaraju's Algorithm is simple, with O(|V|+|E|) time complexity using graph encoding and O(N2) time complexity using adjacency matrices:
- Perform a recursive depth-first traversal of the graph starting from s. Each time you return, add that node onto an auxiliary vector. Upon the traversal's completion, this vector sorts the nodes topologically.
- Until the the vector is empty, use the last node of the vector to begin traversing the transpose graph. Remove the path from the vector; these paths partition the graph into SCCs.
Kosaraju's algorithm is improved upon by Tarjan's Algorithm and Gabow's Algorithm. Natural loop identification proceeds via identification of back edges (edges from a node b to a node a, where a dominates b). A loop is associated with every such back edge; if a backedge exists from b to a, the associated loop is entered at a, and consists additionally of all nodes which can reach b without going through a. Similarly, a loop is associated with the target of every back edge, this being the union of all such backedges' associated natural loops.
One major purpose of dataflow analysis is observing safety constraints across reordering transformations. If a statement S writes to a variable V, S is said to define V. If statement S reads from a variable V, S is said to use V (the two are not mutually exclusive). A definition of v is killed between p1 and p2 if every path between them contains a definition of v; conversely, if a path exists from p1 to p2 which does not redefine v, and v has been defined on input to p1, it reaches p2 from p1. In this situation, v would furthermore be live at p1 (or immediately after p1, if p1 assigned it); a variable is live at a point if that instantiation of the variable might be used in the future.
- Reaching analysis -- forwards dataflow problem, set union as confluence operator
- Copy propagation eliminates assignment statements, when safe (when modification of upwards-exposed uses of the target may be modified)
- Liveness analysis -- backwards-may dataflow problem, set union as confluence operator
- Use-Define chain -- a use U of a variable, and all definitions of that variable which can reach it (those live relative to U)
- Def-Use chain -- a definition D of a variable, and all uses of that variable it can reach (those reached from D)
- Common subexpression elimination lexically matches subexpressions, and can preserve them in registers
- SSA (Single Static Assignment) and CPS (Continuation-Passing Style) are equivalent (Kelsey 1995, Appel 1998 FIXME full citations!)
- SSA makes use-def chains explicit, and restricts them to a single element each
- It's much easier to perform global value numbering optimizations in this representation (see Muchnick, p378-396)
- Global value numbering is much more complete that basic common subexpression elimination
- "On the perfect accuracy of an approximate subscript analysis test" (Klappholz, Psarris, Kong, 1990) analyzes the GCD and Banerjee inequalities, explaining the crappiness of the former and general robustness of the latter. "On the Accuracy of the Banerjee Test" (same authors, 1991) suggests improvements on the Banerjee test.
- "Eliminating False Data Dependencies using the Omega Test" (Pugh, Wonnacott, 1992) moves from integer programming-based (Diophantine) solutions to a subclass of the Presburger formulae.
- Primary references: Allen/Kennedy ch. 2-3, Muchnick ch. 14
- Statement order within a loop is never a factor in preserving loop-carried dependencies
- Why people still write in FORTRAN instead of C
- Type-based analysis (see ANSI C99's restrict keyword), gcc's -fstrict-aliasing directive)
- Andersen's points-to analysis is cubic but more precise than Steensgaard's essentially-linear algorithm (See "Fast and Accurate Flow-Insensitive Points-To Analysis" (Shapiro, Horwitz, 1997))
The role of data dependence in memory optimization
- Allen/Kennedy p.383 sums up dataflow analysis's dual nature beautifully. FIXME rip it off
- A value computed at the dependence's source (write) and used at its sink (read) is a true/flow dependence
- Once brought to a lower-level memory, it can be used without refetch so long as it's preserved
- A value read prior to being recomputed is an antidependence
- Can exploit sub-unit spatial dependence, primarily with cache blocks
- A value assigned, and then reassigned prior to use, is an output dependence
- Restricts life of usages (provides more precise temporal locality information)
- A value read, and then read again prior to use, is an input dependence
- Groups life of usages together (provides more complete temporal locality information)
- Primary references: Allen/Kennedy ch. 8, Muchnick ch. 16
- Take advantage of temporal locality and, in a sense, spatial locality at the subword level.
- Loop fusion merges loops in order to enhance temporal locality
- Especially relevant for machine-generated (preprocessed) input
- Primary references: Allen/Kennedy ch. 9, Muchnick ch. 20. See also Caches on the Architecture page.
- The benefits of spatial locality can now be taken advantage of at the more natural word level.
- Loop interchange -- are we striding through nested loops optimally? If not, we can interchange levels, provided all dependencies are preserved. Especially powerful when combined with hardware- or software-based prefetching and further augmented with non-blocking caches. Optimal interpositioning is NP-hard in general but excellent heuristics exist. See Allen/Kennedy p. 472-474.
- This can be valuable enough that we perform small transformations to accommodate it: loop alignment, loop peeling, etc
- Loop blocking/tiling/strip-mining -- breaking out subloops to work on a cache-friendly block at once
- Especially good for parallelizing across coherent processing elements with local, coherent caches -- coherency traffic is minimized
- Also well-suited for cases of reuse between iterations of a non-innermost loop
- Loop safety is closed under blocking operations, but not necessarily under interchange.
- Blocking can, in some cases, make subsequent interchange safe! "If the strip size is less than or equal to the threshold of the dependence that might prevent interchange, then the strip-mine-and-interchange is legal, even though the interchange would not be." (Allen/Kennedy p. 480)