[left]
1-1 Introduction
Branch prediction has been a problem for CPU designers since the advent of pipelining. A pipelined processor must fetch the next instruction before the current one has executed. If the current instruction is a conditional branch, the processor must decide whether to fetch from the target address, assuming the branch will be taken, or from the next sequential address, assuming the branch will not be taken. An incorrect guess causes the pipeline to stall until it is refilled with valid instructions; this delay is called the mispredicted branch penalty.
Processors with a simple five-stage pipeline typically have a two-cycle branch penalty. For a four-way superscalar design, however, this could mean a loss of eight instructions. If the pipeline is extended, the branch penalty usually increases, resulting in the loss of even more instructions. Since programs typically encounter branches every 4–6 instructions, inaccurate branch prediction causes severe performance degradation in highly super scalar or deeply pipelined designs.
Initial efforts at branch prediction used simple algorithms based on the direction of the branch. Among commercial microprocessors, the MIPS R6000 pioneered the use of compiler “hints” to direct branch prediction. Digital’s 21064 was the first microprocessor to store branch history information, with the P6 leading the way to two-level prediction.
1-2 Simple Hardware Can Achieve 65%
For scalar processors with relatively short pipelines, branch prediction is less of a concern. In fact, for processors with a branch delay slot, the branch penalty can be as little as one cycle. The default “prediction” method for simple pipelined designs is to assume that branches are not taken, always fetching sequential instructions. The 486 and most embedded processors use this scheme because of its simplicity and low cost.
It turns out, however, that conditional branches are taken more often than not. Most programs make heavy use of loops, which repeatedly branch to the same address. Simulations show that conditional branches are taken about 60% of the time in the SPECint89 suite and more often in scientific code such as the SPECfp89 benchmarks [1]. Thus, a simple optimization is to always predict branches to be taken.
A better algorithm takes into account the direction of the branch. Backward branches typically complete loop iterations and thus are taken as much as 80% of the time or more. Forward branches are more difficult to predict but tend to be not taken more often than taken. Thus, by simply looking at the direction of the branch (usually available as the sign bit of the offset), a processor can predict backward branches taken and forward branches not taken. This BTFN algorithm succeeds about 65% of the time for SPECint89. MicroSparc-2 and most PA-RISC processors use BTFN.
1-3 Dynamic Prediction Uses History
The previous algorithms are classified as static schemes, because any particular branch is always predicted in the same way whenever it is encountered. To achieve greater accuracy, dynamic algorithms take into account run-time information. The processor learns from its mistakes and changes its predictions to match the behavior of each particular branch.
A dynamic algorithm keeps a record of previous branch behavior, allowing it to improve its predictions over time. A simple scheme, published by James Smith in 1981[1], maintains a single history bit for each branch. When a branch is encountered, it is predicted to go the same way it did the previous time, as indicated by the bit. This technique can push accuracy to 80%. As a practical matter, there are two ways to implement this scheme. The history bits can be kept in the instruction cache, for example, one per every four instructions. When instructions are fetched from the cache, the history bit comes along. If the bit is set, that group of instructions contains a predicted-taken branch, and the fetch stream is redirected. In this example, the storage overhead would be less than 1% of the cache area. Although this method—used by Digital’s Alpha, AMD’s K5, and other processors—provides dynamic prediction with minimal cost, it has some drawbacks. Some groups of instructions will not contain a branch, wasting the history bit. Groups with multiple branches create interference, as the history of one branch overwrites that of another in the same group.
Processors such as Pentium store the history bits in a separate branch history table (BHT), assigning one entry per branch. By avoiding the interference and unused bits of the previous scheme, the BHT offers improved accuracy. Alternatively, similar accuracy is achieved with fewer entries. The BHT, however, must maintain its own set of tags, greatly increasing the amount of storage required.
Given the overhead of tag storage, most processors with a separate BHT store two bits of history per entry instead of just one bit. In this method, also elucidated by Smith[2], the two bits can be thought of as a saturating counter that is incremented when the branch is taken and decremented when it is not;
the most-significant bit is used to predict future occurrences. Another way to look at this implementation is as a state machine, which is depicted in Figure (1-1).
Figure (1-1).
In the two-bit Smith algorithm, the two history bits implement a state machine with four possible states: strongly taken (ST),weakly taken (WT), weakly not taken (WNT), and strongly not taken(SNT). In ST and WT, future branches are predicted taken; in WNTand SNT, branches are predicted not taken.
The advantage of the two-bit method is that a single unusual iteration will not change the predicted direction.For example, if a branch has been taken many times in succession, the state machine will be in the Strongly Taken state (3). If the branch is then not taken, the history bits will indicate Weakly Taken but still predict the next iteration as taken. Only if the branch is not taken two or more times consecutively will the prediction change to not taken. This hysteresis effect can boost prediction accuracy to 85% on SPECint92, depending on the size and type of history table that is used.
يتبع-------------