المهندس سعد عيادة الحديثي Admin الحديثي صاحب الامتياز
عدد الرسائل : 874 البلد الذي تنتمي الية : العراق نقاط : 305089 تاريخ التسجيل : 17/08/2007
| موضوع: Branch preadiction(2) الأحد مايو 25, 2008 12:30 am | |
|
1-4 Two-Level Algorithm Improves Accuracy
Even at 85%, one out of every six branches is mispredicted, a rate that can significantly degrade performance in a highly superscalar design. Consider a fourway superscalar processor with a mispredicted branch penalty of three cycles. With one branch every five instructions, an accuracy of 85% creates a penalty of (4 ´ 3 ´ 15%) ÷ 5, or 0.4 cycles per instruction (CPI). This penalty is a major factor for a processor with a peak throughput of four instructions per cycle (0.25 CPI). For further improvement in prediction accuracy, Intel’s P6 designers turned to the two-level algorithm developed by Yeh, a Ph.D. candidate, and Prof. Patt at the University of Michigan. The two researchers first published this technique in 1991[1] and continued refining it[3,4] until Yeh graduated in 1993; he is now employed by Intel on the P7 program.
A simplistic approach to improving the Smith algorithm is to increase the number of history bits beyond two using the same up/down counter. This approach retains more history information but does little to improve accuracy. In fact, it is actually more sluggish than the two-bit design when a branch changes from consistently taken to consistently not taken, or vice versa. The two-level algorithm instead looks for patterns in an extended history register. For example, suppose a branch has been taken three times in a row, then not taken once. Will the branch return to “taken” on the next iteration, or will it now be consistently not taken? To resolve this issue, the algorithm looks at previous behavior when this same sequence has occurred, using historical data to generate a prediction. Implementation of this algorithm requires two level of storage. As shows in fig (1-2)
Figure (1-2) The two-level algorithm uses the contents of a branch history register as an index into a pattern table. Each table entry consistsof a two-bit saturating up/down counter. The most-significantbit of the indicated table entry provides the branch prediction.
For each branch, a history register maintains the state of the last k branches. Unlike a saturating counter, this circuit is a shift register that represents taken branches with a 1 and not taken branches with a 0. Each branch has a pattern table consisting of 2k entries. Each entry implements a two-bit saturating up/down counter that tracks the results of previous iterations that occurred when the history register was in a given state. When a branch is encountered, the contents of the history register are used to index into the pattern table, selecting the entry that corresponds to the recent history of that branch. As in the Smith algorithm, the two bits of that entry indicate the prediction. After the branch is resolved, the result (taken or not taken) is shifted into the history register and used to update the appropriate entry in the pattern table. To improve performance on tight loops, designs may use the prediction to speculatively update these fields, correcting them in the case of a misprediction. 1-5 Practicality Forces Simplification
As with simpler dynamic approaches, storing aunique history for every possible branch is impractical, so history is kept for only the most recent branches. The two-level approach, however, requires much more storage than the two-bit design: for each branch, there is apattern table of 2 x2k bits, as Figure 3 shows.
Fig (1-3) A branch history table consists of a number of entries, each with its own history register. Each history register may map to its own pattern table (PAp), or groups of entries may map to eachpattern table (PAs).
Even a relatively limited implementation with k=4 requires 18 times as much storage as the simple two-bit design.To allow practical implementations with larger valuesof k, Yeh and Patt propose reducing the number of pattern tables by combining multiple branches in each table[4]. Branches can be grouped into sets based on their address (using a hashing algorithm), opcode, or some other characteristic. Branches in the same set use the same pattern table. In their taxonomy ,Yeh and Patt call this the PAs algorithm, as opposed to the PAp method described above. Although combining multiple branch histories causes some interference, this loss is compensated by the ability to implement larger pattern tables and thus larger history registers. For a given hardware cost, Pas delivers better accuracy than the full-blown PAp. Yeh and Patt simulated a variety of designs that all use about 8 Kbits of storage; at this size, the most accurate configuration is a 1,024 x6-bit branch history table (i.e., k=6) with 16 pattern tables (each of 128 bits). Note that, in this case, the BHT consumes 75% of the storage budget, with the remainder for the pattern tables. For further simplification, the BHT can be reduced to a single branch history register. This global resource simply tracks the results of the last k branches, regardless of their address. The pattern tables, however, can still be maintained on a per-address basis (GAp) or can be grouped by set (GAs). By eliminating the BHT, the width of the history register (k) and number of pattern tables can be increased within a fixed hardware budget. For small implementations such as the 8-Kbit design above, the PAs method provides better results. With a budget of 128 Kbits, however, GAs is superior. Specifically, a GAs design with an 11-bit history register and 32 pattern tables (each of 2 ´ 211, or 4,096 bits) delivers the most accuracy within this storage budget[4]. The above analysis does not include the effect of context switches. The GAs approach has an advantage when context switches occur, because it takes fewer iterations to develop reliable history information from a single history register. This may give the global history register an advantage in real-world designs even for smaller implementations.
Target Addresses Must Also Be Predicted Predicting whether the branch is taken is only half the battle; for seamless handling of taken branches, the processor must be able to immediately redirect the fetch stream to the target address. This feat can be tricky, since typically the branch instruction is not fully decoded in time. If the processor takes an extra cycle to decode the branch and calculate the target, the instructions fetched during that cycle must be discarded if the branch is predicted taken. Many processors use fetch queues to buffer instructions, hoping that a short stall can be absorbed by instructions already in the buffer. To achieve true zerocycle branches, a few processors cache predicted target addresses. Most Pentium-class designs use a branch target address cache (BTAC) that contains
predicted target addresses. The BTAC is accessed in parallel with the instruction fetch. If the BTAC indicates a branch, the predicted target address is used for the next instruction fetch, redirecting the fetch stream without penalty. The BTAC is often combined with the BHT, forming a single structure that uses a single set of tags for both target addresses and history bits. This combination is often called a branch target buffer (BTB).Current BTAC sizes are typically 256 or 512 entries,smaller than typical instruction caches, so addresses that hit in the cache may not be present in the BTAC. If an address misses the BTAC, the instructions at that address may still contain a taken branch; this fact is detected later in the pipeline, when the instructions are decoded. Such a branch will cause a penalty of one or more cycles. Like any cache, the BTAC can be increased in size or associativity to reduce the miss rate. No matter how large the BTAC is, some branch targets are difficult to predict. Subroutine returns and other register-based branches do not have a fixed target but instead use the contents of a register to determine their destination. To handle these instructions, some processors use special storage for return addresses. When a subroutine is called, processors such as Cyrix’s M1 and all Alpha chips save the return address on a special stack. When the subroutine later returns, this address is taken from this stack and used to redirect the fetch stream, avoiding the need to wait for the return instruction to actually execute. If subroutine calls are nested more deeply than the number of entries in this address stack (typically four), this method cannot provide the correct target address for subroutine returns beyond this limit. One BTAC variation is to cache target instructions instead of addresses. AMD’s early 29K processors, as well as NexGen’s 586, have a branch target cache (BTC) to store the first several instructions located at various target addresses. When a branch is encountered that hits in the BTC, the processor begins fetching instructions from the BTC while redirecting the main instruction cache to the new address. Each BTC entry must hold enough instructions to feed the CPU until the main cache can resume supplying instructions. The BTC technique is useful when the main instruction cache has a latency of more than one cycle, as in the 586, or if there is no other instruction cache, as in the 29K. Like the 29K, some modern processors are connected to a high-bandwidth memory that has a latency of more than one cycle; sequential instructions can be easily prefetched, but taken branches are a problem. Processors such as Digital’s 21164 and Hal’s Sparc64 use a small (4K–8K) primary instruction cache to provide single- cycle response to branches. In these situations, a BTC can be more effective than a standard instruction cache if only 1K–2K of storage is available.
1-6 Alternative Designs Ease Branching
One method of avoiding branch prediction is to follow both the taken and not taken paths simultaneously, delivering 100% prediction accuracy. The only commercial microprocessor to implement this strategy is Super- Sparc, which has enough instruction-cache bandwidth to fetch from two streams (on alternate cycles). SuperSparc also has a relatively short pipeline, only four stages, and can resolve pending branches before unneeded instructions are executed. Few processors have the cache bandwidth to fetch from multiple streams. Furthermore, most processors have longer pipelines, raising the possibility that a second branch could be encountered before the current branch is resolved. In this case, the processor would have to speculate down four (or more) paths at once, a seemingly impractical task. Regardless of the prediction strategy, keeping the pipeline to a minimal length improves branch performance. If branches are resolved quickly, the percentage of mispredictions is less important. The R10000, for example, uses a six-stage pipeline to keep the mispredicted branch penalty to two cycles, improving performance on code with many branches or with branches that are difficult to predict. In many advanced processors, however, extra pipeline stages are required to issue several instructions per cycle. As clock speeds increase, some vendors are extending the pipeline to allow multicycle caches. In decoupled designs, branches can stall in execution queues, further increasing the potential branch penalty. These issues all force processor designers to seek improved branch prediction.
1-7 Two-Level Prediction Is Very Accurate
Yeh and Patt found that a two-level BTB similar to the one postulated for the P6 will correctly predict the direction of about 96% of all conditional branches in the SPECint89 suite. This figure does not include mispredicted target addresses, nor does it include the effect of context switches. The P6’s branch-target buffer and return- address stack should deliver high accuracy on target addresses as well. Intel would not quote branch-prediction accuracy for the P6; based on the Yeh and Patt study, we expect that it will achieve 90% to 95% accuracy on programs such as SPECint92 and do even better on SPECfp92. Many real-world applications, however, are notoriously difficult to predict with such accuracy. All processors achieve lower accuracy on these applications, but the P6 will have a greater performance degradation than many other processors due to its longer branch penalty. As Figure 4 shows, the P6 has the most extensive branch-prediction hardware of any announced microprocessor and is the first to implement the two-level prediction algorithm. Just as Alpha’s debut of dynamic branch prediction foreshadowed extensive use of that algorithm in other designs, we expect Yeh and Patt’s two-level method to be adopted in future microprocessors, both x86 and RISC, as vendors continue the quest for greater performance.
[/left] | |
|