Systems and methods for performing branch prediction in a variable length instruction set microprocessor
A method of performing branch prediction in a microprocessor using variable length instructions is provided. An instruction is fetched from memory based on a specified fetch address and a branch prediction is made based on the address. The prediction is selectively discarded if the look-up was based on a non-sequential fetch to an unaligned instruction address and a branch target alignment cache (BTAC) bit of the instruction is equal to zero. In order to remove the inherent latency of branch prediction, an instruction prior to a branch instruction may be fetched concurrently with a branch prediction unit look-up table entry containing prediction information for a next instruction word. Then, the branch instruction is fetched and a prediction is made on this branch instruction based on information fetched in the previous cycle. The predicted target instruction is fetched on the next clock cycle. If zero overhead loops are used, a look-up table of a branch prediction unit is updated whenever the zero-overhead loop mechanism is updated. A last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table is stored. Then, whenever an instruction fetch hits the end of a loop body, predictively re-directing an instruction fetch to the start of the loop body. The last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.
This application claims priority to provisional application No. 60/572,238 filed May 19, 2004, entitled “Microprocessor Architecture,” hereby incorporated by reference in its entirety.
FIELD OF THE INVENTIONThis invention relates generally to microprocessor architecture and more specifically to an improved architecture and mode of operation of a microprocessor for performing branch prediction.
BACKGROUND OF THE INVENTIONA typical component of a multistage microprocessor pipeline is the branch prediction unit (BPU). Usually located in or near a fetch stage of the pipelines the branch prediction unit increases effective processing speed by predicting whether a branch to a non-sequential instruction will be taken based upon past instruction processing history. The branch prediction unit contains a branch look-up or prediction table that stores the address of branch instructions, an indication as to whether the branch was taken and a speculative target address for a taken branch. When an instruction is fetched, if the instruction is a conditional branch, the result of the conditional branch is speculatively predicted based on past branch history. This speculative or predictive result is injected into the pipeline. Thus, referencing a branch history table, the next instruction is speculatively loaded into the pipeline. Whether or not the prediction will be correct, will not be known until a later stage of the pipeline. However, if the prediction is correct, clock cycles will be saved by not having to go back to get the next instruction address. Otherwise, the current pipeline behind the stage in which the actual address of the next instruction is determined must be flushed and the correct branch inserted back in the first stage. While this may seem like a harsh penalty for incorrect predictions, in applications where the instruction set is limited and small loops are repeated many times, such as, for example, applications typically implemented with embedded processors, branch prediction is usually accurate enough such that the benefits associated with correct predictions outweigh the cost of occasional incorrect predictions—i.e., pipeline flush. In these types of applications branch prediction can achieve accuracy over ninety percent of the time. Thus, the risk of predicting an incorrect branch resulting in a pipeline flush is outweighed by the benefit of saved clock cycles.
While branch prediction is effective at increasing effective processing speed, problems may arise that reduce or eliminate these efficiency gains when dealing with a variable length microprocessor instruction set. For example, if the look-up table is a comprised of entries associated with 32-bit wide fetch entities and instructions have lengths varying from 16 to 64-bits, a specific lookup table address entry may not be sufficient to reference a particular instruction.
The description herein of various advantages and disadvantages associated with known apparatus, methods, and materials is not intended to limit the scope of the invention to their exclusion. Indeed, various embodiments of the invention may include one or more of the known apparatus, methods, and materials without suffering from their disadvantages.
As background to the techniques discussed herein, the following references are incorporated herein by reference: U.S. Pat. No. 6,862,563 issued Mar. 1, 2005 entitled “Method And Apparatus For Managing The Configuration And Functionality Of A Semiconductor Design” (Hakewill et al.); U.S. Ser. No. 10/423,745 filed Apr. 25, 2003, entitled “Apparatus and Method for Managing Integrated Circuit Designs”; and U.S. Ser. No. 10/651,560 filed Aug. 29, 2003, entitled “Improved Computerized Extension Apparatus and Methods”, all assigned to the assignee of the present invention.
SUMMARY OF THE INVENTIONThus, there exists a need for microprocessor architecture with reduced power consumption, improved performance, reduction of silicon footprint and improved branch prediction as compared with state of the art microprocessors.
In various embodiments of this invention, a microprocessor architecture is disclosed in which branch prediction information is selectively ignored by the instruction pipeline in order to avoid injection of erroneous instructions into the pipeline. These embodiments are particularly useful for branch prediction schemes in which variable length instructions are predictively fetched. In various exemplary embodiments, a 32-bit word is fetched based on the address in the branch prediction table. However, in branch prediction systems based on addresses of 32-bit fetch objects, because the instruction memory is comprised of 32-bit entries, regardless of instruction length, this address may reference a word comprising two 16-bit instruction words, or a 16-bit instruction word and an unaligned instruction word of larger length (32, 48 or 64 bits) or parts of two unaligned instruction words of such larger lengths.
In various embodiments, the branch prediction table may contain a tag coupled to the lower bits of a fetch instruction address. If the entry at the location specified by the branch prediction table contains more than one instruction, for example, two 16-bit instructions, or a 16-bit instruction and a portion of a 32, 48 or 64-bit instruction, a prediction may be made based on an instruction that will ultimately be discarded. Though the instruction aligner will discard the incorrect instruction, a predicted branch will already have been injected into the pipeline and will not be discovered until branch resolution in a later stage of the pipeline causing a pipeline flush.
Thus, in various exemplary embodiments, to prevent such an incorrect prediction from being made, a prediction will be discarded beforehand if two conditions are satisfied. In various embodiments, a prediction will be discarded if a branch prediction look-up is based on a non-sequential fetch to an unaligned address, and secondly, if the branch target alignment cache (BTAC) bit is equal to zero. This second condition will only be satisfied if the prediction is based on an instruction having an aligned instruction address. In various exemplary embodiments, an alignment bit of zero will indicate that the prediction information is for an aligned branch. This will prevent the predictions based on incorrect instructions from being injected into the pipeline.
In various embodiments of this invention, a microprocessor architecture is disclosed which utilizes dynamic branch prediction while removing the inherent latency involved in branch prediction. In this embodiment, an instruction fetch address is used to look up in a BPU table recording historical program flow to predict when a non-sequential program flow is to occur. However, instead of using the instruction address of the branch instruction to index the branch table, the address of the instruction prior to the branch instruction in the program flow is used to index the branch in the branch table. Thus, fetching the instruction prior to the branch instruction will cause a prediction to be made and eliminate the inherent one step latency in the process of dynamic branch prediction caused by the fetching the address of the branch instruction itself. In the above embodiment, it should be noted that in some cases, a delay slot instruction may be inserted after a conditional branch such that the conditional branch is not the last sequential instruction. In such a case, because the delay slot instruction is the actual sequential departure point, the instruction prior to the non-sequential program flow would actually be the branch instruction. Thus, the BPU would index such an entry by the address of the conditional branch instruction itself, since it would be the instruction prior to the non-sequential instruction.
In various embodiments, use of a delay slot instruction will also affect branch resolution in the selection stage. In various exemplary embodiments, if a delay slot instruction is utilized, update of the BPU must be deferred for one execution cycle after the branch instruction. This process is further complicated by the use of variable length instructions. Performance of branch resolution after execution requires updating of the BPU table. However, when the processor instruction set includes variable length instructions it becomes essential to determine the last fetch address of the current instruction as well as the update address, i.e., the fetch address prior to the sequential departure point. In various exemplary embodiments, if the current instruction is an aligned or non-aligned 16-bit or an aligned 32-bit instruction, the last fetch address will be the instruction fetch address of the current instruction. The update address of an aligned 16-bit or aligned 32-bit instruction will be the last fetch address of the prior instruction. For a non-aligned 16-bit instruction, if it was arrived at sequentially, the update address will be the update address of the prior instruction. Otherwise, the update address will be the last fetch address of the prior instruction.
In the same embodiment, if the current instruction is non-aligned 32-bit or an aligned 48-bit instruction, the last fetch address will simply be the address of the next instruction. The update address will be the current instruction address. The last fetch address of a non-aligned 48-bit instruction or an aligned 64-bit instruction will be the address of the next instruction minus one and the update address will be the current instruction address. If the current instruction is a non-aligned 64-bit, the last fetch address will be the same as the next instruction address and the update address will be the next instruction address minus one.
In exemplary embodiments of this invention, a microprocessor architecture is disclosed which employs dynamic branch prediction and zero overhead loops. In such a processor, the BPU is updated whenever the zero-overhead loop mechanism is updated. Specifically, the BPU needs to store the last fetch address of the last instruction of the loop body. This allows the BPU to predictively re-direct instruction fetch to the start of the loop body whenever an instruction fetch hits the end of the loop body. In this embodiment, the last fetch address of the loop body can be derived from the address of the first instruction after the end of the loop, despite the use of variable length instructions, by exploiting the fact that instructions are fetched in 32-bit word chunks and that instruction sizes are in general integer multiple of a 16-bits. Therefore, in this embodiment, if the next instruction after the end of the loop body has an aligned address, the last instruction of the loop body has a last fetch address immediately preceding the address of the next instruction after the end of the loop body. Otherwise, if the next instruction after the end of the loop body has an unaligned address, the last instruction of the loop body has the same fetch address as the next instruction after the loop body.
At least one exemplary embodiment of the invention provides a method of performing branch prediction in a microprocessor using variable length instructions. The method of performing branch prediction in a microprocessor using variable length instructions according to this embodiment comprises fetching an instruction from memory based on a specified fetch address, making a branch prediction based on the address of the fetched instruction, and discarding the branch prediction if (1) the branch prediction look-up was based on a non-sequential fetch to an unaligned instruction address and (2) if a branch target alignment cache (BTAC) bit of the instruction is equal to zero.
At least one additional exemplary embodiment provides a method of performing dynamic branch prediction in a microprocessor. The method of performing dynamic branch prediction in a microprocessor according to this embodiment may comprise fetching the penultimate instruction word prior to a non-sequential program flow and a branch prediction unit look-up table entry containing prediction information for a next instruction word on a first clock cycle, fetching the last instruction word prior to a non-sequential program flow and making a prediction on non-sequential program flow based on information fetched in the previous cycle on a second clock cycle, and fetching the predicted target instruction on a third clock cycle.
Yet an additional exemplary embodiment provides a method of updating a look-up table of a branch prediction unit in a variable length instruction set microprocessor. The method of updating a look-up table of a branch prediction unit in a variable length instruction set microprocessor may comprise storing a last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table, and predictively re-directing an instruction fetch to the start of the loop body whenever an instruction fetch hits the end of a loop body, wherein the last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.
Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
The following description is intended to convey a thorough understanding of the invention by providing specific embodiments and details involving various aspects of a new and useful microprocessor architecture. It is understood, however, that the invention is not limited to these specific embodiments and details, which are exemplary only. It further is understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the invention for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.
In
Returning to step 220, in this step a second determination is made as to whether the branch target address cache (BTAC) alignment bit is 0, indicating that the prediction information is for an aligned branch. This bit will be 0 for all aligned branches and will be 1 for all unaligned branches because it is derived from the instruction address. The second bit of the instruction address will always be 0 for aligned branches (i.e., 0, 4, 8, f, etc.) and will always be 1 for unaligned branches (i.e., 2, 6, a, etc.). If, in step 220, it is determined that the branch target address cache (BTAC) alignment bit is not 0, operation proceeds to step 225 where the prediction is passed. Otherwise, if in step 220 it is determined that the BTAC alignment bit is 0, operation of the method proceeds to step 230, where the prediction is discarded. Thus, rather than causing an incorrect instruction to be injected into the pipeline which will ultimately cause a pipeline flush, the next sequential instruction will be correctly fetched. After step 230, operation of the method is the same as after step 225, where the next fetch address is updated in step 235 based on whether a branch was predicted and returns to step 205 where the next fetch occurs.
As discussed above, dynamic branch prediction is an effective technique to reduce branch penalty in a pipeline processor architecture. This technique uses the instruction fetch address to look up in internal tables recording program flow history to predict the target of a non-sequential program flow. Also, discussed above, branch prediction is complicated when a variable-length instruction architecture is used. In a variable-length instruction architecture, the instruction fetch address cannot be assumed to be identical to the actual instruction address. This makes it difficult for the branch prediction algorithm to guarantee sufficient instruction words are fetched and at the same time minimize unnecessary fetches.
One known method of ameliorating this problem is to add extra pipeline stages to the front of the processor pipeline to perform branch prediction prior to the instruction fetch to allow more time for the prediction mechanism to make a better decision. A negative consequence of this approach is that extra pipeline stages increase the penalty to correct an incorrect prediction. Alternatively, the extra pipeline stages would not be needed if prediction could be performed concurrent to instruction fetch. However, such a design has an inherent latency in which extra instructions are already fetched by the time a prediction is made.
Traditional branch prediction schemes use the instruction address of a branch instruction (non-sequential program instruction) to index its internal tables.
Referring specifically to
It should be noted that in some cases, due to the use of delay slot instructions, the branch instruction may not be the departure point (the instruction prior to non-sequential flow). Rather another instruction may appear after the branch instruction. Therefore, though the non-sequential jump is dictated by the branch instruction, the last instruction to be executed may not be the branch instruction, but may rather be the delay slot instruction. A delay slot is used in some processor architectures with short pipelines to hide branch resolution latency. Processors with dynamic branch prediction might still have to support the concept of delay slots to be compatible with legacy code. Where a delay slot instruction is used after the branch instruction, utilizing the above branch prediction scheme will cause the instruction address of the branch instruction, not the instruction before the branch instruction, to be used to index the BPU tables, because this instruction is actually the instruction before the last instruction. This fact has significant consequences for branch resolution as will be discussed below. Namely, in order to effectively perform branch resolution, we must know the last fetch address of the previous instruction.
As stated above, branch resolution occurs in the selection stage of the pipeline and causes the BPU to be updated to reflect the outcome of the conditional branch during the write-back stage. Referring to
Two pieces of information need to be computed for every instruction under this scheme: namely the last fetch address of the current instruction, L0 and the update address of the current instruction U0. In the case of the scenarios of group one, it is also necessary to know L−1, the last fetch address of the previous instruction, and U−1, the update address of the previous instruction. Looking at both the first scenario and second scenarios, a non-aligned 16 bit instruction, and an aligned 16 or 32 bit instruction respectively, L0 is simply the 30 most significant bits of the fetch address denoted as instr_addr[31:2]. However, because in both of the scenarios, the instruction address spans only one fetch address line, the update address U0 depends on whether these instructions were arrived at sequentially or as the result of a non-sequential instruction. However, in keeping with the method discussed in the context of
Scenarios 3-5 can be handled in the same manner by taking advantage of the fact that each instruction fetch fetches a contiguous 32-bit word. Therefore, when the instruction is sufficiently long and/or unaligned to span two or more consecutive fetched instruction words in memory, we know with certainty that L0, the last fetch address, can be derived from the instruction address of the next sequential instruction, denoted as next_addr[31:2] in
In yet another embodiment of the invention, a method and apparatus are provided for computing the last instruction fetch of a zero-overhead loop for dynamic branch prediction in a variable length instruction set microprocessor. Zero-overhead loops, as well as the previously discussed dynamic branch prediction, are both powerful techniques for improving effective processor performance. In a microprocessor employing both techniques, the BPU has to be updated whenever the zero-overhead loop mechanism is updated. In particular, the BPU needs the last instruction fetch address of the loop body. This allows the BPU to re-direct instruction fetch to the start of the loop body whenever an instruction fetch hits the end of the loop body. However, in a variable-length instruction architecture, determining the last fetch address of a loop body is not trivial. Typically, a processor with a variable-length instruction set only keeps track of the first address an instruction is fetched from. However, the last fetch address of a loop body is the fetch address of the last portion of the last instruction of the loop body and is not readily available.
Typically, a zero-overhead loop mechanism requires an address related to the end of the loop body to be stored as part of the architectural state. In various exemplary embodiments, this address can be denoted as LP_END. If LP_END is assigned the address of the next instruction after the last instruction of the loop body, the last fetch address of the loop body, designated in various exemplary embodiments as LP_LAST, can be derived by exploiting two facts. Firstly, despite the variable length nature of the instruction set, instructions are fetched in fixed size chunks, namely 32-bit words. The BPU works only with the fetch address of theses fixed size chunks. Secondly, instruction sizes of variable-length are usually an integer multiple of a fixed size, namely 16-bits. Based on these facts, an instruction can be classified as aligned if the start address of the instruction is the same as the fetch address. If LP_END is an aligned address, LP_LAST must be the fetch address that precedes that of LP_END. If LP_END is non aligned, LP_LAST is the fetch address of LP_END. Thus, the equation LP_LAST=LP_END[31:2]−(˜LP_END[1]) can be used to derive the LP_LAST whether or not LP_END is aligned.
Referring to
While the foregoing description includes many details and specificities, it is to be understood that these have been included for purposes of explanation only, and are not to be interpreted as limitations of the present invention. Many modifications to the embodiments described above can be made without departing from the spirit and scope of the invention.
Claims
1. In a microprocessor, a method of performing branch prediction using variable length instructions, the method comprising:
- fetching an instruction from memory based on a specified fetch address;
- making a branch prediction based on the address of the fetched instruction; and
- discarding the branch prediction if: (1) a branch prediction look-up was based on a non-sequential fetch to an unaligned instruction address; and (2) a branch target alignment cache (BTAC) bit of the instruction is equal to a predefined value.
2. The method according to claim 1, further comprising passing a predicted instruction associated with the branch prediction if either (1) or (2) is false.
3. The method according to claim 2, further comprising updating a next fetch address if a branch prediction is incorrect.
4. The method according to claim 3, wherein the microprocessor comprises an instruction for pipeline having a select stage, and updating comprises after resolving a branch in the select stage, updating the branch prediction unit (BPU) with the address of the next instruction resulting from that branch.
5. The method according to claim 1, wherein making a branch prediction comprises parsing a branch look-up table of a branch prediction unit (BPU) that indexes non-sequential branch instructions by their addresses in association with the next instruction taken.
6. The method according to claim 1, wherein an instruction is determined to be unaligned if it does not start at the beginning of a memory address line.
7. The method according to claim 1, wherein a BTAC alignment bit will be one of a 0 or a 1 for an aligned branch instruction and the other of a 0 or a 1 for an unaligned branch instruction.
8. In a microprocessor, a method of performing dynamic branch prediction comprising:
- fetching the penultimate instruction word prior to a non-sequential program flow and a branch prediction unit look-up table entry containing prediction information for a next instruction on a first clock cycle;
- fetching the last instruction word prior to a non-sequential program flow and making a prediction on this non-sequential program flow based on information fetched in the previous cycle on a second clock cycle; and
- fetching the predicted target instruction on a third clock cycle.
9. The method according to claim 8, wherein fetching an instruction prior to a branch instruction and a branch prediction look-up table entry comprises using the instruction address of the instruction just prior to the branch instruction in the program flow index the branch in the branch table.
10. The method according to claim 9, wherein if a delay slot instruction appears after the branch instruction, fetching an instruction prior to a branch instruction and a branch prediction look-up table entry comprises using the instruction address of the branch instruction, not the instruction before the branch instruction, to index the BPU tables.
11. A method of updating a look-up table of a branch prediction unit in a variable length instruction set microprocessor, the method comprising:
- storing a last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table; and
- predictively re-directing an instruction fetch to the start of the loop body whenever an instruction fetch hits the end of a loop body, wherein the last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.
12. The method according to claim 11, wherein storing comprises, if the next instruction after the end of the loop body has an aligned address, the last instruction of the loop body has a last fetch address immediately preceding the address of the next instruction after the end of the loop body, otherwise, if the next instruction after the end of the loop body has an unaligned address, the last instruction of the loop body has a last fetch address the same as the address of the next instruction after the loop body.
Type: Application
Filed: May 19, 2005
Publication Date: Dec 15, 2005
Inventors: Kar-Lik Wong (Wokinham), James Hakewill (London), Nigel Topham (Midlothian), Rich Fuhler (Santa Cruz, CA)
Application Number: 11/132,428