Fused instruction operation for performing an operation and adjusting a sign of the operation result
Techniques are disclosed involving fusing instruction pairs and executing corresponding fused instruction operations. A processor includes fusion detection circuitry to detect a pair of fetched instructions and fuse the instructions into a fused instruction operation, and execution circuitry to execute the fused instruction operation. In one embodiment, a first instruction is executable to perform an operation and a second instruction is executable to adjust a sign of a result of the operation. In another embodiment, the first instruction is executable to perform an operation and the second instruction is executable to find a maximum or minimum, as compared to a comparison operand, of a result of the operation. In another embodiment, the first instruction is executable to perform a vector operation and the second instruction is executable to read a first element of the vector result and overwrite one or more additional elements of the vector result.
Latest Apple Inc. Patents:
The present application claims priority to U.S. Provisional App. No. 63/376,699 entitled “Instruction Fusion,” filed Sep. 22, 2022, the disclosure of which is incorporated by reference herein in its entirety.
BACKGROUND Technical FieldThis disclosure relates generally to a computer processor and, more specifically, to the fusion of certain instructions.
Description of the Related ArtModern computer systems often include processors that are integrated onto a chip with other computer components, such as memories or communication interfaces. During operation, those processors execute instructions to implement various software routines, such as user software applications and an operating system. As part of implementing a software routine, a processor normally executes various different types of instructions, such as instructions to generate values needed by the software routine. The specific set of instructions executed by a given processor is defined by the processor's instruction set architecture (ISA).
Instructions executed by a processor may perform operations on data represented using various formats, such as integer format or floating-point format. Some processor embodiments use separate execution units, or execution circuits, for integer instructions and floating-point instructions. Processors may also use separate execution circuits for vector instructions and scalar instructions. In some cases, vector instructions are handled by an execution circuit that also handles floating-point instructions.
As mentioned above, the set of instructions available to a programmer using a given processor is defined by the processor's instruction set architecture (ISA). There are a variety of instruction set architectures in existence (e.g., the x86 architecture originally developed by Intel, ARM from ARM Holdings, Power and PowerPC from IBM/Motorola, etc.). Each instruction is defined in the instruction set architecture, including its coding in memory, its operation, and its effect on registers, memory locations, and/or other processor state. For a given ISA, there are often operations that programmers want to implement that do not correspond to a single instruction in the ISA. Such operations may therefore be implemented using two or more instructions.
Using a pair (or more) of instructions to implement an operation that could be done with one instruction can cause technical problems that reduce processor performance in multiple ways. As one example, execution of two instructions may increase the latency, or number of clock cycles required, to implement an operation. An increase in latency may particularly result if one or both of the two instructions implements a simple operation that can be done in a single cycle.
In addition to potentially increasing latency of a processor operation, using a pair of instructions rather than a single instruction can reduce performance by adding to traffic in the processor's instruction pipeline, potentially increasing power usage or congestion in elements such as the scheduler and reservation stations. Therefore, “fusing” a pair of instructions for execution as a single decoded instruction (or “instruction operation” as used herein) can reduce the amount of resources that would otherwise be consumed by processing those instructions separately. For example, an entry of a re-order buffer may be saved by storing one instead of two decoded instructions and an additional physical register may not need to be allocated. As another example, dispatch bandwidth, or a number of instruction operations dispatched to a reservation station per cycle, may be lowered by instruction fusion. In addition, issue bandwidth, or a number of instruction operations scheduled to an execution unit per cycle, may be lowered by fusion. More efficient and/or lower-power operation of the processor at multiple stages may therefore result from instruction fusion.
The inventors have recognized certain instruction pairs that can be fused for implementation as single instruction operations using additional or modified execution logic. The present disclosure describes techniques for detecting, fusing, and executing such instruction pairs. Embodiments of the disclosed processors and methods implement fused execution of one or more of the types of instruction pairs described herein.
In one embodiment, an instruction pair detected for fusing includes a first instruction for performing an operation to produce an operation result followed by a second instruction for adjusting a sign of the operation result. The second instruction for adjusting the sign may include, for example, a negation instruction or an absolute value instruction. The first instruction may, in various embodiments, include one of various arithmetic operation instructions, a move instruction, or a maximum instruction. In embodiments described herein, the first and second instructions are fused into a single instruction operation that is executable using specific execution circuitry to perform the operation and adjust the sign of the operation result.
In an embodiment described herein, an instruction pair detected for fusing includes a first instruction for performing an operation followed by a second instruction for finding a maximum or a minimum of the result of the operation as compared to an additional operand. The first instruction may be an instruction for implementing various arithmetic and other operations. In an embodiment, the first instruction is an instruction having fewer operands than the maximum number of operands supported by the ISA. For example, if the ISA supports no more than three operands for an instruction, the first instruction is in some embodiments an instruction using no more than two operands. In such an embodiment, the first and second instructions can be fused such that the additional operand for the maximum or minimum operation is carried by the fused instruction operation. In some embodiments, the first instruction is also a maximum or minimum instruction, so that the fused instruction operation is executable to take a maximum, minimum or combination of the two among a group of operands, such as a group of three operands. Such a fusion may reduce the number of instructions needed to perform a three-way comparison, which may reduce power and bandwidth demands on the processor and decrease latency for performing the comparison.
Another embodiment of an instruction pair detected for fusing includes a first instruction that is executable to perform a vector operation and write a result of the operation to a vector register and a second instruction that is executable to read a first element of the vector result from the vector register and overwrite one or more additional elements of the vector result with the first element. In embodiments described herein, the first and second instructions are fused into a single instruction operation that is executable using execution circuitry to perform the vector operation and write the first element of the vector result to the portions of the vector register that the first element would have been written to by the second instruction. In this way, cycles associated with initially storing the full vector result, then reading the first element to overwrite some of the initially stored elements may be saved. In various embodiments, the operands and result of the vector operation are stored in any of various formats including a packed integer format or a packed floating-point format.
In some embodiments, the first and second instructions for fusion techniques as described herein are floating-point instructions operating on floating-point values. Fusion of floating-point instructions as described herein may be particularly advantageous because some floating-point processors use a minimum of two clock cycles to schedule an instruction. Each instruction operation executed therefore contributes at least two cycles of latency in this type of processor, so that executing an instruction operation to complete a simple operation that could be accomplished in one cycle wastes at least one cycle if there is a way to combine the simple operation with execution of another instruction operation. Embodiments of execution circuitry as described herein may allow implementation of two-instruction operations under a given ISA by executing a single fused instruction operation. Fusion of floating-point instructions may also be particularly advantageous in the above-described embodiment of fusing an instruction to perform an operation with an instruction to adjust a sign of the result of the operation, because floating-point representations may use a single bit to represent the sign of a number. Changing a sign in such an embodiment is therefore a matter of changing a single bit, which can be implemented efficiently in execution circuitry.
In some embodiments, the first and second instructions for fusion techniques as described herein are instructions for performing operations on packed integer data, in which multiple integer values are stored in a single register, or packed floating-point data, in which multiple floating-point values are stored in a single register. Such instructions may be referred to as vector instructions. In some processors scheduling of vector instructions uses a minimum of two cycles, so that fusion of packed integer operations may have latency advantages in a manner similar to fusion of floating-point operations in floating-point processors having a minimum two-cycle scheduling delay.
In an embodiment, execution circuitry 104 includes an operation execution circuit configured to produce an operation result and an additional execution circuit configured to perform an additional operation using the operation result. In some embodiments, the additional execution circuit includes a sign adjust circuit, as shown in
Turning to
Fetch and decode circuit 210, in various embodiments, is configured to fetch instructions for execution by processor 200 and decode the instructions into instruction operations (briefly “ops”) for execution. More particularly, fetch and decode circuit 210 may be configured to cache instructions fetched from a memory (e.g., memory 1110 of
In various embodiments, fetch and decode circuit 210 is configured to identify candidate instructions for fusion and provide an indication of those candidate instructions to MDR circuit 220. Fetch and decode circuit 210 may scan across its decode lanes to search for particular combinations of instructions. Such combinations may include but are not limited to an instruction for performing an operation and an instruction for adjusting a sign of the result of the operation, an instruction for performing an operation and an instruction for finding a maximum or minimum of the result of the operation as compared to an additional operand, and an instruction for performing a vector operation and an instruction for writing a first element of the vector operation result to additional elements of the result. In some embodiments conditions may be applied to determine whether an instruction pair is eligible for fusion. The instructions of a combination might not be eligible for fusion, for example, if the instructions are not sequential or otherwise within a specified instruction distance (e.g., three instructions) of each other in program order, or if the instructions fall within different batches of instructions (“instruction groups”). In various embodiments, fetch and decode circuit 210 marks eligible combinations (e.g., by setting bits of the decoded instructions) and provides them to MDR circuit 220. In some embodiments, the fusion of eligible instructions occurs within fetch and decode circuit 210. Fusion detection circuitry 102 from
ICache 215 and DCache 217, in various embodiments, may each be a cache having any desired capacity, cache line size, and configuration. A cache line may be allocated/deallocated in a cache as a unit and thus may define the unit of allocation/deallocation for the cache. Cache lines may vary in size (e.g., 32 bytes, 64 bytes, or larger or smaller). Different caches may have different cache line sizes. There may further be more additional levels of cache between ICache 215/DCache 217 and a main memory, such as a last level cache. In various embodiments, ICache 215 is used to cache fetched instructions and DCache 217 is used to cache data fetched or generated by processor 200.
MDR circuit 220, in various embodiments, is configured to map ops received from fetch and decode circuit 210 to speculative resources (e.g., physical registers) in order to permit out-of-order and/or speculative execution. As shown, MDR circuit 220 can dispatch the ops to RS 227 and RS 232. The ops may be mapped to physical registers in register file 245 from the architectural registers used in the corresponding instructions. That is, register file 245 may implement a set of physical registers that are greater in number than the architectural registers specified by the instruction set architecture implemented by processor 200. As such, MDR circuit 220 may manage a mapping between the architectural registers and the physical registers. In some embodiments, there may be separate physical registers for different operand types (e.g., integer, floating-point, etc.). The physical registers, however, may be shared between different operand types in some embodiments. MDR circuit 220, in various embodiments, tracks the speculative execution and retires ops (or flushes misspeculated ops). In various embodiments, reorder buffer 225 is used in tracking the program order of ops and managing retirement/flush.
In various embodiments, MDR circuit 220 is configured to fuse eligible combination pairs that are marked by fetch and decode circuit 210 if certain criteria are met. While fusion of instructions (or corresponding instruction operations) occurs at MDR circuit 220 in various embodiments, in some embodiments fusion occurs at a different stage in the instruction pipeline, such as at the instruction buffer or the instruction cache. That is, the fusion decoder circuitry used to perform the fusion of instructions may reside at different stages of the instruction pipeline in different implementations.
LSU 234, in various embodiments, is configured to execute memory ops received from MDR circuit 220. Generally, a memory op is an instruction op specifying an access to memory (such as memory 1110 of
LSU 234 may implement multiple load pipelines (“pipes”). As an example, three load pipelines may be implemented, although more or fewer pipelines can be implemented in other cases. Each pipeline may execute a different load, independent and in parallel with other loads in other pipelines. Consequently, reservation station 232 may issue any number of loads up to the number of load pipes in the same clock cycle. Similarly, LSU 234 may further implement one or more store pipes. In some embodiments, the number of store pipes is not equal to the number of load pipes. For example, two store pipes may be used instead of three store pipes. Likewise, reservation station 232 may issue any number of stores up to the number of store pipes in the same clock cycle.
Load/store ops, in various embodiments, are received at reservation station 232, which may be configured to monitor the source operands of the load/store ops to determine when they are available and then issue the ops to the load or store pipelines, respectively. Some source operands may be available when the instruction operations are received at reservation station 232, which may be indicated in the data received by reservation station 232 from MDR circuit 220 for the corresponding instruction operation. Other operands may become available via execution of instruction operations by execution circuits 240 or even via execution of earlier load ops. The operands may be gathered by reservation station 232 or may be read from register file 245 upon issue from reservation station 232 as shown in
Execution circuitry 104 of
Processor 300 of
First instruction 316 implements an operation OP, which may include an addition operation, a subtraction operation, a maximum or minimum operation, or some other arithmetic, logical or bitwise operation. In an embodiment, first instruction 316 implements an operation that may be carried out by an arithmetic logic unit (ALU) of a processor. In an embodiment, first instruction 316 is a floating-point instruction for implementing a floating-point operation. Second instruction 318 is an instruction for adjusting the sign of an operand, as indicated by a “+/−” symbol in
Pair detector circuit 304 within fetch and decode circuit 302 is configured to identify pairs of fetched instructions eligible for fusion into an instruction operation for adjusting a sign of an operation result. In determining whether first instruction 316 and second instruction 318 are eligible for fusion, one criterion that may be used by pair detector circuit 304 is that the source and destination registers of second instruction 318 are the same as the destination register of first instruction 316. Pair detector circuit 304 may also look for one or more specific operations as the operation OP performed by first instruction 316, where the specific operations are designated for potential fused execution with a sign adjustment implemented by second instruction 318. Other criteria may also be used in identifying eligible instructions for fused execution, such as whether the instructions are consecutive or both within a group of instructions such as a dispatch group.
In an embodiment, when instructions 316 and 318 are identified by fetch and decode circuit 302 as eligible for fusion, they are marked so that MDR circuit 306 can recognize the corresponding instruction operations 320 and 324 as fusion candidates. In the embodiment of
In the embodiment of
For one or more eligible instruction pairs, MDR circuit 306 may fuse, using fusion circuit 308, the corresponding first and second instruction operations into a single fused instruction operation such as fused instruction operation 328. In an embodiment, determination by MDR circuit 306 of whether to fuse an eligible instruction pair includes checking an availability of execution circuitry configured to execute a fused instruction operation. In the embodiment of
Operation execution circuit 312 within execution circuit 310 is configured to perform operation OP during execution of fused instruction operation 328. Sign adjust circuit 314 is configured to adjust a sign of the result of operation execution circuit 312. For example, sign adjust circuit 314 is configured to change the sign of the result of operation execution circuit 312 if second instruction 318 is a negation instruction. If second instruction 318 is an absolute value instruction, sign adjust circuit 314 is configured to ensure that the result of operation execution circuit 312 is positive. Sign adjust indicator 326 may serve to identify to execution circuit 310 what type of sign adjustment (such as negation or absolute value) is needed. In an embodiment, logic within execution circuit 310 is similar to logic in other execution circuitry of the processor (not shown in
In an embodiment, execution of fused instruction operation 328 using execution circuit 310 results in a single write, to the destination register, of the sign-adjusted operation result. Such execution may avoid an additional read or write of a portion of the operation result reflecting the sign of the result, when executing second instruction 318. In some embodiments, the sign adjustment portion of execution of fused instruction operation 328 is performed using a single cycle, while sign adjustment via second instruction 318 takes at least two cycles. Fusion of the instruction pair therefore may reduce latency as well as providing the other resource savings associated with having one instruction operation rather than two dispatched from MDR circuit 306.
An operation result 412 is converted by sign adjust circuit 314 to a sign-adjusted result 414. Although the operation result 412 is shown conceptually as being passed to sign adjust circuit 314, in some embodiments only a portion of the operation result, such as one or more bits indicating the sign of the result, is affected by sign adjust circuit 314. Sign adjust circuit 314 includes logic for adjusting the sign of operation result 412 in the manner specified by the instruction operation executed using execution circuit 310. In an embodiment for which an absolute value of operation result 412 is taken, sign adjust circuit 314 includes logic to ensure that sign-adjusted result 414 is a positive value. In an embodiment for which operation result 412 is negated, sign adjust circuit 314 includes logic to change the sign of sign-adjusted result 414 as compared to operation result 412. A particular sign adjust circuit 314 may be selected using an indicator associated with the incoming fused instruction operation, such as sign adjust indicator 326. In various embodiments, such selection of an appropriate sign adjust circuit 314 may be performed within execution circuit 310 or within a different area of the processor such as MDR circuit 306 or a reservation station. Sign-adjusted result 414 is written to a destination register corresponding to the architectural destination register of the original pair of instructions that was fused to form the fused instruction operation executed by execution circuit 310.
Processor 500 of
First instruction 516 implements an operation OP in a manner similar to first instruction 316 of
Pair detector circuit 504 within fetch and decode circuit 502 is configured to identify pairs of fetched instructions eligible for fusion into an instruction operation for finding a maximum or minimum of an operation result as compared to another operand. In determining whether first instruction 516 and second instruction 518 are eligible for fusion, one criterion that may be used by pair detector circuit 504 is that the source and destination registers of second instruction 518 are the same as the destination register of first instruction 516. In an embodiment, first instruction 516 has fewer operands than the maximum number of operands for an instruction supported by the ISA. For example, first instruction 516 may have no more than two operands, so that including an additional operand in the fused instruction operation for use in the maximum or minimum operation results in no more than three operands. Pair detector circuit 504 may also look for one or more specific operations as the operation OP performed by first instruction 516, where the specific operations are designated for potential fused execution with a maximum or minimum operation implemented by second instruction 518. Other criteria may also be used in identifying eligible instructions for fused execution, such as whether the instructions are consecutive or both within a group of instructions such as a dispatch group.
In an embodiment, when instructions 516 and 518 are identified by fetch and decode circuit 502 as eligible for fusion, they are marked so that MDR circuit 506 can recognize the corresponding instruction operations 520 and 524 as fusion candidates. In the embodiment of
In the embodiment of
For one or more eligible instruction pairs, MDR circuit 506 may fuse, using fusion circuit 508, the corresponding first and second instruction operations into a single fused instruction operation such as fused instruction operation 528. In an embodiment, determination by MDR circuit 506 of whether to fuse an eligible instruction pair includes checking an availability of execution circuitry configured to execute a fused instruction operation. In the embodiment of
Operation execution circuit 512 within execution circuit 510 is configured to perform operation OP during execution of fused instruction operation 528. Max/min circuit 514 is configured to select a maximum or minimum of the result of operation execution circuit 512 as compared to comparison operand 530. For example, max/min circuit 514 is configured to write the larger of comparison operand 530 and the result of operation execution circuit 512 to a destination register if second instruction 518 is a maximum instruction. If second instruction 518 is a minimum instruction, max/min circuit 514 is configured to write the smaller of comparison operand 530 and the result of operation execution circuit 512 to the destination register. Max/min indicator 526 may serve to identify to execution circuit 510 what type of maximum or minimum operation is needed. In an embodiment, logic within execution circuit 510 is similar to logic in other execution circuitry of the processor (not shown in
In an embodiment, execution of fused instruction operation 528 using execution circuit 510 results in a single write to the destination register of the maximum or minimum of the operation result as compared to the comparison operand. Such execution avoids an additional writing of the operation result when executing first instruction 516 and reading of the result again when executing second instruction 518. In some embodiments, the maximum/minimum portion of execution of fused instruction operation 528 is performed using a single cycle, while the same operation via second instruction 518 takes at least two cycles. Fusion of the instruction pair therefore may reduce latency as well as providing the other resource savings associated with having one instruction operation rather than two dispatched from MDR circuit 506.
The operation result 612 and comparison operand 530 are compared by max/min circuit 514 to produce maximum or minimum result 614. Max/min circuit 514 includes logic for determining a maximum or minimum, as specified by the instruction operation executed using execution circuit 510, between operation result 612 and comparison operand 530. In an embodiment, max/min circuit 514 includes one or more comparators and one or more multiplexors. A particular maximum or minimum circuit 514 may be selected using an indicator associated with the incoming fused instruction operation, such as max/min indicator 526. In various embodiments, such selection of an appropriate sign adjust circuit 514 may be performed within execution circuit 510 or within a different area of the processor such as MDR circuit 506 or a reservation station. Maximum or minimum result 614 is written to a destination register corresponding to the architectural destination register of the original pair of instructions that was fused to form the fused instruction operation executed by execution circuit 510.
In the embodiment of
The method further includes, at block 820, detecting a second instruction that is executable to adjust a sign of a result of the arithmetic/logic operation. In one embodiment, the second instruction is a negation instruction that is executable to change the sign of the result of the arithmetic/logic operation. In another embodiment, the second instruction is an absolute value instruction that is executable to take an absolute value of the result of the arithmetic/logic operation. An example of a second instruction that may be detected at block 820 is second instruction 318 of
Method 800 further includes, at block 830, fusing the first and second instructions into a fused instruction operation that is executable to perform the arithmetic/logic operation and adjust the sign of the result of the operation. In an embodiment, fusing the first and second instructions is performed at an MDR circuit of a processor, such as MDR circuit 306 of
In some embodiments, method 800 may further include decoding of the first and second instructions into corresponding first and second instruction operations such as first instruction operation 320 and second instruction operation 324 of
The method further includes, at block 920, detecting a second instruction that is executable to find a maximum or minimum, as compared to a comparison operand, of a result of the arithmetic/logic operation. In various embodiments, the second instruction may be executable to find either a maximum or a minimum. An example of a second instruction that may be detected at block 920 is second instruction 518 of
Method 900 further includes, at block 930, fusing the first and second instructions into a fused instruction operation that is executable to find a maximum or minimum of the result of the arithmetic/logic operation as compared to the comparison operand. In an embodiment, fusing the first and second instructions is performed at an MDR circuit of a processor, such as MDR circuit 506 of
In some embodiments, method 900 may further include decoding of the first and second instructions into corresponding first and second instruction operations such as first instruction operation 520 and second instruction operation 524 of
Method 1000 is one embodiment of a method performed by a processor, such as processor 200 of
Method 1000 further includes, at block 1030, fusing the first and second instructions into a fused instruction operation that is executable to perform the vector operation and write the modified vector result to the vector destination register. In an embodiment, performing the vector instruction is limited to computing the first element of the vector result and writing the modified vector result involves writing the first element into each partition of the vector destination register corresponding to an element of a vector result. In an embodiment, fusing the first and second instructions is performed at an MDR circuit of a processor, such as MDR circuit 220 of
Execution of the fused instruction operation at block 1040 may result in a single write to the vector destination register of the modified vector result. Such execution may avoid writing of the initial vector result via execution of the first instruction and reading of the first element of the vector result via execution of the second instruction. Fusion of the instruction pair therefore may reduce latency as well as providing the other resource savings associated with having one instruction operation rather than two dispatched from an MDR circuit.
In some embodiments, method 1000 further includes decoding of the first and second instructions into corresponding first and second instruction operations and associating one or more of the first and second instruction operations with an indicator of eligibility for fused execution. Such an indicator of eligibility may in some embodiments signal to a fusion circuit such as fusion circuit 204 of
Turning now to
Memory 1110, in various embodiments, is usable to store data and program instructions that are executable by CPU complex 1120 to cause a system having SOC 1100 and memory 1110 to implement operations described herein. Memory 1110 may be implemented using different physical memory media, such as hard disk storage, floppy disk storage, removable disk storage, flash memory, random access memory (RAM-SRAM, EDO RAM, SDRAM, DDR SDRAM, RAMBUS RAM, etc.), read only memory (PROM, EEPROM, etc.), etc. Memory available to SOC 1100 is not limited to primary storage such as memory 1110. Rather, SOC 1100 may further include other forms of storage such as cache memory (e.g., L1 cache, L2 cache, etc.) in CPU complex 1120.
CPU complex 1120, in various embodiments, includes a set of processors 1125 that serve as a CPU of the SOC 1100. Processors 1125 may execute the main control software of the system, such as an operating system. Generally, software executed by the CPU during use control the other components of the system to realize the desired functionality of the system. Processors 1125 may further execute other software, such as application programs. An application program may provide user functionality and rely on the operating system for lower-level device control, scheduling, memory management, etc. Consequently, processors 1125 may also be referred to as application processors. CPU complex 1120 may include other hardware such as an L2 cache and/or an interface to the other components of the system (e.g., an interface to communication fabric 1150).
A processor 1125, in various embodiments, includes any circuitry and/or microcode that is configured to execute instructions defined in an instruction set architecture implemented by that processor 1125. Processors 1125 may fetch instructions and data from memory 1110 as a part of executing load instructions and store the fetched instructions and data within caches of CPU complex 1120. In various embodiments, processors 1125 share a common last level cache (e.g., an L2 cache) while including their own caches (e.g., an L0 cache, an L1 cache, etc.) for storing instructions and data. Processors 1125 may retrieve instructions and data (e.g., from the caches) and execute the instructions (e.g., conditional branch instructions, ALU instructions, etc.) to perform operations that involve the retrieved data. Processors 1125 may then write a result of those operations back to memory 1110. Processors 1125 may encompass discrete microprocessors, processors and/or microprocessors integrated into multichip module implementations, processors implemented as multiple integrated circuits, etc.
Memory controller 1130, in various embodiments, includes circuitry that is configured to receive, from the other components of SOC 1100, memory requests (e.g., load/store requests) to perform memory operations, such as accessing data from memory 1110. Memory controller 1130 may be configured to access any type of memory 1110, such as those discussed earlier. In various embodiments, memory controller 1130 includes queues for storing memory operations, for ordering and potentially reordering the operations and presenting the operations to memory 1110. Memory controller 1130 may further include data buffers to store write data awaiting write to memory 1110 and read data awaiting return to the source of a memory operation. In some embodiments, memory controller 1130 may include a memory cache to store recently accessed memory data. In SOC implementations, for example, the memory cache may reduce the power consumption in SOC 1100 by avoiding re-access of data from memory 1110 if it is expected to be accessed again soon. In some cases, the memory cache may also be referred to as a system cache, as opposed to private caches (e.g., L1 caches) in processors 1125 that serve only certain components. But, in some embodiments, a system cache need not be located within memory controller 1130.
Peripherals 1140, in various embodiments, are sets of additional hardware functionality included in SOC 1100. For example, peripherals 1140 may include video peripherals such as an image signal processor configured to process image capture data from a camera or other image sensor, GPUs, video encoder/decoders, scalers, rotators, blenders, display controllers, etc. As other examples, peripherals 1140 may include audio peripherals such as microphones, speakers, interfaces to microphones and speakers, audio processors, digital signal processors, mixers, etc. Peripherals 1140 may include interface controllers for various interfaces external to SOC 1100, such as Universal Serial Bus (USB), peripheral component interconnect (PCI) including PCI Express (PCIe), serial and parallel ports, etc. The interconnection to external devices is illustrated by the dashed arrow in
Communication fabric 1150 may be any communication interconnect and protocol for communicating among the components of SOC 1100. For example, communication fabric 1150 may enable processors 1125 to issue and receive requests from peripherals 1140 to access, store, and manipulate data. In some embodiments, communication fabric 1150 is bus-based, including shared bus configurations, cross bar configurations, and hierarchical buses with bridges. In some embodiments, communication fabric 1150 is packet-based, and may be hierarchical with bridges, cross bar, point-to-point, or other interconnects.
Turning now to
Integrated circuit 1230 may further additionally or alternatively includes other circuits such as a wireless network circuit. In the illustrated embodiment, semiconductor fabrication system 1220 is configured to process design information 1215 to fabricate integrated circuit 1230.
Non-transitory computer-readable medium 1210 may include any of various appropriate types of memory devices or storage devices. For example, non-transitory computer-readable medium 1210 may include at least one of an installation medium (e.g., a CD-ROM, floppy disks, or tape device), a computer system memory or random-access memory (e.g., DRAM, DDR RAM, SRAM, EDO RAM, Rambus RAM, etc.), a non-volatile memory such as a Flash, magnetic media (e.g., a hard drive, or optical storage), registers, or other types of non-transitory memory. Non-transitory computer-readable medium 1210 may include two or more memory mediums, which may reside in different locations (e.g., in different computer systems that are connected over a network).
Design information 1215 may be specified using any of various appropriate computer languages, including hardware description languages such as, without limitation: VHDL, Verilog, SystemC, SystemVerilog, RHDL, M, MyHDL, etc. Design information 1215 may be usable by semiconductor fabrication system 1220 to fabricate at least a portion of integrated circuit 1230. The format of design information 1215 may be recognized by at least one semiconductor fabrication system 1220. In some embodiments, design information 1215 may also include one or more cell libraries, which specify the synthesis and/or layout of integrated circuit 1230. In some embodiments, the design information is specified in whole or in part in the form of a netlist that specifies cell library elements and their connectivity. Design information 1215, taken alone, may or may not include sufficient information for fabrication of a corresponding integrated circuit (e.g., integrated circuit 1230). For example, design information 1215 may specify circuit elements to be fabricated but not their physical layout. In this case, design information 1215 may be combined with layout information to fabricate the specified integrated circuit.
Semiconductor fabrication system 1220 may include any of various appropriate elements configured to fabricate integrated circuits. This may include, for example, elements for depositing semiconductor materials (e.g., on a wafer, which may include masking), removing materials, altering the shape of deposited materials, modifying materials (e.g., by doping materials or modifying dielectric constants using ultraviolet processing), etc. Semiconductor fabrication system 1220 may also be configured to perform various testing of fabricated circuits for correct operation.
In various embodiments, integrated circuit 1230 is configured to operate according to a circuit design specified by design information 1215, which may include performing any of the functionality described herein. For example, integrated circuit 1230 may include any of various elements described with reference to
As used herein, a phrase of the form “design information that specifies a design of a circuit configured to . . . ” does not imply that the circuit in question must be fabricated in order for the element to be met. Rather, this phrase indicates that the design information describes a circuit that, upon being fabricated, will be configured to perform the indicated actions or will include the specified components.
In some embodiments, a method of initiating fabrication of integrated circuit 1230 is performed. Design information 1215 may be generated using one or more computer systems and stored in non-transitory computer-readable medium 1210. The method may conclude when design information 1215 is sent to semiconductor fabrication system 1220 or prior to design information 1215 being sent to semiconductor fabrication system 1220. Accordingly, in some embodiments, the method may not include actions performed by semiconductor fabrication system 1220. Design information 1215 may be sent to semiconductor fabrication system 1220 in a variety of ways. For example, design information 1215 may be transmitted (e.g., via a transmission medium such as the Internet) from non-transitory computer-readable medium 1210 to semiconductor fabrication system 1220 (e.g., directly or indirectly). As another example, non-transitory computer-readable medium 1210 may be sent to semiconductor fabrication system 1220. In response to the method of initiating fabrication, semiconductor fabrication system 1220 may fabricate integrated circuit 1230 as discussed above.
Turning next to
As illustrated, system 1300 is shown to have application in a wide range of areas. For example, system 1300 may be utilized as part of the chips, circuitry, components, etc., of a desktop computer 1310, laptop computer 1320, tablet computer 1330, cellular or mobile phone 1340, or television 1350 (or set-top box coupled to a television). Also illustrated is a wearable device 1360, such as a smartwatch and/or health monitoring device. In some embodiments, a smartwatch may include a variety of general-purpose computing related functions. For example, a smartwatch may provide access to email, cellphone service, a user calendar, and so on. In various embodiments, a health monitoring device may be a dedicated medical device or otherwise include dedicated health related functionality. For example, a health monitoring device may monitor a user's vital signs, track proximity of a user to other users for the purpose of epidemiological social distancing, contact tracing, provide communication to an emergency service in the event of a health crisis, and so on. In various embodiments, the above-mentioned smartwatch may or may not include some or any health monitoring related functions. Other wearable devices are contemplated as well, such as devices worn around the neck, devices that are implantable in the human body, glasses designed to provide an augmented and/or virtual reality experience, and so on.
System 1300 may further be used as part of a cloud-based service(s) 1370. For example, the previously mentioned devices, and/or other devices, may access computing resources in the cloud (e.g., remotely located hardware and/or software resources). Still further, system 1300 may be utilized in one or more devices of a home 1380 other than those previously mentioned. For example, appliances within home 1380 may monitor and detect conditions that warrant attention. For example, various devices within home 1380 (e.g., a refrigerator, a cooling system, etc.) may monitor the status of the device and provide an alert to the homeowner (or, for example, a repair facility) should a particular event be detected. Alternatively, a thermostat may monitor the temperature in home 1380 and may automate adjustments to a heating/cooling system based on a history of responses to various conditions by the homeowner. Also illustrated in
The present disclosure includes references to “embodiments,” which are non-limiting implementations of the disclosed concepts. References to “an embodiment,” “one embodiment,” “a particular embodiment,” “some embodiments,” “various embodiments,” and the like do not necessarily refer to the same embodiment. A large number of possible embodiments are contemplated, including specific embodiments described in detail, as well as modifications or alternatives that fall within the spirit or scope of the disclosure. Not all embodiments will necessarily manifest any or all of the potential advantages described herein.
This disclosure may discuss potential advantages that may arise from the disclosed embodiments. Not all implementations of these embodiments will necessarily manifest any or all of the potential advantages. Whether an advantage is realized for a particular implementation depends on many factors, some of which are outside the scope of this disclosure. In fact, there are a number of reasons why an implementation that falls within the scope of the claims might not exhibit some or all of any disclosed advantages. For example, a particular implementation might include other circuitry outside the scope of the disclosure that, in conjunction with one of the disclosed embodiments, negates or diminishes one or more the disclosed advantages. Furthermore, suboptimal design execution of a particular implementation (e.g., implementation techniques or tools) could also negate or diminish disclosed advantages. Even assuming a skilled implementation, realization of advantages may still depend upon other factors such as the environmental circumstances in which the implementation is deployed. For example, inputs supplied to a particular implementation may prevent one or more problems addressed in this disclosure from arising on a particular occasion, with the result that the benefit of its solution may not be realized. Given the existence of possible factors external to this disclosure, it is expressly intended that any potential advantages described herein are not to be construed as claim limitations that must be met to demonstrate infringement. Rather, identification of such potential advantages is intended to illustrate the type(s) of improvement available to designers having the benefit of this disclosure. That such advantages are described permissively (e.g., stating that a particular advantage “may arise”) is not intended to convey doubt about whether such advantages can in fact be realized, but rather to recognize the technical reality that realization of such advantages often depends on additional factors.
Unless stated otherwise, embodiments are non-limiting. That is, the disclosed embodiments are not intended to limit the scope of claims that are drafted based on this disclosure, even where only a single example is described with respect to a particular feature. The disclosed embodiments are intended to be illustrative rather than restrictive, absent any statements in the disclosure to the contrary. The application is thus intended to permit claims covering disclosed embodiments, as well as such alternatives, modifications, and equivalents that would be apparent to a person skilled in the art having the benefit of this disclosure.
For example, features in this application may be combined in any suitable manner. Accordingly, new claims may be formulated during prosecution of this application (or an application claiming priority thereto) to any such combination of features. In particular, with reference to the appended claims, features from dependent claims may be combined with those of other dependent claims where appropriate, including claims that depend from other independent claims. Similarly, features from respective independent claims may be combined where appropriate.
Accordingly, while the appended dependent claims may be drafted such that each depends on a single other claim, additional dependencies are also contemplated. Any combinations of features in the dependent that are consistent with this disclosure are contemplated and may be claimed in this or another application. In short, combinations are not limited to those specifically enumerated in the appended claims.
Where appropriate, it is also contemplated that claims drafted in one format or statutory type (e.g., apparatus) are intended to support corresponding claims of another format or statutory type (e.g., method).
Because this disclosure is a legal document, various terms and phrases may be subject to administrative and judicial interpretation. Public notice is hereby given that the following paragraphs, as well as definitions provided throughout the disclosure, are to be used in determining how to interpret claims that are drafted based on this disclosure.
References to a singular form of an item (i.e., a noun or noun phrase preceded by “a,” “an,” or “the”) are, unless context clearly dictates otherwise, intended to mean “one or more.” Reference to “an item” in a claim thus does not, without accompanying context, preclude additional instances of the item. A “plurality” of items refers to a set of two or more of the items.
The word “may” is used herein in a permissive sense (i.e., having the potential to, being able to) and not in a mandatory sense (i.e., must).
The terms “comprising” and “including,” and forms thereof, are open-ended and mean “including, but not limited to.”
When the term “or” is used in this disclosure with respect to a list of options, it will generally be understood to be used in the inclusive sense unless the context provides otherwise. Thus, a recitation of “x or y” is equivalent to “x or y, or both,” and thus covers 1) x but not y, 2) y but not x, and 3) both x and y. On the other hand, a phrase such as “either x or y, but not both” makes clear that “or” is being used in the exclusive sense.
A recitation of “w, x, y, or z, or any combination thereof” or “at least one of . . . w, x, y, and z” is intended to cover all possibilities involving a single element up to the total number of elements in the set. For example, given the set [w, x, y, z], these phrasings cover any single element of the set (e.g., w but not x, y, or z), any two elements (e.g., w and x, but not y or z), any three elements (e.g., w, x, and y, but not z), and all four elements. The phrase “at least one of . . . w, x, y, and z” thus refers to at least one element of the set [w, x, y, z], thereby covering all possible combinations in this list of elements. This phrase is not to be interpreted to require that there is at least one instance of w, at least one instance of x, at least one instance of y, and at least one instance of z.
Various “labels” may precede nouns or noun phrases in this disclosure. Unless context provides otherwise, different labels used for a feature (e.g., “first circuit,” “second circuit,” “particular circuit,” “given circuit,” etc.) refer to different instances of the feature. Additionally, the labels “first,” “second,” and “third” when applied to a feature do not imply any type of ordering (e.g., spatial, temporal, logical, etc.), unless stated otherwise.
The phrase “based on” or is used to describe one or more factors that affect a determination. This term does not foreclose the possibility that additional factors may affect the determination. That is, a determination may be solely based on specified factors or based on the specified factors as well as other, unspecified factors. Consider the phrase “determine A based on B.” This phrase specifies that B is a factor that is used to determine A or that affects the determination of A. This phrase does not foreclose that the determination of A may also be based on some other factor, such as C. This phrase is also intended to cover an embodiment in which A is determined based solely on B. As used herein, the phrase “based on” is synonymous with the phrase “based at least in part on.”
The phrases “in response to” and “responsive to” describe one or more factors that trigger an effect. This phrase does not foreclose the possibility that additional factors may affect or otherwise trigger the effect, either jointly with the specified factors or independent from the specified factors. That is, an effect may be solely in response to those factors or may be in response to the specified factors as well as other, unspecified factors. Consider the phrase “perform A in response to B.” This phrase specifies that B is a factor that triggers the performance of A, or that triggers a particular result for A. This phrase does not foreclose that performing A may also be in response to some other factor, such as C. This phrase also does not foreclose that performing A may be jointly in response to B and C. This phrase is also intended to cover an embodiment in which A is performed solely in response to B. As used herein, the phrase “responsive to” is synonymous with the phrase “responsive at least in part to.” Similarly, the phrase “in response to” is synonymous with the phrase “at least in part in response to.”
Within this disclosure, different entities (which may variously be referred to as “units,” “circuits,” other components, etc.) may be described or claimed as “configured” to perform one or more tasks or operations. This formulation—[entity] configured to [perform one or more tasks]—is used herein to refer to structure (i.e., something physical). More specifically, this formulation is used to indicate that this structure is arranged to perform the one or more tasks during operation. A structure can be said to be “configured to” perform some task even if the structure is not currently being operated. Thus, an entity described or recited as being “configured to” perform some task refers to something physical, such as a device, circuit, a system having a processor unit and a memory storing program instructions executable to implement the task, etc. This phrase is not used herein to refer to something intangible.
In some cases, various units/circuits/components may be described herein as performing a set of tasks or operations. It is understood that those entities are “configured to” perform those tasks/operations, even if not specifically noted.
The term “configured to” is not intended to mean “configurable to.” An unprogrammed FPGA, for example, would not be considered to be “configured to” perform a particular function. This unprogrammed FPGA may be “configurable to” perform that function, however. After appropriate programming, the FPGA may then be said to be “configured to” perform the particular function.
For purposes of United States patent applications based on this disclosure, reciting in a claim that a structure is “configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. § 112(f) for that claim element. Should Applicant wish to invoke Section 112(f) during prosecution of a United States patent application based on this disclosure, it will recite claim elements using the “means for” [performing a function] construct.
Different “circuits” may be described in this disclosure. These circuits or “circuitry” constitute hardware that includes various types of circuit elements, such as combinatorial logic, clocked storage devices (e.g., flip-flops, registers, latches, etc.), finite state machines, memory (e.g., random-access memory, embedded dynamic random-access memory), programmable logic arrays, and so on. Circuitry may be custom designed or taken from standard libraries. In various implementations, circuitry can, as appropriate, include digital components, analog components, or a combination of both. Certain types of circuits may be commonly referred to as “units” (e.g., a decode unit, an arithmetic logic unit (ALU), functional unit, memory management unit (MMU), etc.). Such units also refer to circuits or circuitry.
The disclosed circuits/units/components and other elements illustrated in the drawings and described herein thus include hardware elements such as those described in the preceding paragraph. In many instances, the internal arrangement of hardware elements within a particular circuit may be specified by describing the function of that circuit. For example, a particular “decode unit” may be described as performing the function of “processing an opcode of an instruction and routing that instruction to one or more of a plurality of functional units,” which means that the decode unit is “configured to” perform this function. This specification of function is sufficient, to those skilled in the computer arts, to connote a set of possible structures for the circuit.
In various embodiments, as discussed in the preceding paragraph, circuits, units, and other elements may be defined by the functions or operations that they are configured to implement. The arrangement and such circuits/units/components with respect to each other and the manner in which they interact form a microarchitectural definition of the hardware that is ultimately manufactured in an integrated circuit or programmed into an FPGA to form a physical implementation of the microarchitectural definition. Thus, the microarchitectural definition is recognized by those of skill in the art as structure from which many physical implementations may be derived, all of which fall into the broader structure described by the microarchitectural definition. That is, a skilled artisan presented with the microarchitectural definition supplied in accordance with this disclosure may, without undue experimentation and with the application of ordinary skill, implement the structure by coding the description of the circuits/units/components in a hardware description language (HDL) such as Verilog or VHDL. The HDL description is often expressed in a fashion that may appear to be functional. But to those of skill in the art in this field, this HDL description is the manner that is used transform the structure of a circuit, unit, or component to the next level of implementational detail.
Such an HDL description may take the form of behavioral code (which is typically not synthesizable), register transfer language (RTL) code (which, in contrast to behavioral code, is typically synthesizable), or structural code (e.g., a netlist specifying logic gates and their connectivity). The HDL description may subsequently be synthesized against a library of cells designed for a given integrated circuit fabrication technology, and may be modified for timing, power, and other reasons to result in a final design database that is transmitted to a foundry to generate masks and ultimately produce the integrated circuit. Some hardware circuits or portions thereof may also be custom-designed in a schematic editor and captured into the integrated circuit design along with synthesized circuitry. The integrated circuits may include transistors and other circuit elements (e.g., passive elements such as capacitors, resistors, inductors, etc.) and interconnect between the transistors and circuit elements. Some embodiments may implement multiple integrated circuits coupled together to implement the hardware circuits, and/or discrete elements may be used in some embodiments. Alternatively, the HDL design may be synthesized to a programmable logic array such as a field programmable gate array (FPGA) and may be implemented in the FPGA. This decoupling between the design of a group of circuits and the subsequent low-level implementation of these circuits commonly results in the scenario in which the circuit or logic designer never specifies a particular set of structures for the low-level implementation beyond a description of what the circuit is configured to do, as this process is performed at a different stage of the circuit implementation process.
The fact that many different low-level combinations of circuit elements may be used to implement the same specification of a circuit results in a large number of equivalent structures for that circuit. As noted, these low-level circuit implementations may vary according to changes in the fabrication technology, the foundry selected to manufacture the integrated circuit, the library of cells provided for a particular project, etc. In many cases, the choices made by different design tools or methodologies to produce these different implementations may be arbitrary.
Moreover, it is common for a single implementation of a particular functional specification of a circuit to include, for a given embodiment, a large number of devices (e.g., millions of transistors). Accordingly, the sheer volume of this information makes it impractical to provide a full recitation of the low-level structure used to implement a single embodiment, let alone the vast array of equivalent possible implementations. For this reason, the present disclosure describes structure of circuits using the functional shorthand commonly employed in the industry.
Claims
1. A processor, comprising:
- a hardware fetch and decode circuit including hardware pair detector circuitry, wherein the hardware pair detector circuitry is configured to: receive fetched instructions; detect a first instruction that is executable to perform an arithmetic/logic operation and a second instruction that is executable to adjust a sign of a result of the arithmetic/logic operation; determine whether the first instruction and the second instruction are eligible for fusion; and in response to determining that the first instruction and the second instruction are eligible for fusion, generate, for a first instruction operation decoded from the first instruction, an indicator of eligibility for fusion with a second instruction operation decoded from the second instruction; and forward from the hardware fetch and decode circuit the indicator of eligibility with the first instruction operation and the second instruction operation;
- a hardware map-dispatch-rename (MDR) circuit including hardware fusion circuitry, wherein: the hardware MDR circuit is configured to: receive the first instruction operation and the second instruction operation; detect the indicator of eligibility, if received; and based in part on whether the indicator of eligibility is detected, determine whether to fuse the first instruction operation and the second instruction operation; and the hardware fusion circuitry is configured to fuse the first instruction operation and the second instruction operation into a fused instruction operation that is executable to perform the arithmetic/logic operation and adjust the sign of the result; and
- hardware execution circuitry coupled to the hardware MDR circuit and configured to execute the fused instruction operation.
2. The processor of claim 1, wherein the second instruction is a negation instruction, and the fused instruction operation is executable to negate the result of the arithmetic/logic operation.
3. The processor of claim 1, wherein the second instruction is an absolute value instruction, and the fused instruction operation is executable to take an absolute value of the result of the arithmetic/logic operation.
4. The processor of claim 1, wherein the hardware execution circuitry comprises:
- a hardware arithmetic/logic operation execution circuit; and
- a hardware sign adjust circuit.
5. The processor of claim 4, further comprising an additional hardware sign adjust circuit, and wherein the processor is configured to use the additional hardware sign adjust circuit to execute non-fused sign adjust instruction operations but not to execute the fused instruction operation.
6. The processor of claim 1, wherein:
- the first instruction and the second instruction are floating-point instructions;
- the fused instruction operation is a floating-point instruction operation; and
- the hardware execution circuitry is floating-point hardware execution circuitry.
7. The processor of claim 1, wherein:
- the hardware pair detector circuitry is further configured to, in response to determining that the first instruction and the second instruction are eligible for fusion, generate for the second instruction operation an indicator of a type of operation performed by execution of the second instruction, and forward from the hardware fetch and decode circuit the indicator of a type of operation with the second instruction operation;
- the hardware pair detector circuitry is further configured to, in response to determining that the first instruction and the second instruction are not eligible for fusion, forward from the hardware fetch and decode circuit the first instruction operation and the second instruction operation without generating the indicator of eligibility or the indicator of a type of operation; and
- the hardware MDR circuit is further configured to: detect the indicator of a type of operation, if received; and determine whether to fuse the first instruction operation and the second instruction operation based in part on the indicator of a type of operation.
8. The processor of claim 7, wherein the hardware fusion circuitry is further configured to associate the indicator of a type of operation with the fused instruction operation.
9. The processor of claim 1, wherein the indicator of eligibility for fusion is within the first instruction operation.
10. The processor of claim 1, wherein the hardware MDR circuit is further configured to determine whether to fuse the first instruction operation and the second instruction operation based in part on an availability of the hardware execution circuitry.
11. A method, comprising:
- detecting, by a hardware fetch and decode circuit of a processor, a first instruction that is executable by the processor to perform an arithmetic/logic operation and a second instruction that is executable by the processor to adjust a sign of a result of the arithmetic/logic operation;
- decoding, by the hardware fetch and decode circuit, the first instruction and the second instruction into a first instruction operation and a second instruction operation for execution;
- determining, by the hardware fetch and decode circuit, that the first instruction and the second instruction are eligible for fusion;
- in response to determining that the first instruction and the second instruction are eligible for fusion, generating for the first instruction operation an indicator of eligibility for fusion with the second instruction operation;
- forwarding, to a hardware map-dispatch-rename (MDR) circuit of the processor, the first instruction operation, the indicator of eligibility and the second instruction operation;
- based in part on the indicator of eligibility, determining, in the hardware MDR circuit, that the first instruction operation and the second instruction operation should be fused;
- in response to determining that the first instruction operation and the second instruction operation should be fused, fusing the first instruction operation and the second instruction operation into a fused instruction operation that is executable by the processor to perform the arithmetic/logic operation and adjust the sign of the result; and
- executing, by the processor, the fused instruction operation.
12. The method of claim 11, wherein the second instruction is a negation instruction and the fused instruction operation is executable to negate the result of the arithmetic/logic operation.
13. The method of claim 11, wherein the second instruction is an absolute value instruction, and the fused instruction operation is executable to take an absolute value of the result of the arithmetic/logic operation.
14. The method of claim 11, wherein executing the fused instruction operation comprises using an arithmetic/logic operation execution circuit dedicated to execution of fused instruction operations in lieu of an additional arithmetic/logic operation execution circuit used for execution of non-fused instruction operations.
15. The method of claim 11, wherein the first instruction and the second instruction are floating-point instructions and the fused instruction operation is a floating point instruction operation.
16. A system, comprising:
- one or more memory controllers;
- one or more peripheral components;
- a processor; and
- a communication fabric configured to interconnect the one or more memory controllers, the one or more peripheral components and the processor, wherein the processor includes:
- a hardware fetch and decode circuit including hardware pair detector circuitry, wherein the hardware pair detector circuitry is configured to: receive fetched instructions; detect a first instruction that is executable to perform an arithmetic/logic operation and a second instruction that is executable to adjust a sign of a result of the arithmetic/logic operation; determine whether the first instruction and the second instruction are eligible for fusion; and in response to determining that the first instruction and the second instruction are eligible for fusion, generate, for a first instruction operation decoded from the first instruction, an indicator of eligibility for fusion with a second instruction operation decoded from the second instruction; and forward from the hardware fetch and decode circuit the first instruction operation with the indicator of eligibility and the second instruction operation; a hardware map-dispatch-rename (MDR) circuit including hardware fusion circuitry, wherein: the hardware MDR circuit is configured to: receive the first instruction operation and the second instruction operation; detect the indicator of eligibility, if received; and based on whether the indicator of eligibility is detected, determine whether to fuse the first instruction operation and the second instruction operation; and the hardware fusion circuitry is configured to fuse the first instruction operation and the second instruction operation into a fused instruction operation that is executable to perform the arithmetic/logic operation and adjust the sign of the result; and hardware execution circuitry coupled to the hardware fusion circuitry and configured to execute the fused instruction operation.
17. The system of claim 16, wherein the second instruction is a negation instruction, and the fused instruction operation is executable to negate the result of the arithmetic/logic operation.
18. The system of claim 16, wherein the second instruction is an absolute value instruction, and the fused instruction operation is executable to take an absolute value of the result of the arithmetic/logic operation.
19. The system of claim 16, wherein the hardware execution circuitry comprises:
- a hardware arithmetic/logic operation execution circuit; and
- a hardware sign adjust circuit.
20. The system of claim 19, wherein the processor further comprises an additional hardware sign adjust circuit, and wherein the processor is configured to use the additional hardware sign adjust circuit to execute non-fused sign adjust instruction operations but not to execute the fused instruction operation.
| 3793631 | February 1974 | Silverstein |
| 5303356 | April 12, 1994 | Vassiliadis |
| 5420992 | May 30, 1995 | Killian |
| 5689695 | November 18, 1997 | Read |
| 5774737 | June 30, 1998 | Nakano |
| 5794063 | August 11, 1998 | Favor |
| 5805486 | September 8, 1998 | Sharangpani |
| 5889984 | March 30, 1999 | Mills |
| 6292888 | September 18, 2001 | Nemirovsky et al. |
| 6295599 | September 25, 2001 | Hansen et al. |
| 6338136 | January 8, 2002 | Col |
| 6560624 | May 6, 2003 | Otani et al. |
| 6754810 | June 22, 2004 | Elliott et al. |
| 7055022 | May 30, 2006 | Col |
| 7818550 | October 19, 2010 | Vaden |
| 8078845 | December 13, 2011 | Sheffer et al. |
| 8713084 | April 29, 2014 | Weinberg |
| 9501286 | November 22, 2016 | Col |
| 9747101 | August 29, 2017 | Ould-Ahmed-Vall et al. |
| 10324724 | June 18, 2019 | Lai et al. |
| 10579389 | March 3, 2020 | Lai et al. |
| 12008369 | June 11, 2024 | Pape et al. |
| 20010052063 | December 13, 2001 | Tremblay |
| 20020087955 | July 4, 2002 | Ronen |
| 20030167460 | September 4, 2003 | Desai |
| 20030236966 | December 25, 2003 | Samra |
| 20040034757 | February 19, 2004 | Gochman |
| 20040128483 | July 1, 2004 | Grochowski |
| 20050084099 | April 21, 2005 | Montgomery |
| 20050289208 | December 29, 2005 | Harrison |
| 20070038844 | February 15, 2007 | Valentine |
| 20100115248 | May 6, 2010 | OuZiel et al. |
| 20100299505 | November 25, 2010 | Uesugi |
| 20110035570 | February 10, 2011 | Col |
| 20110264896 | October 27, 2011 | Parks |
| 20110264897 | October 27, 2011 | Henry |
| 20120144174 | June 7, 2012 | Talpes |
| 20130024937 | January 24, 2013 | Glew et al. |
| 20130125097 | May 16, 2013 | Ebcioglu et al. |
| 20130179664 | July 11, 2013 | Olson et al. |
| 20130262841 | October 3, 2013 | Gschwind |
| 20140047221 | February 13, 2014 | Irwin |
| 20140208073 | July 24, 2014 | Blasco-Allue |
| 20140281397 | September 18, 2014 | Loktyukhn et al. |
| 20140351561 | November 27, 2014 | Parks |
| 20150039851 | February 5, 2015 | Uliel |
| 20150089145 | March 26, 2015 | Steinmacher-Burow |
| 20160004504 | January 7, 2016 | Elmer |
| 20160147290 | May 26, 2016 | Williamson |
| 20160179542 | June 23, 2016 | Lai |
| 20160291974 | October 6, 2016 | Srinivas et al. |
| 20160378487 | December 29, 2016 | Ouziel |
| 20170102787 | April 13, 2017 | Gu et al. |
| 20170123808 | May 4, 2017 | Caulfield |
| 20170177343 | June 22, 2017 | Lai |
| 20170364363 | December 21, 2017 | Darbari |
| 20180129498 | May 10, 2018 | Levison et al. |
| 20180129501 | May 10, 2018 | Levison |
| 20180267775 | September 20, 2018 | Gopal |
| 20180300131 | October 18, 2018 | Tannenbaum et al. |
| 20190056943 | February 21, 2019 | Gschwind et al. |
| 20190102197 | April 4, 2019 | Kumar et al. |
| 20190108023 | April 11, 2019 | Lloyd et al. |
| 20200042322 | February 6, 2020 | Wang et al. |
| 20200402287 | December 24, 2020 | Shah et al. |
| 20210124582 | April 29, 2021 | Kerr et al. |
| 20220019436 | January 20, 2022 | Lloyd et al. |
| 20220035634 | February 3, 2022 | Lloyd |
| 2019218896 | November 2019 | WO |
- J. E. Smith, “Future Superscalar Processors Based On Instruction Compounding,” Published 2007, Computer Science, pp. 121-131.
- Christopher Celio et al., “The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V,” arXiv:1607.02318v1 [cs.AR] Jul. 8, 2016; 16 pages.
- Abhishek Deb et al., “SoftHV : A HW/SW Co-designed Processor with Horizontal and Vertical Fusion,” CF'11, May 3-5, 2011, 10 pages.
- Ian Lee, “Dynamic Instruction Fusion,” UC Santa Cruz Electronic Theses and Dissertations, publication date Dec. 2012, 59 pages.
- Office Action in U.S. Appl. No. 17/652,501 mailed Nov. 1, 2023, 47 pages.
Type: Grant
Filed: Mar 22, 2023
Date of Patent: May 19, 2026
Assignee: Apple Inc. (Cupertino, CA)
Inventors: Francesco Spadini (Sunset Valley, TX), Skanda K. Srinivasa (Austin, TX), Zhaoxiang Jin (Austin, TX)
Primary Examiner: Keith E Vicary
Application Number: 18/188,123
International Classification: G06F 9/30 (20180101);