Multi-die processor
Disclosed are a multi-die processor apparatus and system. Processor logic to execute one or more instructions is allocated among two or more face-to-faces stacked dice. The processor includes a conductive interface between the stacked dice to facilitate die-to-die communication.
1. Technical Field
The present disclosure relates generally to information processing systems and, more specifically, to processors whose logic is partitioned among a plurality of stacked dice.
2. Background Art
Electronic devices such as cellular telephones and notebook computers typically contain a number of integrated circuit (IC) packages mounted to a printed circuit board (PCB). IC packages typically include a single IC die on a substrate or leadframe. The die and substrate are encapsulated in a material such as plastic. The encapsulated packages are then mounted to another substrate such as a PCB. Various packaging approaches have been employed to improve performance for such electronic devices.
Multichip modules (MCM) are IC packages that can contain two or more “bare” or unpackaged integrated circuit dice interconnected on a common substrate. The size of the electronic device that uses MCMs can be reduced because MCMs typically have a number of individual IC dice mounted within a single package in a laterally adjacent manner.
System on a Chip (SoC) technology is the packaging of most or all of the necessary electronic circuits and parts for a “system” (such as a cell phone or digital camera) on a single IC die. For example, a system-on-a-chip for a sound-detecting device may include an audio receiver, an analog-to-digital converter, a microprocessor, memory, and input/output control logic on a single IC die.
Another type of IC package configuration that attempts to decrease the footprint and volume of the IC package is known as a Stacked Chip Scale Package (Stacked-CSP). The Stacked-CSP is essentially a space-efficient MCM, where multiple die are stacked (in a face-to-back orientation) and integrated into a single package. Stacked-CSP packaging allows manufacturers of mobile phones and other portable devices to make their products smaller by vertically stacking heterogeneous dice, such as stacking flash and SRAM (Static Random Access Memory) dice, within a single package. By utilizing Stacked-CSP products that vertically mount two or more heterogeneous IC dice in a single package, wireless devices may be generated to have lower cost, weight and board space than devices made of traditional single-die packages.
BRIEF DESCRIPTION OF THE DRAWINGSThe present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are not intended to be limiting but are instead provided to illustrate selected embodiments of an apparatus and system for a multiple-die processor for which logic of the processor is partitioned among the multiple die.
FIGS. 3 is a data flow diagram illustrating at least one embodiment of an illustrative instruction execution pipeline.
Described herein are selected embodiments of a multi-die processor apparatus and system. In the following description, numerous specific details such as inter-component communication mechanisms, specified pipeline stages, overlap configurations for split logic, and the like have been set forth to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. Additionally, some well-known structures, circuits, and the like have not been shown in detail to avoid unnecessarily obscuring the present invention.
Disclosed herein is a packaging approach to stack multiple dice that, together, implement a processor device in a single package. For example, efficiencies in processor performance (as measured, for instance, by instructions per clock cycle) and heat and power management may be realized by splitting the logic of a processor core among two stacked dice that work together to cooperatively execute instructions.
At least one embodiment of each of the first die 102 and second die 104 has a face side and a back side. By “face” it is intended to refer to the side of the die with an integrated circuit formed on it. This face side may be referred to as the side of the die having active silicon. The “back side” of a die is the side having inactive matter (such as silicon substrate) that may be coupled to another structure, such as a heat sink, C4 I/O bumps, a substrate, or the like.
From
Brief reference to
The I/O 212 bumps provide a means for communicating with structures outside the multi-die processor 200, such as an interface portion of a processing system (see 1704,
The techniques disclosed herein may be utilized on a processor whose pipeline 300 may include different or additional pipeline stages to those illustrated in
Similarly, other performance-critical loops may occur during a processor's execution of instructions. For example,
A multi-die processor, such as, for example, the embodiments 100, 200 illustrated in
Regarding the schedule-execute data path 520,
At least one allocation scheme for splitting processor logic between two dice 802, 804 may be designed, for example, to ameliorate power-density concerns. That is, processors often strive to achieve a current-per-region value that is at or lower than a predetermined threshold. A relatively high power-density region requires a relatively large amount of current. By allocating a portion of the logic for the high power-density region to a first die and the remaining portion of the logic for the high power-density region to a second die, the implementation constraints for the region may be relaxed, leading to a lower power-density design. This ability to partition the logic of a high power-density region to reduce its footprint and to lower its power consumption is only one advantage of the stacking approach illustrated in
Further partitioning of scalar processor logic is also illustrated in
Turning to
The embodiment illustrated in
Additionally,
It should be noted, of course, that the two portions 1208a, 1208b may, but need not necessarily, completely overlap each other. For instance, in order to offset potential thermal effects that may be associated with overlapping portions of “hot” processor logic blocks over each other, the overlapping portions may be offset such that only part of the portions 1208a, 1208b overlap each other.
While
Also for example, while the illustrated embodiments indicate a two-die processor, with each die having a logic portion of the processor disposed thereon, the logic of a processor may be partitioned among a plurality of dice. For example, face-to-face die may overlap such that a portion of a first top die and a portion of a second top die overlap a third bottom die. The partitioned logic on the multiple dice, whatever the number, cooperatively operates to execute one or more instructions.
That is, as disclosed herein the logic portions allocated to respective multiple dice may be invoked to perform one or more execution operations associated with an instruction. The logic portions operate to cooperatively accomplish execution operations, such as those operations indicated for an execution pipeline (see, for example, sample pipeline 300 illustrated in
The logic portions may be allocated among the multiple dice such that certain functions are split. That is, address generation unit logic may be split into a first portion and a second portion, with the first portion being allocated to a first die and a second portion being allocated to a second die. The first and second logic portions may at least partially overlap and may act together to cooperatively perform the operations of an address generation unit. Similarly, a scheduling unit may be split, as may an array such as a general register file, a cache, a floating point register file or a microcode memory array. A memory controller may also be split, as may a cache, a translation lookaside buffer, decode logic, rename logic, fetch logic, retirement logic, and floating point execution unit logic.
As is indicated above, logic portions may also be allocated such that, rather splitting a block of logic, the intact logic blocks for successive pipeline stages are allocated among the multiple dice of the processor. Such allocation of the logic for pipeline stages may result in a zigzag communication path 1106 through the die-to-die interface 275 as illustrated in
The execution operations associated with an execution stage of an execution pipeline may further include execution, by an execution unit, of arithmetic instruction codes such as integer or floating point instruction codes. As used herein, the term “instruction code” is intended to encompass any unit of work that can be understood and executed by an execution unit, such as a floating point execution unit, arithmetic logic unit, or load/store execution unit. An instruction code may be a micro-operation.
The execution operations associated with the execution pipeline stage may also include execution, by an execution unit, of a memory instruction code such as a memory read or memory write instruction code.
The foregoing discussion discloses selected embodiments of a multi-die processor. A multi-die processor 1702 such as described herein may be utilized on a processing system such as the processing system 1700 illustrated in
Processing system 1700 includes a memory system 1705 and a processor 1702. Memory system 1705 may store instructions 1740 and data 1741 for controlling the operation of the processor 1702. Memory system 1705 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory and related circuitry. Memory system 1705 may store instructions 1740 and/or data 1741 represented by data signals that may be executed by the processor 1702.
The processing system 1700 includes an interface portion 1704. Rather than the die-to-die interface 275 between the first die 102 and second die 104 of the processor 1702, the interface portion 1704 may be coupled to only one or both of the dice 102, 104. The interface portion 1704 is to generate inter-component signals between the processor 1702 and another component of the system 1700. For example, the interface portion 1704 may generate inter-component signals between the processor 1702 and the memory system 1705. For instance, the interface portion 1704 may generate signals between the processor 1702 and memory system 1705 in order to perform a memory transaction such as a data-retrieval read operation from memory or a data write to memory. The interface portion 1704 may also generate signals between the processor 1702 and another system component 1707, such as an RF unit, keyboard, external memory device, monitor, mouse or the like.
In the preceding description, various aspects of an apparatus and system for a multi-die processor are disclosed. For purposes of explanation, specific numbers, examples, systems and configurations were set forth in order to provide a more thorough understanding. However, it is apparent to one skilled in the art that the described apparatus and system may be practiced without the specific details. It will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects. While particular embodiments of the present invention have been shown and described, the appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.
Claims
1. An apparatus comprising:
- a first die having a first face side and a first back side, the first die comprising a first logic portion;
- a second die having a second face side and a second back side, the second die comprising a second logic portion;
- said first and second dice being coupled together with their faces opposed to each other;
- wherein said first logic portion and said second logic portion are to cooperatively execute an instruction.
2. The apparatus of claim 1, wherein
- said first and second dice are further coupled such that the first logic portion and said second logic portion at least partially overlap.
3. The apparatus of claim 1, wherein:
- to cooperatively execute an instruction is further to cooperatively accomplish sub-instruction level tasks in response to an instruction.
4. The apparatus of claim 1, further comprising:
- a conductive inter-die interface between the opposing faces of the first and second dice.
5. The apparatus of claim 4, wherein:
- the inter-die interface is disposed between a subset of the face side of the first die and a subset of the face side of the second die.
6. The apparatus of claim 5, wherein:
- the subset of the face side of the first die is a central region.
7. The apparatus of claim 5, wherein:
- the subset of the face side of the second die is a central region.
8. The apparatus of claim 5, wherein:
- the subset of the face side of the first die is a perimeter region.
9. The apparatus of claim 5, wherein:
- the subset of the face side of the second die is a perimeter region.
10. The apparatus of claim 4, further comprising:
- an interface portion, said interface portion being operatively coupled to at least one of said first logic portion and said second logic portion to generate inter-component signals between the processor and a component.
11. The apparatus of claim 10, wherein: the interface portion is coupled to the first die.
12. The apparatus of claim 10, wherein:
- the component is a memory system.
13. The apparatus of claim 1, wherein:
- said first logic portion and said second logic portion collectively form address generation logic.
14. The apparatus of claim 1, wherein:
- said first logic portion and said second logic portion collectively form scheduling logic.
15. The apparatus of claim 14, wherein:
- said first logic portion comprises arithmetic scheduling logic and wherein said second logic portion comprises memory request scheduling logic.
16. The apparatus of claim 1, wherein:
- said first logic portion comprises a first portion of an array and wherein said second logic portion comprises a second portion of the array.
17. The apparatus of claim 16, wherein:
- wherein said array is a register file array.
18. The apparatus of claim 16, wherein:
- said array is a microcode memory array.
19. The apparatus of claim 1, wherein
- said first logic portion comprises a hot logic block and said second logic portion comprises a cool logic block.
20. The apparatus of claim 19, wherein
- said first logic portion at least partially overlaps said second logic portion.
21. The apparatus of claim 19, wherein:
- said first logic portion further comprises an execution unit and wherein said second logic portion further comprises a data cache.
22. The apparatus of claim 1, wherein:
- said first logic portion comprises a first execution unit and said second logic portion comprises a second execution unit.
23. The apparatus of claim 22, wherein:
- said first execution unit comprises an integer execution unit and said second execution unit comprises a floating point execution unit.
24. The apparatus of claim 22, wherein:
- said first execution unit comprises floating point execution unit and said second execution unit comprises a single-instruction-multiple-data (SIMD) execution unit
25. The apparatus of claim 1, wherein:
- the first logic portion is disposed on the face side of the first die.
26. The apparatus of claim 1, wherein:
- the second logic portion is disposed on the face side of the second die.
27. The apparatus of claim 1, wherein:
- said first logic portion comprises logic to execute a first pipeline stage to execute the instruction; and
- said second logic portion comprises logic to execute a second pipeline stage to execute the instruction.
28. The apparatus of claim 27, wherein:
- logic blocks for additional pipeline stages are disposed on said first and second dice such that a signal path for the pipeline follows a zigzag path between the first and second dice.
29. A processor comprising:
- a first partition on a first die;
- a second partition on a second die; and
- execution logic to invoke the first partition and the second partition to perform an execution operation associated with an instruction.
30. The processor of claim 29, wherein:
- said execution operation further comprises a scheduling operation.
31. The processor of claim 29, wherein:
- said execution operation further comprises an address generation operation.
32. The processor of claim 29, wherein:
- said execution logic, in response the instruction, is further to invoke a partition on the first die to invoke a second execution operation associated with the instruction and is to invoke a partition on the second die to perform a third execution operation associated with the instruction.
33. The processor of claim 32, wherein:
- said second execution operation further comprises an operation associated with an execute stage of an instruction pipeline.
34. The processor of claim 32, wherein:
- said second execution operation further comprises an instruction pointer generation operation.
35. The processor of claim 29, wherein:
- said execution operation further comprises an instruction fetching operation.
36. The processor of claim 29, wherein:
- said second execution operation further comprises a decoding operation.
37. The processor of claim 29, wherein:
- said second execution operation further comprises a renaming operation.
38. The processor of claim 29, wherein:
- said second execution operation further comprises a retirement operation.
39. An apparatus, comprising:
- a first die comprising: an execution unit; and a first array fraction; and
- a second die comprising: a second array fraction coupled to said first register file fraction by die-to-die couplings to cooperatively operate as an array in conjunction with said first array fraction.
40. The apparatus of claim 39, wherein:
- said first die further comprises a first scheduling fraction; and
- said second die further comprises a second scheduling fraction coupled to said first scheduling fraction by die-to-die couplings to cooperatively operate as a scheduling unit in conjunction with said first scheduling fraction.
41. The apparatus of claim 39, wherein:
- said first die further comprises a first address generation fraction; and
- said second die further comprises a second address generation fraction;
- wherein said first address generation fraction is coupled to said second address generation fraction to cooperatively operate as an address generation unit in conjunction with the second address generation fraction.
42. The apparatus of claim 39, wherein:
- the array is a register file.
43. The apparatus of claim 39, wherein:
- the array is a microcode memory array.
44. The apparatus of claim 39, wherein a microprocessor comprises said first die comprising said first partition and said second die comprising said second partition as well as an interface disposed on said first die, and further wherein said apparatus is a system further comprising:
- a memory coupled to the interface portion of the microprocessor, said memory to store an instruction which when executed by the microprocessor causes said microprocessor to invoke said first partition on said first die and said second partition on said second die
45. The system of claim 38, further comprising:
- an additional system component comprising an RF unit.
46. The apparatus of claim 1, wherein
- said first logic portion comprises a low power-density region and said second logic portion comprises a high power-density region.
47. The apparatus of claim 46, wherein
- said first logic portion at least partially overlaps said second logic portion.
Type: Application
Filed: Dec 16, 2003
Publication Date: Jun 16, 2005
Inventors: Bryan Black (Austin, TX), Nicholas Samra (Austin, TX), M. Webb (Aloha, OR)
Application Number: 10/738,680