Method and apparatus for processor to operate at its natural clock frequency in the system
A mechanism to generate a self-clock within a synchronous processing unit of an asynchronous digital device. The self-clock is designed to match the worst-case delay of pipeline processing unit in such a way that the pipeline processing unit is operate at its own natural clock frequency and shutting off when there is no valid data to process. The synchronization logic of the processing unit consists of self-clock that generates output clock to synchronize with the internal clock edge if the processing unit is active or synchronize with the input clock edge if the processing unit is inactive.
The present disclosure relates to digital systems (such as mobile devices, processors, memory devices, and computer systems) and, more particularly, to mechanisms and techniques for clocking mechanism of the digital designs.
BACKGROUNDIn general, microprocessors (processors) achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle. The term “clock cycle” refers to an interval of time accorded to various stages of processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to a rising or falling edge of a clock signal defining the clock cycle. The storage devices store the values until a subsequent rising or falling edge of the clock signal, respectively. The phrase “instruction processing pipeline” is used herein to refer to the logic circuits employed to process instructions in a pipeline fashion. Although the pipeline may include any number of stages, where each stage processes at least a portion of an instruction, instruction processing generally includes the steps of: decoding the instruction, fetching data operands, executing the instruction and storing the execution results in the destination identified by the instruction.
Processor design consists of a central clock, generally phase lock loop (PLL) clock, with a clock tree network. The clock tree consists of many global clock buffers and local clock buffers. The clock buffers can be clock-gated to save power but the clock tree itself can still consume much power. In some estimate, the clock tree can consume 15% to 35% of total dynamic power of the processor. The distributed clock networks with local clock generators can significant reduce the power consumption of microprocessor as suggested in U.S. Pat. No. 5,987,620. Unfortunately, at system level, the clocking network is still inefficient with a single PLL clock or multiple PLL clocks. The globally-asynchronous-locally-synchronous (GALS) clocking allows the system modules to operate at different clock frequencies but these clock frequencies are still fixed by PLL clocks.
The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. Embodiments of the present disclosure are illustrated by way of examples and are not limited by the accompanying figures, in which like references indicate similar elements. The use of the same reference symbols in different drawings indicates similar or identical items. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
The problems outlined above are in large part solved by a design in accordance with the various embodiments of this disclosure. Embodiments of this disclosure are adaptable for use in any Mobile Device, computer systems, or other digital designs.
In particular, the disclosure contemplates on using the self-clock mechanism that will conditionally generate clocks when there is a valid operation to be performed. The self-clock modules are used for internal of the processing unit as well as in the interface block of the processing unit for communication with other processing units. The interface block includes asynchronous buffers to allow the processing unit to receive and send data to other processing units with different clock frequencies. The self-clock modules within a processing unit are designed to operate at the same clock frequency which matches the worst case speed path or the target frequency of the processing unit. This mechanism will enable a power reduction mechanism at the processor level as well as system level. The system can include many cores such as a general-purpose microprocessor, a DSP, a peripheral device, an I/O device, a hardware accelerator, and memory modules. Instead of using a single or multiple PLL clocks to force these cores and memory modules to operate at certain clock frequencies, the cores and memory modules should operate at their own natural clock frequencies. The natural clock module is designed in accordance with the design technology which matches the frequency of the pipeline operation of the processor.
This disclosure provides various embodiments of mechanisms to generate clock only when there is a need to perform a valid operation.
A further understanding of the nature and advantages of the present disclosure may be realized by reference to the remaining portions of the specification and the drawings.
DETAILED DESCRIPTIONPC interface unit 16 communicates with external devices (not shown) through bidirectional bus 26. Processing unit 100 receives an external clock 30 in which PC interface unit 16 can synchronize with external devices in transferring data. The external clock 30 connects to the PLL clocks of processor 10, processor 12, and memory module 14. The PLL clocks generate internal clocks at different clock frequencies for processor 10, processor 12, and memory module 14. In another configuration, the PLL clocks can be generated from an external clock module instead of internal to the processors or memory modules.
In the processor 10, the PLL clock frequency can be multiple of clock frequency of the external clock 30. The internal clock of processor 10 connects to a clock tree to supply clock to all internal functional units and bus interface unit. Similarly, the PLL clock of processor 12 connects to the clock tree to supply the clock to its internal functional unit and bus interface unit. Memory module 14 may use the PLL clock in different manner than processor 10 and processor 12. One such purpose is multiple internal clocks with different clock frequencies for internal SRAM or DRAM arrays and I/O interfaces with processor 10, processor 12, and PC interface unit 16.
In alternate embodiment, processing device 100 may include any number of processors, hardware accelerators, and I/O devices. In another embodiment, the processor 10 may be a general-purpose microprocessor and processor 12 may be a DSP processor or graphic unit. The memory module 14 may include memory modules and hierarchical memory subsystem for processors 10 and 12.
Traditionally, each of the functional units constitutes one or more pipeline stages in a processor. A first instruction is fetched from instruction cache 54 during a clock cycle, during the next clock cycle, the first instruction will be in the decode unit 58 while a second instruction is being fetched from the instruction cache 54. Thus pipelining enables simultaneous operation of multiple instructions. In general, number of pipeline stages increases with design complexity and high clock frequency. The term clock frequency refers to number of clock cycles within a time unit, usually a second.
In
The local clock module 42 in BIU 52 is slightly different from the local clock modules 40 because of interfacing with external devices at different clock frequency. The BIU 52 receives clock signals and output data from instruction cache or data cache and generates output clock signal 32b for external devices. The local clock module 42 in BIU 52 also receives input clock signal 32a and input data on bus 20 from external device to generate internal clock signal 52a. Since the processing unit 10 in
Referring now to
Turning now to
The asynchronous output FIFO 72 received BIU output valid 88 and data 87 and used sync clock 84 to buffer the data before sending out to external device. The asynchronous output FIFO 72 can be a simple buffer to send output data to external device when external device is not busy or full. The output clock 32b is based on sync clock 84 and sent along with output valid 96 and output data 94 to external device. In another embodiment, the asynchronous output FIFO 72 may consist of a control block and FIFO as shown for input data. In this case, the sync clock 84 must synchronized with the clock edge of input clock 32a to generate output valid 96 and output data 94. For example, the memory module 14 can use this mechanism to send data PCI 16 in
The input FIFO data 85 are consumed by BIU control logic 76 and processor 10 using internal clock generator 78. The input clock 32a and valid signal 90 enables the clock generator 78 to activate output clock 81 for sending valid data to either instruction cache 54 or data cache 56. The clock generator 78 is part of the self-clock module 42 which generates the natural clock frequency of processor 10. The input FIFO valid 86 from asynchronous control block 74 is sent to BIU control logic 76 to steer input FIFO data 85 to instruction cache 54 or data cache 56. The output clock 81 is sent to local clock modules 40 of instruction cache 54 and data cache 56 along with input FIFO valid 86 and data 85. Output clock 81 is part of the clock bus 52a. The clock bus 52a also includes feedback clock 80 and active signal 82. The feedback clock 80 and active signal 82 indicates that instruction cache 54 or data cache 56 is not idle. In absence of input valid 90, the active signal 82 is used to shut down the clock generator 78 to save power. The active clock module 75 will continuously generate clock signal that is synchronous with the internal clock of processor 10. In one embodiment, the sync clock 84 and the output clock 81 are from the same self-clock module with different enable signals internal to the clock generator 78. The BIU control logic 76 receives requests from instruction cache 54 and data cache 56 and sends BIU output data 87 and valid 88 to external devices via asynchronous output FIFO 72.
The clock generator 78 includes an active clock module 75. The output clock 81 and sync clock 84 operate at the same frequency and match with the worst pipeline delay of processor 10 which is the natural clock frequency of processor 10. Output clock 81 is generated differently depended on the state of the processor 10. If the processor 10 is active, then the output clock 81 is generated from sync clock 84. The active clock module 75 continuously generates the sync clock 84 when processor 10 is active. Clock generator 78 uses this sync clock 84 to generate output clock 81. When the processor 10 is idle (the active clock module 75 is disabled), upon receiving valid external input data 90, the active clock module 75 is enabled and clock generator 78 generates output clock 81 and sync clock 84 based on the clock edge of input clock 32a. In another embodiment, the clock generate 78 can randomly generate an output clock 81 and sync clock 84.
Yet, in another embodiment, the active clock module 75 is active based solely on the feedback clocks 80 from instruction cache 54 and data cache 56. The output clock 81 and sync clock 84 are synchronized with the clock edge of feedback clock 80. The feedback clocks 80 are generated when instruction cache 54 is active, or data cache 56 is active. It is a combination of both active clock 54a of instruction cache 54 and active clock 56a of data cache 56.
Turning now to
The above examples in
Some of the above embodiments, as applicable, may be implemented using a variety of different information processing systems. For example, although
Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures can be implemented which achieve the same functionality. In an abstract, but still definite sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality.
Furthermore, those skilled in the art will recognize that boundaries between the functionality of the above described operations merely illustrative. The functionality of multiple operations may be combined into a single operation, and/or the functionality of a single operation may be distributed in additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
In one embodiment, the local-clocks of this disclosure is applicable to all digital ICs like custom chip, Application Specific IC (ASIC), Field Programmable Gate Array (FPGA). It is applicable to practically any digital design such as processing units, memory systems, communication system, and I/O systems.
In one embodiment, system 100 is a computer system such as a personal computer system. Other embodiments may include different types of computer systems. Computer systems are information handling systems which can be designed to give independent computing power to one or more users. Computer systems may be found in many forms including but not limited to mainframes, minicomputers, servers, workstations, personal computers, notepads, personal digital assistants, electronic games, automotive and other embedded systems, cell phones and various other wireless devices. A typical computer system includes at least one processing unit, associated memory and a number of input/output (I/O) devices.
Although the disclosure is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure. Any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
The term “coupled,” as used herein, is not intended to be limited to a direct coupling or a mechanical coupling.
Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to disclosures containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.
Claims
1. A digital circuitry comprising:
- a first processing unit wherein the first processing unit generates a first clock; and
- a second processing unit that receives the said first clock comprising of a self-clock circuitry that generates an internal clock; wherein the said self-clock circuitry further comprises of: a mechanism to generate a first output clock that synchronizes with the internal clock; a mechanism to generate a second output clock that synchronizes with the first clock from the first processing unit; and a mechanism to generate a select between the first output clock and the second output clock to generate an output clock.
2. The apparatus of claim 1, wherein the second processing unit further comprises of a first-in-first-out register to receive a data from the first processing unit.
3. The apparatus of claim 2, wherein:
- the first clock from the first processing unit is at a faster clock frequency in comparison to the internal clock frequency of the self-clock circuitry of the second processing unit wherein an acknowledge signal is needed to avoid overrun of the first-in-first-out register.
4. The apparatus of claim 3, wherein the output clock from the self-clock circuitry of the second processing unit is used to read a data from the first-in-first-out register.
5. The apparatus of claim 1, wherein the self-clock circuitry of the second processing unit continuously generates the internal clock as long as there is a valid operation within the second processing unit.
6. The apparatus of claim 1, wherein the second processing unit comprises of a second self-clock circuitry; wherein the second self-clock circuitry generates an output clock that:
- has the same clock frequency with the internal clock of the first self-clock circuitry of the second processing unit; and
- synchronizes with the internal clock of the first self-clock circuitry of the second processing unit.
7. The apparatus of claim 1, wherein the second processing unit is a memory storage device.
8. The apparatus of claim 1, wherein the self-clock circuitry of the second processing unit further comprises of:
- a mechanism to generate a third output clock that synchronizes with an internal feedback clock within the second processing unit; and
- a mechanism to generate a select between the first output clock, the second output clock, and the third output clock to generate an output clock.
9. The apparatus of claim 1, wherein the self-clock circuitry of the second processing unit further comprises of:
- an active indication to generate the output clock; and
- an idle indication to generate no clock.
10. The apparatus of claim 1, wherein the internal clock period is designed to match a target clock frequency of the second processing unit.
11. The apparatus of claim 1, wherein the internal clock period is designed to match a worst-case delay of an internal pipeline logic of the second processing unit.
12. The apparatus of claim 1, wherein the second processing unit includes a clock synchronous logic and a second first-in-first-out register to send an output clock and a packet of data to the first processing unit.
Type: Grant
Filed: Jul 22, 2012
Date of Patent: May 23, 2017
Patent Publication Number: 20120317434
Inventor: Thang Tran (Saratoga, CA)
Primary Examiner: Mark Connolly
Application Number: 13/555,178
International Classification: G06F 1/00 (20060101); G06F 1/04 (20060101); G06F 1/12 (20060101); H04L 7/00 (20060101); G06F 9/38 (20060101); G06F 1/10 (20060101);