NOVEL CONTEXT INSTRUCTION CACHE ARCHITECTURE FOR A DIGITAL SIGNAL PROCESSOR
Improved thrashing aware and self configuring cache architectures that reduce cache thrashing without increasing cache size or degrading cache hit access time, for a DSP. In one example embodiment, that is accomplished by selectively caching only the instructions having a higher probability of recurrence to considerably reduce cache thrashing.
The present invention relates to digital signal processors, and more particularly to real-time memory management for digital signal processors.
BACKGROUND OF THE INVENTIONA digital signal computer or digital signal processor (DSP) is a special purpose computer that is designed to optimize performance for digital signal processing applications, such as, for example, fast Fourier transforms, digital filters, image processing and speech recognition. DSP applications are characterized by real-time operation, high interrupt rates, and intensive numeric computations. In addition, DSP applications tend to be intensive in memory access operations and to require the input and output of large quantities of data. Thus, designs of DSPs may be quite different from those of general purpose processors.
One approach that has been used in the architecture of DSPs is the Harvard architecture, which utilizes separate, independent program and data memories so that two memories may be accessed simultaneously. This permits instructions and data to be accessed in a single clock cycle. Frequently, the program occupies less memory space than data. To achieve full memory utilization, a modified Harvard architecture utilizes the program memory for storing both instructions and data. Typically, the program and data memories are interconnected to the core processor by separate program and data buses.
When instructions and data are stored in the program memory, conflicts may arise in the fetching of instructions. Further, in the case of Harvard architecture, the instruction fetch and the data access can take place in the same clock cycle, which can lead to a conflict on the program memory bus. In this scenario, instructions which can generally be fetched in a single clock cycle for a case can stall a cycle due to conflict. This happens when the instructions fetch phase coincides with the memory access phase of a preceding load or store instruction on the program memory bus. Such instructions are cached in conflict cache so that next time when the same instructions are encountered, it can be fetched from the conflict cache to avoid the instruction fetch phase stalls. In addition to the conflict cache, traditional instruction cache is also required for fetching instructions from the external main memory. This results in requiring two different cache architectures.
Further, conventional instruction cache architectures exploit the locality of code to maximize cache-hits. Most of the cache architectures suffer from performance degradation due to cache thrashing, i.e., loading the cache with instruction and then removing it while it is still needed before it can be used by the computer system. Cache thrashing is, of course, undesirable, as it reduces the performance gains.
Conventional techniques reduce cache thrashing by increasing the cache size, increasing cache-associatively, having a victim cache, and so on. However, these techniques come with overheads like extra hardware, increased cache hit access time, and/or higher software overhead. Another conventional technique identifies frequently executed instructions after code-profiling and locking the cache through software to minimize cache thrashing. However, this technique requires additional overheads in terms of requiring profiling of code by user and extra instructions in the code to lock the cache. Further, this can make the code very cumbersome.
SUMMARY OF THE INVENTIONAccording to an aspect of the subject matter, there is provided a method for reducing cache thrashing in a DSP, comprising the steps of dynamically enabling caching of instructions upon encountering current frequently executed instructions in a program, and dynamically disabling the caching of the instructions upon encountering an exit point associated with the frequently executed instructions.
According to another aspect of the subject matter, there is provided a method for self configuring a cache memory in a digital signal processor, comprising determining during run-time execution of a program whether a current instruction is coming from an external main memory or internal memory, outputting an execution-space control signal based on the determination that code is executed from internal memory, determining whether a fetch phase of the current instruction coincides with the memory access phase of a preceding load or store instruction on program memory bus, if so, outputting a conflict instruction load enable signal so that the cache memory behaves like a conflict cache and store the current instruction in the cache memory upon receiving the execution-space control signal, and if the code is executed from external memory then enable a traditional instruction load enable signal so that the cache memory behaves likes a traditional cache and then store the current instruction in the cache memory upon receiving the execution-space control signal.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:
In the following detailed description of the various embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
The terms “cache”, “cache memory”, “instruction cache memory”, “conflict cache memory” are used interchangeably throughout the document. Also, the terms “thrashing” and “cache thrashing” are used interchangeably throughout the document. In addition, the terms “code”, “instructions”, and “program” are used interchangeably throughout the document. In addition, the term “current frequently executed instructions” means first encountered one or more frequently executed instructions in the program during run-time.
At step 120, current instructions are cached upon encountering the current frequently executed instructions in the program by dynamically enabling instruction cache memory. Generally, instruction cache memory is useful if same instruction is required again before the instruction is thrashed during run-time of the program. In some embodiments, the instruction cache is enabled only for those instructions which have higher probability of reoccurrence to reduce thrashing.
In some embodiments, caching of the instructions is dynamically disabled upon encountering an exit point in the current frequently executed instructions. The exit point refers to an exit found in frequently executed instructions, such as loop termination, call return, and the like. At step 130, an N-bit up-counter is incremented upon caching each instruction in the current frequently executed instructions in the instruction cache memory. In these embodiments, the N-bit up-counter has a number of states that is equal to number of entries available in the instruction cache memory.
At step 140, the method 100 determines whether the exit point in the current frequently executed instructions before the N-bit up-counter reaching saturation. Based on the determination at step 140, the method 100 goes to step 150. At step 150, the method 100 determines whether the N-bit up-counter has reached saturation. Based on the determination at step 150, the method 100 goes step 120 if the N-bit up-counter has not reached the saturation and repeats steps 120-150. Based on the determination at step 150 the method 100 goes to step 160 and dynamically disables caching of the current frequently executed instructions if the N-bit up-counter has reached the saturation. In these embodiments, the N-bit up-counter saturation can signify that instruction cache memory is saturated with instructions.
Based on the determination at step 140 the method 100 goes to step 160 and dynamically disables caching of the current frequently executed instructions if the exit point in the current frequently executed instructions is before the N-bit up-counter reaches saturation.
At step 170, the method 100 determines if there is a next frequently executed instructions. Based on the determination at step 170 the method 100 goes to step 120 and repeats steps 120-170 if there is a next frequently executed instructions in the program. In these embodiments, the instruction cache memory is dynamically re-enabled upon encountering next frequently executed instructions. Based on the determination at step 170 the method 100 goes to step 110 and repeats steps 110-170 if there is no other frequently executed set of instructions in the program.
In the case of a hardware loop or other such frequently occurring code including instructions that are greater than the length of the instruction cache memory, thrashing can occur causing a performance loss. As described above the proposed thrashing-aware scheme dynamically disables caching of the current frequently executed instructions once the instruction cache memory reaches saturation. The instruction cache memory is re-enabled when either the loop including the frequently executed instructions is terminated or a nested loop starts executing during run-time. This technique improves performance by reducing thrashing and increasing hit-ratio during run-time of the program. The above-described thrashing-aware technique is generally suitable for small instruction cache memories.
For example, in the case of a DSP having a small cache memory of 32 entries, the cache memory is very susceptible to thrashing if every instruction is cached during run-time. In the case of big loops, thrashing can lead to performance loss (i.e., for loop-sizes approximately greater than about 32) or for calls/Cjumps based subroutines which are greater than about 32 instruction. In order to avoid this problem, using a 5-bit up-counter to count 32 ACAM (address content addressable memory) loads in conjunction with instruction-based caching including a decoder logic circuit which decodes the frequently executed instructions, such as loops, calls, nested loops, negative jumps and the like as described above can increase cache hit-ratio. In this scenario, the 5 bit up-counter starts incrementing, upon encountering frequently executed instructions, with every instruction load to the instruction cache memory until the 5-bit up-counter reaches saturation at 32 loads. The instruction cache memory is disabled for that particular loop/call upon reaching saturation of the 5 bit up-counter.
The following equation illustrates the benefits of using the above-described technique to reduce thrashing and increase hit-ratio during run-time of a program:
Considering a case where an instruction cache memory has “X” entries and a frequently occurring set of instructions or code segment has a length of “Y” that occurs “N” times.
For Conventional Cache Architecture:
If “Y”<“X”, then the hit-ratio=N−/N
If “X”<“Y”<“2X”, then the hit-ratio=(Y−(Y−X)*2)(N−1)/NY
If “Y”>“2X”, then the Hit-ratio=0
For Thrashing-Aware Cache Architecture:
If “Y”<“X”, then the hit-ratio=N−1/N
If “Y”>“X”, then the hit-ratio=X(N−1)/NY
Now for “X”<Y”<“2X”,
The cache-hit advantage factor for thrashing aware cache architecture over the conventional cache architecture
=X/(Y−(Y−X)*2)
=X/(2X−Y)
It can be seen that for “X”<“Y”<“2X”, the cache-hit advantage factor (X)/(2X−Y) can be always greater than 1. This confirms that the hit-ratio for thrashing-aware cache architecture can be always greater than the conventional cache architecture.
Similarly, for cases where “Y”>“2X”, conventional cache architecture returns 0 hits, whereas the thrashing aware cache architecture can continue to return “X” hits per iterations.
The above example clearly illustrates that the thrashing-aware cache architecture gives a better hit-ratio when compared with the conventional cache architecture when deploying a combination of caching the frequently executed instructions and exiting upon the cache counter saturation, without increasing cache-size or degrading cache-hit access time. In some embodiments, the current frequently executed instructions is held in the instructions cache memory until identifying and enabling caching of a next frequently executed instructions in the program. In these embodiments, caching of instructions is dynamically re-enabled upon encountering next frequently executed instructions.
Referring now to
In operation, the computational unit 240 coupled to the instruction cache memory 210 dynamically enables loading of instructions upon encountering frequently executed instructions. Further, the computational unit 240 dynamically disables loading the instructions upon encountering an exit point associated with the frequently executed instructions in a program.
In some embodiments, the N-bit up-counter 260 has a number of states that is equal to a predetermined number of entries in the instruction cache memory 210. In these embodiments, the decoder logic circuit 250 locates the current frequently executed instructions in the program. Also, in these embodiments, the enabler/disabler logic circuit 270 enables storing of the instructions associated with the located frequently executed instructions via the cache controller 280. The N-bit up-counter 260 then increments upon storing each instruction in the instruction cache memory 210. The enabler/disabler logic circuit 270 then disables the storing of the instructions in the instruction cache memory 210 via the cache controller 280 upon the N-bit up-counter 260 reaching a saturation point or upon encountering the exit point in the instructions associated with the frequently executed instructions before reaching the saturation point.
In some embodiments, the instruction cache memory 210 has a predetermined number of entries 205. Also, in these embodiments, the N-bit up-counter 260 has a number of states that is equal to the predetermined number of entries in the internal cache memory 210. The N-bit up-counter 260 then increments a counter value for each instruction that is stored in the instruction cache memory 210. The enabler/disabler logic circuit 270 then disables the storing of the instructions in the frequently executed instructions via the cache controller 280 upon the N-bit up-counter 260 reaching a counter value equal to the number of states in the N-bit up-counter 260 or upon encountering the exit point in the instructions before the counter value in the N-bit up-counter 260 becomes equal to the number of states in the N-bit up-counter 260.
The operation of the thrashing-aware cache architecture shown in
Based on the determination at step 310, the method 300 goes to step 330 and outputs an internal execution-space control signal if the current instruction is coming from the internal memory. At step 350, the method determines whether the fetch phase of the current instruction coincides with the memory access of a preceding load or a store instruction. Based on the determination at step 350 the method 300 goes to step 360 if the fetch phase of the current instruction coincides with the memory access of the preceding load of the store instruction and outputs a conflict instruction load enable signal so that the cache memory behaves like a conflict cache. This generally indicates a conflict condition. Based on the determination at step 350 the method 300 goes to step 310 via step 355 to fetch a next current instruction and repeats steps 310-360 if the fetch phase of the current instruction does not coincide with the memory access of the preceding load or the store instruction.
Referring now to
In operation, the execution-space decode logic circuit 450 dynamically determines whether a current instruction in an executable program is coming from the external memory 430 or the internal memory 420. The cache control logic circuit 460 then configures the cache memory 410 to behave like a traditional cache or a conflict cache based on an outcome of the determination by the execution-space decode logic circuit 450. The cache control logic circuit 460 then transfers the current instruction to and between the cache memory 410, the internal memory 420 and the external memory 430 based on the configured cache memory.
In some embodiments, the execution-space decode logic circuit 450 determines during run-time execution of the executable instructions whether a current instruction in the executable instructions in the executable program is coming from the external memory 430 or the internal memory 420. The execution-space decode logic circuit 450 then outputs an external execution-space control signal if the current instruction is coming from the external memory 430 and outputs an internal execution-space control signal if the current instruction is coming from the internal memory 420.
In some embodiments, the conflict instruction cache enabler 470 determines whether the current instruction in the executable program has a memory conflict condition and then outputs a conflict instruction load enable signal upon finding the memory conflict condition. The traditional instruction cache enabler 480 then enables a traditional instruction load enable signal for the current instruction in the executable program upon receiving the current instruction from the external memory 430. The MUX 490 then outputs an instruction load enable signal via the cache controller 495 and configures the cache memory 410 to behave like a traditional cache or a conflict cache based on the instruction load enable signal. The instruction load enable signal then transfers the current instruction to and between the cache memory 410, the internal memory 420, and the external memory 430 based on the configuration of the cache memory 410.
In some embodiments, the MUX 490 outputs the instruction load enable signal and enables the cache memory 410 to behave like a conflict cache via the cache controller 495 and transfers the current instruction to and between the internal memory 420, the cache memory 410 and the computational unit 440 upon finding a memory conflict condition and receiving the internal execution-space control signal from the conflict instruction cache enabler 470. In these embodiments, the MUX 490 outputs the instruction load enable signal and enables the cache memory 410 to behave like a traditional cache via the cache controller 495 and transfers the current instruction, coming from the external memory 430, to and between the cache memory 410 and the computation unit 440 upon receiving the current instruction from the external memory 430 and the traditional instruction load enable signal from the traditional instruction cache enabler 480.
Although the flowcharts 100 and 300 shown in
The above thrashing-aware architecture increases the digital signal processor performance by reducing cache thrashing and increasing hit-ratio. Further, the above process lowers power dissipation by reducing loading of unwanted instructions into cache memory. Further, the above thrashing-aware process is suitable for caches of small sizes used in digital signal processors.
The above-described self-configuring cache architecture is facilitates in significantly improving the cache functionality by using the same cache hardware as a traditional cache and conflict cache thereby eliminating the need for having two physically different cache in a DSP. The above described context switching self-configuring cache seamlessly switches between conflict cache to traditional cache and vice-versa without any user intervention. The above process uses same cache hardware as conflict cache to avoid resource-conflict during code execution from internal memory and as traditional instruction cache to improve performance during code execution from external memory where there is no resource-conflict.
The above techniques can be implemented using an apparatus controlled by a processor where the processor is provided with instructions in the form of a computer program constituting an aspect of the above technique. Such a computer program may be stored in storage medium as computer readable instructions so that the storage medium constitutes a further aspect of the present subject matter.
Although the flowchart shown in
The above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those skilled in the art. The scope of the subject matter should therefore be determined by the appended claims, along with the full scope of equivalents to which such claims are entitled.
As shown herein, the present subject matter can be implemented in a number of different embodiments, including various methods, a circuit, an I/O device, a system, and an article comprising a machine-accessible medium having associated instructions.
Other embodiments will be readily apparent to those of ordinary skill in the art. The elements, algorithms, and sequence of operations can all be varied to suit particular requirements. The operations described-above with respect to the methods illustrated in
In the foregoing detailed description of the embodiments of the invention, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the detailed description of the embodiments of the invention, with each claim standing on its own as a separate preferred embodiment.
Claims
1. A method for reducing cache thrashing in a digital signal processor (DSP), comprising:
- dynamically enabling caching of instructions upon encountering current frequently executed instructions in a program; and
- dynamically disabling the caching of the instructions upon encountering an exit point in the frequently executed instructions.
2. The method of claim 1, further comprising:
- dynamically identifying the current frequently executed instructions during run-time of the program.
3. The method of claim 2, further comprising:
- holding the current frequently executed instructions in instruction cache memory until identifying and enabling caching of the instructions in next frequently executed instructions.
4. The method of claim 1, wherein the frequently executed instructions comprises instructions selected from the group consisting of a hardware loop, a nested hardware loop, a call, and a backward jump.
5. The method of claim 1, wherein disabling the caching of the instructions upon encountering an exit point associated with the frequently executed instructions comprises:
- incrementing an N-bit up-counter upon caching each of the instructions associated with the current frequently executed instructions into the instruction cache memory, wherein the N-bit up-counter has a number of states equal to number of entries available in the instruction cache memory; and
- dynamically disabling the caching of the instructions associated with the current frequently executed instructions into the instruction cache memory upon the N-bit up-counter reaching a counter value equal to the number of states in the N-bit up-counter or upon encountering an exit point, associated with the frequently executed instructions, before the counter value becomes equal to the number of states in the N-bit up-counter.
6. The method of claim 1, further comprising:
- dynamically re-enabling caching of instructions upon encountering next frequently execute instructions.
7. An article comprising:
- a storage medium having instructions, that when executed by a computing platform, result in execution of a method for reducing cache thrashing comprising: dynamically enabling caching of instructions upon encountering current frequently executed instructions in a program; and dynamically disabling the caching of the instructions upon encountering an exit point in the frequently executed instructions.
8. The article of claim 7, further comprising:
- dynamically identifying the current frequently executed instructions during run-time.
9. The article of claim 8, further comprising:
- holding the instructions in instruction cache memory until identifying a next frequently executed instructions and enabling caching of the instructions in the next frequently executed instructions.
10. The article of claim 7, wherein the frequently executed instructions comprises instructions selected from the group consisting of a hardware loop, a nested hardware loop, a call, and a backward jump.
11. The article of claim 7, wherein disabling the caching of the instructions upon encountering an exit point associated with the frequently executed instructions comprises:
- incrementing an N-bit up-counter upon caching each of the instructions into the instruction cache memory, wherein the N-bit up-counter has a number of states equal to number of entries available in the instruction cache memory; and
- dynamically disabling the caching of the instructions into the instruction cache memory upon the N-bit up-counter reaching a counter value equal to the number of states in the N-bit up-counter or upon encountering an exit point, associated with the frequently executed instructions, before the counter value becomes equal to the number of states in the N-bit up-counter.
12. A digital signal processor, comprising:
- an instruction cache memory; and
- a computational unit coupled to the instruction cache memory to dynamically enable loading of instructions upon encountering frequently executed instructions in a program and to dynamically disable loading of instructions upon encountering an exit point associated with the frequently executed instructions.
13. The digital signal processor of claim 12, wherein the computational unit comprises:
- an N-bit up-counter having a number of states that is equal to a predetermined number of entries in the instruction cache memory;
- a decoder logic circuit that locates the current frequently executed instructions in the program;
- a cache controller; and
- an enabler/disabler logic circuit that enables caching of the instructions associated with the located current frequently executed instructions via the cache controller, wherein the N-bit up-counter increments upon storing each instruction in the instruction cache memory, and wherein the enabler/disabler circuit disables the caching of the instructions in the instruction cache memory via the cache controller upon the N-bit up-counter reaching a saturation point or upon encountering an exit point in the instructions associated with the frequently executed instructions before reaching the saturation point.
14. The digital signal processor of claim 13, wherein the instruction cache memory has a predetermined number of entries, wherein the N-bit up-counter has a number of states that is equal to the predetermined number of entries in the internal cache memory, wherein the N-bit up-counter increments a counter value for each instruction stored in the instruction cache memory, and wherein the enabler/disabler logic circuit disables the storing of the instructions via the cache controller upon the N-bit up-counter reaching a counter value equal to the number of states in the N-bit up-counter or upon encountering an exit point, associated with the frequently executed instructions, before the counter value becomes equal to the number of states in the N-bit up-counter.
15. The digital signal processor of claim 12, wherein the frequently executed instructions comprises instructions selected from the group consisting of a hardware loop, a nested hardware loop, a call, and a backward jump.
16. A self-configuring cache architecture for a digital signal processor, comprising:
- cache memory;
- an internal memory;
- an external memory; and
- a computational unit comprising:
- an execution-space decode logic circuit that dynamically determines whether a current instruction in an executable program that is coming from an external memory or an internal memory; and
- a cache control logic circuit that configures the cache memory to behave like a traditional cache or a conflict cache based on the outcome of the determination, wherein the cache control logic circuit transfers the current instruction to and between the cache memory, the internal memory, and the external memory based on the configuration of the cache memory.
17. The self-configuring cache architecture of claim 16, wherein the execution-space decode logic circuit determines, during run-time execution of the executable program, whether a instruction is coming from the external memory or the internal memory and then outputs an external execution-space control signal if the current instruction is coming from the external memory and outputs an internal execution-space control signal if the current instruction is coming from the internal memory.
18. The self-configuring cache architecture of claim 17, wherein the cache control logic circuit comprises:
- a cache controller;
- a conflict instruction cache enabler that determines whether the current instruction in the executable program has a memory conflict condition and then outputs a conflict instruction load enable signal upon finding the memory conflict condition;
- a traditional instruction cache enabler that enables a traditional instruction load enable signal for the current instruction in the executable program upon receiving the current instruction from the external memory; and
- a MUX coupled to the execution-space decode logic circuit, the conflict instruction cache enabler and the traditional instruction cache enabler outputs an instruction load enable signal via the cache controller to configure the cache memory to behave like a traditional cache or a conflict cache based on the instruction load enable signal, wherein the instruction load enable signal transfers the current instruction to and between the cache memory, the internal memory, and the external memory based on the configuration of the cache memory.
19. The self-configuring cache architecture of claim 18, wherein the MUX outputs the instruction load enable signal and enables the cache memory to behave like a conflict cache via the cache controller and transfers the current instruction to and between the internal memory, cache memory and the computational unit upon finding the memory conflict condition and receiving the internal execution-space control signal.
20. The self-configuring cache architecture of claim 19, wherein the MUX outputs the instruction load enable signal and enables the cache memory to behave like a traditional cache via the cache controller and transfers the current instruction, coming from the external memory, to and between the cache memory and the computation unit upon receiving the current instruction from the external memory and the traditional instruction load enable signal from the traditional instruction cache enabler.
21. A method for self configuring a cache memory in a digital signal processor, comprising:
- determining during run-time execution of a program whether a current instruction is coming from an external memory or an internal memory;
- outputting an external execution-space control signal or an internal execution-space control signal based on the determination;
- determining whether a fetch phase of the current instruction coincides with the memory access phase of a preceding load or store instruction on program memory bus;
- if so, outputting a conflict instruction load enable signal so that the cache memory behaves like a conflict cache and stores the current instruction in the cache memory upon receiving the internal execution-space control signal; and
- outputting a traditional instruction load enable signal so that the cache memory behaves like a traditional cache and then stores the current instruction in the cache memory upon receiving the external execution-space control signal.
Type: Application
Filed: Jan 17, 2007
Publication Date: Jul 17, 2008
Inventors: Tushar Prakash Ringe (Indore), Abhijit Giri (Bangalore)
Application Number: 11/623,760
International Classification: G06F 12/08 (20060101);