Variable length pipeline with parallel functional units
Method and apparatus for implementing a variable length pipeline in a packet-driven memory control system, including a command front end and one or more parallel command sequencers. The command front end decodes an external command packet into an internal command and issues it to a selected one of the command sequencers. The command has associated therewith a desired latency value. A first group of one or more memory control steps for the given command is performed by the command front end if the desired latency value is less than a threshold latency value, or by the selected command sequencer if the desired latency value is greater than or equal to the threshold latency value. The remainder of the memory control steps required for the command are performed by the selected command sequencer. If the first control steps are to be performed by the selected command sequencer, then depending on the desired latency value, the command sequencer further may insert one or more wait states before doing so.
Latest Advanced Memory International, Inc. Patents:
- Memory system having synchronous-link DRAM (SLDRAM) devices and controller
- Read/write timing for maximum utilization of bi-directional read/write bus
- Memory system having synchronous-link DRAM (SLDRAM) devices and controller
- Contention-free signaling scheme for shared control signals
- Variable length pipeline with parallel functional units
[0001] The following pending application is owned by the assignee of the present application, and its contents are hereby incorporated by reference:
[0002] Serial No. 09/132,158 [Attorney Docket No. SLDM1025] filed Aug. 10, 1998, invented by Gustavson et. al and entitled, MEMORY SYSTEM HAVING SYNCHRONOUS-LINK DRAM (SLDRAM) DEVICES AND CONTROLLER
BACKGROUND TO THE INVENTION[0003] 1. Field of the Invention
[0004] The present invention relates generally to command processing applications in high bandwidth memory systems.
[0005] 2. Description of the Related Art
[0006] The evolution of the dynamic random access memories used in computer systems has been driven by ever-increasing speed requirements mainly dictated by the microprocessor industry. Dynamic random access memories (DRAMs) have generally been the predominant memories used for computers due to their optimized storage capabilities. This large storage capability comes with the price of slower access time and the requirement for more complicated interaction between memories and microprocessors/microcontrollers than in the case of say static random access memories (SRAMs) or non-volatile memories.
[0007] In an attempt to address this speed deficiency, DRAM design has implemented various major improvements, all of which are well documented. Most recently, the transition from Fast Page Mode (FPM) DRAM to Extended Data Out (EDO) DRAMs and synchronous DRAMs (SDRAMs) has been predominant. Further speed increases have been achieved with double data rate (DDR) SDRAM, which synchronizes data transfers on both clock edges. New protocol based memory interfaces have recently been developed to further increase the bandwidth and operating frequencies of synchronous memories.
[0008] As the complexity of these memories has increased, the associated control systems responsible for internally managing the operation of the memories have also become more complex. These command-driven control systems internally must typically process a stream of commands or instructions that overlap in execution time and have programmable latency (time from receipt of command to first control outputs asserted in response). Programmable latency is desirable in such systems in order to allow the memory controller to schedule the use of shared data, address or control buses for optimum usage. Since the processing of two or more commands may be required to occur simultaneously, many control systems implement multiple functional units operating in parallel. The minimum latency of the control system is therefore limited by the need to (i) decode the command control field(s), (ii) determine the programmed latency associated with the identified command, and (iii) issue the command to a number of parallel functional units before the first control output action can be determined for use by the memory.
[0009] A conventional implementation of such a memory system control block comprises a single front end decoding block which decodes external commands and issues internal commands to multiple identical functional elements capable of operating in parallel. The minimum latency therefore cannot be shorter than the time it takes to decode the command in the front end block plus the time required to issue the command to a parallel functional unit, and finally, the time that the functional unit takes to initialize and issue its first control action. The common approach to reducing the minimum latency described above is by replicating the command decoding logic within each parallel functional unit and feeding the command stream to all parallel functional units simultaneously to eliminate the issue and initialization delay. This advantage comes with the cost of a large increase in overall logic complexity, redundant logic, and increased power consumption. As frequency and bandwidth requirements increase, there is a need for a memory system control block which makes optimum use of area and power consumption and which can process commands with a reduced minimum latency than previously achieved in the prior art.
SUMMARY OF THE INVENTION[0010] It is therefore an object of the present invention to provide a command processing system for use in a high bandwidth memory interface which processes commands with a minimum latency.
[0011] It is another object of the present invention to provide the command processing system with a minimum increase to the command circuitry.
[0012] According to the invention, roughly described, a packet-driven memory control system which implements a variable length pipeline includes a command front end and one or more parallel command sequencers. The command front end decodes an external command packet into an internal command and issues it to a selected one of the command sequencers. The command has associated therewith a desired latency value. A first group of one or more memory control steps for the given command is performed by the command front end if the desired latency value is less than a threshold latency value, or by the selected command sequencer if the desired latency value is greater than or equal to the threshold latency value. The remainder of the memory control steps required for the command are performed by the selected command sequencer. If the first control steps are to be performed by the selected command sequencer, then depending on the desired latency value, the command sequencer further may insert one or more wait states before doing so.
BRIEF DESCRIPTION OF THE DRAWINGS[0013] FIG. 1 is a simplified block diagram of a synchronous link memory system incorporating features of the invention.
[0014] FIGS. 2A and 2B together represent a simplified block diagram of the synchronous link DRAM (SLDRAM) module or integrated circuit of FIG. 1.
[0015] FIGS. 3A and 3B are schematics and timing diagrams illustrating the conversion of external command, address and flag signals of FIG. 2 into internal command and address signals to be processed.
[0016] FIG. 4 is a simplified block diagram of the command processing pipeline incorporating an embodiment of the invention.
[0017] FIG. 5A is a conceptual diagram illustrating the processing of a minimum latency page read command.
[0018] FIG. 5B is a conceptual diagram illustrating the processing of a non-minimum latency page read command.
[0019] FIG. 6 is a block diagram of a command sequencer according to an embodiment of the invention.
[0020] FIG. 7 is a block diagram of a control signal variable delay circuit with a delay resolution shorter than one clock period.
[0021] FIG. 8 is a block diagram illustrating input and output circuits connected to shared bus signal lines CTL/ADDR in FIG. 4.
[0022] FIG. 9 is a timing diagram illustrating the operation of the circuits of FIG. 8.
DETAILED DESCRIPTION OF THE INVENTION[0023] FIG. 1 provides a simplified view of a memory system employing a packet based synchronous link architecture (SLDRAM). The system which is described more fully in the above-incorporated patent application generally comprises a command module 150 (typically implemented by a memory controller), and a plurality of SLDRAM modules or individual IC's 110-180. The command link 151 is used by the command module to issue commands, a command system clock, control signals and address information to each of the SLDRAMs. Data is written into or read out of the SLDRAMs via the DataLink 155a and 156a in synchronization with source-synchronous clocks 155b, 155c, 156b and 156c.
[0024] Within this system, an embodiment of the command processing in accordance with the invention will be described.
[0025] FIG. 2A and 2B together illustrate the general structure of an SLDRAM memory integrated circuit of FIG. 1. The structure and operation of the circuit is described broadly in the above-incorporated patent application. The command decode and sequencer unit 504 will be described in more detail below.
[0026] FIG. 3A illustrates the input stage of the command decoder 504 of FIG. 2A. The incoming external command and address signals CA[9:0] along with the FLAG and command clock CCLK signals are received via input cells each comprising an input protection device 50, an input buffer 51 and two D-type flip/flops 52 and 53 for latching the command/address and FLAG signals on both rising and falling edges of the command clock CCLK. As a result, the eleven (11) incoming signals made up of FLAG and CA[9:0] operating at 400 Mbps are converted internally into twenty two (22) internal command/address signals consisting of FLAG_R, FLAG_F, CA_R[9:0], and CA_F[9:0], operating at 200 Mbps. The command clock also has a delay locked loop (DLL) and vernier delay in its input path, which are used to properly latch the incoming commands and address signals at the appropriate time within the system.
[0027] FIG. 3B illustrates the relative timing of the input stage. CCLK is a free-running clock. Upon assertion of the FLAG signal, command/address words begin to be latched on the rising edge of the delayed internal version of the command clock CCLKH. On the subsequent rising edge of the internal flag signal FLAG_R, the internal command/address words begin to be accepted into the system at one half the frequency of the external CCLK. The command/address words are alternated between the rising and falling edge command/address internal busses CA_R[9:0] and CA_F[9:0] as indicated by A0, A1, A2, A3, etc.
[0028] FIG. 4 is a block diagram illustrating the command processing pipeline according to an embodiment of the invention. A command decoder front end CFE 200 receives the command packet as four consecutive 10-bit words on CA[9:0]. It then internally assembles and decodes the command packet into a 31-bit internal command COM[30:0] which is issued to a selected one of a plurality of parallel functional units or command sequencers 201-208. The CFE 200 also generates a 6-bit command delay signal COMDEL[5:0] which is determined by comparing the latency in the selected latency register with a predetermined threshold. The CFE 200 initializes each of the sequencers by asserting the ISSUE0-ISSUE7 signals. The available or busy state of a sequencer is fedback to the CFE via the BUSY0-BUSY7 signals. Both the CFE 200 and each of the sequencers also have a multi-bit control/address output CTL/ADD which is used to send out the control signals to the memory banks, the data path, etc. The CTL/ADD signal coming from the CFE 200 corresponds to control signals being generated by the CFE itself as will be described in more detail below.
[0029] With reference to FIG. 2A and FIG. 4, in accordance with an embodiment of the present invention, an SLDRAM memory device receives streams of command packets as 4 consecutive 10-bit words on the CA[9:0] bus. Each 40 bit command packet is assembled and then decoded by the command front end or CFE block 200. For SLDRAM commands that utilize user-programmable latencies (such as memory reads, memory writes, and register reads), the CFE 200 selects the appropriate latency value based upon the command type and issues the command packet and latency value to one of eight identical parallel functional units called command sequencers 201-208, with sequencer 0, 201 having the highest priority and sequencer 7, 208 having the lowest priority. The determination of whether to perform the first group of control steps within the CFE 200 or to forward the entire command to a selected command sequencer depends on the command's specified latency. Once a command is decoded, if the desired latency is determined to be shorter than a predetermined threshold, then the CFE 200 executes the first several control steps using control logic located within the CFE block 200, and simultaneously issues the command to a parallel functional unit and initializes that unit. Subsequently, the control action sequence is seamlessly taken over by the selected command sequencer which recognizes (based on the latency value accompanying the command) that the CFE 200 has already performed the initial control actions. The selected command sequencer therefore skips performing these actions which it would normally do, and instead proceeds directly to execute the remaining control actions necessary to process the command. For example, if a page read command is dispatched by the system's command module 150, the CFE 200 within a particular SLDRAM recognizes this command as a special case if the page read latency register located in the SLDRAM device was programmed to the minimum value by a previous command. When this occurs, special logic in the CFE 200 performs the first two control actions for a page read (column open (select) for the low array, and initiation of a read operation within the data output path) simultaneously with issuing the command to an idle sequencer. The sequencer itself is designed to recognize the special case of a minimum latency page read and will skip the first two control steps performed by the special logic in the CFE 200 and instead directly proceed to the remaining instructions to complete the page read command.
[0030] FIG. 5A illustrates the relative timing of processing of a minimum latency page read operation by the CFE 200 and a selected command sequencer. Upon a predetermined rising edge of the command clock, CCLKH, the CFE opens a selected column, initiates the data path (precharging and equalizing of data buses) and issues the page read command to the available sequencer, all within the first CCLK period. Then, during a subsequent CCLK period, the command sequencer opens a second column, column high (previously column low was opened) to initiate the read of the second column. Prior to the end of this subsequent clock cycle, the read data path begins to receive the low data bits corresponding to the low column which was opened by the CFE in the first CCLK cycle. Subsequently, after some delay, the read data path receives the high data bits corresponding to the high column which was opened by the sequencer, as described above. In this fashion, the labour (reading column low and column high) was divided between the CFE and a selected sequencer by having the CFE perform the first portion of the operation, and the sequencer perform the remaining portion of the operation.
[0031] If on the other hand, the desired latency is determined to be greater than a predetermined threshold i.e. if the actual page read latency register is programmed to a value greater than the minimum latency, then the CFE 200 executes none of the control actions and instead forwards them all to an available command sequencer. In this case, the selected sequencer also recognizes that the page read latency is greater than minimum and performs all control actions to accomplish the page read (after inserting any necessary wait states).
[0032] FIG. 5B illustrates the relative timing of processing of a non-minimum latency page read operation by the CFE 200 and a selected command sequencer. Upon a predetermined rising edge of the command clock, CCLKH, the CFE 200 recognizes that the requested command is a non-minimum latency command by the value written into the latency register, and immediately issues the command to an available command sequencer within the first CCLK period. The selected sequencer is initialized and a number of latency states are inserted depending on the value in the page read delay register. Once the latency wait states have elapsed, the sequencer proceeds to execute the command in a manner similar to that described in FIG. 5A, i.e. a column open low is performed along with the initialization of the data path. Subsequently, a column open high read operation is performed during the second clock cycle and during that same cycle, low data starts to appear on the read data path. Finally, an optional row close command is executed during the third clock cycle and the high data appears on the read data path. By optionally performing the initial page read control actions simultaneously with the issuing of the page read command to an idle sequencer, the minimum page read latency is reduced by one clock cycle.
[0033] In general, the CFE block 200 has the following procedure for receiving and processing a command:
[0034] Assemble a 40-bit command packet when FLAG is asserted
[0035] Compare packet ID with the device ID in order to determine whether packet is heading to correct device
[0036] Decode 6-bit command field to:
[0037] Determine the command type (buffered, immediate, etc.)
[0038] Determine command latency
[0039] Issue command if all the following conditions are satisfied:
[0040] Command field contains a valid opcode
[0041] Command ID matches device ID
[0042] FLAG protocol is obeyed, i.e. FLAG bit is asserted for one clock tick only (i.e. half a period)
[0043] Command processing mode is enabled
[0044] An idle command sequencer is available
[0045] FIG. 6 illustrates a command sequencer 321 corresponding with one of the command sequencers 201-208 in FIG. 4, according to an embodiment of the invention. The command signals COM[30:0] are received by a latch 300 which is enabled by a signal G from a Idle/Active module select block 303 (for a more detailed breakdown of the command packet, see Table 2.0 in the above-incorporated patent application). The output of the latch 300 is broken down into bank address signals BNK[2:0], register address signals REGA[3:0], column address signals COLA[6:0] and the actual command instructions CM[5:0]. The BNK[2:0] signals are decoded by a 3-to-8 decoder 304 and then fed into output buffers 314 for high and low column block addresses signals YBKLO[7:0] and YBKHI[7:0], as well as being input into a miscellaneous decoder 317 for closing an open row RCLOSE[7:0]. The register addresses REGA[3:0] are output via buffers 315, while the column addresses COLA[6:0] are latched and then output via buffers 316; the LSB COLA[0] is optionally inverted by an LSB inverter 305 for performing the second half of the word burst operation. The misc. decoder 317 also receives the command instruction signals CMD[5:0] as inputs. The required command latency delay is input into the sequencer via lines COMDEL[5:1] into a 5-bit counter 301 and with the least significant bit COMDELO input into a latch 302. The counter and latch 301 and 302 respectively, also receive the G control signal from the Idle/Active module select block 303. The output of the counter 301 feeds into read latency decoders 360, read command decoders 307, write latency decoders 308 and write command decoder 309. If the sequencer is available, the Idle/Active module select block 303 generates and asserts an ACTIVE signal in response to an asserted ISSUE signal from the CFE. The ACTIVE signal in turn enables the decoder combining circuitry, AND gates 310 and 311. OR gate 312 selects between read and write command decoder outputs from 310 or 311 respectively to initiate a column operation via block 319. The column operation block 319 also produces a control signal which is used to control the buffers 314, 315 and 316, and also produces the output the control signals COLENLO, COLENHI for internally enabling the selected columns within the device. If a read command is decoded along with its corresponding latency via 306, 307 and 310, a data output path command encoder 318 is used to generate the data path output control signals DPO[4:0]. If a write command is decoded along with its corresponding latency via 308, 309 and 311, a data input path command encoder 320 is used to generate the data path input control signals DPI[4:0]. The data path output and input command decoders 318 and 320 are also controlled by the LSB from latch 302.
[0046] The sequencer 321 is one of eight identical functional units 201-208 as illustrated in FIG. 4. There is no interlocking between the sequencers or between a particular sequencer and the CFE 200. Therefore, the command module (memory controller) must be aware of the actual delay values and schedule commands appropriately. The sequencer performs any one the following operations:
[0047] all bank read/write commands except for row open which is performed by the command front end CFE
[0048] all page read commands unless actual delay is programmed to minimum, in which case the CFE performs the data path initiate and part of the column open
[0049] all register read and read synch commands unless page read actual delay is programmed to minimum, in which case the CFE performs the data path initiate
[0050] As a further clarification as to the Division of Labour between the CFE and the command sequencers, Tables 1A, 1B and 1C are included below. These tables set forth the memory control steps performed by the CFE or by a sequencer, as the case may be, in response to a received command. As used herein, a “memory control step” is a step which drives the operation of a DRAM bank in a desired manner. The memory control steps set forth in Tables 1A, 1B and 1C are illustrative ones of such steps which are used in the present embodiment. 1 TABLE 1A Division of Labor - Read Operations Command Front Sequencer End Memory Memory Command Control Steps Control Steps Read Page If latency = If latency > minimum (BURST4)** minimum insert necessary wait states open column low, open column low, initiate DPO initiate DPO transfer transfer open column high issue command optional precharge to sequencer Read Page If latency = If latency > minimum (BURST8)*** minimum insert necessary wait states open column low, open column low, initiate DPO initiate DPO transfer transfer open column high issue open column low*, initiate DPO command to transfer sequencer open column high* optional precharge Read Bank open row insert necessary wait states (BURST4) issue command to open column low, initiate DPO sequencer transfer open column high optional precharge Read Bank open row insert necessary wait states (BURST8) issue command to open column low, initiate DPO sequencer transfer open column high open column low*, initiate DPO transfer open column high* optional precharge *LSB of column address is complemented. **BURST4 refers to a burst of 4 consecutive 18-bit data words. ***BURST8 refers to a burst of 8 consecutive 18-bit data words.
[0051] 2 TABLE 1B Division of Labor - Write Operations Command Front Sequencer End Memory Memory Control Command Control Steps Steps Write Page issue command to insert necessary wait states (BURST4) sequencer initiate DPI transfer open column low open column high optional precharge Write Page issue command to insert necessary wait states (BURST8) sequencer initiate DPI transfer open column low open column high, initiate DPI transfer open column low* open column high* optional precharge Write Bank open row insert necessary wait states (BURST4) issue command to initiate DPI transfer sequencer open column low open column high optional precharge Write Bank open row insert necessary wait states (BURST8) issue command to initiate DPI transfer sequencer open column low open column high, initiate DPI transfer open column low* open column high* optional precharge *LSB of column address is complemented
[0052] 3 TABLE 1C Event Operations Command Front End Memory Control Sequencer Command Steps Memory Control Steps Read Register If latency = If latency > minimum minimum insert necessary wait states initiate DPO transfer initiate DPO transfer (register) (register) drive address to register selection issue command to MUX sequencer Read Sync If latency = If latency > minimum minimum initiate insert necessary wait states DPO read sync initiate DPO read sync issue command to sequencer Row Open open row Row Close close row Register Write, issue command to Event, Stop immediate Read Sync, command block Drive DCLKs, Disable DCLKs
[0053] In general, the command sequencers perform bank read and writes, page writes and all the rest of operations with a programmable latency which is not set to a minimum value. It will be appreciated that the CFE and the sequencers never perform the same control step at the same time, the memory controller being responsible for scheduling instructions in such a way that the CFE and sequencers will not be generating control signals which create contention on the CTL/ADDR bus of FIG. 4. Similarly, the memory controller is responsible for ensuring that the parallel sequencers do not create contention. Note that two parallel sequencers can operate simultaneously and still not create contention if, for example, they are generating control signals for different banks of memory controlled by different signal lines.
[0054] The command pipeline described above gives rise to one timing outcome which must be compensated. Namely, since command latencies are programmed in increments of clock “ticks” i.e. half clock cycles, and the command pipeline operates with a full clock period (i.e. 2 ticks), for latencies requiring an odd number of delays, a mismatch arises between the latency ticks and the command clock period, since the command pipeline cannot insert the appropriate number of tick delays based solely on its clock period. For an even number of delays, there is no mismatch between the number of delays required and the command pipeline clock period. As a result, a method for inserting an additional tick delay for odd-numbered latencies is implemented in a preferred embodiment of the invention, as will be discussed below.
[0055] More generally, in order to generate control signals with timing resolution Tres using conventional synchronous logic design techniques, it is necessary to clock the logic with a clock period shorter than or equal to Tres. For a high timing resolution system (i.e. short Tres), this requires a high operating frequency for the control logic, resulting in relatively high power consumption, especially in CMOS implementations due to the CV2f term, and also resulting in the minimum timing resolution Tres being limited by the maximum operation frequency of the synchronous control logic. Conventional approaches to resolving this issue included simply designing the control logic to operate at the frequency necessary for the desired timing resolution Tres by use of an increased control pipeline depth or the use of special circuit level design techniques such as dynamic storage elements to achieve the desired frequency/resolution. However, as the operating frequencies have increased, simply forcing the control logic by design to operate at those frequencies is becoming more and more challenging.
[0056] According to a preferred embodiment of the invention, a half-period adjust scheme is implemented to address this timing resolution drawback. The control logic is designed to operate with a clock period that is an integral multiple N of the desired timing resolution Tres, i.e., the control logic operates with clock period Tcp=N×Tres. As a result, control signal timing is represented in terms of an integral number P of Tcp clock periods plus a fraction F/N where F is an integer between 0 and N−1, tcs=(P+F/N)×Tcp. The implementation of the control logic to handle this timing is as follows:
[0057] 1) Store the parameter F while using P to count out the desired number of clock periods.
[0058] 2) Upon completion of P synchronous counting steps, use the parameter F to generate the output signal delayed from the logic clock by (F/N)×Tcp.
[0059] One possible implementation is to use the parameter F to control the insertion of appropriately scaled delay elements within the signal path of the output control signal. An alternate implementation is to pass the parameter F to the functional logic being controlled for the delay to be effected there.
[0060] Specifically, with respect to the command processing described earlier, with command latencies programmed in ticks, and the command pipeline operating with a full clock period, (in this case 5 ns.) the half-clock adjust solution according to the preferred embodiment of the invention consists of implementing the latency within the command pipeline to within the nearest clock count, or effectively dividing the tick count by two, and then adjusting for the final fraction portion according to the number of tick delays required by the latency. In the case of an even tick count latency, the resulting tick count implemented in the command pipeline is equivalent to the tick count programmed. For an odd tick count latency delay, the command pipeline delay ends up being early by a half a clock period. In order to compensate for this effect, the command is flagged as requiring a “half-period adjustment” and the data path introduces an extra half clock delay.
[0061] FIG. 7 illustrates a general implementation of this aspect of the invention. A latency value is input along with a command and stored in a latency register, in this case a 6-bit unsigned value. For a read operation for example, the read latency associated with the read operation is processed as follows:
[0062] 1) the control logic takes the upper 5 bits of the latency value and inserts that number of 5 ns wait states within the command pipeline;
[0063] 2) the least significant bit of the programmed latency value is passed along through the command pipeline as the “half clock adjust bit”. When the wait states inserted in the command pipeline are completed the control logic asserts a control signal to the data output path logic along with the half clock adjust bit. If the half clock adjust bit is logic 1, then the data path further delays the read data by 2.5 ns, alternately, if the half clock adjust bit is logic 0, then the data path does not insert any additional delay.
[0064] In general, the half period adjust scheme can be extended as follows. For a system with desired timing resolution Tres,the control pipeline can be clocked with a clock with a period Tcp that is an integral multiple N that is a power of two times Tres,i.e., Tcp=N×Tres, N=2n. Referring to FIG. 7, a timing parameter is then represented as a binary M-bit fixed point value with the least significant n bits as a fraction of the Tcp clock period. The m timing parameter bits above the least significant n bits specify the synchronous logic delay count P. These bits are loaded into a down counter 710. The least significant n bits carry the fractional delay value F, and are loaded into a latch 712 for temporary storage. After the down counter 710 counts down P clock pulses, a zero detector 714 asserts the desired control signal. This control signal is provided to N−1 delay elements 716.1, 716.2, . . . , 716.N−1 (collectively 716), which delay the control signal by respective amounts 1/N Tcp, 2/N Tcp, . . . , and (N−1)/N Tcp The control signal is also provided to one input of a multiplexer 718, as are the outputs of each of the delay elements. The n low order bits of the delay value are provided from the latch 712 to the select input of multiplexer 718. Thus the control signal, already delayed by P clock periods Tcp by the counter 710, is then further delayed by the specified fractional part F/N of a clock period by the delay elements 716 and multiplexer 718.
[0065] In the embodiment described herein, M=6, m=5, n=1, and N=2. In this case the control pipeline is clocked with the clock period Tcp which is N=2 times the desired timing resolution Tres, The least significant n=1 bit of the timing value is therefore used to control a 2-to-1 multiplexer 718 to select the synchronous pipeline output signal delayed by 0 or ½ Tcp as the control signal output by the control logic. In another example, with n=3, the control pipeline is clocked with the clock period Tcp which is N=8 times the desired timing resolution Tres. The least significant 3 bits of a timing value are therefore used to control an 8-to-1 multiplexer 718 to select the synchronous pipeline output signal delayed by 0, ⅛ Tcp, {fraction (2/8)} Tcp, ⅜ Tcp, {fraction (4/8)} Tcp, ⅝ Tcp, {fraction (6/8)} Tcp, or ⅞ Tcp as the control signal output by the control logic.
[0066] Note that other implementations are possible within the scope of this aspect of the invention. For example, the delay elements 716 could be replaced if desired by a single delay line having N−1 taps. As another example, the delay elements 716 and the multiplexer 718 in combination could be replaced by a single variable delay element. Other variations will be apparent.
Contention-Free Signaling Scheme[0067] As shown in FIG. 4, many of the CTL/ADDR leads that are driven by the CFE 200 or any of the command sequencers 201-208 are shared. Thus at different times they might be driven by different controlling units. As mentioned above, the memory controller is responsible for ensuring, through proper scheduling of memory control steps, that no two of the controlling units assert signals on the same control line in the same clock pulse. Though not required in different embodiments, the memory module of the present embodiment uses a transition-based contention-free signaling scheme in order to achieve enhanced contention-free operation.
[0068] FIG. 8 is a block diagram illustrating the circuits connected immediately to one of the shared control lines 210-X. Control units 830 and 831 represent two units from the group consisting of the CFE 200 and the command sequencers 201-208, any of which can drive the control line 210-X. Functional unit 832 represents any of the functional units which receive commands from the shared bus, such as DRAM banks and data paths in and out. The bus holder cell 833 could be physically part of any of the control units or functional units, or could be physically a separate cell as shown in FIG. 8. The function of the bus holder cell 833 is described below.
[0069] FIG. 8 shows the output driver portion of the control units 830 and 831. Referring to control unit 830, the output driver comprises two D-type flip/flops 836 and 834 as well as a tri-state buffer 835. Flip-flop 836 receives a command “assert-X” at its D-input and the system clock CLK at its clock input and outputs, on its Q output, a control signal to enable the tri-state buffer 835. Flip/flop 834 receives the output of the tri-state buffer 835 at its D-input and CLK on its clock input and outputs its Q\ (“Q-not”) output to the input of the tri-state buffer 835. The resulting output signal from control unit 830 is therefore the output of the tri-state buffer 835. A similar output driver structure exists for control unit 831 as illustrated in FIG. 8.
[0070] The bus holder cell 833 consists of two cross-coupled inverters 843 and 844 which essentially act as a shared SRAM (static random access memory) bit storing the most recently asserted value on control signal line 210-X, until overwritten. The output of each inverter is connected to the input of the other, and the output of inverter 843 (input of inverter 844) is connected to the shared signal line 210-X. The inverter 843 is designed with weak driving characteristics so it can be easily overcome with an opposite polarity signal driven onto the shared control line 210-X by one of the control units 830 or 831.
[0071] The input portion of functional unit 832 comprises two D-type flip/flops 840 and 841 and an exclusive OR (XOR) gate 842. Shared control signal 210-X is input into one of the inputs of the XOR gate 842 as well as to the D-input of flip/flop 840, which in turn is clocked by the system clock CLK. The Q output of flip/flop 840 is input as the second input to the XOR gate 842, which then outputs to the D-input of flip/flop 841. The Q output of flip/flop 841 represents an “asserted-X” control signal within the functional unit which is used to implement some control operation in the functional unit 832.
[0072] FIG. 9 is a timing diagram illustrating the operation of the circuits of FIG. 8. Referring to FIG. 9, prior to an arbitrary system clock cycle, cycle 1 for example, initiated at sampling time to control unit 830 evaluates a command and decides to assert the corresponding control signal onto the shared control signal line 210-X. Since this is a fully synchronous system, control unit 830 will assert is request and upon the next rising edge of CLK at time t1, and after a short time delay, the control signal 210-X will experience a transition in its logic state from a logic low to a logic high (note that prior to this change, at sampling time t1, the shared control signal 210-X had a logic low value). The state of control signal 210-X is maintained by the bus holder cell 833 for the duration of cycle 1 until it is overwritten in the next cycle. At the end of cycle 1 and the beginning of cycle 2, control unit 831 evaluates a command action and chooses to assert X. At the end of cycle 2, demarcated by sampling time t2, the shared control signal 210-X is still logic high, and therefore a state transition is detected by the functional unit 832. The D-flip/flop 840 stores the last 210-X value (output of 840) and the XOR gate 842 compares the current value of 210-X and last 210-X. Since at t2, 210-X is logic high and last 210-X is logic low, the X-asserted output of D-flip/flop 841 is made logic high by the XOR gate 842 on the rising clock edge. The functional unit 832 then proceeds to execute the control steps associated with the X-asserted control signal (not shown). Subsequently, during a third clock cycle, cycle 3, control unit 831 decides to continue to assert X. At time t3, the functional unit 832 samples the shared signal 210-X and finds it to be logic low, thereby indicating another state transition since sampling time t2. Since the command to continue to assert X was provided during cycle 3, 210-X will again change states after sampling time t3, and the last 210-X in the functional unit D-flip/flop 840 will also change states. However, since both 210-X and last 210-X still remain opposite in phase, the X-asserted output remains logic high, through the XOR action of 842. As can be seen from FIG. 9, three clock cycles are required between the time when an action is evaluated by a controlling unit and the time when an asserted control signal results in the functional unit. Also, from FIG. 9, it can be seen that the two controlling units used to illustrate the operation, units 830 and 831 did not have to contend for the control signal 210-X bus over consecutive clock cycles. The system can continue to operate in this fashion with alternating control between control units on every clock cycle.
Alternate Embodiments and Applications[0073] As higher clock frequencies will be required in future applications, deeper pipelining will also be required, and according to an embodiment of the invention, two or more clock cycles of control activity are selectively moved up into the command front end based on early or partial decoding of certain commands. For example, consider a case where commands require 4 consecutive pipeline stages D1, D2, D3 and D4 to completely capture, decode and issue to a parallel function unit (sequencer), illustrated in Table 2. The commands themselves take control actions C1, C2, and C3 in three consecutive clock cycles to perform. Without the division of the commands between the CFE and a selected sequencer, the minimum control latency is five clocks as shown below in Table 2. 4 TABLE 2 programmed latency = minimum (5) |0| 1| 2| 3| 4| 5| 6| 7| Command Decoder D1 D2 D3 D4 | command issue Functional Unit → C1 C2 C3 ↑ ↑ Command Received First control action
[0074] In this system the minimum control latency is five cycles and programmed latencies greater than this are performed by the sequencer inserting wait states between command issue and the control sequence C1, C2, and C3. If sufficient knowledge is known about the command (including the associated programmed latency) by decode stage D3, it is possible to reduce the minimum control latency by two cycles by allowing the command decoder to optionally perform control actions C1 and C2. This is shown below in Table 3. 5 TABLE 3 programmed latency = minimum (3) |0| 1| 2| 3| 4| 5| 6| 7| Command Decoder D1 D2 D3 D4 C1 C2 | command issue Functional Unit → C3 C4 ↑ ↑ Command Received First control action
[0075] 6 TABLE 4 programmed latency = minimum + 1 (4) |0| 1| 2| 3| 4| 5| 6| 7| Command Decoder D1 D2 D3 D4 | command issue Functional Unit → C1 C2 C3 ↑ ↑ Command Received First control action
[0076] 7 TABLE 5 programmed latency = minimum + 2 (5) |0| 1| 2| 3| 4| 5| 6| 7| Command Decoder D1 D2 D3 D4 | command issue Functional Unit → C1 C2 C3 ↑ ↑ Command Received First control action
[0077] It is also possible to incorporate the an embodiment of the invention within a control system with only a single functional unit. In that case, command processing is broken down into a front end block which performs command decoding and issues commands to a single back end block that executes control actions. The decomposition of the control system into two parts therefore allows parallelism even with a single functional unit because the back end block can perform the control actions for command N even as the front end decoder processes command N+1. The invention may be applied as in the case of multiple functional units to reduce minimum control latency.
[0078] In general, this invention may be used in any application where it is important to reduce the minimum latency within a control system processing a stream of commands or instructions where the control actions for two or more separate commands may overlap in time and control latency is programmable. These include high speed pipelined interchip communications interfaces, packet based network routing and bridging equipment, specialized data processing and digital signal processing systems, and control and stimulus generation within automated test equipment (ATE).
[0079] The improvements attained through the implementation of the present invention include a reduction in the minimum control action latency compared to the conventional scheme with a front end decoder unit issuing commands to multiple parallel functional units and implementation of all control actions within the parallel functional units while achieving the same minimum latency as an aggressive implementation with replicated command decoding logic in each parallel functional unit while avoiding most of its extra complexity and power consumption relative to the conventional scheme.
[0080] With respect to the general implementation of the half period adjust scheme, the proposed solution can be used for any application where digital control signals must be generated with a timing resolution too small to be practical or desirable for conventional synchronous control logic with timing resolution equal to the clock period. This could include high speed interchip communication schemes, output waveform shaping circuits, programmable time base generators, automated test equipment (ATE), direct digital synthesis (DDS) signal generators, and high frequency signal modulation.
[0081] The above disclosure is to be taken as illustrative of the invention, not as limiting its scope or spirit. Numerous modifications and variations will become apparent to those skilled in the art after studying the above disclosure. For example, apparatus according to the invention need not issue commands to a sequencer exactly simultaneously with the performance of the first memory control step(s). It is sufficient for the apparatus to issue the command “substantially” simultaneously with the performance of the first memory control step(s), such as within one clock cycle.
[0082] As used herein, a given signal or event is “responsive” to, or “depends upon”, a predecessor signal or event if the predecessor signal or event influenced the given signal or event. If there is an intervening processing element or time period, the given event or signal can still be “responsive” to, or “dependent upon”, the predecessor signal or event. If the intervening processing element combines more than one signal or event, the signal output of the processing element is considered “responsive” to, or “dependent upon”, each of the signal or event inputs. If the given signal or event is the same as the predecessor signal or event, this is merely a degenerate case in which the given signal or event is still considered to be “responsive” to, or “dependent upon”, the predecessor signal or event.
[0083] Given the above disclosure of general concepts and specific embodiments, the scope of protections sought is to be defined by the claims appended hereto.
Claims
1. A memory system having a plurality of memories each having a command decoder front end receiving incoming command packets, and a set of at least one command sequencer,
- wherein said command decoder front end has facilities for (1) at least partially decoding incoming command packets, (2) issuing commands to at least one sequencer in said set of command sequencers in response to said incoming command packets, and (3) performing a first group of at least one memory control step of a decoded command in response to said incoming command packets,
- and wherein each of said command sequencers has facilities for performing a second group of memory control steps of decoded commands issued to the command sequencer from the command decoder front end.
2. A system according to claim 1, wherein said command decoder front end further has facilities for assembling each of said incoming command packets from a respective plurality of consecutive incoming command words.
3. A system according to claim 1, wherein said command decoder front end further has facilities for determining whether or not to perform said first group of memory control steps for a given incoming command packet.
4. A system according to claim 1, wherein each of said incoming command packets has associated therewith a respective desired latency value, and wherein said command decoder front end further has facilities for performing said first group of memory control steps for a given incoming command packet only if the desired latency value associated with said given command packet is below a predetermined threshold latency value.
5. A system according to claim 4, wherein each of said command sequencers further has facilities for performing said first group of memory control steps for said given incoming command packet, if said command decoder front end does not perform said first group of memory control steps for said given incoming command.
6. A system according to claim 4, wherein said incoming command packets include a command type indicator, and wherein said command decoder front end includes facilities to determine the desired latency value for the given command packet in dependence upon the command type indicator in the given command packet.
7. A system according to claim 1, wherein the facilities of said command decoder front end for issuing commands to at least one sequencer in response to said incoming command packets, issues such commands for a given incoming command packet substantially simultaneously with the performance by said command decoder front end of a memory control step for said given incoming command packet.
8. A method for managing a memory system, for use with an incoming command packet, comprising the steps of:
- receiving said incoming command packet in a command decoder front end;
- said command decoder front end decoding said command packet, issuing a command to a first command sequencer in response to said command packet, and further performing a first group of at least one memory control step in response to said incoming command packet; and
- said first command sequencer performing a second group of at least one memory control step in response to receipt of said command from said command decoder front end.
9. A method according to claim 8, further comprising the step of assembling said command packet from a plurality of consecutive incoming command words.
10. A method according to claim 8, further comprising the step of said first command sequencer determining that said first group of memory control steps, are performed by said command decoder front end,
- further comprising the step of said first command sequencer abstaining from performing said first group of memory control steps in response to said step of determining.
11. A method according to claim 8, wherein said step of said command decoder front end issuing a command to a first command sequencer in response to said incoming command packet occurs substantially simultaneously with the step of said command decoder front end performing a first group of at least one memory control step in response to said incoming command packet.
12. A method according to claim 8, further comprising the steps of:
- said command decoder front end further indicating a latency value to said first command sequencer in conjunction with said step of said command decoder issuing a command to a first command sequencer; and
- said first command sequencer inserting at least one latency wait state in dependence upon said latency value indicated by said command decoder front end, after receipt of said command from said command decoder front end and prior to said step of performing a second group of at least one memory control step.
13. A method according to claim 8, further comprising the step of said command decoder front end selecting said first command sequencer from among a plurality of parallel command sequencers in response to receipt of said command packet.
14. A method for managing a memory system, for use with a first incoming command packet, comprising the steps of:
- receiving said first incoming command packet in a command decoder front end;
- said command decoder front end decoding said first command packet, issuing a command to a first command sequencer in response to said first command packet, and determining whether to perform a first group of at least one memory control step in response to said first command packet; and
- said first command sequencer performing a second group of at least one memory control step in response to receipt of said command from said command decoder front end.
15. A method according to claim 14, further comprising the step of assembling said command packet from a plurality of consecutive incoming command words.
16. A method according to claim 14, wherein said command decoder front end determines to perform said first group of memory control steps, further comprising the step of said command decoder front end performing said first group of memory control steps in response to said first command packet,
- wherein said second group of memory control steps excludes said first group of memory control steps.
17. A method according to claim 16, wherein said step of said command decoder front end issuing a command to a first command sequencer in response to said first command packet occurs substantially simultaneously with said step of said command decoder front end performing said first group of memory control steps in response to said first command packet.
18. A method according to claim 14, wherein each of said incoming command packets has associated therewith a respective desired latency value, and wherein said command decoder front end performs said step of determining whether to perform said first group of memory control steps in response to a determination of whether the desired latency value associated with said first command packet is below a predetermined threshold latency value.
19. A method according to claim 18, wherein said command decoder front end determines that the desired latency value associated with said first command packet is below said predetermined threshold latency value, further comprising the step of said command decoder front end performing said first group of memory control steps in response to said first command packet,
- wherein said second group of memory control steps excludes said first group of memory control steps.
20. A method according to claim 18, wherein said command decoder front end determines that the desired latency value associated with said first command packet is not below said predetermined threshold latency value, further comprising the step of said first command sequencer performing said first group of memory control steps in response to receipt of said command from said command decoder front end.
21. A method according to claim 18, wherein said incoming command packets include a command type indicator, further comprising the step of wherein said command decoder front end determining the desired latency value for said first command packet in dependence upon the command type indicator in the first command packet.
22. A method according to claim 14, further comprising the steps of:
- said command decoder front end further indicating a latency value to said first command sequencer in conjunction with said step of said command decoder issuing a command to a first command sequencer; and
- said first command sequencer inserting at least one latency wait state in dependence upon said latency value indicated by said command decoder front end, after receipt of said command from said command decoder front end and prior to said step of performing a second group of at least one memory control step.
23. A method according to claim 14, further comprising the step of said command decoder front end selecting said first command sequencer from among a plurality of parallel command sequencers in response to receipt of said command packet.
24. A method of operating a memory device for use in a packet-driven memory system comprising the steps of:
- receiving external command packets in a command front end circuit;
- decoding said external command packets into internal commands in said command front end circuit;
- issuing said internal commands to respective selected ones of a plurality of command sequencers;
- receiving each of said internal commands from the command front end circuit into the respective selected sequencer;
- performing a first group of control steps for a respective given internal command decoded from each given one of said external command packets, either in the command front end circuit or in the sequencer selected for the given internal command, selectably in dependence upon a comparison of a latency value associated with the given external command packet with a threshold latency value; and
- performing a second group of control steps for the given internal command in the sequencer selected for the given internal command.
25. A method according to claim 24, further comprising the steps of:
- receiving in the command sequencer selected for each given internal command a latency indication from the command front end circuit; and
- entering a wait state for a selected number of clock cycles in dependence upon the command delay indication for each given internal command, after receipt of the given internal command in said step of receiving internal commands, and prior to said step of performing a second group of at least one memory control step.
26. A method for processing commands in a memory system having a command module and multiple memory modules coupled together via command and data links, the method comprising the steps of:
- issuing a command packet from the command module to a selected memory module, the command packet having a latency value associated therewith;
- receiving the command packet in the selected memory module via a command decoder front end;
- decoding the issued command packet into an internal command;
- internally issuing the decoded command to a selected one of a plurality of parallel functional units;
- performing a first group of control actions in the command decoder front end if the latency value is less than a predetermined latency threshold; and
- performing a remaining group of control actions in the selected parallel functional unit.
27. A method of operating a memory device for use in a packet-driven memory system comprising the steps of:
- receiving external command packets in a command front end circuit;
- decoding one of the external command packets to produce an internal command in the command front end circuit;
- issuing the internal command to a selected one of a plurality of command sequencers;
- performing a first group of control steps in the command front end circuit;
- receiving the internal command from the command front end circuit into the selected sequencer;
- receiving a command delay output from the command front end circuit into the selected sequencer;
- entering a wait state for a selected number of clock cycles if a latency value associated with the internal command is greater than a predetermined latency threshold, and
- executing remaining control steps in the selected command sequencer.
28. A method for generating a control signal delayed by a delay time specified with a resolution smaller than one period of a clock signal, comprising the steps of:
- receiving a desired delay time specified as a digital delay value which includes an m-bit integral multiple and an n-bit fractional multiple of the period of said clock signal, m>0 and n>0;
- loading said m-bit integral multiple into a counter clocked synchronously with said clock signal;
- generating said control signal in response to count completion of said counter; and
- further delaying said control signal by F/2n×Tcp, where F is the integer value of said n-bit fractional multiple, and Tcp is the period of said clock signal.
29. A method according to claim 28, wherein said step of further delaying said control signal comprises the steps of:
- providing said control signal to respective inputs of N delay elements, each i'th one of said delay elements inserting a respective relative delay of ((i−1)/N) Tcp; and
- selecting an output of the F'th one of said delay elements.
30. A method according to claim 28, further comprising the step of latching said n-bit fractional multiple while said counter counts.
31. Selectable control signal delay apparatus, for use with a delay value specified as a fixed point value with m>0 integer bits carrying a value P and n>0 fraction bits carrying a value F, comprising:
- a counter having a load input port, a count output port and a clock input, said load input port being coupled to receive said integer bits and said clock input being coupled to receive a clock signal having a clock period Tcp;
- a control signal generator coupled to generate said control signal in response to count completion by said counter; and
- a fractional delay circuit coupled to receive said control signal and said fraction bits, said fractional delay circuit delaying said control signal by F/2n×Tcp.
32. Apparatus according to claim 31, wherein said fractional delay circuit comprises N delay elements each having an input coupled to receive said control signal, each i'th one of said delay elements having an output and inserting a respective relative delay of ((i−1)/N) Tcp; and
- a multiplexer coupled to receive the outputs of said N delay elements, said multiplexer further having a select input coupled to receive said fraction bits.
33. Apparatus according to claim 32, wherein the 1st one of said delay elements consists of a conductor connecting the input of said 1st delay element to the output of said 1st delay element.
34. Apparatus according to claim 32, further comprising a storage element having an input port coupled to receive said fraction bits and an output port coupled to the select input of said multiplexer.
35. Selectable control signal delay apparatus, for use with a delay value specified as a 6-bit fixed point value, comprising:
- a counter having a load input port, a count output port and a clock input, said load input port being coupled to receive the high order 5 bits of said delay value and said clock input being coupled to receive a clock signal having a clock period;
- a latch having a data input and a data output, the data input of said latch being coupled to receive the low order bit of said delay value;
- a count completion detector coupled to generate a control signal in response to count completion by said counter;
- a half-clock-period delay element having an input and an output, the input of said half-clock-period delay element being coupled to receive said control signal; and
- a multiplexer having first and second inputs and a select input, the first input of said multiplexer being coupled to receive said control signal from said count completion detector, the second input of said multiplexer being coupled to the output of said half-clock-period delay element.
Type: Application
Filed: Mar 9, 2001
Publication Date: Jan 24, 2002
Applicant: Advanced Memory International, Inc.
Inventors: Paul W. DeMone (Kanata), Peter B. Gillingham (Kanata)
Application Number: 09803076
International Classification: G06F012/00;