FAST, ENERGY EFFICIENT CMOS 2P1R1W REGISTER FILE ARRAY USING HARVESTED DATA
A transistor memory device includes storage elements storing a capacitance including (1) a capacitance at a source of PFETs, (2) a capacitance at each storage element connected to a storage node and (3) a capacitance at a gate input of inverter transistors from the plurality of transistor storage elements. Each storage element configured to perform (i) a read data access (ii) a write data access, to increase static noise margin. The transistor memory device further includes a harvest node coupled to a ground and that is configured to store a harvested charge transferred from a selected bitline to increase an output voltage at the harvest node. The transistor memory device further includes a capacitor divider configured to maintain a voltage swing on a bitline. The transistor memory device further includes a harvest circuit configured to, in response to the read data access, decouple the harvest node and invert a voltage.
Latest Metis Microsystems, LLC Patents:
- Fast, energy efficient CMOS 2P1R1W register file array using harvested data
- Circuits and methods to use energy harvested from transient on-chip data
- Circuits and methods to harvest energy from transient on-chip data
- FAST, ENERGY EFFICIENT 6T SRAM ARRAYS USING HARVESTED DATA
- CIRCUITS & METHODS TO HARVEST ENERGY FROM TRANSIENT DATA
This application claims priority to and is a continuation of U.S. patent application Ser. No. 17/578,482, filed Jan. 19, 2022 entitled “Fast, Energy Efficient Cmos 2p1r1w Register File Array Using Harvested Data”, which claims priority to U.S. Provisional Application No. 63/247,136, filed Sep. 22, 2021, entitled “Fast, Energy Efficient Cmos 2p1r1w Register File Array Using Harvested Data”, and U.S. Provisional Application No. 63/138,456, filed Jan. 17, 2021, entitled “Fast, Energy Efficient Cmos 2p1r1w Register File Array Using Harvested Data”, each of which is hereby incorporated by reference in its entirety. The application claims priority to U.S. Provisional Application No. 63/247,136, filed Sep. 22, 2021, entitled “Fast, Energy Efficient Cmos 2p1r1w Register File Array Using Harvested Data”.
FIELDThe present disclosure generally relates to digital integrated circuits. In particular, the present disclosure is related to fast, energy efficient CMOS 2P1R1W Register File Array using Harvested Data.
BACKGROUNDWhile power density of CMOS chips was held constant with constant electric field (Dennard) scaling for over 30 years, increases in CMOS device variability at lower operating voltages and scaled geometries in tandem with reductions in circuit speed from non-scaling of gate overdrive due to exponential increases in leakage from scaling MOSFET threshold voltages limited CMOS voltages from scaling to much below 1 V. These trends brought an end to Dennard scaling in (
In some embodiments, a transistor memory device includes a plurality of transistor storage elements storing a collective capacitance including (1) a capacitance at a source terminal of each p-channel field-effect transistors (PFETs) from a plurality of PFETs, (2) a capacitance at each transistor storage element from the plurality of transistor storage elements electrically connected to a storage node and (3) a capacitance at a gate input of a plurality of inverter transistors from the plurality of transistor storage elements. Each transistor storage element from the plurality of transistor storage elements includes a word line port configured to select (a) a bitcell and (b) a first bitline or a second bitline. Each transistor storage element from the plurality of transistor storage elements is configured to perform (i) a read data access from or (ii) a write data access to each remaining transistor storage element from the plurality of transistor storage elements, to increase a static noise margin in response to a decrease of a read current and a voltage on the storage node. The collective capacitance of the plurality of transistor storage elements is greater than a terminal capacitance of the selected bitline. The transistor memory device further includes a harvest node electrically coupled to a ground and that is configured to store a harvested charge transferred from the selected bitline to increase an output voltage at the harvest node. The transistor memory device further includes a capacitor divider electrically connected between the selected bitline and the harvest node of a first transistor storage element from the plurality of transistor storage elements that shares the selected bitline and the harvest node. The capacitor divider is configured to maintain a voltage swing on the selected bitline. The transistor memory device further includes a harvest circuit electrically coupled to the harvest node and configured to, in response to the read data access performed by the first transistor storage element, decouple the harvest node from the ground and invert a voltage equal to a potential difference between the selected bitline and the harvest node.
Global I/O, Control—Address<0:10>, Data in, out<0:31>, CLK, R/W.
8:1 Column mux for Write assumed. Global I/O & Control placed in middle of instance to limit R of pitch constrained Global BLs.
RWL, WWL: 128 b (100 um): Cw=128×0.767 um×1.02×0.185 fF/um=18.52 fF. (R=0.95 ohms/sq with double metal lines RRWL=475 ohms)
LBL: 16 b (3.45 um): Cw=16×0.18 um×1.2×0.185 fF/um=0.64 fF. RLBL=33 ohms GRBL, GWBL: 224 b (50.4 um): Cw=224×0.18 um×1.25×0.185 fF/um=9.324 fF RGRBL=475 ohms.
The end of Dennard scaling end the lack of greater instruction-level parallelism forced the industry to switch from a single-energy-intensive core per microprocessor to multiple efficient cores per chip with roll-outs of the industry's first dual. The move to parallel processing allowed each core to be more energy efficient by having a lower peak performance (at reduced supply voltage), with multiple cores on the die to increase the overall throughput performance. With Dennard scaling dead since 2004 and Moore's Law slowing to a doubling of transistor count every 20 years, transistors are not getting much faster while the peak power per mm2 increases because voltages cannot scale anymore. Power budgets cannot increase either due to heat removal limits (
The energy consumption for various arithmetic operations and memory accesses in
Large last-level caches are included on the CPU chip to scale memory stall time with performance by lowering the miss rate of the processor's caches. Since most of the memory bitcells are idle most of the time, the energy dissipation of large on-chip CPU cache memory is dominated by its leakage. The importance of memory leakage is evident from the fraction of processor power consumed by leakage in large caches with caches and register files (RF) consuming over 50% of the CPU's energy.
GPUs are widely preferred over CPUs to accelerate AI workloads because Deep Neural Network. (DNN) model training is composed of simple matrix math and convolution calculations, the speed of which can be greatly enhanced if the computations can be carried out in parallel. GPUs use tens of thousands of threads to pursue high throughput performance with extreme multithreading. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. In GPUs, the bottleneck for DNN processing is in the memory read access—with each multiply-and-accumulate (MAC) operation requiring three-memory read accesses and one memory write access. Row Stationary Dataflows (
Each thread in a GPU must store its register context on-chip. Unlike CPUs that hide latency of a single thread by using a large last-level on-chip cache, GPUs use a large number of threads and switch between them to hide memory access latency. Just holding the register context. of these threads requires substantial on-chip storage. With so many threads, register files are one of the largest on-chip memory resource in current GPUs. Recently announced commercial GPUs report aggregate on-chip RF array sizes up- to 256 Mb—much larger than last-level, on-chip caches in CPUs
Note that while this paper details the circuit schemes proposed for a 2 port 1R1W 8T register file bitcell array, these are easily extended to Register File arrays with additional Read Ports by adding an NFET transistor pair (corresponding to the decoupled Read stack in the ‘Read port’ box in
Similarly, each additional Write port j is added to the schematic in
2. Conventional Two-Port 1R1W Register File Array Circuits
2-Port Register File bitcells
The conventional RT-bitpath assumed serves as a baseline reference relative to which improvements are typically reported by industry and academia alike. All of these recent (within last 4 years) references assume this ‘Domino Read’ full-swing technique as the baseline reference to compare their Register File array implementations with.
2.1 Full-Swing, Short-BL sensing with Logic Gates: Small signal differential sensing—typically used in 6T arrays due to small area overheads and robust operation, is not as attractive for RF arrays because differential sense amps do not track delay scaling in logic circuits and because the small signal development rate on the bitline depends on bitline loading capacitance—dominated by local interconnects in each bitcell which don't scale with device geometries. The scaling of transistor dimensions also degrades random mismatch at the sense amplifier input that translates into larger sense amplifier voltage offsets the BL signal must overcome as a performance overhead.
Alternative large signal sensing schemes for RF arrays, shown in
2.2 Dynamic Read-Access: Dynamic circuits that precharge output nodes so they evaluate much faster on arrival of the clock edge with inputs stable during evaluation—are found in practically all fast-memory arrays. Precharge of local and global bitlines and their evaluation by bitcells at the arrival edge of the Read WL select transition are an example in 2P RF bitcell arrays. However, these techniques are energy inefficient since all of the charge discarded (from the LBL and the GRBL in
2.3 Disturb Current Read Failure avoidance with BL Keeper: The read stack also increases the risk of read failure from disturb current if data at cell node ‘Bit’ in
2.4 An Industry Solution to Disturb Current read failure: One alternative solution to the keeper described above for disturb current read failure has been to use PFETs instead of NFETs for access devices in the RF bitcell driven-by the Write WL using precharged-low Local Write BLs in half-selected bitcells during simultaneous read and write access of the RF bitcell. This eliminates the voltage bump at the gate of the lower NFET NR1 in the Read stack when ‘Bit’ is 0, but Ion of NR1 is degraded by up to 35% due to a drop in the high node storage level at ‘Bit’ when both RWL and WWL in the same row are simultaneously turned on—effectively degrading read current. The RWL voltage by 15-20% to recover performance when using Write PFET access transistors to eliminate Disturb Current driven Read failure. The power & area overheads in doing so appear significant given the size of bootstrap capacitors required to deliver sufficient charge to the WL. Also, this solution assumes approximately equal drive strengths of NFETs and PFETs due to the introduction of embedded Si/Ge source/drain that enhances hole mobility. Absent this feature in older CMOS platforms, other complications of lowering write margins substantially (and raising write VMIN) could arise when using weaker PFET gates instead of NFETs as access devices driven by the write WL.
2.5 High Leakage through Fast Read Stack: Another negative consequence of the use of the Keeper PFET solution is that when the Bitline is held at VDD by the keeper during active or standby mode, all bitcells attached to a Local Bitline, are draining high leakage current from the bitline (due to a drop of almost VDD across the top NFET of the read stack) through an already leaky stack—some of which are worse (whose bitcells have ‘Bit’=1 turning on the lower of the two devices in the Read stack). This leakage path is ‘live’ for practically every LBL across the aggregate RF array in a GPU that is powered on. The presence of a keeper circuit also holds the LBL at VDD following a read access where the Bit read in the column was a ‘0’.
2.6 Reliability of NR1 in read stack: NMOS Transistor aging mostly arises from positive bias temperature instability (PBTI), hot carrier injection (HCI), time-dependent dielectric breakdown (TDDB) and electro-migration (EM). In an NFET stack as shown in
New CMOS harvesting circuits are proposed that improve component performance and substantially lower the energy cost of moving data across 2-port/multiport Register File (2P/MP RF) arrays typically implemented in GPU based AI Hardware accelerators. These circuits lower switching energy in local and global bitpaths by over 70% for Read and by over 30% for Write when engaging harvested data to self-limit energy dissipation during a memory access. They also lower bitcell leakage currents along the Read transistor stack pair by over an order of magnitude as a result of-self-disabling of current flow by the rising electric potential barrier of harvested charge.
Proposed sensing circuits double signal development circuit speed along local and global bitlines by comparing a decreasing BL voltage to the increasing electric potential of harvested charge as the evaluation energy expended on the local or global bit path is harvested. These improvements in sensing speed reduce by up to 50+% the WL Select to Output Data delay in a conventional RF array. The proposed bit path circuits also engage harvested charge to provide immunity to disturb current noise during concurrent Read and Write access along a WL—eliminating the performance, area and energy overheads of BL keeper circuits used in conventional 2 port RF Memory arrays.
Proposed circuits improve the reliability of Read performance-limiting bitcell devices by lowering of voltages across their terminals using harvested charge during most of active and standby periods. Area overheads of proposed circuits are expected to be marginal based on device widths of replacements to conventional peripheral circuits and can be further minimized by sharing of devices and their connections between bit slices of the array. Moreover, proposed circuits do not require any changes to the CMOS platform, to the bitcell or to the array architecture with much of the flow for design, verification and test of 2P RF Memory arrays expected to remain unchanged—minimizing risk and allowing integration of proposed circuits into existing products with minimal disruption to schedule and cost. Circuit Simulations are run on a 16 nm FinFET CMOS technology using ASU parameter decks developed and available on a public domain. Additional data on wiring parasitics were obtained from IEDM/ISSCC publications by the foundry of wiring parasitics and bitcell geometries. The Array architecture assumed in simulations is mostly identical to that reported by the Foundry except for a few opportunities to improve circuit and wiring delays.
3. Example Array for Circuit Analysis and Comparison
To be able to make quantitative-comparisons between proposed circuits and those used by baseline industry standard designs, a simple, common 8 KB RF Array architecture (
The 8 KB array, shown in
The ASU decks along with the wiring parasitic data from TSMC reported at IEDM are used in the same array architecture with the same bitcell assumed in both—the baseline reference Register File array as well as the proposed charge harvesting circuit schemes. This apple-apple comparison is what this paper mostly relies on to make quantitative comparisons of performance and power metrics from circuit simulations.
4. Operation of Proposed 2P RF Array Bitpath
4.1 Harvest of LBL & GRBL Evaluation Energy: In the proposed scheme, the Source terminal of the NFET read stack in the RF bitcell, NR1 shown in
The Read access proceeds as with a conventional RF bitcell, except that charge flowing into the selected bitcells (with ‘Bit’=1) from the precharged Local BL in any given column—is harvested on V2L. This harvesting action raises the voltage on V2L at the same time that LBL loses charge, practically doubling the signal development rate asserted at the gate-source input of the sense-amp (inverter I1 with NFET footer LBR1), until the Read stack self-disables. (Note that the implementation could use a NAND gate instead of inverter I1 with the other input of the NAND driven by a Column select signal if the column is selected by the column multiplexor. The self-disabling action occurs when the read stack devices of the selected bitcells have insufficient gate overdrive to stay in the linear region and move into the subthreshold region as LBL and V2L coverage in voltage (Shown by Red and Green waveforms in
In
So, when charge moves from LBL to V2L on selection of any of the bitcells along this column by a Read WL (RWL), at any time, the change in voltage (reduction of LBL and increase in V2L voltage) is about the same. This is verified in
The capacitance of V2L is fixed and cannot be changed to charge V2L to a different voltage. So, the sensing inverter for the local BL, I1 triggers when LBL and V2L are within a VT of each other causing its output L_out to make a 0→1 transition as seen in
From charge conversion, initial charge=final charge
So,
CGBRL*VDD=(CGRBL+CV2G)=final charge
Since CV2G=0.4 CGBRL (
we get.
V2Gfinal=[CGBRL/(CGBRL+CV2G)]*VDD=[1/1.4]*0.85V=0.61V
4.2 Fast, energy and area efficient Sense amp action: As the LBL voltage drops, the gate input voltage of I1 approaches I1's logic threshold, which itself moves to a higher voltage of V2L rises with more harvested charge. As the LBL voltage meets the rising logic threshold voltage of I1, the output of I1 L_out rises fast due to the high gain of a CMOS inverter. Since L_out directly drives the gate input of NFET GBB, GBE turns on and the precharged Global Read BL (GRBL) begins discharging as soon as L_out makes its 0→1 transition past the device threshold voltage of NFET GBE.
The precharged Global GRBL discharges to V2G instead of discharging to GND as in the conventional Global RF bitpath. As with the LBL, the converging voltages on GRBL and V2G trigger a low→high transition at the output of inverter I2. A dropping GRBL voltage meets the rising logic threshold voltage of I2. The converging waveforms of GRBL and V2G (red and green waveforms at bottom of
Note that since the V2L net has about the same capacitance and dimensions as the LBL. The and V2L voltages thus converge to about the same value—VDD/2, by this balanced capacitive divider when they share charge. If V2L were to have a smaller capacitance, V2L could rise to a higher voltage and self-limit the LBL to discharging less than half of its charge. Given the impracticality of using a shorter V2L line (must connect to the GND terminal of the Read stack in each of the RF bit cells along a LBL) and given the smaller capacitance of the LBL (compared to the much larger and longer GRBL), an imbalanced capacitive divider is pursued in the Global BL to raise the voltage of V2G higher than ½ VDD so that V2G can self-limit GRBL discharge sooner, at a voltage closer to VDD than to GND and can this consume much less charge from the VDD grid during a Read access.
4.3 Reset of Dynamic nodes before Read Access: The Block Select signal from pre-decoders (
Now that L_out is discharged and GBE is turned off, GRBL can be precharged to VDD from its partially discharged state from a previous Read access. Once RST1 has moved charge from V2G to V2, RST2 ‘resets’ V2G to GND readying it for the impending Read. Also, since L_out has been discharged during RST1, the NFET GBE is turned off enabling the precharged GRBL to hold its precharge voltage of VDD when V2G is discharged to GND to RST2.
All of the 4 signal outputs shown in
4.3 Immunity to Disturb Current Failure: The proposed scheme does not require keeper circuitry found in conventional RF array bit paths to avoid read failure when RWL and WWL concurrently select the same row of bit cells as seen in a conventional bitpath. This is illustrated in the circuit simulations of a conventional bitpath without keeper circuits. Cell noise at node ‘Bit’—modeled with a voltage bump at the gate of NR1, can initiate an unintended discharge of the LBL—as seen in
When using the proposed bitpath circuits, keepers are not required since the rising voltage on V2L due to noise voltage at the gate of NFET NR1, self-disables the discharge of the LBL as V2L asymptotically approaches the noise voltage (
4.4 Compact, fast Decoders:
4.5Write Data Path: For a Multiply Accumulate operation, 3 reads and a Write access are typical. Thus, with an 8:1Write column multiplexer, a Write access exercises a bit column for about every 24 exercised by a Read access.
The GWBL driver schematic in
4.6 Metal Grid that holds Harvest Charge: V2 lines are charged up by Read accesses as shown in (at top of)
4.7 Circuit Speed, Switching Energy Comparisons:
The improvements in Read performance of the RF bitcell demonstrated in
MM 09_21_008_2021
Note: As shown in
4.8 Leakage reduction:
Claims
1. A Register File memory device comprising:
- a plurality of conventional 8 transistor 2 port storage elements each with 1 read port and 1 write port and each with a decoupled read stack of a pair of NFETs with the gate input of one driven by a Read word line and the gate input of the other in the pair driven by a cell storage node.
- a harvest terminal that replaces the reference ground potential terminal of the decoupled read stack of FETs in a conventional Register File storage element.
- a harvest circuit coupled to the harvest terminal of a plurality of storage elements whose Read ports are coupled along a common bitline with the harvest circuit responsive to a read access by self-disabling the development of signal on the bitline, eliminating the uncertainty of signal voltage development on the bitline due to the statistical variation of read current read stack and at least doubling the rate at which data sensed in the selected storage element is resolved.
2. An apparatus, comprising:
- a plurality of transistor storage elements, a transistor storage element from the plurality of transistor storage elements including a read port and a write port, the transistor storage element electrically from the plurality of transistor storage elements coupled to a first n-channel field effect transistor (NFET) and a second NFET, the second NFET including a gate terminal configured to be driven by a bitcell including a read word line, the first NFET including a source terminal and a gate terminal such that the gate terminal of the first NFET is configured to be driven by a cell storage node for the read port from the transistor storage element;
- a bitline electrically coupled to the read port of the transistor storage element from the plurality of transistor storage elements and configured to be precharged;
- a harvesting node electrically coupled to the source terminal of the first NFET and configured to harvest, at the harvesting node, voltage that was precharged at the bitline in response to a read access action and an activation of the cell storage node;
- a harvest inverter including a reference ground potential terminal configured to be replaced with the harvesting node, the harvest inverter including a gate terminal configured to be electrically coupled to the read port of the transistor storage element from the plurality of transistor storage elements; and
- a harvesting grid electrically coupled to the harvesting node, the harvesting grid configured to self-disable a signal development on the bitline to eliminate an uncertainty of the signal development on the bitline when an electric potential of harvested data on the harvesting node matches a voltage at the bitline from the signal development.
3. The apparatus of claim 2, wherein the harvesting node configured to discharge a voltage to the reference ground potential before the read access through the first NFET having an active high pulse at the gate terminal of the first NFET enabling discharge of the voltage.
4. The apparatus of claim 3, wherein the gate terminal of the harvest inverter is electrically coupled to the bitline, the harvest inverter including an output terminal configured to be triggered by (1) a decreasing electric potential difference between the bitline that is precharged and the harvesting node during the read access when the cell storage node is active and electrically coupled to the first NFET or the second NFET or (2) a voltage substantially equal to a voltage at a power supply terminal that is electrically coupled to the transistor storage element, the output terminal configured to be triggered by movement of electric charge from the gate terminal of the harvest inverter to the harvesting node.
5. The apparatus of claim 2, wherein:
- the apparatus is configured to perform a global sensing scheme such that the gate terminal of the harvest inverter is electrically coupled to a global bitline that is electrically coupled to a drain terminal of the first NFET and a drain terminal of the second NFET, the gate terminal of the first NFET and the gate terminal of the second NFET being driven by the bitline,
- the reference ground potential terminal of the harvest inverter configured to be replaced with a global harvest terminal that is electrically coupled to the source terminal of the first NFET, the global harvest terminal configured to harvest charge from the global bitline that is precharged as the first NFET is triggered by a rising output at an output terminal of the harvest inverter.
6. The apparatus of claim 2, wherein the transistor storage element from the plurality of transistor storage elements is configured to be decoupled from the first NFET and the second NFET to enable a higher read performance without compromising read stability margins.
7. The apparatus of claim 2, wherein a discharge of voltage at the bitline is configured to stop in response to a rising voltage at the harvesting node asymptotically approaching a noise voltage at the gate terminal of the first NFET.
8. The apparatus of claim 2, wherein a voltage signal developed at the bitline is determined by a capacitive divide electrically coupled between the bitline and the harvesting node.
9. The apparatus of claim 2, wherein harvested charge stored at the harvesting node is configured to self-disable a flow of read current as the harvested charge at the harvesting node approaches a noise voltage at the bit storage node.
10. The apparatus of claim 2, wherein the signal development on the bitline is configured to self-disable as the electric potential of the harvested data at the harvesting node rises to equalize a dropping voltage of the bitline.
11. The apparatus of claim 2, wherein the harvesting grid is configured to self-disable when the first NFET and the second NFET have insufficient gate overdrive.
12. The apparatus of claim 2, wherein a capacitance of the harvesting node is fixed and cannot be changed to charge the harvesting node to a different voltage.
13. The apparatus of claim 2, wherein a voltage at the harvesting node is configured to increase while the bitline loses charge, to increase a signal development rate for the voltage signal.
14. The apparatus of claim 2, wherein a change in voltage, in response to a charge being transferred from the bitline to the harvesting node when the bitline including the read word line is selected, is the same as when a bitline including a write word line is selected.
15. The apparatus of claim 2, further comprising an inverter electrically coupled to the bitline and the harvesting node, the inverter including an input terminal and an output terminal, the output terminal configured to perform a low-to-high transition in response to a voltage at the bitline and a voltage at the harvesting node being within a voltage threshold.
16. The apparatus of claim 2, wherein increasing a voltage at the harvesting node occurs at a same time as and at a same rate of a voltage at the bitline is lowered.
Type: Application
Filed: Sep 22, 2022
Publication Date: Aug 24, 2023
Applicant: Metis Microsystems, LLC (Newtown, CT)
Inventor: Azeez BHAVNAGARWALA (Newtown, CT)
Application Number: 17/951,049