Method for Specifying Stateful, Transaction-Oriented Systems for Flexible Mapping to Structurally Configurable In-Memory Processing Semiconductor Device
A method for specifying stateful, transaction-oriented systems is provided. The method initiates with designating a plurality of primitive FlowModules. The method includes defining at least one FlowGate within each of the plurality of FlowModules, wherein each FlowGate includes a non-interruptible sequence of procedure code, a single point of entry and is invoked by a named concurrent call. An Arc is designated from a calling FlowGate to a called FlowGate and a Signal is generated for each named invocation of the called FlowGate. A Channel is defined for carrying the Signal. Methods for synthesizing a semiconductor device and routing signals in the semiconductor device are provided.
The present application is a divisional application of U.S. application Ser. No. 11/426,882 filed Jun. 27, 2006, and claims priority under 35 U.S.C. §119(e) from U.S. Provisional Patent Application No. 60/694,538, filed Jun. 27, 2005, U.S. Provisional Patent Application No. 60/694,546, filed Jun. 27, 2005, and U.S. Provisional Patent Application No. 60/694,537, filed Jun. 27, 2005, all of which are incorporated by reference in their entirety for all purposes. The present application is related to U.S. Pat. No. 7,676,783 issued Mar. 9, 2010 (Atty Docket ARITP001) entitled APPARATUS FOR PERFORMING COMPUTATIONAL TRANSFORMATIONS AS APPLIED TO IN-MEMORY PROCESSING OF STATEFUL, TRANSACTION ORIENTED SYSTEMS, and U.S. Pat. No. 7,614,020 issued Nov. 3, 2009 (Atty Docket ARITP003) entitled STRUCTURALLY FIELD-CONFIGURABLE SEMI-CONDUCTOR ARRAY FOR IN-MEMORY PROCESSING OF STATEFUL, TRANSACTION-ORIENTED SYSTEMS, each of which are incorporated by reference in their entirety for all purposes.
BACKGROUNDSystem on a chip (SOC) implementation is predominantly based on design capture at the register-transfer level using design languages such as Verilog and VHDL, followed by logic synthesis of the captured design and placement and routing of the synthesized netlist in physical design. Current efforts to improve design productivity have aimed at design capture at a higher level of abstraction, via more algorithmic/system approaches such as C++, C, SystemC and System Verilog.
As process technology advances, physical design issues such as timing closure and power consumption management have dominated the design cycle time as much as design capture and verification. Methodology advances currently in development and under consideration for adoption using higher levels of abstraction in design capture do not address these physical design issues, and manufacturability issues. It is recognized in the semiconductor industry that with process technologies at 90 nm and below, physical design issues will have even more significant cost impacts in design cycle time and product quality.
CAD tools for placement and route of synthesized logic netlists have delivered limited success in addressing the physical design requirements of deep submicron process technologies. To take full advantage of deep submicron process technology, the semiconductor industry needs a design methodology and a supporting tool suite that can improve productivity through the entire design cycle, from design capture and verification through physical design, while guaranteeing product manufacturability at the same time. It is also well-known in the semiconductor industry that SOC implementations of stateful, transaction-oriented applications depend heavily on on-chip memory bandwidth and capacity for performance and power savings. Placement and routing of a large number of memory modules becomes another major bottleneck in SOC physical design.
Another important requirement for an advanced SOC design methodology for deep submicron process technology is to allow integration of on-chip memory with significant bandwidth and capacity without impacting product development schedule or product manufacturability. High level design capture, product manufacturability, and support for significant memory resources are also motivating factors in the development of processor-in-memory. Processor-in-memory architectures are driven by requirements to support advanced software programming concepts such as virtual memory, global memory, dynamic resource allocation, and dynamic load balancing. The hardware and software complexity and costs of these architectures are justified by the requirement to deliver good performance for a wide range of software applications. Due to these overheads, multiple processor-in-memory chips are required in any practical system to meet realistic performance and capacity requirements, as witnessed by the absence of any system product development incorporating a single processor-in-memory chip package.
There is thus an added requirement for cost effective SOC applications that resource management in processor-in-memory architectures be completely controllable by the designer through program structuring and annotations, and compile-time analysis. It is also important to eliminate all cost and performance overheads in software and hardware complexity attributed to the support of hierarchical memory systems. Based on these observations, there is a need in the semiconductor industry for a cost-effective methodology to implementing SOCs for stateful, transaction-oriented applications.
SUMMARYBroadly speaking, the present invention fills these needs by providing a method and apparatus for performing in-memory computation for stateful, transaction-oriented applications. It should be appreciated that the present invention can be implemented in numerous ways, including as a process, an apparatus, a system, or a device. Several inventive embodiments of the present invention are described below.
In one embodiment, a method for specifying stateful, transaction-oriented systems is provided. The method initiates with designating a plurality of primitive FlowModules. The method includes defining at least one FlowGate within each of the plurality of FlowModules, wherein each FlowGate includes a non-interpretable sequence of procedure code, a single point of entry and is invoked by a named concurrent call. An Arc is designated from a calling FlowGate to a called FlowGate and a Signal is generated for each named invocation of the called FlowGate. A Channel is defined for carrying the Signal.
In another embodiment, a method for synthesizing a stateful, transaction-oriented system for flexible mapping to a structurally field-configurable semiconductor device having a multi-level array of storage elements, for in-memory processing is provided. The method initiates with mapping FlowLogic to a network of FlowVirtualMachines (FVM). A FlowModule is mapped into a corresponding FlowVirtualMachine (FVM) and one or more FVMs are integrated into an AggregateFVM (AFVM). One or more AFVMs are composed into a FlowTile, and Signals are routed between FlowModules.
In yet another embodiment, a method for routing FlowLogic Signals over a structurally configurable in-memory processing array is provided. The method initiates with configuring a pool of memory resource units into corresponding OutputBuffers, CommuteBuffers and ChannelMemories, the pool of memory units shared with a FlowLogicMachine. A producer-consumer relationship between the corresponding OutputBuffers and CommuteBuffers is configured and a producer-consumer relationship between the CommuteBuffers and VirtualChannels residing in the ChannelMemories is configured. Producer-consumer relationships between the OutputBuffers and VirtualChannels residing in said ChannelMemories are configured and producer-consumer relationships between the CommuteBuffers and neighboring CommuteBuffers are configured.
In still yet another embodiment, a method for debugging a stateful, transaction-oriented runtime system having a multi-level array of storage elements is provided. The method includes instructing the stateful transaction oriented system to pause and instructing the stateful transaction oriented system to single step until a given point. Information for selected FlowGate invocations is tracked and areas within the multi-level array of storage elements are queried for the debugging session.
Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
The present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings, and like reference numerals designate like structural elements.
An invention is described for a structurally reconfigurable intelligent memory device for efficient implementation of stateful, transaction-oriented systems in silicon. It will be obvious, however, to one skilled in the art, that the present invention may be practiced without some or all of these specific details. In other instances, well known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The embodiments of the present invention described below provide a method and apparatus enabling flexible design capture methodology which allows a designer to select the granularity at which a stateful, transaction-oriented application is captured. An efficient methodology to implement a stateful, transaction-oriented application on a platform economically superior with respect to design effort, implementation costs and manufacturability is further described below. The embodiments utilize an execution model that allows for efficient compiler optimization and resource allocation, efficient hardware implementation, and accurate performance analysis and prediction when a design is captured and analyzed. It should be appreciated that no significant uncertainty is introduced by design compilation, mapping into the physical platform, or resource conflicts during system operation. The resource requirements are specified explicitly when the design is captured, using annotations or compiler analysis. Allocation of hardware resources can be determined statically at compile time.
In another aspect of the invention a simple and effective chip architecture that uses a single level real memory organization to eliminate the costs of managing a caching hierarchy associated with virtual memory systems in applications development, compiler optimization, run-time system support, and hardware complexity is provided. As will be explained in more detail below, the embodiments described herein meet the tremendous demands of memory capacity and bandwidth in future generation SOCs with solutions that are economical in die area, product development cycle and power consumption. At the same time, the embodiments reap the cost, performance and power consumption benefits of advanced deep submicron fabrication processes with exceedingly high manufacturability and reliability.
Still referring to
One skilled in the art will appreciate from
The FlowLogic architecture allows flexible design space exploration of performance and quantitative behavior, followed by flexible mapping of the results into the said structurally field-configurable semiconductor device. The parameters related to Arcs 108, among others, are determined interactively during system simulations using FlowLogic. It may be noted that the performance behavior of such systems will only be as good as the traffic pattern assumptions made in the simulation. In one embodiment, FlowGates referred to as DynamicFlowGates can be dynamically loaded and linked at run-time. In one embodiment, DynamicFlowGates are limited to serving the purposes of run-time system diagnostics and debug. Thus, an overview of the FlowLogic system and language has been provided above and further details are provided with reference to the Figures referenced below.
It should be noted that the sizes of the logical memory partitions in an FVM are arbitrary and the partitions have physically independent access paths. The code related to FlowGates and FlowMethods is compiled into relocatable machine code which in-turn determines the logical size of the corresponding FVM CodeMemory. The FlowGateIndex contains a jump table indexed on unique FlowGate identifier along with the pointer to the FlowGate code, among other context data for proper FlowGate execution. The StackMemory is used for storing intermediate states as required during the FlowGate execution. There are no register files in the FVM. The working of the FVM is analogous to that of a stack machine. The Stack is always empty before a FlowGate starts since the FlowGate by itself does not have a persistent state, and the FlowGate is not allowed to suspend.
The size or the depth of the Stack is determined at compile-time by the FlowLogic compiler. As may be evident, FlowLogic programming style does not support nested calls and recursive function calls whose depths are not predictable at compile-time. Furthermore, there is no dynamic allocation or garbage collection in FlowLogic because memory resource allocations are fixed at compile-time. Other than temporary variables whose life times span the FlowGate call, State variables are all pre-allocated at compile-time. The size of the StateMemory 126 for a FVM is well known at the compile time. OutputBuffer 128 and ChannelMemory 130 are managed by the run-time system and are visible to the system designer only via annotation in one embodiment. OutputBuffer 128 is a small memory area for temporarily staging outgoing Signals. ChannelMemory 130, on the other hand, hosts the Channels and is as large as is required by the corresponding FVM. It is useful to point out at this time that although these memories have different access data paths, the memories all use the same resource types in the structurally configurable in-memory processing array. In fact, memories are the only resources directly allocated in the array, with other necessary logic, including processing elements, being fixed to such memory resources.
The FlowLogicMachine can itself be thought of as an array of structurally configurable memory units that implements a plurality of FlowTiles, where the computational logic is fixed and distributed. As a further analogy, the FlowLogic language described herein may be thought of as the JAVA language, while the FlowLogicMachine may be analogized to the JAVA Virtual machine, since the FlowLogic Language has some attributes of object oriented programming languages. For one skilled in the art, it should be appreciated that much of the resources in question are memory units in one form or another, i.e., code, state, stack, channels, and buffer. Motivated by the above observation, the FlowLogicMachine is designed to provide the ability to configure these memory units, also referred to as memory resources, as required by a particular application and the FlowLogic representation allows the flexibility of re-casting a system description in flexible ways to achieve the targeted capacity, performance, and functionality.
As mentioned above,
Some of the FlowTiles, say on the periphery of the array, are configured to interface with the external world. The said interface is also a Signal based interface that is accomplished through Adapter as shown in
The FlowLogicMachine can itself be thought of as an array of structurally configurable memory units that implements a plurality of FlowTiles where the computational logic is fixed and distributed. For one skilled in the art, it is easy to see that much of the said resources in question are memory units in one form or another: code, state, stack, channels, and buffer. Motivated by the above observation, the FlowLogicMachine is designed to provide the ability to configure the memory units as required by a particular application and the FlowLogic representation allows the flexibility of re-casting a system description in flexible ways to achieve the targeted capacity, performance and functionality.
The FlowLogicMachine has novel features that help in system diagnosis among others. FlowGates are by-design atomic and always go to completion, once fired. There is no notion of run-time instruction-level single-stepping in the context of FlowLogicMachine. Instead, it can be stepped on FlowGate boundaries. FlowTiles can be instructed to execute one FlowGate at a time. An external debug controller can observe the StateMemory, ChannelMemory and other partitions of the FVM by making explicit system read calls when the FlowLogicMachine is paused between steps of FlowGate execution. The debug controller may even launch DynamicFlowGates to achieve diagnostic goals. The FlowLogicMachine has built-in FlowGates called SystemFlowGates for read, write and configuration purposes. The SystemFlowGates come into existence on device boot, independent of applications. These SystemFlowGates are also used for booting application-specific FVMs.
The embodiments described herein also support runtime debugging of the FlowLogicMachine. The FlowLogic runtime system can be controlled from an outside machine (host) through sending and receiving of signals with specific debugging payloads. The host sends debugging commands to the runtime system in signals; it also receives data and state information back from the runtime system in signals.
The following debugging techniques are supported by the FlowLogicMachine:
-
- The runtime system can be instructed to pause (break) execution on a given condition. These conditions may include invocation of a specific FlowGate, the contents of any input signal, any expression on FlowGate invocations (i.e. the nth invocation of a given FlowGate), or any other internal state of the runtime system. Upon halting execution, the runtime system will notify the host by sending a signal indicating that execution has stopped. The host can then control the debugging process by sending further instructions encapsulated in signals.
- The runtime system can be instructed to resume execution (step) until a given condition. This is analogous to single-stepping in a compiled code environment. Several variants of this behavior are supported, such as “step to the next FlowGate invocation”, “step to the nth invocation of a given FlowGate”, or “step until a FlowGate receives a signal with a given content”.
- The runtime system can be instructed to capture information (trace) about selected or all FlowGate invocations and communicate this information to the host. The information communicated is essentially a trace of the firings of FlowGates, their input signals, and their output signals.
- The runtime system can be instructed to query certain memory areas in the tile and return data (dump) to the host system. The information communicated can be the current positions of the context pointers (such as MP), the contents of any memory or a sub-range of that memory, or the current utilization of VirtualChannels.
- To support diagnostics and debugging, executable FlowGate code can be sent from the host to the runtime system of a given FlowTile. The runtime system will load this code into its CodeMemory and execute it to support the debugging session.
One skilled in the art may note that FlowLogic is not a general method for describing any digital system for system-on-chip implementation. Some of its notable distinctions include:
-
- 1. It raised the level of abstraction for design capture, verification and analysis. To allow for implementation flexibility, it is not required to preserve cycle accuracy among different levels of design representation.
- 2. At a higher level of design capture, it is not deemed necessary to support arbitrary combinational logic oriented systems efficiently
- 3. The performance of the system designed using FlowLogic depends on the mix of workload used in simulation.
- 4. Functionality and performance of FlowLogic designs are not efficiently implemented on systems that primarily span over bandwidth constrained networks. FlowLogic is optimized for implementation on bandwidth over-provisioned on-chip intelligent memory with Flit based communications.
FlowLogic relies on the assumption that quantitative behavior at the FlowLogic level is perturbed minimally as it is translated to the physical implementation.
The embodiments described above provide a memory centric approach for a processing system design and architecture, as well as the FlowLogic language for designing, synthesizing, and placing and routing techniques for this unique processing system design. Terms of the FlowLogic language have been analogized to some object oriented terms for ease of understanding. For example, a FlowGate may be thought of as a Function, Procedure or Task, while a FlowModule may be analogized to an object in object oriented programming. A Signal may be referred to as a message or a packet. It should be appreciated that while these analogies are used for explanatory purposes, there are significant differences between the embodiments described herein and the corresponding analogies.
Traditional processors incorporate the notion of virtual memories to push physical memory away from the processing core. To do so, they introduce accumulators, registers and caching hierarchies. The embodiments described above embrace the incorporation of processing core(s) directly within the physical memory. Furthermore, the data paths in the above-described embodiments are significantly different than the data paths within the traditional processor architecture.
Still referring to
The invention has been described herein in terms of several exemplary embodiments. Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention. The embodiments and preferred features described above should be considered exemplary, with the invention being defined by the appended claims.
With the above embodiments in mind, it should be understood that the invention may employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing.
Any of the operations described herein that form part of the invention are useful machine operations. The invention also relates to a device or an apparatus for performing these operations. The apparatus may be specially constructed for the required purposes, or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
The invention can also be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data which can be thereafter be read by a computer system. Examples of the computer readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims. In the claims, elements and/or steps do not imply any particular order of operation, unless explicitly stated in the claims.
Claims
1. A method for synthesizing a stateful, transaction-oriented system for flexible mapping to a structurally field-configurable semiconductor device having a multi-level array of storage elements, for in-memory processing, comprising method operations of:
- mapping FlowLogic to a network of FlowVirtualMachines(FVM);
- mapping a FlowModule into a corresponding FlowVirtualMachine (FVM);
- integrating one or more FVMs into an AggregateFVM (AFVM);
- composing one or more AFVMs into a FlowTile, and
- routing Signals between FlowModules.
2. The method of claim 1, wherein the FVM is an array of similar memory unit resources configured into partitions, the partitions accessible via a plurality of independent access paths.
3. The method of claim 1 wherein the partitions define a FlowGateIndex, a StackMemory space, a CodeMemory space, a StateMemory space, an OutputBuffer space and a ChannelMemory space.
4. The method of claim 2, further comprising:
- relocating the partitions; and
- repeating the method operations of mapping FlowLogic to a network of FlowVirtualMachines(FVM); mapping a FlowModule into a corresponding FlowVirtualMachine (FVM); integrating one or more FVMs into an AggregateFVM (AFVM); composing one or more AFVMs into a FlowTile, and routing Signals between FlowModules.
5. The method of claim 1 wherein the AFVM is derived from a composition of FVMs by one of linearly aggregating, merging or sharing memory unit resources of the composition of FVMs.
6. The method of claim 1, wherein the FlowTile is derived from a composition of AFVMs by one of linearly aggregating, merging or sharing of the memory unit resources of the composition of AFVMs.
7. The method of claim 1, wherein the Flowtile provides scheduling functionality through run-time flow control, reception of Signals and invoking of appropriate FlowGates.
8. The method of claim 3, wherein the FlowTile enables signals to be commuted out of the OutputBuffer space and into the ChannelMemory space.
9. The method of claim 2, wherein the FVM is without memory or caching hierarchies, and wherein all elements in the partitions are accessible in a same access time, the method further comprising:
- allocating and defining initial contents for all memories at compile time.
10. The method of claim 1, further comprising:
- designating SystemFlowGates that are application independent, built-in and available on power-on boot;
- providing access to the storage elements for read, write and configuration operations; and
- providing booting application specific FVMs.
11. The method of claim 1, further comprising:
- splitting the Signals into two portions, a first portion defining header information and a second portion defining a payload, the first portion residing in a different part of the memory from the second portion.
12. A method for routing FlowLogic Signals over a structurally configurable in-memory processing array, the method comprising:
- configuring a pool of memory resource units into corresponding OutputBuffers, CommuteBuffers and ChannelMemories, the pool of memory units shared with a FlowLogicMachine;
- configuring a producer-consumer relationship between the corresponding OutputBuffers and CommuteBuffers,
- configuring a producer-consumer relationship between the CommuteBuffers and VirtualChannels residing in the ChannelMemories;
- configuring producer-consumer relationships between the OutputBuffers and VirtualChannels residing in said ChannelMemories;
- configuring producer-consumer relationships between the CommuteBuffers and neighbouring CommuteBuffers.
13. The method of claim 12 wherein configuring a producer-consumer relationship between the corresponding OutputBuffers and CommuteBuffers includes,
- enabling simultaneous access of the memory resource units through independent ports, asynchronous clocks and physical addressing; and
- segmenting signals into small fixed size entities (Flits).
14. The method of claim 12 wherein configuring producer-consumer relationship between the CommuteBuffers and VirtualChannels residing in the ChannelMemories includes,
- enabling simultaneous access of the memory resource units through independent ports, asynchronous clocks and physical addressing;
- reassembling the small fixed size entities in the VirtualChannels;
- segregating small fixed size entities arriving simultaneously for different signals from different sources prior to the reassembling.
15. The method of claim 14, wherein physically addressed writes into corresponding memory units achieve the reassembling and the segregating.
16. The method of claim 12 wherein configuring a producer-consumer relationship between the OutputBuffers and VirtualChannels includes,
- enabling simultaneous access of the memory resource units through independent ports, asynchronous clocks and physical addressing;
- reassembling Flits into Signals in the VirtualChannels; and
- segregating Flits arriving simultaneously for different Signals from different sources prior to reassembly, wherein the reassembling and the segregating are achieved through physically addressed writes into corresponding memory units.
17. The method of claim 12, wherein the method operation of configuring producer-consumer relationships between the CommuteBuffers and neighbouring CommuteBuffers includes,
- enabling simultaneous access of the memory resource units through independent ports, asynchronous clocks and physical addressing; and
- switching an input Flit from a neighbor to a corresponding CommuteBuffer.
18. The method of claim 12, wherein the pool of memory resource units are single ported memories with time division access.
19. The method of claim 12, wherein the pool of memory resource units are enabled for synchronous access using a global clock.
20. A method for debugging a stateful, transaction-oriented runtime system having a multi-level array of storage elements, comprising method operations of:
- instructing the stateful transaction oriented system to pause;
- instructing the stateful transaction oriented system to single step until a given point;
- tracking information for selected FlowGate invocations; and
- querying contents of a portion within the multi-level array of storage elements.
21. The method of claim 20, wherein the method operation of instructing the stateful transaction oriented system to pause includes,
- transmitting a signal to a host system indicating that the system has paused; and
- controlling the debugging process through the host system by sending further instructions encapsulated in signals.
22. The method of claim 20, wherein the method operation of tracking information for selected FlowGate invocations includes,
- tracing of firings of FlowGates, including FlowGate input signals, and FlowGate output signals.
23. The method of claim 20, wherein the method operation of querying contents of the portion within the multi-level array of storage elements includes,
- communicating information to a host system, the information including current position of context pointers, contents of a portion of the multi-level array of storage elements or utilization of VirtualChannels.
24. The method of claim 20, further comprising;
- sending executable FlowGate code from a host to the runtime system of a given tile;
- loading the FlowGate code into the multi-level storage array; and
- executing the FlowGate code.
Type: Application
Filed: Oct 18, 2010
Publication Date: Feb 10, 2011
Inventors: Shridhar Mukund (San Jose, CA), Anjan Mitra (Santa Clara, CA), Jed Krohnfeldt (Los Gatos, CA), Clement Leung (Fremont, CA)
Application Number: 12/906,967