INTERCONNECT THAT ELIMINATES ROUTING CONGESTION AND MANAGES SIMULTANEOUS TRANSACTIONS

Info

Publication number: 20120036296
Type: Application
Filed: Oct 18, 2011
Publication Date: Feb 9, 2012
Applicant: SONICS, INC. (Milpitas, CA)
Inventors: Drew E. Wingard (Palo Alto, CA), Chien-Chun Chou (Saratoga, CA), Stephen W. Hamilton (Pembroke Pines, FL), Ian Andrew Swarbrick (Sunnyvale, CA), Vida Vakilotojar (Mountain View, CA)
Application Number: 13/276,041

Abstract

A method, apparatus, and system are described, which generally relate to an integrated circuit having an interconnect. The flow control logic for the interconnect applies a flow control splitting protocol to permit transactions from each initiator thread and/or each initiator tag stream to be outstanding to multiple channels in a single aggregate target at once, and therefore to multiple individual targets within an aggregate target at once. The combined flow control logic and flow control protocol allows the interconnect to manage simultaneous requests to multiple channels in an aggregate target from the same thread or tag at the same time.

Description

Description

RELATED APPLICATIONS

This application is continuation of patent application Ser. No. 12/144,987, filed Jun. 24, 2008, titled “Various methods and apparatus to support outstanding requests to multiple targets while maintaining transaction ordering,” which is related to and claims the benefit of U.S. Provisional Patent Application Ser. No. 60/946,096, titled “An interconnect implementing internal controls,” filed Jun. 25, 2007.

NOTICE OF COPYRIGHT

A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the software engine and its modules, as it appears in the Patent and Trademark Office Patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD OF THE INVENTION

Embodiments of the invention generally relate to an interconnect implementing internal controls to eliminate routing congestion.

BACKGROUND OF THE INVENTION

When an SOC has multiple DRAM interfaces for accessing multiple DRAMs in parallel at differing addresses, each DRAM interface can be commonly referred to as a memory “channel”. In the traditional approach, the channels are not interleaved, so the application software and all hardware blocks that generate traffic need to make sure that they spread their traffic evenly across the channels to balance the loading. Also, in the past, the systems use address generators that split a thread into multiple requests, each request being sent to its own memory channel. This forced the software and system functional block to have to be aware of the organization and structure of the memory system when generating initiator requests. Also, in some super computer prior systems, the system forced dividing up a memory channel at the size of burst length request. Also, in some prior art, requests from a processor perform memory operations that are expanded into individual memory addresses by one or more address generators (AGs). To supply adequate parallelism, each AG is capable of generating multiple addresses per cycle to the multiple segments of a divided up memory channel. The memory channel performs the requested accesses and returns read data to a reorder buffer (RB) associated with the originating AG. The reorder buffer collects and reorders replies from the memory channels so they can be presented to the initiator core.

In the traditional approach, the traffic may be split deeply in the memory subsystem in central routing units, which increases traffic and routing congestion, increases design and verification complexity, eliminates topology freedom, and increases latencies. The created centralized point can act as a bandwidth choke point, a routing congestion point, and a cause of longer propagation path lengths that would lower achievable frequency and increase switching power consumption. Also, some systems use re-order buffers to maintain an expected execution order of transactions in the system.

In the typical approach, area-consuming reorder buffering is used at the point where the traffic is being merged on to hold response data that comes too early from a target.

SUMMARY OF THE INVENTION

A method, apparatus, and system are described, which generally relate to an integrated circuit having an interconnect that has multiple initiator IP cores and multiple target IP cores that communicate request transactions over an interconnect. The interconnect provides a shared communications bus between the multiple initiator IP cores and multiple target IP cores. The flow control logic for the interconnect applies a flow control splitting protocol to permit transactions from each initiator thread and/or each initiator tag stream to be outstanding to multiple channels in a single aggregate target at once, and therefore to multiple individual targets within an aggregate target at once. The combined flow control logic and flow control protocol allows the interconnect to manage simultaneous requests to multiple channels in an aggregate target from the same thread or tag at the same time.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings refer to embodiments of the invention as follows.

FIG. 1 illustrates a block diagram of an embodiment of a System-on-a-Chip having multiple initiator IP cores and multiple target IP cores that communicate transactions over an interconnect.

FIG. 2 illustrates an embodiment of a map of contiguous address space in which distinct memory IP cores are divided up in defined memory interleave segments and then interleaved with memory interleave segments from other memory IP cores.

FIG. 3 shows an embodiment of a map of an address region for multiple interleaved memory channels.

FIG. 4A illustrates a block diagram of an embodiment of a integrated circuit having multiple initiator IP cores and multiple target IP cores that maintains request order for read and write requests over an interconnect that has multiple thread merger and thread splitter units.

FIG. 4B illustrates a block diagram of an embodiment of flow control logic implemented in a centralized merger splitter unit to maintain request path order.

FIG. 5 illustrates a block diagram of an embodiment of one or more thread splitter units to route requests from an initiator IP core generating a set of transactions in a thread down two or more different physical paths.

FIG. 6 illustrates an example timeline of the thread splitter unit in an initiator agent's use of flow control protocol logic that allows multiple write requests from a given thread to be outstanding at any given time but restricts an issuance of a subsequent write request from that thread.

FIG. 7a illustrates an example timeline of embodiment of flow logic to split a 2D WRITE Burst request.

FIG. 7b also illustrates an example timeline of embodiment of flow logic to split a 2D WRITE Burst request.

FIG. 7c illustrates an example timeline of embodiment of flow logic to split a 2D READ Burst.

FIG. 8 illustrates a block diagram of an embodiment of a response path from two target agents back to two initiator agents through two thread splitting units and two thread merger units.

FIG. 9 shows the internal structure of an example interconnect maintaining the request order within a thread and the expected response order to those requests.

FIG. 10 illustrates a diagram of an embodiment of chopping logic to directly support chopping individual transactions that cross the channel address boundaries into two or more transactions/requests from the same thread.

FIG. 11 illustrates a diagram of an embodiment of a path across an interconnect from an initiator agent to multiple target agents including a multiple channel aggregate target.

FIGS. 12a-12e illustrate five types of channel based chopping for block burst requests: normal block chopping, block row chopping, block height chopping, block deadlock chopping, and block deadlock chopping and then block height chopping.

FIG. 13 illustrates a flow diagram of an embodiment of an example of a process for generating a device, such as a System on a Chip, with the designs and concepts discussed above for the Interconnect.

While the invention is subject to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will herein be described in detail. The invention should be understood to not be limited to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

DETAILED DISCUSSION

In the following description, numerous specific details are set forth, such as examples of specific data signals, named components, connections, number of memory channels in an aggregate target, etc., in order to provide a thorough understanding of the present invention. It will be apparent, however, to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well known components or methods have not been described in detail but rather in a block diagram in order to avoid unnecessarily obscuring the present invention. Further specific numeric references, such as first target, may be made. However, the specific numeric reference should not be interpreted as a literal sequential order but rather interpreted that the first target is different than a second target. Thus, the specific details set forth are merely exemplary. The specific details may be varied from and still be contemplated to be within the spirit and scope of the present invention.

In general, a method, apparatus, and system are described, which generally relate to an integrated circuit having an interconnect that implements internal controls. The interconnect may maintain request path order; maintain response path order; interleave channels in an aggregate target with unconstrained burst sizes; have configurable parameters for channels in an aggregate target; chop individual transactions that cross channel boundaries headed for channels in an aggregate target; chop individual transactions that cross channel boundaries headed for channels in an aggregate target so that two or more or the chopped portions retain their 2D burst attributes, as well as implement many other internal controls.

In an embodiment, the flow control logic for the interconnect applies a flow control splitting protocol to permit transactions from each initiator thread and/or each initiator tag stream to be outstanding to multiple channels in a single aggregate target at once, and therefore to multiple individual targets within an aggregate target at once. The combined flow control logic and flow control protocol allows the interconnect to manage simultaneous requests to multiple channels in an aggregate target from the same thread or tag at the same time.

Most aspects of the invention may be applied in most networking environments and an example integrated circuit such as a System-on-a-Chip environment will be used to flesh out these aspects of the invention.

FIG. 1 illustrates a block diagram of an embodiment of a System-on-a-Chip having multiple initiator IP cores and multiple target IP cores that communicate read and write requests as well as responses to those requests over an interconnect. Each initiator IP core such as a CPU IP core 102, an on-chip security IP core 104, a Digital Signal Processor (DSP) 106 IP core, a multimedia IP core 108, a Graphics IP core 110, a streaming Input-Output (I/O) IP core 112, a communications IP core 114, such as a wireless transmit and receive IP core with devices or components external to the chip, etc. and other similar IP cores may have its own initiator agent 116 to interface with the interconnect 118. Each target IP core such as a first DRAM IP core 120 through a fourth DRAM IP core 126 as well as a FLASH memory IP core 128 may have its own target agent 130 to interface with the interconnect 118. Each DRAM IP core 120-126 may have an associated memory scheduler 132 as well as DRAM controller 134.

The Intellectual Property cores (IP) have self-contained designed functionality to provide that macro function to the system. The interconnect 118 implements an address map 136 with assigned address for the target IP cores 120-128 and potentially the initiator IP cores 102-114 in the system to route the requests and potentially responses between the target IP cores 120-128 and initiator IP cores 102-114 in the integrated circuit. One or more address generators may be in each initiator IP core to provide the addresses associated with data transfers that the IP core will initiate to memories or other target IP cores. All of the IP cores may operate at different performance rates (i.e. peak bandwidth, which can be calculated as the clock frequency times the number of data bit lines (also known as data width), and sustained bandwidth, which represents a required or intended performance level). Most of the distinct IP cores communicate to each other through the memory IP cores 120-126 on and off chip. The DRAM controller 134 and address map 136 in each initiator agent 116 and target agent 130 abstracts the real IP core addresses of each DRAM IP core 120-126 from other on-chip cores by maintaining the address map and performing address translation of assigned logical addresses in the address map to physical IP addresses.

The address mapping hardware logic may also be located inside an initiator agent. The DRAM scheduler & controller may be connected downstream of a target agent. Accordingly, one method for determining the routing of requests from initiators to targets is to implement an address mapping apparatus that associates incoming initiator addresses with specific target IP cores. One embodiment of such an address mapping apparatus is to implement target address decoding logic in each initiator agent. In order for a single initiator to be able to access all of the target IP core locations, the initiator may need to provide more total address values than a single target IP core contains, so the interconnect may translate the initiator address into a target IP core address. One embodiment of such a translation is to remove the initiator address bits that were used to decode the selected target IP core from the address that is presented to the target IP core.

The interconnect 118 provides a shared communications bus between IP core sub-systems 120-128 and 102-114 of the system. All the communication paths in the shared communication bus need not pass through a single choke point rather many distributed pathways may exist in the shared communication bus. The on-chip interconnect 118 may be a collection of mechanisms that may be adapters and/or other logical modules along with interconnecting wires that facilitate address-mapped and arbitrated communication between the multiple Intellectual Property cores 102-114 and 120-128.

The interconnect 118 may be part of an integrated circuit, such as System-on-a-Chip, that is pipelined with buffering to store and move requests and responses in stages through the System-on-a-Chip. The interconnect 118 may have flow control logic that is 1) non-blocking with respect to requests from another thread as well as with respect to requiring a response to an initial request before issuing a subsequent request from the same thread, 2) implements a pipelined protocol, and 3) maintains each thread's expected execution order. The interconnect also may support multiple memory channels, with 2D and address tiling features, response flow control, and chopping of individual burst requests. Each initiator IP core may have its own initiator agent to interface with the interconnect. Each target IP core may have its own target agent to interface with the interconnect.

The System-on-a-Chip may be pipelined to store and move requests and responses in stages through the System-on-a-Chip. The flow control logic in the interconnect is non-blocking with respect to requests from another thread as well as with respect to requiring a response to a first request before issuing a second request from the same thread, pipelined, and maintains each thread's execution order.

Each memory channel may be an IP core or multiple external DRAM chips ganged together to act as a single memory make the width of a data word such as 64 bits or 128 bits. Each IP core and DRAM chip may have multiple banks inside that IP core/chip. Each channel may contain one or more buffers that can store requests and/or responses associated with the channel. These buffers can hold request addresses, write data words, read data words, and other control information associated with channel transactions and can help improve memory throughput by supplying requests and write data to the memory, and receiving read data from the memory, in a pipelined fashion. The buffers can also improve memory throughput by allowing a memory scheduler to exploit address locality to favor requests that target a memory page that is already open, as opposed to servicing a different request that forces that page to be closed in order to open a different page in the same memory bank.

One benefit of a multi-channel aggregate target is that it provides spatial concurrency to target access, thus increasing effective bandwidth over that achievable with a single target of the same width. An additional benefit is that the total burst size of each channel is smaller than the total burst size of a single channel target with the same bandwidth, since the single channel target would need a data word that is as wide as the sum of the data word sizes of each of the multiple channels in an aggregate target. The multi-channel aggregate target can thus move data between the SoC and memory more efficiently than a single channel target in situations where the data size is smaller than the burst size of the single channel target. In an embodiment, this interconnect supports a strict super-set of the feature set of the previous interconnects.

Connectivity of multi-channel targets may be primarily provided by cross-bar exchanges that have a chain of pipeline points to allow groups of channel targets to be separated on the die. The multiple channel aggregate target covers the high performance needs of digital media dominated SOCs in the general purpose (memory reference and DMA) interconnect space.

Also, the memory channels in an aggregate target may support configurable configuration parameters. The configurable configuration parameters flexibly support a multiple channel configuration that is dynamically changeable and enable a single already-designed System-on-a-Chip design to support a wide range of packaging or printed circuit board-level layout options that use different on-chip or external memory configurations by re-configuring channel-to-region assignments and interleaving boundaries between channels to better support different modes of operation of a single package.

Interleaved Channels in an Aggregate Target with Unconstrained Burst Sizes

Many kinds of IP core target blocks can be combined and have their address space interleaved. The below discussion will use discreet memory blocks as the target blocks being interleaved to create a single aggregate target in the system address space. An example “aggregate target” described below is a collection of individual memory channels, such as distinct external DRAM chips, that share one or more address regions that support interleaved addressing across the aggregate target set. Another aggregate target is a collection of distinct IP blocks that are being recognized and treated as a single target by the system.

FIG. 2 illustrates an embodiment of a map of contiguous address space in which distinct memory IP cores are divided up in defined memory interleave segments and then interleaved with memory interleave segments from other memory IP cores. Two or more discreet memories channels including on chip IP cores and off chip memory cores may be interleaved with each other to appear to system software and other IP cores as a single memory (i.e. an aggregate target) in the system address space. Each memory channel may be an on-chip IP memory core, an off-chip IP memory core, a standalone memory bank, or similar memory structure. For example, the system may interleave a first DRAM channel 220, a second DRAM channel 222, a third DRAM channel 224, and a fourth DRAM channel 226. Each memory channel 220-226 has two or more defined memory interleave segments such as a first memory interleave segment 240 and a second memory interleave segment 242. The two or more defined memory interleave segments from a given discreet memory channel are interleaved with two or more defined memory interleave segments from other discreet memory channels in the address space of a memory map 236b. The address map 236a may be divided up into two or more regions such as Region 1 thru Region 4, and each interleaved memory segment is assigned to at least one of those regions and populates the system address space for that region as shown in 236b, eventually being mappable to a physical address, in the address space.

For example, memory interleave segments from the first and second DRAM channels 220 and 222 are sized and then interleaved in region 2 of the address map 236b. Also, memory interleave segments from the third and fourth DRAM channels 224 and 226 are sized (at a granularity smaller than interleave segments in the first and second DRAM channels) and then interleaved in region 4 of the address map 236b. Memory interleave segments from the first and second DRAM channels 220 and 222 are also interleaved in region 4 of the address map 236b. Thus, a memory channel may have defined memory interleave segments in the address space of two or more regions and can be implemented through an aliasing technique. Memory interleave segments from the first DRAM channel 220 of a first size, such as a first memory interleave segment 240, are controlled by a configurable parameter of the second region in the address map 236b and interleave segments of a second size, such as a third memory interleave segment 244, are controlled by a configurable parameter of the fourth region in the address map 236b.

Thus, each memory channel 220-226 has defined memory interleave segments and may have memory interleave segments of different sizes. Each corresponding region 4 in the system address map 236b has a configurable parameter, which may be programmable at run time or design time by software, to control the size granularity of the memory interleave segments in the address space assigned to that region potentially based on anticipated type of application expected to have transactions (including read and write requests) with the memory interleave segments in that region. As discussed, for example, the second region in the address map 236b has defined memory interleave segments allocated to that region from the first memory channel 220 that have a configured granularity at a first amount of bytes. Also, the fourth region in the address map 236b has defined memory interleave segments allocated to that region from the first memory channel 220 that have a configured granularity at a second amount of bytes. Also, each region, such as region 4, may have defined memory interleave segments allocated to that region from two or more memory channels 220-226.

FIG. 3 shows an embodiment of a map of an address region for multiple interleaved memory channels. The address region 346 or the address map 336 may have address space for example from 00000 to 3FFFF in the hexadecimal numbering system. The address region 346 has interleaved addressing across multiple channels in an aggregated target. The global address space covered by the address region 346 may be partitioned into the set of defined memory interleave segments from the distinct memory channels. The defined memory interleave segments are non-overlapping in address space and collectively cover and populate the entire region 346 in that address space. Each interleaved memory segment from an on-chip or off-chip IP memory core/channel is then sequential stacked with the defined interleaved segments from the other on-chip IP memory cores to populate address space in the address map. The maximum number of channels associated with a region may be a static value derived from the number of individual targets associated with the region, and from the nature of the target. Individual targets and multi-ported targets may have a single channel; multi-channel targets have up to 2, 4, or 8 channels. In an embodiment, a num_channels attribute is introduced for the “region” construct provided in the RTL.conf syntax and is used to indicate the maximum number of active channels an address region can have. It may be possible to configure the address map to use fewer than the static number of individual targets associated with the region. The first defined memory interleave segment 340 in the region 336 is mapped to channel 0. The second defined memory interleave segment 342 in the region 336 is mapped to channel 1. The third defined memory interleave segment 344 in the region 336 is mapped to channel 2. The next defined memory interleave segment 346 in the region 336 is mapped to channel 3. This process continues until a memory interleave segment is mapped to the last channel active in this region. This completes what is known as a “channel round”. The sequential stacking process of memory interleave segments in the address space assigned to a region is then repeated until enough channel rounds are mapped to completely cover the address space assigned to a particular region. This address region 336 will be treated as an aggregate target. A request, for data, such as a first request 348 from that aggregate target in this region may then require response data spans across multiple defined memory interleave segments and thus across multiple discrete memory IP cores. Also, a physical memory location in an on chip or off chip memory may actually be assigned to multiple regions in the system address space and thus have multiple assigned system addresses from that address map to the same physical memory location. Such multiple mapping, sometimes termed address aliasing, can be used to support multiple ways of addressing the same memory location or to support dynamic allocation of the memory location to either one region or the other, when the different regions have different interleaving sizes or channel groupings and may therefore have different access performance characteristics.

Each memory interleave segment is defined and interleaved in the system address space at a size granularity unconstrained by a burst length request allowed by the DRAM memory design specification by a system designer. The size granularity of memory interleave segment may be a defined length between a minimum DRAM burst length request allowed by the DRAM memory design specification configured into the DRAM and an anticipated maximum DRAM memory page length as recognized by the memory configuration. The size of this granularity is a configurable value supplied by user, such as software programmable. For example, the defined length supplied by the user may be between 64 Bytes and 64 Kilobytes.

Logically, this aggregated target presents itself as a single target to other IP cores but interleaves the memory interleave/segments in the address map of the system from multiple on-chip IP memory cores/memory channels. Thus, each DRAM IP core/channel may be physically divided up into interleaving segments at a size granularity supplied by the user. An initiator agent interfacing the interconnect for a first initiator IP core interrogates the address map based on a logical destination address associated with a request to the aggregate target of the interleaved two or more memory channels and determines which memory channels will service the request and how to route the request to the physical IP addresses of each memory channel in the aggregate target servicing that request so that any IP core need not know of the physical IP addresses of each memory channel in the aggregate target.

The access load to each memory core automatically statistically spreads application traffic across the channels by virtue of the system designer configuring the granularity of the interleave segments based on the address patterns associated with expected request traffic to that region/aggregated target. Requests sent by a single initiating thread to a multi-channel address region can cross the interleave boundary such that some transfers are sent to one channel target while others are sent to another channel target within the aggregate target. These requests can be part of a request burst that crossed a channel interleave boundary or independent transactions. Thus, if the expected request traffic that for system is dominated by requests that linearly access memory location by virtue of the code in the programs they run, the size granularity is set up such that the several requests will be serviced by a first memory channel followed by maybe one request falling on both sides of a memory channel boundary followed by several requests being serviced by a second memory channel. The traffic spreading is due to system addressing, size granularity of the memory segment, and the memory channels being stacked sequentially. Thus, for example, requests a-c 350 from a same thread may be serviced by exclusively by memory channel 2, while request d 352 is partially serviced by both memory channel 2 and memory channel 3. This way of sequentially stacking of defined memory interleave segments in the address space from different memory cores/channels allows the inherent spreading/load balancing between memory cores as well as takes advantage of the principle of locality (i.e. requests in thread tend to access memory address in locally close to the last request and potentially reuse the same access data).

Referring to FIG. 1, an initiator IP core itself and system software are decoupled from knowing the details of the organization and structure of the memory system when generating the addresses of requests going to memory targets. Requests from the initiator cores, such as a CPU 102, to perform memory operations can be expanded into individual memory addresses by one or more address generators (AGs). To supply adequate parallelism, an AG in the initiator agent generates a single address per request, and several AGs may operate in parallel, with each generating accesses from different threads. The address generators translate system addresses in the memory map into real addresses of memory cells within a particular IP memory core or in some cases across a channel boundary. A generated request may have an address with additional fields for memory channel select bits, which aid in decoding where to retrieve the desired information in a system having one or more aggregated targets. The initiator agents, such as a first initiator agent 158, may have address generators with logic to add channel select bits into the address of a generated request from an IP core. At least part of the address decode of a target's address may occur at the interface of when a request first enters the interconnect such as at an initiators agent. An address decoder may decode an address of a request to route the request to the proper IP memory core based on, for example, the low bits of the memory address. The address decoder removes the channel select bits from the address and then pass the address to the address decoders/generator(s) in the memory controller. The addresses presented to a channel target may be shifted, for example, to the right to compensate for channel selection bit(s). The memory scheduler 132 may also decode/translate a system's memory target address sent in a request to determine a defined memory segments physical location on a chip i.e. (Rank, bank, row, and column address information). Each access can be routed to the appropriate memory channel (MC) via a look up table. The address map 136 of details of the organization and structure of the memory system exits in each initiator agent coupled to an IP core. The memory scheduler 132 schedules pending accesses in a channel-buffer, selecting one access during each DRAM command cycle, sending the appropriate command to the DRAM, and updating the state of the pending access. Note that a single memory access may require as many as three DRAM commands to complete. The memory channel then performs the requested accesses and returns one or more responses with the read data to a buffer. The target agent collects replies from the memory channels so they can be presented to the initiator core in the expected in-order response order.

Thus, the initiator cores 102-114 and 120-126 do not need hardware and software built in to keep track of the memory address structure and organization. The initiator cores 102-114 and 120-126 do not need a priori knowledge of the memory address structure and organization. The initiator agents 116 have this information and isolate the cores from needing this knowledge. The initiator agents 116 have this information to choose the true address of the target, the route to the target from the initiator across the interconnect 118, and then the channel route within an aggregated target. The memory scheduler 132 may receive a request sent by the initiator agent and translate the target address and channel route to rank, bank, row, and column address information in the various memory channels/IP cores. In an embodiment, the multiple channel nature of an aggregate target is abstracted from the IP cores in the system and puts that structural and organizational knowledge of memory channels onto either each initiator agent 116 in the system or the centralized memory scheduler 132 in the system.

The flow control protocol and flow control logic ensure that the transactions are re-assembled correctly in the response path before the corresponding responses are returned to the initiator IP core.

It is desirable in interleaved multi-channel systems that each initiator distributes its accesses across the channels roughly equally. The interleave size has an impact on this. The expected method to allocate bandwidth N to a thread is to program each channel QOS allocation as (N/channels) plus a small tolerance margin. If the application is known to have a channel bias, non-symmetric allocations can be made instead. If region re-definition is used, the number of active channels may differ in different boot setups. Having separate allocations at each channel is useful to accommodate this.

For multiple channel DRAM, the percentage of service bandwidth that is elastic is greater than for single DRAM. Each channel still has the page locality based elasticity. Additionally, there is elasticity related to the portion of service bandwidth from a single channel that is available to each initiator. If the address streams distribute nicely across channels, then there is a certain level of contention at each channel. If the address streams tend to concentrate at a few channels, then other channels are lightly used (less than 100% utilized), and therefore the aggregate service rate is reduced.

In an embodiment, the flow control logic internal to the interconnect may interrogate the address map and a known structural organization of an aggregated target in the integrated circuit to decode an interleaved address space of the aggregated target to determine the physical distinctions between the targets making up the first aggregated target in order to determine which targets making up the first aggregated target need to service a first request. The flow control logic applies a flow control splitting protocol to allow multiple transactions from the same thread to be outstanding to multiple channels of an aggregated target at any given time and the multiple channels in the aggregated target map to IP memory cores having physically different addresses. The flow control logic internal to the interconnect is configured to maintain request order routed to the target IP core. The flow control mechanism cooperates with the flow control logic to allow multiple transactions from the same thread to be outstanding to multiple channels of an aggregated target at any given time and the multiple channels in the aggregated target map to IP memory cores having physically different addresses.

The interconnect implements an address map with assigned addresses for target IP cores in the integrated circuit to route the requests between the target IP cores and initiator IP cores in the integrated circuit. A first aggregate target of the target IP cores includes two or more memory channels that are interleaved in an address space for the first aggregate target in the address map. Each memory channel is divided up in defined memory interleave segments and then interleaved with memory interleave segments from other memory channels. Each memory interleave segment of those memory channels being defined and interleaved in the address space at a size granularity unconstrained by a burst length request allowed by the memory design specification by a system designer. The size granularity of memory interleave segment can be a defined length between a minimum burst length request allowed by a DRAM memory design specification configured into the DRAM and an anticipated maximum DRAM memory page length as recognized by the memory configuration and the size of this granularity is configurable.

The two or more discreet memory channels may include on-chip IP memory cores and off-chip memory cores that are interleaved with each other to appear to system software and other IP cores as a single memory in the address space.

An initiator agent interfacing the interconnect for a first initiator IP core is configured to interrogate the address map based on a logical destination address associated with a first request to the aggregate target of the interleaved two or more memory channels and determines which memory channels will service the first request and how to route the first request to the physical IP addresses of each memory channel in the aggregate target servicing that request so that the first IP core need not know of the physical IP addresses of each memory channel in the aggregate target.

The two or more memory channels are interleaved in the address space of the system address map to enable automatic statistical spreading of application requests across each of the memory channels over time, to avoid locations of uneven load balancing between distinct memory channels that can arise when too much traffic targets a subset of the memory channels making up the aggregated target.

The address map can be divided up into two or more regions and each interleaved memory interleave segment is assigned to at least one of those regions and populates the address space for that region. Memory channels can have defined memory interleave segments in the address space of two or more regions. Memory interleave segments in the address space assigned to a given region having a unique tiling function used in two dimensional (2D) memory page retrieval for the 2-D block request and the memory interleave segments are addressable through a memory scheduler. The memory interleave segments are addressable through a memory scheduler.

Chopping logic internal to the interconnect to chop individual burst transactions that cross channel boundaries headed for channels in the first aggregate target into two or more requests. The chopping logic chops the individual transactions that cross channel boundaries headed for channels in the aggregate target so that the two or more or resulting requests retain their 2D burst attributes.

The address map register block contains the following two types of registers: the base register and the control register. Each pair of the base and control registers is corresponding to a multi-channel address region. A base register contains the base address of the multi-channel address region. The fields control register contains the other configuration parameters.

The system may also support enhanced concurrency management. The system has support for Open Core Protocol (OCP) threads and OCP tags, and connectivity to AXI with its master IDs it is important that the interconnect have flexible mappings between the external and internal units of concurrency. This will likely take the form of flexible thread/tag mappings. The interconnect has an efficient mechanism for managing concurrency cost verses performance trade-offs. Thread mapping and thread collapsing address may be used to manage concurrency cost verses performance trade-off needs along with a fine granularity of control. Providing combined OCP thread and OCP tag support is one way to address these needs. Also, additional control may be supplied by specifying tag handling where initiator thread merging to target threads occurs. Support for partial thread collapsing is another feature that can address these trade-off needs.

In an embodiment, if an initiator agent connects to one individual target agent in a multi-channel target, this initiator agent should connect to all individual target agents in the multi-channel target.

Maintaining Request Path Order

FIG. 4A illustrates a block diagram of an embodiment of a integrated circuit, such as a SoC, having multiple initiator IP cores and multiple target IP cores that maintains request order for read and write requests over an interconnect that has multiple thread merger and thread splitter units. Each initiator IP core such as a Central Processor Unit IP core 602 may have its own initiator agent 658 to interface with the interconnect 618. Each target IP core such as a first DRAM IP core may have its own initiator agent to interface with the interconnect 618. Each DRAM IP core 620-624 may have an associated memory scheduler 632, DRAM controller 634, and PHY unit 635. The interconnect 658 implements flow control logic internal to the interconnect 618 itself to manage an order of when each issued request in a given thread arrives at its destination address for each thread on a per thread basis. The interconnect 618 also implements flow control protocol internal to the interconnect in the response network to enforce ordering restrictions of when to return responses within a same thread in an order in which the corresponding requests where transmitted. The interconnect 618 implements flow control logic and flow control protocol internal to the interconnect itself to manage expected execution ordering a set of issued requests within the same thread that are serviced and responses returned in order with respect to each other but independent of an ordering of another thread. The flow control logic at a thread splitter unit permits transactions from one initiator thread to be outstanding to multiple channels at once and therefore to multiple individual targets within a multi-channel target at once. This includes a transaction targeted at two different channels, as well as, two transactions (from the same initiator thread) each targeted at a single but different channel, where these two different channels are mapped to two individual targets within a multi-channel target.

Thread splitter units near or in an initiator agent sends parts of the thread, such as requests, to multiple separate physical pathways on the chip. For example, a thread splitter unit in the first initiator agent 658 associated with the CPU core 602 can route transactions in a given thread down a first physical pathway 662 to a first combined thread merger-splitter unit 668, down a second physical pathway 664 to a second combined thread merger-splitter unit 670, or down a third physical pathway 666 to a third combined thread merger-splitter unit 672. The flow control logic applies the flow control splitting protocol to splitting the traffic early where it makes sense due to the physical routing of parts of that set of transactions being routed on separate physical pathways in the system as well as being routed to targets physically located in different areas in the system/on the chip.

Thread merger units near or in a target agent ensures that responses to the requests from that thread segment come back from the target core to the initiator core in the expected in-order response order. For example, the first thread merger unit 668 near the first target agent 631 ensures that responses to the requests from a given thread come back from the first target DRAM IP core 620 and the second target DRAM IP core 622 to the first initiator core 662 in the expected in-order response order.

Threads from two different initiators may be combined into a single third thread in a thread merger unit. Parts of a single thread may be split into two different threads in a thread splitter unit. The merger and splitter units may use thread id mapping to combine or split threads having different thread identifiers. Each thread merger unit and thread splitter unit may maintain a local order of transactions at that splitting-merger point and couple that system with a simple flow control mechanism for responses.

As discusses, a thread splitter unit in an initiator agent such as a first initiator agent 658 may split a set of transactions in a given thread from a connected initiator IP core where the split up parts of the set of transactions are being routed on separate physical pathways to their intended target (i.e. two different channels and two different target IP cores). The flow control logic associated with that splitter unit implements flow control stopping an issuance of a next request from the same thread headed to a physical pathway other than the physical pathway being used by outstanding requests in that thread same to allow a switch to the other physical pathway and the switch to route requests from the same thread with destination addresses down the other physical pathway to occur when all acknowledge notifications from outstanding requests in that same thread going to a current physical pathway are returned to the splitter unit. The flow control logic may be part of a thread splitter unit or a separate block of logic coordinating with a thread splitter unit target. Thus, the thread splitter unit implements flow control to prevent an issuance of a next request from the same thread headed to a first physical pathway 662, such as a link, other than a current physical pathway being used by outstanding requests in that same thread until all acknowledge notifications from outstanding requests in that same thread going to the current physical pathway are communicated back to the thread splitter unit.

The flow control logic tracks acknowledge notifications from requests within the same thread, indicating safe arrival of those requests, to ensure all previous requests headed toward an intended target have reached a last thread merger unit prior to the intended target IP core before requests from the same thread are routed along a separate physical path to second intended target. The flow control logic applies a flow control protocol to stop issuance of requests from the same thread merely when requests from that thread are being routed to separate physical pathways in the system. The thread splitter unit and associated flow control logic allow much more flexibility about where in the interconnect topology each target or channel is attached and minimizes the traffic and routing congestion issues associated with a centralized target/channel splitter.

In an embodiment, address decoding the intended address of the request from a thread happens as soon as the request enters the interconnect interface such as at the initiator agent. The flow control logic interrogates the address map and a known structural organization of each aggregated target in the system to decode an interleaved address space of the aggregated targets to determine the physical distinctions between the targets making up a particular aggregated target in order to determine which targets making up the first aggregated target need to service a current request. The multiple channels in the aggregated target 637 map to IP memory cores 620 and 622 having physically different addresses. The flow logic may cooperate with the chopping logic which understands the known structural organization of the aggregated targets including how the memory interleave segments wrap across channel boundaries of different channels in a channel round back to the original channel and then repeats this wrapping nature. Thus, the flow logic of an initiator agent may route requests to both a proper channel, such as 620 and 622, in an aggregated target 637, and a specific target 628 amongst all the other targets on the chip. Overall, the flow control logic such as applies the flow control splitting protocol to allow multiple transactions from the same thread to be outstanding to multiple channels at any given time.

Requests being routed through separate physical pathways can be split at an initiator agent as well as other splitter units in a cascaded splitter unit highly pipelined system. FIG. 4B illustrates a block diagram of an embodiment of flow control logic 657 implemented in a centralized merger splitter unit 668b to maintain request path order.

FIG. 5 illustrates a block diagram of an embodiment of one or more thread splitter units to route requests from an initiator IP core 716 generating a set of transactions in a thread down two or more different physical paths by routing a first request with a destination address headed to a first physical location on the chip, such as a first target 724, and other requests within that thread having a destination address headed to different physical locations on the chip from the first physical location such as a first channel 722 and a second channel 720 making up an aggregate second target 737. The first and second channels 720 and 722 share an address region to appear as single logical aggregated target 737. The initiator agent 716 may route requests from the thread to a first thread splitter unit. The first thread splitter unit 761 may route the request depending on its destination address down one or more different physical pathways such as a first link 762 and a second link 764.

In the IA 716, when the address lookup is done, then a request destination physical route determination and a return route for acknowledge notification determination are looked up. The IA 716 looks up the acknowledge notification return route statically at the time when the sending address/route lookup takes place. An ordered flow queue, such as a first order flow queue 717, exists per received thread in each thread splitter unit 761 and 763, and thread merger unit 765, 767, 769 and 771. The ordered flow queue may have a First-In-First-Out ordering structure. One turnaround First-In-First-Out ordered flow queue may be maintained per received thread in that first splitter unit. Logic circuitry and one or more tables locally maintain a history of requests in each ordered flow queue of transactions entering/being stored in that ordered flow queue. As discussed, the flow logic tracks acknowledge notifications/signals from requests from within the same thread to ensure all previous requests headed toward an intended target have reached the last merger unit prior to the intended target before requests from the same thread are routed along a separate physical path to second intended target.

The first-in-first-out inherent ordering of the queue may be used to establish a local order of received requests in a thread and this maintained local order of requests in a particular thread may be used to compare requests to other requests in that same thread to ensure a subsequent request to a different link is not released from the splitter unit until all earlier requests from that same thread going to the same target have communicated acknowledge signals back to the splitter unit that is splitting parts of that thread.

The thread splitter units are typically located where a single transaction in a given thread may be split into two or more transactions, and the split up parts of the single transaction are being routed on separate physical pathways to their intended targets. In an embodiment, when a transaction transfer/part of the original request reaches a last serialization point (such as the last thread merger unit) prior to the intended target of that transfer, then an acknowledge notification is routed back to the initial thread splitter unit. Note, pending transactions are serialized until an acknowledge signal is received from all previous requests from a different physical path but not serialized with respect to receiving a response to any of those requests in that thread. The flow control protocol for requests is also has non-blocking nature with respect to other threads as well as non-blocking with respect to requiring a response to a first request before issuing a second request from the same thread.

A first thread splitting unit 761 may be cascaded in the request path with a second thread splitting unit 763. Subsequent thread splitter units in the physical path between an initial thread splitter unit and the intended aggregated target channel may be treated as a target channel by the flow control logic associated with the initial thread splitter unit. Request path thread merger units can be cascaded too, but the acknowledge notification for each thread should come from the last thread merger on the path to the intended target channel. As discussed, thread splitter units can be cascaded, but acknowledge notification needs to go back to all splitters in the path. The flow logic in each splitter unit in the physical path blocks changes to a different ‘branch/physical pathway’ until all acknowledge notifications from the open branch are received. Note, the response return network may be an exact parallel of the forward request network illustrated in FIG. 5 and could even use the same interconnect links 762 and 764 with reverse flow control added.

In an embodiment, an upstream splitter unit will continue to send multiple requests from a given thread to another splitter until the subsequent request needs to be split down the separate physical pathway at the downstream thread splitter unit. The downstream splitter unit causing the pathway splitting then implements flow control buffering of the subsequent request from the same thread heading down the separate physical pathway than all of the outstanding requests from that thread until all of the outstanding requests from that thread headed down the initial physical pathway have communicated an acknowledge notification of receipt of those outstanding requests to the downstream thread splitter unit causing the pathway splitting.

In an embodiment, the interconnect for the integrated circuit communicates transactions between the one or more initiator Intellectual Property (IP) cores and multiple target IP cores coupled to the interconnect. The interconnect may implement a flow control mechanism having logic configured to support multiple transactions issued from a first initiator in parallel with respect to each other and issued to, at least one of, 1) multiple discreet target IP cores and 2) an aggregate target that includes two or more memory channels that are interleaved in an address space for the aggregate target in an address map, while maintaining an expected execution order within the transactions. The flow control mechanism has logic that supports a second transaction to be issued from the first initiator IP core to a second target IP core before a first transaction issued from the same first initiator IP core to a first target IP core has completed while ensuring that the first transaction completes before the second transaction and while ensuring an expected execution order within the first transaction is maintained. The first and second transactions are part of a same thread from the same initiator IP core. The first and second transactions are each composed of one or more requests and one or more optional responses. An initiator sending a request and a target sending a response to the request would be a transaction. Thus, a write from the initiator and a write from the target in response to the original write would still be a transaction.

A thread splitting unit may be cascaded in the request path with another thread splitting unit. An upstream thread splitter may continuously send requests from a given thread from the upstream thread splitter unit to a downstream thread splitter unit until the subsequent request needs to be split down the separate physical pathway at the downstream thread splitter unit. The downstream thread splitter unit implements flow control buffering of the subsequent request from the same thread heading down the separate physical pathway than all of the outstanding requests from that thread until all of the outstanding requests from that thread headed down the initial physical pathway have communicated an acknowledge notification of receipt of those outstanding requests to the downstream thread splitter unit causing the pathway splitting.

The system can be pipelined with buffers in the interconnect component to store and move requests and responses in stages through the system. The system also uses a pipeline storage system so multiple requests may be sent from the same initiator, each request sent out on a different cycle, without the initiator having to wait to receive a response to the initial request before generating the next request. The thread splitter units in the interconnect must simply wait for an acknowledge notification of an issued request before sending a next request down the same request as the previous request.

The flow logic prevents a request path deadlock by using acknowledge notifications, which are propagated back up the request network from the last thread merge unit. The flow logic uses the above flow control protocol as an interlock that virtually assures no initiator thread will have transactions outstanding to more than one target at a time. Yet, the flow control protocol does permit transactions from one initiator thread to be outstanding to multiple channels in a single aggregate target at once, and therefore to multiple individual targets within an aggregate target at once. Since the rate of progress at these individual targets may be different, it is possible that responses will be offered to an initiator core out of order with respect to how the requests were issued out by the initiator core. A simple response flow control protocol may be used to ensure responses to these requests will be offered to the initiator core in the expected order with respect to how the requests were issued by the initiator core. The combined request flow control logic and simple response flow control protocol allows the interconnect to manage simultaneous requests to multiple channels in an aggregate target from same thread at the same time.

The combined request flow control logic and simple response flow control protocol implemented at each thread splitter unit and thread merger unit allows this control to be distributed over the interconnect. The distributed implementation in each thread splitter unit and thread merger unit allow them to interrogate a local system address map to determine both thread routing and thread buffering until a switch of physical paths can occur. This causes a lower average latency for requests. Also, software transparency because software and, in fact, the IP cores themselves need not be aware of the actual aggregated target structure. The thread splitter units and thread merger units cooperate end-to-end to ensure ordering without a need to install full transaction reorder buffers within the interconnect.

Similarly, FIG. 11 illustrates a diagram of an embodiment of a path across an interconnect from an initiator agent to multiple target agents including a multiple channel aggregate target 1579.

As discussed, the interconnect for the integrated circuit is configured to communicate transactions between one or more initiator Intellectual Property (IP) cores and multiple target IP cores coupled to the interconnect. The interconnect implements logic configured to support multiple transactions issued from a first initiator IP core to the multiple target IP cores while maintaining an expected execution order within the transactions. The logic supports a second transaction to be issued from the first initiator IP core to a second target IP core before a first transaction issued from the same first initiator IP core to a first target IP core has completed while ensuring that the first transaction completes before the second transaction. The logic does not include any reorder buffering, and ensures an expected execution order for the first transaction and second transaction are maintained. The first and second transactions may be part of a same thread from the first initiator IP core, and the expected execution order within the first transaction is independent of ordering of other threads. The logic may be configured to support one or more transactions issued from a second initiator IP core to at least the first target IP core, simultaneous with the multiple transactions issued from the first initiator IP core to the first and second target IP cores while maintaining the expected execution order for all of the transactions and thereby allowing transactions from several initiators to be outstanding simultaneously to several targets. The flow control logic is associated with a thread splitter unit in a request path to a destination address of a target IP core. The first and second transactions may be each composed of one or more requests and one or more optional responses. The aggregate target IP core of the multiple target IP cores may include two or more memory channels that are interleaved in an address space for the aggregate target in an address map. The thread splitter unit implements flow control to prevent an issuance of a next request from a same thread from the first initiator IP core headed to a first physical pathway, other than a current physical pathway being used by outstanding requests in that same thread until all acknowledge notifications from outstanding requests in that same thread going to the current physical pathway are communicated back to the thread splitter unit.

FIG. 6 illustrates an example timeline of the thread splitter unit in an initiator agent's use of flow control protocol logic that allows multiple write requests from a given thread to be outstanding at any given time, such as a first write burst request 851 and a second write burst request 853 but restricts an issuance of a subsequent write request from that thread, such as a third write burst request 855 having a destination address down a separate physical pathway from all of the outstanding requests in that thread. All initiator agents may have a thread splitter unit that splits requests from a given thread based on requests in that set of requests being routed down separate physical pathway from other requests in that thread. A burst request may be a set of word requests that are linked together into a transaction having a defined address sequence, defined pattern, and number of word requests. The first write burst request 851 and the second write burst request 853 have eight words in their request and a destination address of channel 0. The third burst request 855 also has eight words in its request but a destination address of channel 1 which is down a separate physical pathway from channel 0.

The flow control logic 857 associated with the thread splitter unit that split the set of transactions in that given thread issues the next third burst request 855 being routed down the separate first physical pathway from other outstanding requests, such as the first and second 851 and 853 in that thread 1) no earlier than one cycle after an amount of words in an immediate previous request if the previous request was a burst request or 2) no earlier than a sum of a total time of an amount of anticipated time the immediate previous request will arrive at a last thread merger unit prior to that previous request's target address plus an amount of time to communicate the acknowledgement notification back to the thread splitter. If the flow logic was only based on sum of the total time of an amount of anticipated time the immediate previous request will arrive at a last thread merger unit plus an amount of time/cycles to communicate an acknowledgement notification of the previous request to a thread splitter unit, then the third request 855 could have issued 3 cycles earlier. Note, neither the response to the first burst request 851 nor the response to the second burst request 853 needs to be even generated let alone arrive in its entirety back at the initiating core prior to the issuing of the third request 855.

FIGS. 7b and 7c illustrate additional example timelines of embodiments of the flow control logic to split target request traffic such as a 2D WRITE Burst and 2D READ Burst. Referring to FIG. 5, in an embodiment, the acknowledgement mechanism generates confirmation information from the last channel merge point at which two links merge threads. This information confirms that the channel requests from different links have been serialized. The acknowledgement information is propagated back up the request network to all the channel thread splitter units. If the channel splitter and the last serialization point exist within the same cycle boundary (i.e., there are no registers between them) then no explicit acknowledgement signals are needed—the acceptance of a transfer on the link between the channel splitter and channel merger can also be used to indicate acknowledgement.

In an embodiment, the merger unit is configured structurally to store the incoming branch/thread for a successful request that has ack_req set. When an ack_req_return signal is set high, the turnaround queue is ‘popped’ and causes the corresponding ack_req_return signal to be driven high on the correct branch/thread. Serialization merger (for this thread)—thread merging happens here. The merger unit is configured structurally to reflect the incoming ack_req signal back on the ack_req_return signal on the incoming branch/thread that sent the current request.

The initiator agent generates m_ack_req signals. The signal is driven low by default. The m_ack_req signal is driven high on the first transfer of any split burst that leaves the initiator agent that is going to a multi-channel target. Channel splitting happens at a thread splitter in an embedded register point or in a pipeline point, and is needed in the request path. Inside the splitter, an acknowledge control unit (ACU) is added. The ACU prevents requests from proceeding on a thread if the outgoing splitter branch and/or thread changes from that of the previous transfer and there are outstanding acknowledge signals. There is at most one ACU for each (input) thread at the RS.

The m_ack_req signals travel in-band with a request transfer. At some point the request transfer with the m_ack_req will reach the serialization merger—this is the last point where the connection merges with another connection on the same merger (outgoing) thread. If the transfer wins arbitration at the merger, the merger will extract the m_ack_req signal and return it back upstream on the same, request DL link path via the s_ack_req_return signal. The s_ack_req_return signals are propagated upstream on the request DL links. These signals do not encounter any backpressure or have any flow control. Wherever there is a PP RS, the s_ack_req_return signals will be registered. The s_ack_req_return signals are used at each channel splitter ACU along the path. The ACU keeps a count of outstanding acknowledgements. When s_ack_req_return is set to one, the ACU will decrement its count of outstanding acknowledgements. The s_ack_req_return propagates back to the first channel split point in the request network. For the example shown in FIG. 5, this first channel split point is at the embedded register point RS just downstream to the initiator agent component. However, the first channel split point in a request acknowledgement network could also be at a PP RS component.

If the path leading into an RS that performs channel splitting is thread collapsed, then the DL link is treated as single threaded for the purposes of the acknowledgement mechanism.

The architecture intentionally splits multi-channel paths in the IA, or as early as possible along the path to the multiple channel target agents. This approach avoids creating a centralized point that could act as a bandwidth choke point, a routing congestion point, and a cause of longer propagation path lengths that would lower achievable frequency and increase switching power consumption.

FIG. 7A illustrates an example timeline of embodiment of flow logic to split a 2D WRITE Burst request. In this example, the number of words in the 2D WRITE Burst request is 4, N=3, M=2, ChannelInterleaveSize=4<N+M. The WRITE Burst request 1081 is shown over time. FIG. 7B also illustrates an example timeline of embodiment of flow logic to split a 2D WRITE Burst request 1083. FIG. 7C illustrates an example timeline of embodiment of flow logic to split a 2D READ Burst 1085. The flow control logic in conjunction with the other above features allowing high throughput/deep pipelining of transactions. As shown in FIG. 7A, multiple transactions are being issued and serviced in parallel, which increases the efficiency of each initiator in being able to start having more transactions serviced in the same period of time. Also, the utilization of the memory is greater because as seen in the bubbles on FIG. 7A there are very few periods of idle time in the system. The first four bubbles shown the initial write burst is being issued. Next two bubbles of inactivity occur. However, after that the next four bubbles of the next write burst are issued and being serviced by the system. The initiator and memory are working on multiple transactions at the same time. The latency of the ACK loop may limit the effective data bandwidth of the initiator. The initiator has to wait for first-row responses to order them. No need to wait for the next rows. Channel responses may become available too early for the initiator to consume them. This will create a back-pressure on this thread at the channels, forcing them to service other threads. Initiators that send 2D bursts may have dedicated threads, because of the way they can occupy their thread on multiple channels for the duration of the 2D burst. Note: for 2D WRITE bursts, because of the channel switching, the split WRITE bursts will remain open until the original 2D burst is closed; that is, while a splitter is sweeping all other branches before switching back to a given branch, all the resources for that thread of the branch remain idle (maybe minus N cycles). A similar situation exists for 2D READ requests, on the response path.

As discussed, the interconnect for the integrated circuit is configured to communicate transactions between the one or more initiator Intellectual Property (IP) cores and the multiple target IP cores coupled to the interconnect. Two or more memory channels may make up a first aggregate target of the target IP cores. The two or more memory channels may populate an address space assigned to the first aggregate target and appear as a single target to the initiator IP cores. The interconnect may be configured to implement chopping logic to chop individual two-dimensional (2D) transactions that cross the memory channel address boundaries from a first memory channel to a second memory channel within the first aggregate target into two or more 2D transactions with a height value greater than one, as well as stride and width dimensions, which are chopped to fit within memory channel address boundaries of the first aggregate target. The flow control logic internal to the interconnect may be configured to maintain ordering for transactions routed to the first aggregate target IP core. The flow control logic is configured to allow multiple transactions from the same initiator IP core thread to be outstanding to multiple channels of an aggregated target at the same time and the multiple channels in the first aggregated target map to target IP cores having physically different addresses. The transactions may include one or more requests and one or more optional responses and the transactions are part of a same thread from the same initiator IP core.

Maintaining Response Path Order

FIG. 8 illustrates a block diagram of an embodiment of a response path from two target agents back to two initiator agents through two thread splitting units and two thread merger units. The two target agents 1120, 1122 may each have one or more associated thread splitting unit such as a first thread splitting unit 1141 for the first target agent 1120 and a second thread splitting unit 1143 for the second target agent 1122. The two target agents 1120, 1122 may each have one or more associated thread merging unit such as a first thread merging unit 1145 for the first target agent 1120 and a second thread merging unit 1147 for the first target agent 1120. A target agent or memory scheduler may have FIFO response flow buffers, such as a first response flow buffer 1149, which cooperate with the merger units 1145, 1147 implementing a flow control protocol to return responses within a same thread in the order in which the corresponding requests were transmitted rather than using re-order buffers.

The flow logic in the target agent and merger unit uses first-in first-out inherent ordering to compare responses to other responses in that same thread to ensure the next response is not released from the target agent until all earlier responses from that same thread have been transmitted back toward a thread merger unit in the response path toward the initiator IP core issuing that thread. The FIFO response flow buffers are filled on a per thread basis. Alternatively, the turnaround state of the response buffers may be distributed to other channels making up the aggregated target or even just other targets on the chip to implement a response flow order protocol.

The merger unit closest to the target/channel may determine which physical branch pathway should be delivering the next response, and routes a threadbusy from the correct branch back to the target. The merger unit closest to the target agent or the merger unit closest to the initiator IP core generating the thread may assert this flow control protocol to backpressure all responses from a particular thread from all physical pathways connected to that thread merger unit except responses from the physical pathway expected to send a next in order response for that thread. For example, the first thread merger unit controls when responses come from the first target agent 1120 and the second target agent 1122. Logic, counters, and tables associated with the merger unit keep track of which physical pathway, such as a link, should be supplying the next response in sequential order for that thread and stops responses from that thread from all other physical branches until that next response in sequential order for that thread is received on the active/current physical pathway.

The flow control logic maintains expected execution order of the responses from within a given thread by referencing the ordered history of which physical path requests were routed to from the maintained order history of the request queue, the expected execution order of the responses corresponding to those requests, and allows only the target agent from the physical branch where the next expected in order response to send responses for that thread to the merger unit and blocks responses from that thread from the other physical branches. The flow logic in a merger unit establishes a local order with respect to issued requests and thus expected response order sent down those separate physical pathways.

The flow control mechanism asserts response flow control on a per thread basis and the flow control mechanism blocks with respect to other out-of-order responses within a given thread and is non-blocking with respect to responses from any other thread. The flow control mechanism and associated circuitry maintain the expected execution order of the responses from within a given thread by 1) referencing an ordered history of which physical path requests in that thread where routed to, 2) an expected execution order of the responses corresponding to those requests, and 3) allowing the target agent to send responses for that given thread to the thread merger unit only from the physical pathway where a next expected in-order response is to come from and block responses from that given thread from the other physical pathways.

The thread splitter and merger units in combination with buffers in the memory controller eliminate the need for dedicated reorder buffers and allow a non-blocking flow control so that multiple transactions may be being serviced in parallel vice merely in series.

FIG. 9 shows the internal structure of an example interconnect maintaining the request order within a thread and the expected response order to those requests. The interconnect include three initiator agents 1331, 1333, and 1335 and three target agents, where target agent0 1341 and target agent1 1339 are target agents that belong to a multi-channel target, DRAM. Only one multi-channel aggregate target 1331 exists in this example.

On the request network, for initiator agent 1331, the multi-channel path going to the multi-channel target DRAM splits at the initiator agent0's 1331 embedded, request-side thread splitter units, Req_rs10. Since there are two channels, the two outgoing single-threaded (ST) DL links 1362, 1364 each goes to a different channel target. The third outgoing ST DL link 1366 is a normal path leading to a normal individual target agent TA2 1341. A request-side channel splitter 1368b is embedded in the initiator agent 1331. For the channel target agent0 1343, the merger splitter unit component, tat00_ms0 1368a, upstream to target agent0 1343 acts as a channel merger and regulates channel traffics coming from two different initiator agents, initiator agent0 1331 and initiator agent1 1333.

On the response network, for target agent1 1339, the embedded RS component, Resp_rs01, acts as a response channel splitter—it has three outgoing links 1371, 1373, 1375 for delivering channel responses back to initiator agent0 1331, normal responses back to the normal initiator agent2 1333, and channel responses back to initiator agent1 1335, respectively. A response-side channel splitter is color-coded in blue. For initiator agent1 1333, its upstream merger splitter unit component, lah11_ms0, is a channel merger, which not only regulates responses coming back from channel 0 (i.e., target agent0) and channel 1 (i.e., target agent1) in the aggregate target 1337, but also handles responses returned by the normal target agent2 1341. The response-side channel merger 1381 receives responses from target agent0 1343, target agent1 1339, and target agent2 1341.

Since a response-side channel merger unit needs to regulate channel responses but it may not have enough information to act upon, additional re-ordering information can be passed to the merger unit from the request-side channel splitter of the initiator agent. For instance, the DRL link 1391 is used to pass response re-ordering information between the request-side channel thread splitter unit, Req_rs11, and the response-side channel thread merger unit, lah11_ms0, for initiator agent1 1333.

Target agent TA0 1343 is assigned to channel 0 and target agent TA1 1339 is assigned to channel 1 for the multi-channel target DRAM. Connectivity between initiators and individual targets of the multi-channel target DRAM is done via connectivity statements that specify the initiator agent (connected to an initiator) and the specific target agent (connected to an individual target of the multi-channel target DRAM) as shown in the example.

Also disclosed are two multi-channel address regions: SMS_reg and USB_mem. The specification of the SMS_reg region can be explained as follows: The size of this region is 0x1000 bytes. Having a channel_interleave_size of 8, means that each interleave is of size 0x100 (28). This results in 16 non-overlapping memory interleave segments (region size 0x1000/interleave size 0x100=16). As discussed, each interleave is assigned to a channel using the “channel round” idea. In this case there are 2 channels so interleaves 0, 2, 4, 6, 8, 10, 12, 14 are assigned to channel 0 (target agent TA0) and interleaves 1, 3, 5, 7, 9, 11, 13, 15 are assigned to channel 1 (target agent TA1). Note that if an initiator agent connects to one individual target agent in a multi-channel target, this initiator agent should connect to all individual target agents in the multi-channel target. That is, as indicated in FIG. 9, the connection between IA2 and TA1 is NOT ALLOWED unless IA2 is also connected to TA0 in the same time.

In an embodiment, in the response path ordering, the interconnect maintains OCP thread order, and has a mechanism to re-order responses in the response path. This is achieved by passing information for a request path channel splitter RS component to the corresponding response path channel merger MS component. The information is passed via a turnaround queue, which maintains FIFO order. The information passed over tells the thread merger splitter unit component which incoming branch/thread the next response burst should come from. The thread merger splitter unit component applies backpressure to all branches/threads that map to the same outgoing thread, except for the one indicated by the turnaround queue. When the burst completes, then the turnaround queue entry is popped. This mechanism ensures that all responses are returned in the correct order.

Chopping Individual Transactions that Cross Channel Boundaries Headed for Channels in an Aggregate Target

FIG. 10 illustrates a diagram of an embodiment of chopping logic to directly support chopping individual transactions that cross the channel address boundaries into two or more transactions/requests from the same thread, which makes the software and hardware that generates such traffic less dependent on the specific multiple channel configuration of a given SoC.

The interconnect implements chopping logic 1584 to chop individual burst requests that cross the memory channel address boundaries from a first memory channel 1520 to a second memory channel 1522 within the first aggregate target into two or more burst requests from the same thread. The chopping logic 1584 cooperates with a detector 1585 to detect when the starting address of an initial word of requested bytes in the burst request 1548 and ending address of the last word of requested bytes in the burst request 1548 causes the requested bytes in that burst request 1548 to span across one or more channel address boundaries to fulfill all of the word requests in the burst request 1548. The chopping logic 1585 includes a channel chopping algorithm and one or more tables 1586 to track thread ordering in each burst request 1548 issued by an IP initiator core to maintain a global target ordering among chopped up portions of the burst request 1548 that are spread over the individual memory channels 1520 and 1522. Either in a distributed implementation with each initiator agent in the system or in a centralized memory scheduler 1587 the system may have a detector 1585, chopping logic 1584, some buffers 1587, state machine 1588, and counters 1587 to facilitate the chopping process as well as ensuring the sequential order within the original chopped transaction is maintained.

The chopping logic supports transaction splitting across channels in an aggregate target. The chopping logic 1584 chops a burst when an initiator burst stays within a single region but spans a channel boundary. The chopping logic may be embedded in an initiator agent at the interface between the interconnect and a first initiator core. The chopping logic chops, an initial burst request spanning across one or more memory channel address boundaries to fulfill all of the word requests in the burst request, into two or more burst requests of a same height dimension for each memory channel. As shown in FIG. 12a the chopping algorithm in the flow control logic 1657 chops a series of requests in the burst request so that a starting address of an initial request in the series has a same offset from a channel boundary in a first memory channel as a starting address of the next request starting in the series of requests in the burst request in a neighboring row in the first memory channel as shown in FIG. 12b. Also, if the burst request vertically crosses into another memory channel, then the chopping algorithm chops a transaction series of requests in the burst request so that a starting address of an initial request has a same offset from a channel boundary in a first DRAM page of a first memory channel as a starting address of the next request starting the sequence of series of requests in the burst request of a second DRAM page of the first memory channel as shown in FIG. 12c.

The detector 1585 in detecting 2D block type burst requests also detects whether the initial word of the 2D burst request starts in a higher address numbered memory channel than memory channels servicing subsequent requests in that 2D burst request from the chopped transaction. If the detector detects that the initial words in a first row of the 2D block burst that crosses a memory channel boundary starts in a higher address numbered memory channel than subsequent requests to be serviced in a lower address numbered memory channel, then the state machine chops this first row into multiple bursts capable of being serviced independent of each other. The request, containing the initial words in a first row of the 2D block burst request, which is headed to the higher address numbered memory channel must be acknowledged as being received at a last thread merger unit prior to the intended higher address numbered memory channel before the chopping logic allows the second burst, containing the remainder of the first row, to be routed to the lower address numbered memory channel.

A state machine 1588 in the chopping logic chops a transaction based upon the type of burst request crossing the memory channel address boundary. The detector 1585 detects the type of burst. The detector detects for a request containing burst information that communicates one or more read requests in a burst from an initiator Intellectual Property (IP) core that are going to related addresses in a single target IP core. A burst type communicates the address sequence of the requested data within the target IP core. The state machine 1588 may perform the actual chopping of the individual transactions that cross the initial channel address boundary into two or more transactions/requests from the same thread and put chopped portions into the buffers 1587. The detector 1588 may then check whether the remaining words in the burst request cross another channel address boundary. The state machine will chop the transaction until the resulting transaction fits within a single channel's address boundary. The state machine 1585 may factor into the chop of a transaction 1) the type of burst request, 2) the starting address of initial word in the series of requests in the burst request, 3) the burst length indicating the number of words in the series of requests in the burst request, and 4) word length involved in crossing the channel address boundary. The word length and number of words in the burst request may be used to calculate the ending address of the last word in the original burst request. The design allows the traffic generating elements to allow both their request and response traffic to cross such channel address boundaries.

In an embodiment, a burst length may communicate that multiple read requests in this burst are coming from this same initiator IP core and are going to related addresses in a single target IP core. A burst type may indicate that the request is for a series of incrementing addresses or non-incrementing addresses but a related pattern of addresses such as a block transaction. The burst sequence may be for non-trivial 2-dimensional block, wrap, XOR or similar burst sequences. If the block transaction is for two-dimensional data then the request also contains annotations indicating 1) a width of the two-dimensional object that the two-dimensional object will occupy measured in the length of the row (such as a width of a raster line), 2) a height of the two-dimensional object measured in the number of rows the two-dimensional object will occupy, and 3) a stride of the two-dimensional object that the two-dimensional object will occupy that is measured in the address spacing between two consecutive rows. Address spacing between two consecutive rows can be 1) a length difference between the starting addresses of two consecutive row occupied by the target data, 2) a difference between an end of a previous rows to the beginning of next row or 3) similar spacing. The single 2D block burst request may fully describe attributes of a two-dimensional data block across the Interconnect to a target to decode the single request.

A request generated for a block transaction may include annotations indicating that an N number of read requests in this burst are going to related addresses in a single target, a length of a row occupied by a target data, a number of rows occupied by the target data, and a length difference between starting addresses of two consecutive row occupied by the target data.

Chopping Individual Transactions that Cross Channel Boundaries Headed for Channels in an Aggregate Target so that Two or More or the Chopped Portions Retain their 2D Burst Attributes

FIGS. 12a-12e illustrate five types of channel based chopping for block burst requests: normal block chopping, block row chopping, block height chopping, block deadlock chopping, and block deadlock chopping and then block height chopping. The state machine may be configured to implement channel based chopping rules as follows:

For unknown pattern types of burst requests, the chopping logic breaks the single initiator burst into a sequence of single initiator word transfers with the same sequence code (chop to initiator singles).

For detected types of bursts such as streaming, incrementing address, XOR and wrap burst, the chop fits them within a single channel. Streaming bursts, by definition, are always within a single channel. An incrementing burst request is for a series of incrementing addresses and XOR bursts non-incrementing addresses but a related pattern of addresses that are cross a channel boundary. The state machine breaks the single initiator burst into a sequence of two or more separate burst requests—each with a burst length reduced to fit within each individual channel of an aggregate target (chop to channels). Moreover, for any XOR bursts crossing a channel boundary, the resulting channel bursts have a burst byte length that is equal to 2 times 2^channel^—^interleave^—^sizebytes; and the second burst starts at MAddr+/−2^channel^—^interleave^—^size. For WRAP bursts that cross a channel boundary, the state machine breaks the single initiator burst into a sequence of single initiator word transfers (chop to initiator singles). Normally interleave_size is selected to be larger than the cache lines whose movement is the dominant source of WRAP bursts. So channel crossing WRAPs will usually not occur; and the chopping logic chops up a WRAP burst to two INCR bursts when the WRAP burst crosses a channel boundary.

For any initiator 2-Dimensional block burst to a target that is not capable of supporting the block burst, but the target does support INCR bursts, the state machine performs block row chopping. Block row chopping breaks the initiator burst into a sequence of INCR bursts, one for each row in the block burst. If the row(s) crosses a channel boundary, each row is broken into a sequence of 2 INCR bursts, one to each channel. Each such INCR burst may further be chopped into smaller INCR bursts if the target has user-controlled burst chopping and does not have sufficiently large chop_length or the target supports a shorter OCP MBurstLength.

The chopping logic prevents a deadlock situation when each smaller burst/portion of the transaction has requests that need to be serviced by their own channel and these requests should be serviced from each channel in a ping-pong fashion by making sure that the a burst request headed to a lower address numbered memory channel is serviced initially and then a burst request in the second portion may be serviced by a higher address numbered memory channel. If the initiator block row(s) crosses a channel boundary and the burst starts in a higher address numbered memory channel than memory channels servicing subsequent requests in that burst, then block deadlock chopping creates 4 target bursts as shown in FIG. 12d. The first of the 4 chopped burst (resulting from the deadlock block chopping) is a single row block with chopped length for the highest-number channel. It corresponds to the leading part of the first row of the initiator block burst that falls into the highest-numbered channel. The last of the 4 chopped burst (resulting from the deadlock block chopping) is a single row block with chopped length for the first channel (channel 0). It corresponds to the trailing part of the last row of the initiator block burst that falls into channel 0. The first and last single row block bursts are separated by an even number of block bursts each containing a series of rows that alternatively fall into channel 0 and then the highest-numbered channel, ch 3. Each pair of such channel block bursts has a new and the largest possible/affordable MBurstHeight that is a power of two. The 4 target bursts may have a new MBurstStride equal to the initiator-supplied MBurstStride divided by num_active_channels.

Whenever normal block chopping or block deadlock chopping is applied to a block

Write burst or a block Multiple Request Multiple response Data Read (MRMD burst that is not translated to Single Request Multiple response Data (SRMD) (MRMD Read to SRMD Read translation is disabled for the given target), initiator agent sends the two resulting channel block bursts as a single atomic sequence, called an interleaved block burst. The reason is to prevent downstream mergers from interleaving in other traffic from other initiators while an upstream splitter switches among alternative rows of the two-channel block bursts. i.e., the splitter has to lock arbitration (using m_lockarb) on both of its outgoing branches/threads until all rows are processed and then release the lock on both branches/threads. In the alternative, the m_lockarb action at the splitter may be the following: (a) the initiator agent should set the m_lockarb properly among alternative rows to prevent downstream mergers from interleaving in other traffic before these alternative rows reaching the first channel splitter RS (only 1 channel crossing). At the channel splitter, the m_lockarb needs to be set for the first block burst's last row.

In the interconnect, 2D block bursts are sent as Single Request Multiple response Data bursts whenever possible (i.e., MRMD to SRMD conversion of RD bursts is not disabled). Burst length conversion for block channel bursts (post channel burst chopping) is performed similar to INCR bursts. For example, for wide-to-narrow conversion, burst length is multiplied by the ratio of target to initiator data widths; for narrow-to-wide conversions, initiator agent pads each row at start and end to align it to the target data width, and the resulting initiator burst (row) length is divided to get the target burst length.

As shown in FIG. 12e, a round of block height chopping to the second and third of the 4 chopped bursts resulting from the original block deadlock chopping.

In an embodiment, when the chopping logic chops a request into two, the chopping logic maintains the width of the word request being chopped by figuring out the number of bits in the first portion of the chopped word request being serviced by a first channel and subtracting that number of bits from the width of a word to determine the width of the second portion of the chopped word request being serviced by a next channel. See FIG. 3 and chopped request d. The second portion of the chopped word request being serviced by a second channel has a starting address of a first row of the next channel. Also, each portion of a chopped burst request may be chopped so that a start address for requested bytes of an initial request in the series of requests in each portion has the same relative position within a channel (same relative offset in column from channel boundary) as other words in column. See FIG. 12a and the aligned portions in Channel 0.

A DL link payload signal p_split_info may be used to notify the splitter. The p_split_info field is zero for non-INT_block bursts. For INT_block bursts, split_info identifies the downstream splitter where the INT_block burst will split into two. The channel splitter whose channel_splitter_id matches p_split_info will split the INT_block burst and reset to 0 any m_lockarb=1 in that atomic sequence that is accompanied by a p_burstlast=1.

Higher Performance Access Protection

The chopping logic in the interconnect may also employ a new higher performance architecture for access protection mechanism (PM) checking. The architecture is a dual look-up architecture. Each request burst issued from the target agent is first qualified by the PM using two look-ups in parallel. The first look-up is based upon the starting address for the burst. The second look-up is based upon the calculated ending address for the burst. Qualification of the access as permitted requires all the conditions as current required in SMX associated with the first look-up, plus 1 new condition. The new condition is that the first and second look-ups must hit the same protection region. This disqualifies bursts that cross a protection region boundary, even if the proper permissions are set in both the starting and the ending regions. It is expected and required that a single protection region covers data sets accessed by bursts.

The second look-up is only performed for INCR bursts at targets with burst_aligned=0, and for block bursts. For WRAP, XOR, STRM, and burst aligned INCR bursts success of the second look-up is guaranteed (by the aligned nature of the bursts, the range of lengths supported, and the minimum granularity of protection region sizes). UNKN and DFLT2 transactions are still only handled as single word transfers at protected the target agents, so the second look-up for these is also assured.

FIG. 13 illustrates a flow diagram of an embodiment of an example of a process for generating a device, such as a System on a Chip, with the designs and concepts discussed above for the Interconnect. The example process for generating a device with from designs of the Interconnect may utilize an electronic circuit design generator, such as a System on a Chip compiler, to form part of an Electronic Design Automation (EDA) toolset. Hardware logic, coded software, and a combination of both may be used to implement the following design process steps using an embodiment of the EDA toolset. The EDA toolset such may be a single tool or a compilation of two or more discrete tools. The information representing the apparatuses and/or methods for the circuitry in the Interconnect, etc may be contained in an Instance such as in a cell library, soft instructions in an electronic circuit design generator, or similar machine-readable storage medium storing this information. The information representing the apparatuses and/or methods stored on the machine-readable storage medium may be used in the process of creating the apparatuses, or representations of the apparatuses such as simulations and lithographic masks, and/or methods described herein.

Aspects of the above design may be part of a software library containing a set of designs for components making up the Interconnect and associated parts. The library cells are developed in accordance with industry standards. The library of files containing design elements may be a stand-alone program by itself as well as part of the EDA toolset.

The EDA toolset may be used for making a highly configurable, scalable System-On-a-Chip (SOC) inter block communication system that integrally manages input and output data, control, debug and test flows, as well as other functions. In an embodiment, an example EDA toolset may comprise the following: a graphic user interface; a common set of processing elements; and a library of files containing design elements such as circuits, control logic, and cell arrays that define the EDA tool set. The EDA toolset may be one or more software programs comprised of multiple algorithms and designs for the purpose of generating a circuit design, testing the design, and/or placing the layout of the design in a space available on a target chip. The EDA toolset may include object code in a set of executable software programs. The set of application-specific algorithms and interfaces of the EDA toolset may be used by system integrated circuit (IC) integrators to rapidly create an individual IP core or an entire System of IP cores for a specific application. The EDA toolset provides timing diagrams, power and area aspects of each component and simulates with models coded to represent the components in order to run actual operation and configuration simulations. The EDA toolset may generate a Netlist and a layout targeted to fit in the space available on a target chip. The EDA toolset may also store the data representing the interconnect and logic circuitry on a machine-readable storage medium.

Generally, the EDA toolset is used in two major stages of SOC design: front-end processing and back-end programming.

Front-end processing includes the design and architecture stages, which includes design of the SOC schematic. The front-end processing may include connecting models, configuration of the design, simulating, testing, and tuning of the design during the architectural exploration. The design is typically simulated and tested. Front-end processing traditionally includes simulation of the circuits within the SOC and verification that they should work correctly. The tested and verified components then may be stored as part of a stand-alone library or part of the IP blocks on a chip. The front-end views support documentation, simulation, debugging, and testing.

In block 2005, the EDA tool set may receive a user-supplied text file having data describing configuration parameters and a design for at least part of an individual IP block having multiple levels of hierarchy. The data may include one or more configuration parameters for that IP block. The IP block description may be an overall functionality of that IP block such as an Interconnect. The configuration parameters for the Interconnect IP block may be number of address regions in the system, system addresses, how data will be routed based on system addresses, etc.

The EDA tool set receives user-supplied implementation technology parameters such as the manufacturing process to implement component level fabrication of that IP block, an estimation of the size occupied by a cell in that technology, an operating voltage of the component level logic implemented in that technology, an average gate delay for standard cells in that technology, etc. The technology parameters describe an abstraction of the intended implementation technology. The user-supplied technology parameters may be a textual description or merely a value submitted in response to a known range of possibilities.

The EDA tool set may partition the IP block design by creating an abstract executable representation for each IP sub component making up the IP block design. The abstract executable representation models TAP characteristics for each IP sub component and mimics characteristics similar to those of the actual IP block design. A model may focus on one or more behavioral characteristics of that IP block. The EDA tool set executes models of parts or all of the IP block design. The EDA tool set summarizes and reports the results of the modeled behavioral characteristics of that IP block. The EDA tool set also may analyze an application's performance and allows the user to supply a new configuration of the IP block design or a functional description with new technology parameters. After the user is satisfied with the performance results of one of the iterations of the supplied configuration of the IP design parameters and the technology parameters run, the user may settle on the eventual IP core design with its associated technology parameters.

The EDA tool set integrates the results from the abstract executable representations with potentially additional information to generate the synthesis scripts for the IP block. The EDA tool set may supply the synthesis scripts to establish various performance and area goals for the IP block after the result of the overall performance and area estimates are presented to the user.

The EDA tool set may also generate an RTL file of that IP block design for logic synthesis based on the user supplied configuration parameters and implementation technology parameters. As discussed, the RTL file may be a high-level hardware description describing electronic circuits with a collection of registers, Boolean equations, control logic such as “if-then-else” statements, and complex event sequences.

In block 2010, a separate design path in an ASIC or SOC chip design is called the integration stage. The integration of the system of IP blocks may occur in parallel with the generation of the RTL file of the IP block and synthesis scripts for that IP block.

The EDA toolset may provide designs of circuits and logic gates to simulate and verify the operation of the design works correctly. The system designer codes the system of IP blocks to work together. The EDA tool set generates simulations of representations of the circuits described above that can be functionally tested, timing tested, debugged and validated. The EDA tool set simulates the system of IP block's behavior. The system designer verifies and debugs the system of IP blocks' behavior. The EDA tool set tool packages the IP core. A machine-readable storage medium may also store instructions for a test generation program to generate instructions for an external tester and the interconnect to run the test sequences for the tests described herein. One of ordinary skill in the art of electronic design automation knows that a design engineer creates and uses different representations to help generating tangible useful information and/or results. Many of these representations can be high-level (abstracted and with less details) or top-down views and can be used to help optimize an electronic design starting from the system level. In addition, a design process usually can be divided into phases and at the end of each phase, a tailor-made representation to the phase is usually generated as output and used as input by the next phase. Skilled engineers can make use of these representations and apply heuristic algorithms to improve the quality of the final results coming out of the final phase. These representations allow the electric design automation world to design circuits, test and verify circuits, derive lithographic mask from Netlists of circuit and other similar useful results.

In block 2015, next, system integration may occur in the integrated circuit design process. Back-end programming generally includes programming of the physical layout of the SOC such as placing and routing, or floor planning, of the circuit elements on the chip layout, as well as the routing of all metal lines between components. The back-end files, such as a layout, physical Library Exchange Format (LEF), etc. are generated for layout and fabrication.

The generated device layout may be integrated with the rest of the layout for the chip. A logic synthesis tool receives synthesis scripts for the IP core and the RTL design file of the IP cores. The logic synthesis tool also receives characteristics of logic gates used in the design from a cell library. RTL code may be generated to instantiate the SOC containing the system of IP blocks. The system of IP blocks with the fixed RTL and synthesis scripts may be simulated and verified. Synthesizing of the design with Register Transfer Level (RTL) may occur. The logic synthesis tool synthesizes the RTL design to create a gate level Netlist circuit design (i.e. a description of the individual transistors and logic gates making up all of the IP sub component blocks). The design may be outputted into a Netlist of one or more hardware design languages (HDL) such as Verilog, VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) or SPICE (Simulation Program for Integrated Circuit Emphasis). A Netlist can also describe the connectivity of an electronic design such as the components included in the design, the attributes of each component and the interconnectivity amongst the components. The EDA tool set facilitates floor planning of components including adding of constraints for component placement in the space available on the chip such as XY coordinates on the chip, and routes metal connections for those components. The EDA tool set provides the information for lithographic masks to be generated from this representation of the IP core to transfer the circuit design onto a chip during manufacture, or other similar useful derivations of the circuits described above. Accordingly, back-end programming may further include the physical verification of the layout to verify that it is physically manufacturable and the resulting SOC will not have any function-preventing physical defects.

In block 2020, a fabrication facility may fabricate one or more chips with the signal generation circuit utilizing the lithographic masks generated from the EDA tool set's circuit design and layout. Fabrication facilities may use a standard CMOS logic process having minimum line widths such as 1.0 um, 0.50 um, 0.35 um, 0.25 um, 0.18 um, 0.13 um, 0.10 um, 90 nm, 65 nm or less, to fabricate the chips. The size of the CMOS logic process employed typically defines the smallest minimum lithographic dimension that can be fabricated on the chip using the lithographic masks, which in turn, determines minimum component size. According to one embodiment, light including X-rays and extreme ultraviolet radiation may pass through these lithographic masks onto the chip to transfer the circuit design and layout for the test circuit onto the chip itself.

The EDA toolset may have configuration dialog plug-ins for the graphical user interface. The EDA toolset may have an RTL generator plug-in for the SocComp. The EDA toolset may have a SystemC generator plug-in for the SocComp. The EDA toolset may perform unit-level verification on components that can be included in RTL simulation. The EDA toolset may have a test validation testbench generator. The EDA toolset may have a dis-assembler for virtual and hardware debug port trace files. The EDA toolset may be compliant with open core protocol standards. The EDA toolset may have Transactor models, Bundle protocol checkers, OCPDis2 to display socket activity, OCPPerf2 to analyze performance of a bundle, as well as other similar programs.

As discussed, an EDA tool set may be implemented in software as a set of data and instructions, such as an Instance in a software library callable to other programs or an EDA tool set consisting of an executable program with the software cell library in one program, stored on a machine-readable medium. A machine-readable storage medium may include any mechanism that provides (e.g., stores and/or transmits) information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium may include, but is not limited to: read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; DVD's; EPROMs; EEPROMs; FLASH, magnetic or optical cards; or any other type of media suitable for storing electronic instructions. The instructions and operations also may be practiced in distributed computing environments where the machine-readable media is stored on and/or executed by more than one computer system. In addition, the information transferred between computer systems may either be pulled or pushed across the communication media connecting the computer systems.

Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

While some specific embodiments of the invention have been shown the invention is not to be limited to these embodiments. For example, most functions performed by electronic hardware components may be duplicated by software emulation. Thus, a software program written to accomplish those same functions may emulate the functionality of the hardware components in input-output circuitry. A target may be single threaded or multiple threaded. The invention is to be understood as not limited by the specific embodiments described herein, but only by scope of the appended claims.

Claims

1. An integrated circuit having multiple initiator IP cores and multiple target IP cores that communicate request transactions over an interconnect, where the interconnect provides a shared communications bus between the multiple initiator IP cores and multiple target IP cores, comprising:

flow control logic for the interconnect is configured to apply a flow control splitting protocol to permit transactions from a first initiator thread or a first initiator tag stream to be outstanding to multiple channels in a single aggregate target at once, and therefore to multiple individual target IP cores within the aggregate target at once, where the combination of the flow control logic and the flow control splitting protocol allows the interconnect to manage simultaneous requests to multiple channels in the aggregate target from the same first initiator thread or the first initiator tag at the same time.

2. The integrated circuit of claim 1, where the flow control logic also includes merger and thread splitter units in its architecture to intentionally split request transactions in an initiator agent, or as early as possible along the path in the interconnect to target agents for the multiple channels of the aggregate target, and this approach avoids creating a centralized point that could act as a bandwidth choke point and routing congestion point.

3. The integrated circuit of claim 1, where a distribution of the flow control logic eliminates a need to have all the communication paths in the interconnect pass through a single choke point because many distributed pathways exist in this shared communications bus, and the flow control logic for the interconnect is configured to apply a flow control splitting protocol to also split the request transactions early where it makes sense due to a physical routing of parts of that set of request transactions being routed on separate physical pathways in the interconnect as well as being routed to target IP cores physically located in different areas on the integrated circuit, and

where flow control splitting protocol is also configured to allow multiple transactions to be issued and serviced in parallel, which increases an efficiency of each initiator in being able to start having more transactions serviced in the same period of time, where a first and a second transaction from a first initiator IP core are issued prior to the first transaction being completely serviced by a first target IP core resulting in that the first initiator IP core and the first target IP core are working on multiple transactions at the same time.

4. The integrated circuit of claim 1, where the interconnect has multiple thread merger and thread splitter units in the flow control logic distributed over the interconnect that maintain request order for read and write request transactions over the interconnect, where the one or more thread splitter units route request transactions from a first initiator IP core generating a set of request transactions in the first initiator thread down two or more different physical paths to the target IP cores physically located in different areas on the integrated circuit.

5. The integrated circuit of claim 1, where the interconnect implements an address map with assigned address for the target IP cores in the integrated circuit to route request transactions between the target IP cores and the initiator IP cores in the integrated circuit, where the interconnect is configured to interrogate the address map based on a logical destination address associated with a first request to the aggregate target with two or more interleaved memory channels, and determines which memory channels will service the first request and how to route the first request to the physical IP addresses of each memory channel in the aggregate target servicing that request so that an initiator IP core need not know of the physical IP addresses of each memory channel in the aggregate target, and

where flow control splitting protocol implemented in the flow control logic is also configured to also allow multiple transactions from either 1) the same initiator IP core thread or 2) the same initiator IP core set of tags to be outstanding to the multiple channels of the aggregated target at the same time, and

the multiple channels in the aggregated target map to target IP cores having physically different addresses.

6. The integrated circuit of claim 1, where the flow control logic in the interconnect maintains a request order within the first initiator thread and an expected response order to those requests, and where the interconnect includes three or more initiator agents and three or more target agents, where two or more target agents are located at physically different locations coupling to the interconnect but belong to the same aggregate target with multiple channels.

7. The integrated circuit of claim 1, where one or more thread splitter units with the flow control logic in a request network splits the path links to the aggregate target with multiple channels, which is a Dynamic Random-Access Memory (DRAM) IP core, where a first request travels a first link to a first channel in the multi-channel target DRAM and a second request travels a second link to a second channel in the multi-channel target DRAM and two or more target agents are coupled to the multi-channel target DRAM, and a first target agent is assigned to the first channel and a second target agent is assigned to the second channel for the multi-channel target DRAM, where the first and second target agents are at physically different locations coupling to the interconnect and belong to the same aggregate target with multiple channels, and where the thread splitter units and other associated flow control logic minimize the transaction and routing congestion issues associated with a centralized channel splitter.

8. The integrated circuit of claim 1, where a distributed implementation in each thread splitter unit and thread merger unit in the flow control logic is configured to allow them to interrogate a local system address map to determine both 1) thread routing and 2) thread buffering until a switch of physical paths can occur, and where the thread splitter units and thread merger units cooperate end-to-end to ensure ordering without a need to install one or more full transaction reorder buffers within the interconnect.

9. The integrated circuit of claim 1, where the flow control logic internal to 1) the interconnect or 2) in the initiator agent interrogates the address map and a known structural organization of the aggregated target in the integrated circuit to decode an interleaved address space of the aggregated target to determine any physical distinctions between the target IP cores making up the aggregated target IP core in order to determine which targets making up the aggregated target need to service a given request from an initiator IP core, and where the flow control logic applies a flow control splitting protocol to allow multiple transactions from the same thread to be outstanding to multiple channels of the aggregated target at any given time and the multiple channels in the aggregated target map to target IP memory cores having physically different addresses.

10. The integrated circuit of claim 1, where an initiator agent interfacing the interconnect for a first initiator IP core interrogates an address map based on a logical destination address associated with a request to the aggregate target that has interleaved two or more memory channels, and determines which memory channels will service the request and how to route the request to the physical IP addresses of each memory channel in the aggregate target servicing that request so that any initiator IP core need not know of the physical IP addresses of each memory channel in the aggregate target.

11. The integrated circuit of claim 1, where the flow control logic is configured to apply a flow control splitting protocol to allow multiple transactions from the same thread to be outstanding to the multiple channels of the aggregated target at any given time and the multiple channels in the aggregated target map to target memory cores having physically different addresses.

12. The integrated circuit of claim 1, where chopping logic and the flow control logic cooperate to allow requests that are part of a request burst transaction to cross an interleave boundary of the aggregate target such that some request transfers are sent to one channel target while others are sent to another channel target within the aggregate target, where the chopping logic is internal to the interconnect and is configured to chop individual burst transactions that cross channel boundaries headed for channels in the aggregate target into two or more requests.

13. The integrated circuit of claim 1, where the initiator cores do not need a priori knowledge of a memory's address structure and organization in the aggregate target, rather one or more initiator agents have this structural and organizational knowledge of memory channels to choose a true address of the target of a request transaction, a route to the target of the request transaction from a first initiator IP core across the interconnect, and then a channel within the aggregated target.

14. The integrated circuit of claim 1, where address decoding of an intended address of the request transaction from the first initiator thread happens as soon as the request transaction enters an interface of the interconnect, and the flow control logic interrogates an address map and a known structural organization of each aggregated target IP core in the integrated circuit to decode an interleaved address space of the aggregated targets to determine the physical distinctions between the target IP cores making up a particular aggregated target IP core in order to determine which target IP cores making up a first aggregated target needs to service a current request transaction.

15. The integrated circuit of claim 1, where two or more thread splitter units with the flow control logic are configured to route request transactions from an initiator IP core generating a set of transactions in the first initiator thread down two or more different physical paths in the interconnect by routing a first request with a destination address headed to a first physical location on the integrated circuit, which is a first target, and other requests within that first initiator thread having a destination address headed to different physical locations on the integrated circuit from the first physical location, where the first physical location is a first channel and the different physical location is a second channel making up part of the aggregate target, where the first and second channels share an address region to appear as single logical aggregated target, and where a channel merger component in a response path maintains response path ordering, where a mechanism to re-order responses in the response path includes passing information from a channel splitter in the request path to a corresponding channel merger component in the response path, and the information passed over tells the thread merger component which incoming thread the next response burst transaction should come from.

16. The integrated circuit of claim 1, where a request path in the interconnect includes a series of splitter and merger units in the flow control logic distributed across the interconnect to create different physical paths across the interconnect to the aggregate target with multiple channels, and where the aggregate target with multiple channels has two or more discreet memories channels including on-chip IP cores and off-chip memory cores that are interleaved with each other to appear to system software and other IP cores as a single memory in a system address space.

17. The integrated circuit of claim 1, where the interconnect implements the flow control logic and flow control protocol internal to the interconnect itself to manage expected execution ordering of a set of issued requests within the same first initiator thread that are serviced and responses returned in order with respect to each other but independent of an ordering of another thread, and the flow control logic at a thread splitter unit permits transactions from one initiator thread to be outstanding to multiple channels at once and therefore to multiple individual target IP cores within a multi-channel target at once, where different channels are mapped to two individual target IP cores within the aggregate target with multiple channels, and

the integrated circuit has chopping logic to chop individual burst requests that cross the memory channel address boundaries from a first memory channel to a second memory channel within the first aggregate target into two or more burst requests from the same thread, where the chopping logic cooperates with a detector to detect when the starting address of an initial word of requested bytes in the burst request and ending address of the last word of requested bytes in the burst request causes the requested bytes in that burst request to span across one or more channel address boundaries to fulfill all of the word requests in the burst request transaction.

18. A method of communicating requests over an interconnect in an integrated circuit having multiple initiator IP cores and multiple target IP cores, where the interconnect provides a shared communications bus between the multiple initiator IP cores and multiple target IP cores, comprising:

applying a flow control splitting protocol to permit transactions from one initiator thread or one initiator tag stream to be outstanding to multiple channels in a single aggregate target at once, and therefore to multiple individual target IP cores within the aggregate target at once, where the combined flow control logic and flow control protocol allows the interconnect to manage simultaneous requests to multiple channels in the aggregate target from a same thread or tag at the same time.