Abstract: Described herein is a method and system for directly accessing and transferring data between a first memory architecture and a second memory architecture associated with a graphics processing unit (GPU) by treating the first memory architecture, the second memory architecture and system memory as a single physical memory, where the first memory architecture is a non-volatile memory (NVM) and the second memory architecture is a local memory. Upon accessing a virtual address (VA) range by a processor, the requested content is paged in from the single physical memory and is then redirected by a virtual storage driver to the second memory architecture or the system memory, depending on which of the GPU or CPU triggered the access request. The memory transfer occurs without awareness of the application and the operating system.
Type:
Grant
Filed:
August 6, 2018
Date of Patent:
September 1, 2020
Assignees:
ADVANCED MICRO DEVICES, INC., ATI TECHNOLOGIES ULC
Abstract: A processor reduces bus bandwidth consumption by employing a shared load scheme, whereby each shared load retrieves data for multiple compute units (CUs) of a processor. Each CU in a specified group monitors a bus for load accesses directed to a cache shared by the multiple CUs. In response to identifying a load access on the bus, a CU determines if the load access is a shared load access for its share group. In response to identifying a shared load access for its share group, the CU allocates an entry of a private cache associated with the CU for data responsive to the shared load access. The CU then monitors the bus for the data targeted by the shared load. In response to identifying the targeted data on the bus, the CU stores the data at the allocated entry of the private cache.
Abstract: A method and system for compiler optimization includes analyzing a representation of source code to identify an original conditional construct having both a high-latency instruction and one or more instructions dependent on the high-latency instruction in a branch of the conditional construct. A set of one or more instructions following the conditional construct in the representation of source code and independent of the high-latency instruction is selected. An optimized representation of the source code is generated, whereby the optimized representation replaces the original conditional construct with a first split conditional construct positioned prior to the selected set of one or more instructions and a second split conditional construct positioned following the selected set of one or more instructions, The method further includes generating an executable representation of the source code based on the optimized representation of the source code.
Type:
Grant
Filed:
December 20, 2018
Date of Patent:
August 11, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Brian J. Favela, Todd Martin, Robert A. Gottlieb
Abstract: A computer vision processing device is provided which comprises memory configured to store data and a processor. The processor is configured to store captured image data in a first buffer and acquire access to the captured image data in the first buffer when the captured image data is available for processing. The processor is also configured to execute a first group of operations in a processing pipeline, each of which processes the captured image data accessed from the first buffer and return the first buffer for storing next captured image data when a last operation of the first group of operations executes.
Type:
Grant
Filed:
July 28, 2017
Date of Patent:
August 11, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Radhakrishna Giduthuri, Michael L. Schmit
Abstract: A processing system employs an expandable memory buffer that supports enlarging the memory buffer when the processing system generates a large number of long latency memory transactions. The hybrid structure of the memory buffer allows a memory controller of the processing system to store a larger number of memory transactions while still maintaining adequate transaction throughput and also ensuring a relatively small buffer footprint and power consumption. Further, the hybrid structure allows different portions of the buffer to be placed on separate integrated circuit dies, which in turn allows the memory controller to be used in a wide variety of integrated circuit configurations, including configurations that use only one portion of the memory buffer.
Abstract: Methods are provided for creating objects in a way that permits an API client to explicitly participate in memory management for an object created using the API. Methods for managing data object memory include requesting memory requirements for an object using an API and expressly allocating a memory location for the object based on the memory requirements. Methods are also provided for cloning objects such that a state of the object remains unchanged from the original object to the cloned object or can be explicitly specified.
Type:
Grant
Filed:
April 3, 2017
Date of Patent:
August 4, 2020
Assignees:
ATI TECHNOLOGIES ULC, ADVANCED MICRO DEVICES, INC.
Abstract: A set of entries in a branch prediction structure for a set of second blocks are accessed based on a first address of a first block. The set of second blocks correspond to outcomes of one or more first branch instructions in the first block. Speculative prediction of outcomes of second branch instructions in the second blocks is initiated based on the entries in the branch prediction structure. State associated with the speculative prediction is selectively flushed based on types of the branch instructions. In some cases, the branch predictor can be accessed using an address of a previous block or a current block. State associated with the speculative prediction is selectively flushed from the ahead branch prediction, and prediction of outcomes of branch instructions in one of the second blocks is selectively initiated using non-ahead accessing, based on the types of the one or more branch instructions.
Type:
Grant
Filed:
June 18, 2018
Date of Patent:
August 4, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Marius Evers, Aparna Thyagarajan, Ashok T. Venkatachar
Abstract: A system including a stack of two or more layers of volatile memory, such as layers of a 3D stacked DRAM memory, places data in the stack based on a temperature or a refresh rate. When a threshold is exceeded, data are moved from a first region to a second region in the stack, the second region having one or both of a second temperature lower than a first temperature of the first region or a second refresh rate lower than a first refresh rate of the first region.
Type:
Grant
Filed:
August 1, 2018
Date of Patent:
July 28, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Jagadish B. Kotra, Karthik Rao, Joseph L. Greathouse
Abstract: An electronic device handles memory access requests for data in a memory. The electronic device includes a memory controller for the memory, a last-level cache memory, a request generator, and a predictor. The predictor determines a likelihood that a cache memory access request for data at a given address will hit in the last-level cache memory. Based on the likelihood, the predictor determines: whether a memory access request is to be sent by the request generator to the memory controller for the data in parallel with the cache memory access request being resolved in the last-level cache memory, and, when the memory access request is to be sent, a type of memory access request that is to be sent. When the memory access request is to be sent, the predictor causes the request generator to send a memory request of the type to the memory controller.
Type:
Grant
Filed:
February 12, 2019
Date of Patent:
July 21, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Jieming Yin, Yasuko Eckert, Matthew R. Poremba, Steven E. Raasch, Doug Hunt
Abstract: A compute unit configured to execute multiple threads in parallel is presented. The compute unit includes one or more single instruction multiple data (SIMD) units and a fetch and decode logic. The SIMD units have differing numbers of arithmetic logic units (ALUs), such that each SIMD unit can execute a different number of threads. The fetch and decode logic is in communication with each of the SIMD units, and is configured to assign the threads to the SIMD units for execution based on such differing numbers of ALUs.
Type:
Grant
Filed:
September 18, 2014
Date of Patent:
July 14, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Joseph L. Greathouse, Mitesh R. Meswani, Sooraj Puthoor, Dmitri Yudanov, James M. O'Connor
Abstract: A processor includes two or more branch target buffer (BTB) tables for branch prediction, each BTB table storing entries of a different target size or width or storing entries of a different branch type. Each BTB entry includes at least a tag and a target address. For certain branch types that only require a few target address bits, the respective BTB tables are narrower thereby allowing for more BTB entries in the processor separated into respective BTB tables by branch instruction type. An increased number of available BTB entries are stored in a same or a less space in the processor thereby increasing a speed of instruction processing. BTB tables can be defined that do not store any target address and rely on a decode unit to provide it. High value BTB entries have dedicated storage and are therefore less likely to be evicted than low value BTB entries.
Abstract: A processor partitions a coherency directory into different regions for different processor cores and manages the number of entries allocated to each region based at least in part on monitored recall costs indicating expected resource costs for reallocating entries. Examples of monitored recall costs include a number of cache evictions associated with entry reallocation, the hit rate of each region of the coherency directory, and the like, or a combination thereof. By managing the entries allocated to each region based on the monitored recall costs, the processor ensures that processor cores associated with denser memory access patterns (that is, memory access patterns that more frequently access cache lines associated with the same memory pages) are assigned more entries of the coherency directory.
Type:
Grant
Filed:
August 22, 2018
Date of Patent:
July 7, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Michael W. Boyer, Gabriel H. Loh, Yasuko Eckert, William L. Walker
Abstract: An electronic device for decompressing compressed data to recreate original data includes a first string copy engine and a second string copy engine. The first string copy engine processes a first string copy command by acquiring a first string from recreated original data and appending the first string to the recreated original data. The second string copy engine processes a second string copy command by checking the second string copy command for a dependency on the first string and, when the dependency is found, stalling further processing of the second string copy command until the first string copy engine has appended a corresponding portion of the first string to the recreated original data. The second string copy engine processes the second string copy command by acquiring a second string from the recreated original data and appending the second string to the recreated original data.
Abstract: Systems, methods, and devices for generating an image frame for display to a user. Brain activity sensor data correlated with movement of a user is received. A predicted field of view of the user is determined based on the brain activity sensor data. An image frame is generated based on the predicted field of view. The image frame is transmitted to a display for display to a user. Some implementations provide for receiving and displaying a foveated image frame based on a predicted field of view of a user. Brain activity information of a user is captured. The brain activity information is communicated to a transceiver. The brain activity information is transmitted to a rendering device using the transceiver to generate a foveated image frame based on a predicted field of view of the user. The foveated image frame is received from the rendering device, decoded, and displayed to the user.
Type:
Grant
Filed:
October 31, 2018
Date of Patent:
July 7, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Michael L. Schmit, Nathaniel David Naegle
Abstract: A system includes a multi-core processor that includes a scheduler. The multi-core processor communicates with a system memory and an operating system. The multi-core processor executes a first process and a second process. The system uses the scheduler to control a use of a memory bandwidth by the second process until a current use in a control cycle by the first process meets a first setpoint of use for the first process when the first setpoint is at or below a latency sensitive (LS) floor or a current use in the control cycle by the first process exceeds the LS floor when the first setpoint exceeds the LS floor.
Abstract: A heterogeneous processor system includes a first processor implementing an instruction set architecture (ISA) including a set of ISA features and configured to support a first subset of the set of ISA features. The heterogeneous processor system also includes a second processor implementing the ISA including the set of ISA features and configured to support a second subset of the set of ISA features, wherein the first subset and the second subset of the set of ISA features are different from each other. When the first subset includes an entirety of the set of ISA features, the lower-feature second processor is configured to execute an instruction thread by consuming less power and with lower performance than the first processor.
Abstract: An asynchronous pipeline includes a first stage and one or more second stages. A controller provides control signals to the first stage to indicate a modification to an operating speed of the first stage. The modification is determined based on a comparison of a completion status of the first stage to one or more completion statuses of the one or more second stages. In some cases, the controller provides control signals indicating modifications to an operating voltage applied to the first stage and a drive strength of a buffer in the first stage. Modules can be used to determine the completion statuses of the first stage and the one or more second stages based on the monitored output signals generated by the stages, output signals from replica critical paths associated with the stages, or a lookup table that indicates estimated completion times.
Type:
Grant
Filed:
July 21, 2016
Date of Patent:
June 30, 2020
Assignee:
ADVANCED MICRO DEVICES, INC.
Inventors:
Greg Sadowski, John Kalamatianos, Shomit N. Das
Abstract: Described herein are a method and apparatus for memory vulnerability prediction. A memory vulnerability predictor predicts the reliability of a memory region when it is first accessed, based on past program history. The memory vulnerability predictor uses a table to store reliability predictions and predicts reliability needs of a new memory region. A memory management module uses the reliability information to make decisions, (such as to guide memory placement policies in a heterogeneous memory system).
Abstract: A method of message-based communication is provided which includes executing, on one or more accelerated processing units, a plurality of groups of work items, receiving a first message from a first group of work items of the plurality of groups of work items executing on the one or more accelerated processing units and storing the first message at a first segment of memory allocated to a second group of work items of the plurality of groups of work items executing on the accelerated processing unit.
Abstract: The described embodiments include an input-output memory management unit (IOMMU) with two or more memory elements and a controller. The controller is configured to select, based on one or more factors, one or more selected memory elements from among the two or more memory elements for performing virtual address to physical address translations in the IOMMU. The controller then performs the virtual address to physical address translations using the one or more selected memory elements.