Single Instruction, Multiple Data (simd) Patents (Class 712/22)

Processing with compact arithmetic processing element

Patent number: 12299411

Abstract: A processor or other device, such as a programmable and/or massively parallel processor or other device, includes processing elements designed to perform arithmetic operations (possibly but not necessarily including, for example, one or more of addition, multiplication, subtraction, and division) on numerical values of low precision but high dynamic range (“LPHDR arithmetic”). Such a processor or other device may, for example, be implemented on a single chip. Whether or not implemented on a single chip, the number of LPHDR arithmetic elements in the processor or other device in certain embodiments of the present invention significantly exceeds (e.g., by at least 20 more than three times) the number of arithmetic elements, if any, in the processor or other device which are designed to perform high dynamic range arithmetic of traditional precision (such as 32 bit or 64 bit floating point arithmetic).

Type: Grant

Filed: December 8, 2023

Date of Patent: May 13, 2025

Assignee: Singular Computing LLC

Inventor: Joseph Bates
Techniques for efficiently synchronizing multiple program threads

Patent number: 12271765

Abstract: Various embodiments include a parallel processing computer system that enables parallel instances of a program to synchronize at disparate addresses in memory. When the parallel program instances need to exchange data, the program instances synchronize based on a mask that identifies the program instances that are synchronizing. As each program instance reaches the point of synchronization, the program instance blocks and waits for all other program instances to reach the point of synchronization. When all program instances have reached the point of synchronization, at least one program instance executes a synchronous operation to exchange data. The program instances then continue execution at respective and disparate return addresses.

Type: Grant

Filed: June 3, 2021

Date of Patent: April 8, 2025

Assignee: NVIDIA CORPORATION

Inventors: Ajay Sudarshan Tirumala, Olivier Giroux, Peter Nelson, Gary M. Tarolli, Ankita Upreti, Konstantinos Kyriakopoulos, Divya Shanmughan, Rishkul Kulkarni
Graphics primitives and positions through memory buffers

Patent number: 12169896

Abstract: Systems, apparatuses, and methods for preemptively reserving buffer space for primitives and positions in a graphics pipeline are disclosed. A system includes a graphics pipeline frontend with any number of geometry engines coupled to corresponding shader engines. Each geometry engine launches shader wavefronts to execute on a corresponding shader engine. The geometry engine preemptively reserves buffer space for each wavefront prior to the wavefront being launched on the shader engine. When the shader engine executes a wavefront, the shader engine exports primitive and position data to the reserved buffer space. Multiple scan converters will consume the primitive and position data, with each scan converter consuming primitive and position data based on the screen coverage of the scan converter. After consuming the primitive and position data, the scan converters mark the buffer space as freed so that the geometry engine can then allocate the freed buffer space to subsequent shader wavefronts.

Type: Grant

Filed: September 29, 2021

Date of Patent: December 17, 2024

Assignee: Advanced Micro Devices, Inc.

Inventors: Todd Martin, Tad Robert Litwiller, Nishank Pathak, Randy Wayne Ramsey, Michael J. Mantor, Christopher J. Brennan, Mark M. Leather, Ryan James Cash
Adaptive graphics acceleration in pre-boot

Patent number: 12118367

Abstract: Disclosed subject matter enables early PEI phase initialization of GPU cores and dynamic configuration of the GPU core computing to accept sliced workloads for parallel execution. Disclosed methods dynamically adapt based on various factors to a graphics rendering context determined based on factors such as the connected monitors, their various resolutions, etc., to provide advanced GPU rendering in pre-boot operating environment. Methods and systems may support pre-boot hybrid graphics rendering including dynamic utilization of integrated and discrete GPU cards/memory, along with the central processing unit (CPU) and cache to provide seamless and faster graphics rendering operations for all preboot requirements.

Type: Grant

Filed: April 17, 2023

Date of Patent: October 15, 2024

Assignee: Dell Products L.P.

Inventors: Shekar Babu Suryanarayana, Harish Barigi
Predicated compare-exchange-shuffle instruction for parallel processor

Patent number: 12079627

Abstract: A processor-implemented method for executing a hardware intrinsic programming instruction, includes performing one or more Boolean operations in combination with one or more permutation operations in response to the hardware intrinsic programming instruction being a single predicated compare-exchange-shuffle programming instruction. The method also includes outputting a sub-sorted list after the performing of the one or more Boolean operation in combination with the one or more permutation operation.

Type: Grant

Filed: March 23, 2023

Date of Patent: September 3, 2024

Assignee: QUALCOMM Incorporated

Inventors: Himanshu Pradeep Aswani, Mithil Ramteke, Venkata Prema Sai Sravan Patchala, Sridhar Kandimalla
Methods and systems for handling control flow structures in data-parallel architectures

Patent number: 12056490

Abstract: Methods, systems and apparatus are provided for handling control flow structures in data-parallel architectures. A method includes receiving, by a processing unit (PU), a program for execution. The method further includes applying, by the PU, a branching solution to the program to obtain data on control flow structures of the program. The method further includes determining, by the PU and based at least in part on the obtained data, one or more control flow structures of the program to predicate. The method further includes applying, by the PU, predication to the one or more control flow structures of the program.

Type: Grant

Filed: August 12, 2022

Date of Patent: August 6, 2024

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Kevin Lin, Guansong Zhang
Barriers and synchronization for machine learning at autonomous machines

Patent number: 12001209

Abstract: A method of embodiments, as described herein, includes detecting thread groups relating to machine learning associated with one or more processing devices. The method may further include facilitating barrier synchronization of the thread groups across multiple dies such that each thread in a thread group is scheduled across a set of compute elements associated with the multiple dies, where each die represents a processing device of the one or more processing devices, the processing device including a graphics processor.

Type: Grant

Filed: May 23, 2022

Date of Patent: June 4, 2024

Assignee: Intel Corporation

Inventors: Abhishek R. Appu, Altug Koker, Joydeep Ray, Balaji Vembu, John C. Weast, Mike B. Macpherson, Dukhwan Kim, Linda L. Hurd, Sanjeev Jahagirdar, Vasanth Ranganathan
Computational memory

Patent number: 11934482

Abstract: A processing device includes a two-dimensional array of processing elements, each processing element including an arithmetic logic unit to perform an operation. The device further includes interconnections among the two-dimensional array of processing elements to provide direct communication among neighboring processing elements of the two-dimensional array of processing elements. A processing element of the two-dimensional array of processing elements is connected to a first neighbor processing element that is immediately adjacent the processing element in a first dimension of the two-dimensional array. The processing element is further connected to a second neighbor processing element that is immediately adjacent the processing element in a second dimension of the two-dimensional array.

Type: Grant

Filed: August 3, 2023

Date of Patent: March 19, 2024

Assignee: UNTETHER AI CORPORATION

Inventor: William Martin Snelgrove
Accelerated polynomial coding system and method

Patent number: 11848686

Abstract: A system using accelerated error-correcting code in the storage and retrieval of data, wherein a single-instruction-multiple-data (SIMD) processor, SIMD instructions, non-volatile storage media, and an I/O controller implement a polynomial coding system including: a data matrix including at least one vector and including rows of at least one block of original data; a check matrix including more than two rows of at least one block of check data in the main memory; and a thread that executes on a SIMD CPU core and including: a parallel multiplier that multiplies the at least one vector of the data matrix by a single factor; and a parallel linear feedback shift register (LFSR) sequencer or a parallel syndrome sequencer configured to order load operations of the original data into at least one vector register of the SIMD CPU core and respectively compute the check data or syndrome data with the parallel multiplier.

Type: Grant

Filed: May 18, 2022

Date of Patent: December 19, 2023

Assignee: STREAMSCALE, INC.

Inventor: Michael H. Anderson
Accelerators for post-quantum cryptography secure hash-based signing and verification

Patent number: 11770258

Abstract: In one example an apparatus comprises a computer readable memory, hash logic to generate a message hash value based on an input message, signature logic to generate a signature to be transmitted in association with the message, the signature logic to apply a hash-based signature scheme to a private key to generate the signature comprising a public key, and accelerator logic to pre-compute at least one set of inputs to the signature logic. Other examples may be described.

Type: Grant

Filed: December 27, 2021

Date of Patent: September 26, 2023

Assignee: INTEL CORPORATION

Inventors: Vikram Suresh, Sanu Mathew, Manoj Sastry, Santosh Ghosh, Raghavan Kumar, Rafael Misoczki
Optimizing NOC performance using crossbars

Patent number: 11762560

Abstract: A system including an array of processing elements, a plurality of periphery crossbars and a plurality of storage components is described. The array of processing elements is interconnected in a grid via a network on an integrated circuit. The periphery crossbars are connected to a plurality of edges of the array of processing elements. The storage components are connected to the periphery crossbars.

Type: Grant

Filed: December 6, 2021

Date of Patent: September 19, 2023

Assignee: Meta Platforms, Inc.

Inventors: Linda Cheng, Olivia Wu, Abdulkadir Utku Diril, Pankaj Kansal
Multiprocessor system with improved secondary interconnection network

Patent number: 11755504

Abstract: Embodiments of a multiprocessor system are disclosed that may include a plurality of processors interspersed with a plurality of data memory routers, a plurality of bus interface units, a bus control circuit, and a processor interface circuit. The data memory routers may be coupled together to form a primary interconnection network. The bus interface units and the bus control circuit may be coupled together in a daisy-chain fashion to form a secondary interconnection network. Each of the bus interface units may be configured to read or write data or instructions to a respective one of the plurality of data memory routers and a respective processor. The bus control circuit coupled with the processor interface circuit may be configured to function as a bidirectional bridge between the primary and secondary networks. The bus control circuit may also couple to other interface circuits and arbitrate their access to the secondary network.

Type: Grant

Filed: July 14, 2020

Date of Patent: September 12, 2023

Assignee: Coherent Logix, Incorporated

Inventors: Carl S. Dobbs, Michael R. Trocino
Wave throttling based on a parameter buffer

Patent number: 11710207

Abstract: A graphics pipeline includes a first shader that generates first wave groups, a shader processor input (SPI) that launches the first wave groups for execution by shaders, and a scan converter that generates second waves for execution on the shaders based on results of processing the first wave groups the one or more shaders. The first wave groups are selectively throttled based on a comparison of in-flight first wave groups and second waves pending execution on the at least one second shader. A cache holds information that is written to the cache in response to the first wave groups finishing execution on the shaders. Information is read from the cache in response to read requests issued by the second waves. In some cases, the first wave groups are selectively throttled by comparing how many first wave groups are in-flight and how many read requests to the cache are pending.

Type: Grant

Filed: March 30, 2021

Date of Patent: July 25, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Christopher J. Brennan, Nishank Pathak
Tiling algorithm for a matrix math instruction set

Patent number: 11687615

Abstract: Systems, apparatuses, and methods for implementing a tiling algorithm for a matrix math instruction set are disclosed. A system includes at least a memory, a cache, a processor, and a plurality of compute units. The memory stores a plurality of matrix elements in a linear format, and the processor converts the plurality of matrix elements from the linear format to a tiling format. Each compute unit retrieves a plurality of matrix elements from the memory into the cache. Each compute unit includes a matrix operations unit which loads the plurality of matrix elements of corresponding tile(s) from the cache and performs a matrix operation on the plurality of matrix elements to generate a result in the tiling format. The system generates a classification of a first dataset based on results of the matrix operations.

Type: Grant

Filed: December 26, 2018

Date of Patent: June 27, 2023

Inventor: Hua Zhang
Method and apparatus for vector based finite impulse response (FIR) filtering

Patent number: 11681526

Abstract: A method is provided that includes performing, by a processor in response to a vector finite impulse response (VFIR) filter instruction, generating of a plurality of filter outputs using a plurality of coefficients and a plurality of sequential data elements, the plurality of coefficients specified by a coefficient operand of the VFIR filter instruction and the plurality of sequential data elements specified by a data operand of the VFIR filter instruction, and storing the filter outputs in a storage location specified by the VFIR filter instruction.

Type: Grant

Filed: May 20, 2020

Date of Patent: June 20, 2023

Assignee: Texas Instmments Incorporated

Inventors: Mujibur Rahman, Asheesh Bhardwaj, Timothy David Anderson
Training algorithm in artificial neural network (ANN) incorporating non-ideal memory device behavior

Patent number: 11586900

Abstract: Machine learning of model parameters for a neural network using a computing system is provided, that produces error-aware model parameters. An iterative process to converge on trained model parameters to be applied in the inference engine, includes applying a sequence of input training data sets to a neural network to produce inference results for the sequence using a set of model parameters in the neural network combined with factors based on a model of non-ideal characteristics of target memory to provide a training set of model parameters. An inference engine using the target memory technology to store the model parameters can have more stable results across a large number of engines.

Type: Grant

Filed: April 20, 2020

Date of Patent: February 21, 2023

Assignee: MACRONIX INTERNATIONAL CO., LTD.

Inventor: Ming-Hsiu Lee
Video decoding method and apparatus using multi-core transform, and video encoding method and apparatus using multi-core transform

Patent number: 11589066

Abstract: A method and apparatus for performing transformation and inverse transformation on a current block by using multi-core transform kernels in video encoding and decoding processes. A video decoding method may include obtaining, from a bitstream, multi-core transformation information indicating whether multi-core transformation kernels are to be used according to a size of a current block; obtaining horizontal transform kernel information and vertical transform kernel information from the bitstream when the multi-core transformation kernels are used according to the multi-core transformation information; determining a horizontal transform kernel for the current block according to the horizontal transform kernel information; determining a vertical transform kernel for the current block according to the vertical transform kernel information; and performing inverse transformation on the current block by using the horizontal transform kernel and the vertical transform kernel.

Type: Grant

Filed: July 3, 2018

Date of Patent: February 21, 2023

Assignee: SAMSUNG ELECTRONICS CO., LTD.

Inventors: Ki-ho Choi, Min-soo Park, Elena Alshina
Method and apparatus for implementing a deep neural network performing an activation function

Patent number: 11494622

Abstract: A method for configuring hardware for implementing a Deep Neural Network (DNN) for performing an activation function, the hardware comprising, at an activation module for performing an activation function, a programmable lookup table for storing lookup data approximating the activation function over a first range of input values to the activation module, the method comprising: providing calibration data to a representation of the hardware; monitoring an input to an activation module of the representation of the hardware so as to determine a range of input values to the activation module; generating lookup data for the lookup table representing the activation function over the determined range of input values; and loading the generated lookup data into the lookup table of the hardware, thereby configuring the activation module of the hardware for performing the activation function over the determined range of input values.

Type: Grant

Filed: November 5, 2018

Date of Patent: November 8, 2022

Assignee: Imagination Technologies Limited

Inventors: Yuan Li, Antonios Tsichlas, Christopher Martin
Reducing operations of sum-of-multiply-accumulate (SOMAC) instructions

Patent number: 11481223

Abstract: Methods, systems and apparatuses for reducing operations of Sum-Of-Multiply-Accumulate (SOMAC) instructions are disclosed. One method includes scheduling, by a scheduler, a thread for execution, executing, by a processor of a plurality of processors, the thread, fetching, by the processor, a plurality of instructions for the thread from a memory, selecting, by a thread arbiter of the processor, an instruction of the plurality of instructions for execution in an arithmetic logic unit (ALU) pipeline of the processor, and reading the instruction, and determining, by a macro-instruction iterator of the processor, whether the instruction is a Sum-Of-Multiply-Accumulate (SOMAC) instruction with an instruction size, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed.

Type: Grant

Filed: August 8, 2019

Date of Patent: October 25, 2022

Assignee: Blaize, Inc.

Inventors: Kamaraj Thangam, Palaparthy Venkata Divya Bharathi, Satyaki Koneru
Method and apparatus for vector sorting using vector permutation logic

Patent number: 11461096

Abstract: A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.

Type: Grant

Filed: September 30, 2019

Date of Patent: October 4, 2022

Assignee: TEXAS INSTRUMENTS INCORPORATED

Inventors: Timothy David Anderson, Mujibur Rahman
Pseudorandom keephot instructions to mitigate large load steps during hardware emulation

Patent number: 11449337

Abstract: A pseudorandom logic circuit may be embedded as a hardware within an emulation system, which may generate pseudorandom keephot instructions. A masking logic may mask out portions in each pseudorandom keephot instruction, which may change state elements during execution. A cluster of emulation processors may execute masked pseudorandom keephot instructions to consume power when not executing mission instructions. The cluster of emulation processors may run keephot cycles, during which the cluster of emulation processors may execute the pseudorandom keephot instructions causing the cluster of emulation processors to continue consuming a roughly constant amount of power, either at a same or different voltage level, but supposed outputs of the pseudorandom keephot instructions may have no impact upon inputs and outputs generated during mission cycles.

Type: Grant

Filed: December 19, 2019

Date of Patent: September 20, 2022

Assignee: CADENCE DESIGN SYSTEMS, INC.

Inventors: Mitchell Poplack, Yuhei Hayashi
Bit-packed array processing using SIMD

Patent number: 11442729

Abstract: A method and system for processing a bit-packed array using one or more processors, including determining a data element size of the bit-packed array, determining a lane configuration of a single-instruction multiple-data (SIMD) unit for processing the bit-packed array based at least in part on the determined data element size, the lane configuration being determined from among a plurality of candidate lane configurations, each candidate lane configuration having a different number of vector register lanes and a corresponding bit capacity per vector register lane, configuring the SIMD unit according to the determined lane configuration, and loading one or more data elements into each vector register lane of the SIMD unit. SIMD instructions may be executed on the loaded one or more data elements of each vector register lane in parallel, and a result of the SIMD instruction may be stored in memory.

Type: Grant

Filed: October 26, 2020

Date of Patent: September 13, 2022

Assignee: Google LLC

Inventors: Junwhan Ahn, Jichuan Chang, Andrew McCormick, Yuanwei Fang, Yixin Luo
Multiple independent synchonization named barrier within a thread group

Patent number: 11409579

Abstract: An apparatus to facilitate thread barrier synchronization is disclosed. The apparatus includes a plurality of processing resources to execute a plurality of execution threads included in a thread workgroup and barrier synchronization hardware to assign a first named barrier to a first set of the plurality of execution threads in the thread workgroup, assign a second named barrier to a second set of the plurality of execution threads in the thread workgroup, synchronize execution of the first set of execution threads via the first named barrier and synchronize execution of the second set of execution threads via the second named barrier.

Type: Grant

Filed: February 24, 2020

Date of Patent: August 9, 2022

Assignee: Intel Corporation

Inventors: James Valerio, Vasanth Ranganathan, Joydeep Ray
System and method of loading and replication of sub-vector values

Patent number: 11372804

Abstract: A processor includes a vector register configured to load data responsive to a special purpose load instruction. The processor also includes circuitry configured to replicate a selected sub-vector value from the vector register.

Type: Grant

Filed: May 16, 2018

Date of Patent: June 28, 2022

Assignee: Qualcomm Incorporated

Inventors: Eric Mahurin, Erich Plondke, David Hoyle
Method and apparatus for vector based matrix multiplication

Patent number: 11347503

Abstract: A method is provided that includes performing, by a processor in response to a vector matrix multiply instruction, multiplying an m×n matrix (A matrix) and a n×p matrix (B matrix) to generate elements of an m×p matrix (R matrix), and storing the elements of the R matrix in a storage location specified by the vector matrix multiply instruction.

Type: Grant

Filed: May 20, 2020

Date of Patent: May 31, 2022

Assignee: TEXAS INSTRUMENTS INCORPORATED

Inventors: Asheesh Bhardwaj, Mujibur Rahman, Timothy David Anderson
Element by vector operations in a data processing apparatus

Patent number: 11327752

Abstract: A data processing apparatus, a method of operating a data processing apparatus, a non-transitory computer readable storage medium, and an instruction are provided. The instruction specifies a first source register, a second source register, and an index. In response to the instruction control signals are generated, causing processing circuitry to perform a data processing operation with respect to each data group in the first source register and the second source register to generate respective result data groups forming a result of the data processing operation. Each of the first source register and the second source register has a size which is an integer multiple at least twice a predefined size of the data group, and each data group comprises a plurality of data elements. The operands of the data processing operation for each data group are a selected data element identified in the data group of the first source register by the index and each data element in the data group of the second source register.

Type: Grant

Filed: February 2, 2018

Date of Patent: May 10, 2022

Assignee: ARM LIMITED

Inventors: Grigorios Magklis, Nigel John Stephens, Jacob Eapen, Mbou Eyole, David Hennah Mansell
Method executed by computing device, apparatus, device and computer-readable storage medium

Patent number: 11221851

Abstract: Embodiments of the present disclosure provide a method, executed by a computing device, for configuring a vector operation, an apparatus, a device, and a storage medium. The method includes obtaining information indicating at least one configurable vector operation parameter. The information indicating the at least one configurable vector operation parameter indicates a type and a value of the configurable vector operation parameter. The method further includes: based on the type and the value of the configurable vector operation parameter, configuring multiple vector operation circuits to enable each of the vector operation circuits to execute a target vector operation including two or more basic vector operations and defined based on the type and value of the configurable vector operation parameter.

Type: Grant

Filed: July 23, 2020

Date of Patent: January 11, 2022

Assignees: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY CO., LTD., KUNLUNXIN TECHNOLOGY (BEIJING) COMPANY LIMITED

Inventors: Huimin Li, Peng Wu, Jian Ouyang
MIMD processor emulated on SIMD architecture

Patent number: 11182170

Abstract: A processor having a SIMD architecture, including an array of elementary processors, each elementary processor being associated with an elementary memory cell, a central controller connected to the elementary processors by an instruction bus and a status bus. The central controller transmits a sequence of instructions in a loop, each instruction including a calculation flow indicator. Each elementary processor has an instruction filter that makes it possible to reject or take into account an instruction depending on the identifier it contains. This operating mode makes it possible to emulate a MIMD processor on a SIMD architecture.

Type: Grant

Filed: June 6, 2019

Date of Patent: November 23, 2021

Assignee: COMMISSARIAT A L'ENERGIE ATOMIQUE ET AUX ENERGIES ALTERNATIVES

Inventors: Stéphane Chevobbe, Marc Duranton
Event messaging in a system having a self-scheduling processor and a hybrid threading fabric

Patent number: 11126587

Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Type: Grant

Filed: April 30, 2019

Date of Patent: September 21, 2021

Assignee: Micron Technology, Inc.

Inventor: Tony M. Brewer
Multi-threaded, self-scheduling processor

Patent number: 11119972

Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Type: Grant

Filed: April 30, 2019

Date of Patent: September 14, 2021

Assignee: Micron Technology, Inc.

Inventor: Tony M. Brewer
Thread commencement using a work descriptor packet in a self-scheduling processor

Patent number: 11119782

Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Type: Grant

Filed: April 30, 2019

Date of Patent: September 14, 2021

Assignee: Micron Technology, Inc.

Inventor: Tony M. Brewer
Register saving for function calling

Patent number: 11113061

Abstract: Described herein are techniques for saving registers in the event of a function call. The techniques include modifying a program including a block of code designated as a calling code that calls a function. The modifying includes modifying the calling code to set a register usage mask indicating which registers are in use at the time of the function call. The modifying also includes modifying the function to combine the information of the register usage mask with information indicating registers used by the function to generate registers to be saved and save the registers to be saved.

Type: Grant

Filed: September 26, 2019

Date of Patent: September 7, 2021

Assignee: Advanced Micro Devices, Inc.

Inventor: Michael John Bedy
Apparatuses, methods, and systems for vector processor architecture having an array of identical circuit blocks

Patent number: 11074213

Abstract: Systems, methods, and apparatuses relating to vector processor architecture having an array of identical circuit blocks are described.

Type: Grant

Filed: June 29, 2019

Date of Patent: July 27, 2021

Assignee: Intel Corporation

Inventors: Joseph Williams, Jay O'Neill, Jeroen Leijten, Harm Peters, Eugene Scuteri
Raster order view

Patent number: 11062680

Abstract: Systems, apparatuses, and methods for implementing raster order view enforcement techniques are disclosed. A processor includes a plurality of compute units coupled to one or more memories. A plurality of waves are launched in parallel for execution on the plurality of compute units, where each wave comprises a plurality of threads. A dependency chain is generated for each wave of the plurality of waves. The compute units wait for all older waves to complete dependency chain generation prior to executing any threads with dependencies. Responsive to all older waves completing dependency chain generation, a given thread with a dependency is executed only if all other threads upon which the given thread is dependent have become inactive. When executed, the plurality of waves generate a plurality of pixels to be driven to a display.

Type: Grant

Filed: December 20, 2018

Date of Patent: July 13, 2021

Assignee: Advanced Micro Devices, Inc.

Inventors: Pazhani Pillai, Christopher J. Brennan
Fault-tolerant dot product engine

Patent number: 11061766

Abstract: Examples disclosed herein relate to a fault-tolerant dot product engine. The fault-tolerant dot product engine has a crossbar array having a number l of row lines and a number n of column lines intersecting the row lines to form l×n memory locations, with each memory location having a programmable memristive element and defining a matrix value. A number l of digital-to-analog converters are coupled to the row lines of the crossbar array to receive an input signal and a number n of analog-to-digital converters are coupled to the column lines of the crossbar array to generate an output signal. The output signal is a dot product of the input signal and the matrix values in the crossbar array, wherein a number m<n of the n column lines in the crossbar array are programmed with matrix values used to detect errors in the output signal.

Type: Grant

Filed: December 12, 2019

Date of Patent: July 13, 2021

Assignee: Hewlett Packard Enterprise Development LP

Inventors: Ron M. Roth, Richard H. Henze
Dynamically detecting uniformity and eliminating redundant computations to reduce power consumption

Patent number: 11055097

Abstract: One embodiment of the present invention includes techniques to decrease power consumption by reducing the number of redundant operations performed. In operation, a streamlining multiprocessor (SM) identifies uniform groups of threads that, when executed, apply the same deterministic operation to uniform sets of input operands. Within each uniform group of threads, the SM designates one thread as the anchor thread. The SM disables execution units assigned to all of the threads except the anchor thread. The anchor execution unit, assigned to the anchor thread, executes the operation on the uniform set of input operands. Subsequently, the SM sets the outputs of the non-anchor threads included in the uniform group of threads to equal the value of the anchor execution unit output.

Type: Grant

Filed: October 8, 2013

Date of Patent: July 6, 2021

Assignee: NVIDIA Corporation

Inventors: Gary M. Tarolli, John H. Edmondson, John Matthew Burgess, Robert Ohannessian
Apparatus and method to identify the source of an interrupt

Patent number: 11048512

Abstract: An apparatus and method for processing non-maskable interrupt source information. For example, one embodiment of a processor comprises: a plurality of cores comprising execution circuitry to execute instructions and process data; local interrupt circuitry comprising a plurality of registers to store interrupt-related data including non-maskable interrupt (NMI) data related to a first NMI; and non-maskable interrupt (NMI) processing mode selection circuitry, responsive to a request, to select between at least two NMI processing modes to process the first NMI including: a first NMI processing mode in which the plurality of registers are to store first data related to a first NMI, wherein no NMI source information related to a source of the NMI is included in the first data, and a second NMI processing mode in which the plurality of registers are to store both the first data related to the first NMI and second data comprising NMI source information indicating the NMI source.

Type: Grant

Filed: March 28, 2020

Date of Patent: June 29, 2021

Assignee: Intel Corporation

Inventors: Ashok Raj, Andreas Kleen, Gilbert Neiger, Beeman Strong, Jason Brandt, Rupin Vakharwala, Jeff Huxel, Larisa Novakovsky, Ido Ouziel, Sarathy Jayakumar
Apparatuses for in-memory operations

Patent number: 11016706

Abstract: An example apparatus includes a processing in memory (PIM) capable device having an array of memory cells and sensing circuitry coupled to the array, where the sensing circuitry includes a sense amplifier and a compute component. The PIM capable device includes timing circuitry selectably coupled to the sensing circuitry. The timing circuitry is configured to control timing of performance of compute operations performed using the sensing circuitry. The PIM capable device also includes a sequencer selectably coupled to the timing circuitry. The sequencer is configured to coordinate the compute operations. The apparatus also includes a source external to the PIM capable device. The sequencer is configured to receive a command instruction set from the source to initiate performance of a compute operation.

Type: Grant

Filed: June 13, 2019

Date of Patent: May 25, 2021

Assignee: Micron Technology, Inc.

Inventors: Perry V. Lea, Timothy P. Finkbeiner
Coalescing adjacent gather/scatter operations

Patent number: 11003455

Abstract: According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.

Type: Grant

Filed: April 29, 2019

Date of Patent: May 11, 2021

Assignee: Intel Corporation

Inventors: Andrew T. Forsyth, Brian J. Hickmann, Jonathan C. Hall, Christopher J. Hughes
Vector arithmetic and logical instructions performing operations on different first and second data element widths from corresponding first and second vector registers

Patent number: 11003447

Abstract: A data processing system (2) supports vector processing operations performed upon vector operands comprising a plurality of vector operand elements. The data processing system includes a processor (4) having an instruction decoder (14) which decodes mixed-element-sized vector arithmetic instructions to generate control signals (16) which control processing circuitry (18) to perform arithmetic operations upon a first vector of first source operand elements ai of a first bit size A, and a second vector of second source operand elements bj of a second bit size B. The second bit size B is greater than the first bit size A.

Type: Grant

Filed: June 23, 2016

Date of Patent: May 11, 2021

Assignee: ARM Limited

Inventor: Nigel John Stephens
Systems and methods for mixed instruction multiple data (xIMD) computing

Patent number: 10990394

Abstract: An integrated circuit may include a mixed instruction multiple data (xIMD) computing system. The xIMD computing system may include a plurality of data processors, each data processor representative of a lane of a single instruction multiple data (SIMD) computing system, wherein the plurality of data processors are configured to use a first dominant lane for instruction execution and to fork a second dominant lane when a data dependency instruction that does not share a taken/not-taken state with the first dominant lane is encountered during execution of a program by the xIMD computing system.

Type: Grant

Filed: September 28, 2017

Date of Patent: April 27, 2021

Assignee: Intel Corporation

Inventor: Jeffrey L. Nye
Display apparatus, method of controlling the same and recording medium thereof

Patent number: 10990341

Abstract: Disclosed are a display apparatus, a method of controlling the same, and a recording medium thereof, the display apparatus including: a display comprising a plurality of light source modules arrayed like tiles and mounted with a plurality of light emitting elements; an image processor configured to output a signal for displaying an image on a predetermined area of the display, the signal comprising image data and identification information about at least one light source module corresponding to the predetermined area; and a driver configured to selectively drive the at least one light source module corresponding to the identification information among the plurality of light source modules, based on the image data.

Type: Grant

Filed: September 3, 2019

Date of Patent: April 27, 2021

Assignee: SAMSUNG ELECTRONICS CO., LTD.

Inventors: Sangkyun Im, Joowhan Lee
Stream processor with decoupled crossbar for cross lane operations

Patent number: 10970081

Abstract: Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands.

Type: Grant

Filed: June 29, 2017

Date of Patent: April 6, 2021

Assignee: Advanced Micro Devices, Inc.

Inventors: Jiasheng Chen, Bin He, Mohammad Reza Hakami, Timothy Lottes, Justin David Smith, Michael J. Mantor, Derek Carson
Apparatus and method for a masked multiply instruction to support neural network pruning operations

Patent number: 10929503

Abstract: An apparatus and method for a masked multiply instruction to support neural network pruning operations. For example, one embodiment of a processor comprises: a decoder to decode a matrix multiplication with masking (GEMM) instruction identifying a destination matrix register to store a result, and source registers storing an A-matrix, a B-matrix, and a matrix mask; execution circuitry to execute the GEMM instruction, the execution circuitry to multiply a plurality of B-matrix elements with a plurality of A-matrix elements, each of the B-matrix elements associated with a mask value in the matrix mask, wherein if the mask value is set to a first value, then the execution circuitry is to multiply the B-matrix element with one or more of the A-matrix elements to generate a first partial result, and if the mask value is set to a second value, then the execution circuitry is to multiply an alternate B-matrix element with a one or more of the A-matrix elements to generate a second partial result.

Type: Grant

Filed: December 21, 2018

Date of Patent: February 23, 2021

Assignee: Intel Corporation

Inventors: Omid Azizi, Chen Koren, Nitin Garegrat
Reduction operations in data processors that include a plurality of execution lanes operable to execute programs for threads of a thread group in parallel

Patent number: 10922086

Abstract: To perform a reduction operation to combine data values for threads in a thread group using a data processor, the data processor performs combining steps that each combine the stored combined data value result of a previous combining operation for a thread with the combined data value result of the previous combining operation for a selected another execution lane that has not yet contributed to the stored combined data value result for the thread. The data processor selects as the another execution lane of the execution processing circuitry that has not yet contributed to the combined data value result for the thread, an execution lane from a group of execution lanes whose values have been combined in the previous combining step and that have not yet contributed to the combined data value result for the thread, and having a particular relative position in the group of execution lanes.

Type: Grant

Filed: June 15, 2019

Date of Patent: February 16, 2021

Assignee: Arm Limited

Inventor: Kevin Petit
Techniques for representing and processing geometry within an expanded graphics processing pipeline

Patent number: 10909739

Abstract: In various embodiments, a parallel processor implements a graphics processing pipeline that generates rendered images. In operation, the parallel processor causes execution threads to execute a task shading program on an input mesh to generate a task shader output specifying a mesh shader count. The parallel processor then generates mesh shader identifiers, where the total number of the mesh shader identifiers equals the mesh shader count. For each mesh shader identifier, the parallel processor invokes a mesh shader based on the mesh shader identifier and the task shader output to generate geometry associated with the mesh shader identifier. Subsequently, the parallel processor performs operations on the geometries associated with the mesh shader identifiers to generate a rendered image. Advantageously, unlike conventional graphics processing pipelines, the performance of the graphics processing pipeline is not limited by a primitive distributor.

Type: Grant

Filed: January 26, 2018

Date of Patent: February 2, 2021

Assignee: NVIDIA Corporation

Inventors: Ziyad Hakura, Yury Uralsky, Christoph Kubisch, Pierre Boudier, Henry Moreton
Apparatus and methods for matrix addition and subtraction

Patent number: 10891353

Abstract: Aspects for matrix multiplication in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-addition instruction. The aspects may further include a computation module configured to receive a first matrix and a second matrix. The first matrix may include one or more first elements and the second matrix includes one or more second elements. The one or more first elements and the one or more second elements may be arranged in accordance with a two-dimensional data structure. The computation module may be further configured to respectively add each of the first elements to each of the second elements based on a correspondence in the two-dimensional data structure to generate one or more third elements for a third matrix.

Type: Grant

Filed: January 17, 2019

Date of Patent: January 12, 2021

Assignee: CAMBRICON TECHNOLOGIES CORPORATION LIMITED

Inventors: Xiao Zhang, Shaoli Liu, Tianshi Chen, Yunji Chen
Embedded scheduling of hardware resources for hardware acceleration

Patent number: 10877766

Abstract: An integrated circuit (IC) may include a scheduler for hardware acceleration. The scheduler may include a command queue having a plurality of slots and configured to store commands offloaded from a host processor for execution by compute units of the IC. The scheduler may include a status register having bit locations corresponding to the slots of the command queue. The scheduler may also include a controller coupled to the command queue and the status register. The controller may be configured to schedule the compute units of the IC to execute the commands stored in the slots of the command queue and update the bit locations of the status register to indicate which commands from the command queue are finished executing.

Type: Grant

Filed: May 24, 2018

Date of Patent: December 29, 2020

Assignee: Xilinx, Inc.

Inventors: Soren T. Soe, Idris I. Tarwala, Umang Parekh, Sonal Santan, Hem C. Neema
Apparatus and methods for matrix addition and subtraction

Patent number: 10860681

Abstract: Aspects for matrix addition in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-add-scalar instruction that includes an address of the first matrix and a scalar value. The aspects may further include a computation module configured to receive the first matrix from a storage device based on the address of the first matrix. The first matrix may include one or more first elements. The one or more first elements are arranged in accordance with a two-dimensional data structure. The computation module may be further configured to respectively add the scalar value to each of the one or more first elements of the first matrix in accordance with the matrix-add-scalar instruction to generate one or more second elements for a second matrix.

Type: Grant

Filed: October 26, 2018

Date of Patent: December 8, 2020

Assignee: CAMBRICON TECHNOLOGIES CORPORATION LIMITED

Inventors: Xiao Zhang, Shaoli Liu, Tianshi Chen, Yunji Chen
In-lane vector shuffle instructions

Patent number: 10831477

Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.

Type: Grant

Filed: December 21, 2017

Date of Patent: November 10, 2020

Assignee: Intel Corporation

Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein

1 2 3 4 5 … next