Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Patent number: 12118367
    Abstract: Disclosed subject matter enables early PEI phase initialization of GPU cores and dynamic configuration of the GPU core computing to accept sliced workloads for parallel execution. Disclosed methods dynamically adapt based on various factors to a graphics rendering context determined based on factors such as the connected monitors, their various resolutions, etc., to provide advanced GPU rendering in pre-boot operating environment. Methods and systems may support pre-boot hybrid graphics rendering including dynamic utilization of integrated and discrete GPU cards/memory, along with the central processing unit (CPU) and cache to provide seamless and faster graphics rendering operations for all preboot requirements.
    Type: Grant
    Filed: April 17, 2023
    Date of Patent: October 15, 2024
    Assignee: Dell Products L.P.
    Inventors: Shekar Babu Suryanarayana, Harish Barigi
  • Patent number: 12079627
    Abstract: A processor-implemented method for executing a hardware intrinsic programming instruction, includes performing one or more Boolean operations in combination with one or more permutation operations in response to the hardware intrinsic programming instruction being a single predicated compare-exchange-shuffle programming instruction. The method also includes outputting a sub-sorted list after the performing of the one or more Boolean operation in combination with the one or more permutation operation.
    Type: Grant
    Filed: March 23, 2023
    Date of Patent: September 3, 2024
    Assignee: QUALCOMM Incorporated
    Inventors: Himanshu Pradeep Aswani, Mithil Ramteke, Venkata Prema Sai Sravan Patchala, Sridhar Kandimalla
  • Patent number: 12056490
    Abstract: Methods, systems and apparatus are provided for handling control flow structures in data-parallel architectures. A method includes receiving, by a processing unit (PU), a program for execution. The method further includes applying, by the PU, a branching solution to the program to obtain data on control flow structures of the program. The method further includes determining, by the PU and based at least in part on the obtained data, one or more control flow structures of the program to predicate. The method further includes applying, by the PU, predication to the one or more control flow structures of the program.
    Type: Grant
    Filed: August 12, 2022
    Date of Patent: August 6, 2024
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Kevin Lin, Guansong Zhang
  • Patent number: 12001209
    Abstract: A method of embodiments, as described herein, includes detecting thread groups relating to machine learning associated with one or more processing devices. The method may further include facilitating barrier synchronization of the thread groups across multiple dies such that each thread in a thread group is scheduled across a set of compute elements associated with the multiple dies, where each die represents a processing device of the one or more processing devices, the processing device including a graphics processor.
    Type: Grant
    Filed: May 23, 2022
    Date of Patent: June 4, 2024
    Assignee: Intel Corporation
    Inventors: Abhishek R. Appu, Altug Koker, Joydeep Ray, Balaji Vembu, John C. Weast, Mike B. Macpherson, Dukhwan Kim, Linda L. Hurd, Sanjeev Jahagirdar, Vasanth Ranganathan
  • Patent number: 11934482
    Abstract: A processing device includes a two-dimensional array of processing elements, each processing element including an arithmetic logic unit to perform an operation. The device further includes interconnections among the two-dimensional array of processing elements to provide direct communication among neighboring processing elements of the two-dimensional array of processing elements. A processing element of the two-dimensional array of processing elements is connected to a first neighbor processing element that is immediately adjacent the processing element in a first dimension of the two-dimensional array. The processing element is further connected to a second neighbor processing element that is immediately adjacent the processing element in a second dimension of the two-dimensional array.
    Type: Grant
    Filed: August 3, 2023
    Date of Patent: March 19, 2024
    Assignee: UNTETHER AI CORPORATION
    Inventor: William Martin Snelgrove
  • Patent number: 11848686
    Abstract: A system using accelerated error-correcting code in the storage and retrieval of data, wherein a single-instruction-multiple-data (SIMD) processor, SIMD instructions, non-volatile storage media, and an I/O controller implement a polynomial coding system including: a data matrix including at least one vector and including rows of at least one block of original data; a check matrix including more than two rows of at least one block of check data in the main memory; and a thread that executes on a SIMD CPU core and including: a parallel multiplier that multiplies the at least one vector of the data matrix by a single factor; and a parallel linear feedback shift register (LFSR) sequencer or a parallel syndrome sequencer configured to order load operations of the original data into at least one vector register of the SIMD CPU core and respectively compute the check data or syndrome data with the parallel multiplier.
    Type: Grant
    Filed: May 18, 2022
    Date of Patent: December 19, 2023
    Assignee: STREAMSCALE, INC.
    Inventor: Michael H. Anderson
  • Patent number: 11770258
    Abstract: In one example an apparatus comprises a computer readable memory, hash logic to generate a message hash value based on an input message, signature logic to generate a signature to be transmitted in association with the message, the signature logic to apply a hash-based signature scheme to a private key to generate the signature comprising a public key, and accelerator logic to pre-compute at least one set of inputs to the signature logic. Other examples may be described.
    Type: Grant
    Filed: December 27, 2021
    Date of Patent: September 26, 2023
    Assignee: INTEL CORPORATION
    Inventors: Vikram Suresh, Sanu Mathew, Manoj Sastry, Santosh Ghosh, Raghavan Kumar, Rafael Misoczki
  • Patent number: 11762560
    Abstract: A system including an array of processing elements, a plurality of periphery crossbars and a plurality of storage components is described. The array of processing elements is interconnected in a grid via a network on an integrated circuit. The periphery crossbars are connected to a plurality of edges of the array of processing elements. The storage components are connected to the periphery crossbars.
    Type: Grant
    Filed: December 6, 2021
    Date of Patent: September 19, 2023
    Assignee: Meta Platforms, Inc.
    Inventors: Linda Cheng, Olivia Wu, Abdulkadir Utku Diril, Pankaj Kansal
  • Patent number: 11755504
    Abstract: Embodiments of a multiprocessor system are disclosed that may include a plurality of processors interspersed with a plurality of data memory routers, a plurality of bus interface units, a bus control circuit, and a processor interface circuit. The data memory routers may be coupled together to form a primary interconnection network. The bus interface units and the bus control circuit may be coupled together in a daisy-chain fashion to form a secondary interconnection network. Each of the bus interface units may be configured to read or write data or instructions to a respective one of the plurality of data memory routers and a respective processor. The bus control circuit coupled with the processor interface circuit may be configured to function as a bidirectional bridge between the primary and secondary networks. The bus control circuit may also couple to other interface circuits and arbitrate their access to the secondary network.
    Type: Grant
    Filed: July 14, 2020
    Date of Patent: September 12, 2023
    Assignee: Coherent Logix, Incorporated
    Inventors: Carl S. Dobbs, Michael R. Trocino
  • Patent number: 11710207
    Abstract: A graphics pipeline includes a first shader that generates first wave groups, a shader processor input (SPI) that launches the first wave groups for execution by shaders, and a scan converter that generates second waves for execution on the shaders based on results of processing the first wave groups the one or more shaders. The first wave groups are selectively throttled based on a comparison of in-flight first wave groups and second waves pending execution on the at least one second shader. A cache holds information that is written to the cache in response to the first wave groups finishing execution on the shaders. Information is read from the cache in response to read requests issued by the second waves. In some cases, the first wave groups are selectively throttled by comparing how many first wave groups are in-flight and how many read requests to the cache are pending.
    Type: Grant
    Filed: March 30, 2021
    Date of Patent: July 25, 2023
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Christopher J. Brennan, Nishank Pathak
  • Patent number: 11687615
    Abstract: Systems, apparatuses, and methods for implementing a tiling algorithm for a matrix math instruction set are disclosed. A system includes at least a memory, a cache, a processor, and a plurality of compute units. The memory stores a plurality of matrix elements in a linear format, and the processor converts the plurality of matrix elements from the linear format to a tiling format. Each compute unit retrieves a plurality of matrix elements from the memory into the cache. Each compute unit includes a matrix operations unit which loads the plurality of matrix elements of corresponding tile(s) from the cache and performs a matrix operation on the plurality of matrix elements to generate a result in the tiling format. The system generates a classification of a first dataset based on results of the matrix operations.
    Type: Grant
    Filed: December 26, 2018
    Date of Patent: June 27, 2023
    Inventor: Hua Zhang
  • Patent number: 11681526
    Abstract: A method is provided that includes performing, by a processor in response to a vector finite impulse response (VFIR) filter instruction, generating of a plurality of filter outputs using a plurality of coefficients and a plurality of sequential data elements, the plurality of coefficients specified by a coefficient operand of the VFIR filter instruction and the plurality of sequential data elements specified by a data operand of the VFIR filter instruction, and storing the filter outputs in a storage location specified by the VFIR filter instruction.
    Type: Grant
    Filed: May 20, 2020
    Date of Patent: June 20, 2023
    Assignee: Texas Instmments Incorporated
    Inventors: Mujibur Rahman, Asheesh Bhardwaj, Timothy David Anderson
  • Patent number: 11589066
    Abstract: A method and apparatus for performing transformation and inverse transformation on a current block by using multi-core transform kernels in video encoding and decoding processes. A video decoding method may include obtaining, from a bitstream, multi-core transformation information indicating whether multi-core transformation kernels are to be used according to a size of a current block; obtaining horizontal transform kernel information and vertical transform kernel information from the bitstream when the multi-core transformation kernels are used according to the multi-core transformation information; determining a horizontal transform kernel for the current block according to the horizontal transform kernel information; determining a vertical transform kernel for the current block according to the vertical transform kernel information; and performing inverse transformation on the current block by using the horizontal transform kernel and the vertical transform kernel.
    Type: Grant
    Filed: July 3, 2018
    Date of Patent: February 21, 2023
    Assignee: SAMSUNG ELECTRONICS CO., LTD.
    Inventors: Ki-ho Choi, Min-soo Park, Elena Alshina
  • Patent number: 11586900
    Abstract: Machine learning of model parameters for a neural network using a computing system is provided, that produces error-aware model parameters. An iterative process to converge on trained model parameters to be applied in the inference engine, includes applying a sequence of input training data sets to a neural network to produce inference results for the sequence using a set of model parameters in the neural network combined with factors based on a model of non-ideal characteristics of target memory to provide a training set of model parameters. An inference engine using the target memory technology to store the model parameters can have more stable results across a large number of engines.
    Type: Grant
    Filed: April 20, 2020
    Date of Patent: February 21, 2023
    Assignee: MACRONIX INTERNATIONAL CO., LTD.
    Inventor: Ming-Hsiu Lee
  • Patent number: 11494622
    Abstract: A method for configuring hardware for implementing a Deep Neural Network (DNN) for performing an activation function, the hardware comprising, at an activation module for performing an activation function, a programmable lookup table for storing lookup data approximating the activation function over a first range of input values to the activation module, the method comprising: providing calibration data to a representation of the hardware; monitoring an input to an activation module of the representation of the hardware so as to determine a range of input values to the activation module; generating lookup data for the lookup table representing the activation function over the determined range of input values; and loading the generated lookup data into the lookup table of the hardware, thereby configuring the activation module of the hardware for performing the activation function over the determined range of input values.
    Type: Grant
    Filed: November 5, 2018
    Date of Patent: November 8, 2022
    Assignee: Imagination Technologies Limited
    Inventors: Yuan Li, Antonios Tsichlas, Christopher Martin
  • Patent number: 11481223
    Abstract: Methods, systems and apparatuses for reducing operations of Sum-Of-Multiply-Accumulate (SOMAC) instructions are disclosed. One method includes scheduling, by a scheduler, a thread for execution, executing, by a processor of a plurality of processors, the thread, fetching, by the processor, a plurality of instructions for the thread from a memory, selecting, by a thread arbiter of the processor, an instruction of the plurality of instructions for execution in an arithmetic logic unit (ALU) pipeline of the processor, and reading the instruction, and determining, by a macro-instruction iterator of the processor, whether the instruction is a Sum-Of-Multiply-Accumulate (SOMAC) instruction with an instruction size, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed.
    Type: Grant
    Filed: August 8, 2019
    Date of Patent: October 25, 2022
    Assignee: Blaize, Inc.
    Inventors: Kamaraj Thangam, Palaparthy Venkata Divya Bharathi, Satyaki Koneru
  • Patent number: 11461096
    Abstract: A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
    Type: Grant
    Filed: September 30, 2019
    Date of Patent: October 4, 2022
    Assignee: TEXAS INSTRUMENTS INCORPORATED
    Inventors: Timothy David Anderson, Mujibur Rahman
  • Patent number: 11449337
    Abstract: A pseudorandom logic circuit may be embedded as a hardware within an emulation system, which may generate pseudorandom keephot instructions. A masking logic may mask out portions in each pseudorandom keephot instruction, which may change state elements during execution. A cluster of emulation processors may execute masked pseudorandom keephot instructions to consume power when not executing mission instructions. The cluster of emulation processors may run keephot cycles, during which the cluster of emulation processors may execute the pseudorandom keephot instructions causing the cluster of emulation processors to continue consuming a roughly constant amount of power, either at a same or different voltage level, but supposed outputs of the pseudorandom keephot instructions may have no impact upon inputs and outputs generated during mission cycles.
    Type: Grant
    Filed: December 19, 2019
    Date of Patent: September 20, 2022
    Assignee: CADENCE DESIGN SYSTEMS, INC.
    Inventors: Mitchell Poplack, Yuhei Hayashi
  • Patent number: 11442729
    Abstract: A method and system for processing a bit-packed array using one or more processors, including determining a data element size of the bit-packed array, determining a lane configuration of a single-instruction multiple-data (SIMD) unit for processing the bit-packed array based at least in part on the determined data element size, the lane configuration being determined from among a plurality of candidate lane configurations, each candidate lane configuration having a different number of vector register lanes and a corresponding bit capacity per vector register lane, configuring the SIMD unit according to the determined lane configuration, and loading one or more data elements into each vector register lane of the SIMD unit. SIMD instructions may be executed on the loaded one or more data elements of each vector register lane in parallel, and a result of the SIMD instruction may be stored in memory.
    Type: Grant
    Filed: October 26, 2020
    Date of Patent: September 13, 2022
    Assignee: Google LLC
    Inventors: Junwhan Ahn, Jichuan Chang, Andrew McCormick, Yuanwei Fang, Yixin Luo
  • Patent number: 11409579
    Abstract: An apparatus to facilitate thread barrier synchronization is disclosed. The apparatus includes a plurality of processing resources to execute a plurality of execution threads included in a thread workgroup and barrier synchronization hardware to assign a first named barrier to a first set of the plurality of execution threads in the thread workgroup, assign a second named barrier to a second set of the plurality of execution threads in the thread workgroup, synchronize execution of the first set of execution threads via the first named barrier and synchronize execution of the second set of execution threads via the second named barrier.
    Type: Grant
    Filed: February 24, 2020
    Date of Patent: August 9, 2022
    Assignee: Intel Corporation
    Inventors: James Valerio, Vasanth Ranganathan, Joydeep Ray
  • Patent number: 11372804
    Abstract: A processor includes a vector register configured to load data responsive to a special purpose load instruction. The processor also includes circuitry configured to replicate a selected sub-vector value from the vector register.
    Type: Grant
    Filed: May 16, 2018
    Date of Patent: June 28, 2022
    Assignee: Qualcomm Incorporated
    Inventors: Eric Mahurin, Erich Plondke, David Hoyle
  • Patent number: 11347503
    Abstract: A method is provided that includes performing, by a processor in response to a vector matrix multiply instruction, multiplying an m×n matrix (A matrix) and a n×p matrix (B matrix) to generate elements of an m×p matrix (R matrix), and storing the elements of the R matrix in a storage location specified by the vector matrix multiply instruction.
    Type: Grant
    Filed: May 20, 2020
    Date of Patent: May 31, 2022
    Assignee: TEXAS INSTRUMENTS INCORPORATED
    Inventors: Asheesh Bhardwaj, Mujibur Rahman, Timothy David Anderson
  • Patent number: 11327752
    Abstract: A data processing apparatus, a method of operating a data processing apparatus, a non-transitory computer readable storage medium, and an instruction are provided. The instruction specifies a first source register, a second source register, and an index. In response to the instruction control signals are generated, causing processing circuitry to perform a data processing operation with respect to each data group in the first source register and the second source register to generate respective result data groups forming a result of the data processing operation. Each of the first source register and the second source register has a size which is an integer multiple at least twice a predefined size of the data group, and each data group comprises a plurality of data elements. The operands of the data processing operation for each data group are a selected data element identified in the data group of the first source register by the index and each data element in the data group of the second source register.
    Type: Grant
    Filed: February 2, 2018
    Date of Patent: May 10, 2022
    Assignee: ARM LIMITED
    Inventors: Grigorios Magklis, Nigel John Stephens, Jacob Eapen, Mbou Eyole, David Hennah Mansell
  • Patent number: 11221851
    Abstract: Embodiments of the present disclosure provide a method, executed by a computing device, for configuring a vector operation, an apparatus, a device, and a storage medium. The method includes obtaining information indicating at least one configurable vector operation parameter. The information indicating the at least one configurable vector operation parameter indicates a type and a value of the configurable vector operation parameter. The method further includes: based on the type and the value of the configurable vector operation parameter, configuring multiple vector operation circuits to enable each of the vector operation circuits to execute a target vector operation including two or more basic vector operations and defined based on the type and value of the configurable vector operation parameter.
    Type: Grant
    Filed: July 23, 2020
    Date of Patent: January 11, 2022
    Assignees: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY CO., LTD., KUNLUNXIN TECHNOLOGY (BEIJING) COMPANY LIMITED
    Inventors: Huimin Li, Peng Wu, Jian Ouyang
  • Patent number: 11182170
    Abstract: A processor having a SIMD architecture, including an array of elementary processors, each elementary processor being associated with an elementary memory cell, a central controller connected to the elementary processors by an instruction bus and a status bus. The central controller transmits a sequence of instructions in a loop, each instruction including a calculation flow indicator. Each elementary processor has an instruction filter that makes it possible to reject or take into account an instruction depending on the identifier it contains. This operating mode makes it possible to emulate a MIMD processor on a SIMD architecture.
    Type: Grant
    Filed: June 6, 2019
    Date of Patent: November 23, 2021
    Assignee: COMMISSARIAT A L'ENERGIE ATOMIQUE ET AUX ENERGIES ALTERNATIVES
    Inventors: Stéphane Chevobbe, Marc Duranton
  • Patent number: 11126587
    Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.
    Type: Grant
    Filed: April 30, 2019
    Date of Patent: September 21, 2021
    Assignee: Micron Technology, Inc.
    Inventor: Tony M. Brewer
  • Patent number: 11119782
    Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.
    Type: Grant
    Filed: April 30, 2019
    Date of Patent: September 14, 2021
    Assignee: Micron Technology, Inc.
    Inventor: Tony M. Brewer
  • Patent number: 11119972
    Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.
    Type: Grant
    Filed: April 30, 2019
    Date of Patent: September 14, 2021
    Assignee: Micron Technology, Inc.
    Inventor: Tony M. Brewer
  • Patent number: 11113061
    Abstract: Described herein are techniques for saving registers in the event of a function call. The techniques include modifying a program including a block of code designated as a calling code that calls a function. The modifying includes modifying the calling code to set a register usage mask indicating which registers are in use at the time of the function call. The modifying also includes modifying the function to combine the information of the register usage mask with information indicating registers used by the function to generate registers to be saved and save the registers to be saved.
    Type: Grant
    Filed: September 26, 2019
    Date of Patent: September 7, 2021
    Assignee: Advanced Micro Devices, Inc.
    Inventor: Michael John Bedy
  • Patent number: 11074213
    Abstract: Systems, methods, and apparatuses relating to vector processor architecture having an array of identical circuit blocks are described.
    Type: Grant
    Filed: June 29, 2019
    Date of Patent: July 27, 2021
    Assignee: Intel Corporation
    Inventors: Joseph Williams, Jay O'Neill, Jeroen Leijten, Harm Peters, Eugene Scuteri
  • Patent number: 11061766
    Abstract: Examples disclosed herein relate to a fault-tolerant dot product engine. The fault-tolerant dot product engine has a crossbar array having a number l of row lines and a number n of column lines intersecting the row lines to form l×n memory locations, with each memory location having a programmable memristive element and defining a matrix value. A number l of digital-to-analog converters are coupled to the row lines of the crossbar array to receive an input signal and a number n of analog-to-digital converters are coupled to the column lines of the crossbar array to generate an output signal. The output signal is a dot product of the input signal and the matrix values in the crossbar array, wherein a number m<n of the n column lines in the crossbar array are programmed with matrix values used to detect errors in the output signal.
    Type: Grant
    Filed: December 12, 2019
    Date of Patent: July 13, 2021
    Assignee: Hewlett Packard Enterprise Development LP
    Inventors: Ron M. Roth, Richard H. Henze
  • Patent number: 11062680
    Abstract: Systems, apparatuses, and methods for implementing raster order view enforcement techniques are disclosed. A processor includes a plurality of compute units coupled to one or more memories. A plurality of waves are launched in parallel for execution on the plurality of compute units, where each wave comprises a plurality of threads. A dependency chain is generated for each wave of the plurality of waves. The compute units wait for all older waves to complete dependency chain generation prior to executing any threads with dependencies. Responsive to all older waves completing dependency chain generation, a given thread with a dependency is executed only if all other threads upon which the given thread is dependent have become inactive. When executed, the plurality of waves generate a plurality of pixels to be driven to a display.
    Type: Grant
    Filed: December 20, 2018
    Date of Patent: July 13, 2021
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Pazhani Pillai, Christopher J. Brennan
  • Patent number: 11055097
    Abstract: One embodiment of the present invention includes techniques to decrease power consumption by reducing the number of redundant operations performed. In operation, a streamlining multiprocessor (SM) identifies uniform groups of threads that, when executed, apply the same deterministic operation to uniform sets of input operands. Within each uniform group of threads, the SM designates one thread as the anchor thread. The SM disables execution units assigned to all of the threads except the anchor thread. The anchor execution unit, assigned to the anchor thread, executes the operation on the uniform set of input operands. Subsequently, the SM sets the outputs of the non-anchor threads included in the uniform group of threads to equal the value of the anchor execution unit output.
    Type: Grant
    Filed: October 8, 2013
    Date of Patent: July 6, 2021
    Assignee: NVIDIA Corporation
    Inventors: Gary M. Tarolli, John H. Edmondson, John Matthew Burgess, Robert Ohannessian
  • Patent number: 11048512
    Abstract: An apparatus and method for processing non-maskable interrupt source information. For example, one embodiment of a processor comprises: a plurality of cores comprising execution circuitry to execute instructions and process data; local interrupt circuitry comprising a plurality of registers to store interrupt-related data including non-maskable interrupt (NMI) data related to a first NMI; and non-maskable interrupt (NMI) processing mode selection circuitry, responsive to a request, to select between at least two NMI processing modes to process the first NMI including: a first NMI processing mode in which the plurality of registers are to store first data related to a first NMI, wherein no NMI source information related to a source of the NMI is included in the first data, and a second NMI processing mode in which the plurality of registers are to store both the first data related to the first NMI and second data comprising NMI source information indicating the NMI source.
    Type: Grant
    Filed: March 28, 2020
    Date of Patent: June 29, 2021
    Assignee: Intel Corporation
    Inventors: Ashok Raj, Andreas Kleen, Gilbert Neiger, Beeman Strong, Jason Brandt, Rupin Vakharwala, Jeff Huxel, Larisa Novakovsky, Ido Ouziel, Sarathy Jayakumar
  • Patent number: 11016706
    Abstract: An example apparatus includes a processing in memory (PIM) capable device having an array of memory cells and sensing circuitry coupled to the array, where the sensing circuitry includes a sense amplifier and a compute component. The PIM capable device includes timing circuitry selectably coupled to the sensing circuitry. The timing circuitry is configured to control timing of performance of compute operations performed using the sensing circuitry. The PIM capable device also includes a sequencer selectably coupled to the timing circuitry. The sequencer is configured to coordinate the compute operations. The apparatus also includes a source external to the PIM capable device. The sequencer is configured to receive a command instruction set from the source to initiate performance of a compute operation.
    Type: Grant
    Filed: June 13, 2019
    Date of Patent: May 25, 2021
    Assignee: Micron Technology, Inc.
    Inventors: Perry V. Lea, Timothy P. Finkbeiner
  • Patent number: 11003455
    Abstract: According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.
    Type: Grant
    Filed: April 29, 2019
    Date of Patent: May 11, 2021
    Assignee: Intel Corporation
    Inventors: Andrew T. Forsyth, Brian J. Hickmann, Jonathan C. Hall, Christopher J. Hughes
  • Patent number: 11003447
    Abstract: A data processing system (2) supports vector processing operations performed upon vector operands comprising a plurality of vector operand elements. The data processing system includes a processor (4) having an instruction decoder (14) which decodes mixed-element-sized vector arithmetic instructions to generate control signals (16) which control processing circuitry (18) to perform arithmetic operations upon a first vector of first source operand elements ai of a first bit size A, and a second vector of second source operand elements bj of a second bit size B. The second bit size B is greater than the first bit size A.
    Type: Grant
    Filed: June 23, 2016
    Date of Patent: May 11, 2021
    Assignee: ARM Limited
    Inventor: Nigel John Stephens
  • Patent number: 10990394
    Abstract: An integrated circuit may include a mixed instruction multiple data (xIMD) computing system. The xIMD computing system may include a plurality of data processors, each data processor representative of a lane of a single instruction multiple data (SIMD) computing system, wherein the plurality of data processors are configured to use a first dominant lane for instruction execution and to fork a second dominant lane when a data dependency instruction that does not share a taken/not-taken state with the first dominant lane is encountered during execution of a program by the xIMD computing system.
    Type: Grant
    Filed: September 28, 2017
    Date of Patent: April 27, 2021
    Assignee: Intel Corporation
    Inventor: Jeffrey L. Nye
  • Patent number: 10990341
    Abstract: Disclosed are a display apparatus, a method of controlling the same, and a recording medium thereof, the display apparatus including: a display comprising a plurality of light source modules arrayed like tiles and mounted with a plurality of light emitting elements; an image processor configured to output a signal for displaying an image on a predetermined area of the display, the signal comprising image data and identification information about at least one light source module corresponding to the predetermined area; and a driver configured to selectively drive the at least one light source module corresponding to the identification information among the plurality of light source modules, based on the image data.
    Type: Grant
    Filed: September 3, 2019
    Date of Patent: April 27, 2021
    Assignee: SAMSUNG ELECTRONICS CO., LTD.
    Inventors: Sangkyun Im, Joowhan Lee
  • Patent number: 10970081
    Abstract: Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands.
    Type: Grant
    Filed: June 29, 2017
    Date of Patent: April 6, 2021
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Jiasheng Chen, Bin He, Mohammad Reza Hakami, Timothy Lottes, Justin David Smith, Michael J. Mantor, Derek Carson
  • Patent number: 10929503
    Abstract: An apparatus and method for a masked multiply instruction to support neural network pruning operations. For example, one embodiment of a processor comprises: a decoder to decode a matrix multiplication with masking (GEMM) instruction identifying a destination matrix register to store a result, and source registers storing an A-matrix, a B-matrix, and a matrix mask; execution circuitry to execute the GEMM instruction, the execution circuitry to multiply a plurality of B-matrix elements with a plurality of A-matrix elements, each of the B-matrix elements associated with a mask value in the matrix mask, wherein if the mask value is set to a first value, then the execution circuitry is to multiply the B-matrix element with one or more of the A-matrix elements to generate a first partial result, and if the mask value is set to a second value, then the execution circuitry is to multiply an alternate B-matrix element with a one or more of the A-matrix elements to generate a second partial result.
    Type: Grant
    Filed: December 21, 2018
    Date of Patent: February 23, 2021
    Assignee: Intel Corporation
    Inventors: Omid Azizi, Chen Koren, Nitin Garegrat
  • Patent number: 10922086
    Abstract: To perform a reduction operation to combine data values for threads in a thread group using a data processor, the data processor performs combining steps that each combine the stored combined data value result of a previous combining operation for a thread with the combined data value result of the previous combining operation for a selected another execution lane that has not yet contributed to the stored combined data value result for the thread. The data processor selects as the another execution lane of the execution processing circuitry that has not yet contributed to the combined data value result for the thread, an execution lane from a group of execution lanes whose values have been combined in the previous combining step and that have not yet contributed to the combined data value result for the thread, and having a particular relative position in the group of execution lanes.
    Type: Grant
    Filed: June 15, 2019
    Date of Patent: February 16, 2021
    Assignee: Arm Limited
    Inventor: Kevin Petit
  • Patent number: 10909739
    Abstract: In various embodiments, a parallel processor implements a graphics processing pipeline that generates rendered images. In operation, the parallel processor causes execution threads to execute a task shading program on an input mesh to generate a task shader output specifying a mesh shader count. The parallel processor then generates mesh shader identifiers, where the total number of the mesh shader identifiers equals the mesh shader count. For each mesh shader identifier, the parallel processor invokes a mesh shader based on the mesh shader identifier and the task shader output to generate geometry associated with the mesh shader identifier. Subsequently, the parallel processor performs operations on the geometries associated with the mesh shader identifiers to generate a rendered image. Advantageously, unlike conventional graphics processing pipelines, the performance of the graphics processing pipeline is not limited by a primitive distributor.
    Type: Grant
    Filed: January 26, 2018
    Date of Patent: February 2, 2021
    Assignee: NVIDIA Corporation
    Inventors: Ziyad Hakura, Yury Uralsky, Christoph Kubisch, Pierre Boudier, Henry Moreton
  • Patent number: 10891353
    Abstract: Aspects for matrix multiplication in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-addition instruction. The aspects may further include a computation module configured to receive a first matrix and a second matrix. The first matrix may include one or more first elements and the second matrix includes one or more second elements. The one or more first elements and the one or more second elements may be arranged in accordance with a two-dimensional data structure. The computation module may be further configured to respectively add each of the first elements to each of the second elements based on a correspondence in the two-dimensional data structure to generate one or more third elements for a third matrix.
    Type: Grant
    Filed: January 17, 2019
    Date of Patent: January 12, 2021
    Assignee: CAMBRICON TECHNOLOGIES CORPORATION LIMITED
    Inventors: Xiao Zhang, Shaoli Liu, Tianshi Chen, Yunji Chen
  • Patent number: 10877766
    Abstract: An integrated circuit (IC) may include a scheduler for hardware acceleration. The scheduler may include a command queue having a plurality of slots and configured to store commands offloaded from a host processor for execution by compute units of the IC. The scheduler may include a status register having bit locations corresponding to the slots of the command queue. The scheduler may also include a controller coupled to the command queue and the status register. The controller may be configured to schedule the compute units of the IC to execute the commands stored in the slots of the command queue and update the bit locations of the status register to indicate which commands from the command queue are finished executing.
    Type: Grant
    Filed: May 24, 2018
    Date of Patent: December 29, 2020
    Assignee: Xilinx, Inc.
    Inventors: Soren T. Soe, Idris I. Tarwala, Umang Parekh, Sonal Santan, Hem C. Neema
  • Patent number: 10860681
    Abstract: Aspects for matrix addition in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-add-scalar instruction that includes an address of the first matrix and a scalar value. The aspects may further include a computation module configured to receive the first matrix from a storage device based on the address of the first matrix. The first matrix may include one or more first elements. The one or more first elements are arranged in accordance with a two-dimensional data structure. The computation module may be further configured to respectively add the scalar value to each of the one or more first elements of the first matrix in accordance with the matrix-add-scalar instruction to generate one or more second elements for a second matrix.
    Type: Grant
    Filed: October 26, 2018
    Date of Patent: December 8, 2020
    Assignee: CAMBRICON TECHNOLOGIES CORPORATION LIMITED
    Inventors: Xiao Zhang, Shaoli Liu, Tianshi Chen, Yunji Chen
  • Patent number: 10831477
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: December 21, 2017
    Date of Patent: November 10, 2020
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 10824434
    Abstract: Examples described herein relate to dynamically structured single instruction, multiple data (SIMD) instructions, and systems and circuits implementing such dynamically structured SIMD instructions. An example is a method for processing data. A first SIMD structure is determined by a processor. A characteristic of the first SIMD structure is altered by the processor to obtain a second SIMD structure. An indication of the second SIMD structure is communicated from the processor to a numerical engine. Data is packed by the numerical engine into an SIMD instruction according to the second SIMD structure. The SIMD instruction is transmitted from the numerical engine.
    Type: Grant
    Filed: November 29, 2018
    Date of Patent: November 3, 2020
    Assignee: XILINX, INC.
    Inventors: Sean Settle, Ehsan Ghasemi, Ashish Sirasao, Ralph D. Wittig
  • Patent number: 10824934
    Abstract: Described examples include an integrated circuit including a vector multiply unit including a plurality of multiply/accumulate nodes, in which the vector multiply unit is operable to provide an output from the multiply/accumulate nodes, a first data feeder operable to provide first data to the vector multiply unit in vector format, and a second data feeder operable to provide second data to the vector multiply unit in vector format.
    Type: Grant
    Filed: October 16, 2017
    Date of Patent: November 3, 2020
    Assignee: TEXAS INSTRUMENTS INCORPORATED
    Inventors: Mihir Narendra Mody, Shyam Jagannathan, Manu Mathew, Jason T. Jones
  • Patent number: 10795677
    Abstract: Embodiments of systems, apparatuses, and methods for multiplication, negation, and accumulation of data values in a processor are described. For example, execution circuitry executes a decoded instruction to multiply selected data values from a plurality of packed data element positions in first and second packed data source operands to generate a plurality of first result values, sum the plurality of first result values to generate one or more second result values, negate the one or more second result values to generate one or more third result values, accumulate the one or more third result values with one or more data values from the destination operand to generate one or more fourth result values, and store the one or more third result values in one or more packed data element positions in the destination operand.
    Type: Grant
    Filed: September 29, 2017
    Date of Patent: October 6, 2020
    Assignee: Intel Corporation
    Inventors: Venkateswara R. Madduri, Elmoustapha Ould-Ahmed-Vall, Robert Valentine, Jesus Corbal, Mark Charney