Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Patent number: 10795677
    Abstract: Embodiments of systems, apparatuses, and methods for multiplication, negation, and accumulation of data values in a processor are described. For example, execution circuitry executes a decoded instruction to multiply selected data values from a plurality of packed data element positions in first and second packed data source operands to generate a plurality of first result values, sum the plurality of first result values to generate one or more second result values, negate the one or more second result values to generate one or more third result values, accumulate the one or more third result values with one or more data values from the destination operand to generate one or more fourth result values, and store the one or more third result values in one or more packed data element positions in the destination operand.
    Type: Grant
    Filed: September 29, 2017
    Date of Patent: October 6, 2020
    Assignee: Intel Corporation
    Inventors: Venkateswara R. Madduri, Elmoustapha Ould-Ahmed-Vall, Robert Valentine, Jesus Corbal, Mark Charney
  • Patent number: 10691464
    Abstract: Systems and methods for virtually partitioning an integrated circuit may include identifying dimensional attributes of a target input dataset and selecting a data partitioning scheme from a plurality of distinct data partitioning schemes for the target input dataset based on the dimensional attributes of the target dataset and architectural attributes of an integrated circuit. The method may include disintegrating the target dataset into a plurality of distinct subsets of data based on the selected data partitioning scheme and identifying a virtual processing core partitioning scheme from a plurality of distinct processing core partitioning schemes for an architecture of the integrated circuit based on the disintegration of the target input dataset.
    Type: Grant
    Filed: January 21, 2020
    Date of Patent: June 23, 2020
    Assignee: quadric.io
    Inventors: Nigel Drego, Aman Sikka, Mrinalini Ravichandran, Robert Daniel Firu, Veerbhan Kheterpal
  • Patent number: 10678662
    Abstract: A computing system includes: a data block including a data; a storage engine, coupled to the data block, configured to process data, as hard information or soft information, through channels including a failed channel and a remaining channel, calculate an aggregated output from a hard decision from the remaining channel, calculate a selected magnitude from a magnitude from the remaining channel with an error detected, calculate an extrinsic soft information based on the aggregated output and the selected magnitude, and decode the failed channel with a scaled soft metric based on the extrinsic soft information.
    Type: Grant
    Filed: March 25, 2016
    Date of Patent: June 9, 2020
    Assignee: CNEX LABS, Inc.
    Inventor: Xiaojie Zhang
  • Patent number: 10628155
    Abstract: First and second forms of a complex multiply instruction are provided for operating on first and second operand vectors comprising multiple data elements including at least one real data element for representing the real part of a complex number and at least one imaginary element for representing an imaginary part of the complex number. One of the first and second forms of the instruction targets at least one real element of the destination vector and the other targets at least one imaginary element. By executing one of each instruction, complex multiplications of the form (a+ib)*(c+id) can be calculated using relatively few instructions and with only two vector register read ports, enabling DSP algorithms such as FFTs to be calculated more efficiently using relatively low power hardware implementations.
    Type: Grant
    Filed: February 22, 2017
    Date of Patent: April 21, 2020
    Assignee: ARM Limited
    Inventor: Thomas Christopher Grocutt
  • Patent number: 10623015
    Abstract: An apparatus and method are described for performing vector compression. For example, one embodiment of a processor comprises: vector compression logic to compress a source vector comprising a plurality of valid data elements and invalid data elements to generate a destination vector in which valid data elements are stored contiguously on one side of the destination vector, the vector compression logic to utilize a bit mask associated with the source vector and comprising a plurality of bits, each bit corresponding to one of the plurality of data elements of the source vector and indicating whether the data element comprises a valid data element or an invalid data element, the vector compression logic to utilize indices of the bit mask and associated bit values of the bit mask to generate a control vector; and shuffle logic to shuffle/permute the data elements of the source vector to the destination vector in accordance with the control vector.
    Type: Grant
    Filed: March 15, 2018
    Date of Patent: April 14, 2020
    Assignee: Intel Corporation
    Inventors: Simon Rubanovich, David M. Russinoff, Amit Gradstein, John W. O'Leary, Zeev Sperber
  • Patent number: 10613863
    Abstract: Techniques and mechanisms described herein include a signal processor implemented as an overlay on a field-programmable gate array (FPGA) device that utilizes special purpose, hardened intellectual property (IP) modules such as memory blocks and digital signal processing (DSP) cores. A Processing Element (PE) is built from one or more DSP cores connected to additional logic. Interconnected as an array, the PEs may operate in a computational model such as Single Instruction-Multiple Thread (SIMT). A software hierarchy is described that transforms the SIMT array into an effective signal processor.
    Type: Grant
    Filed: July 3, 2019
    Date of Patent: April 7, 2020
    Assignee: Nextera Video, Inc.
    Inventors: John E. Deame, Steven Kaufmann, Liviu Voicu
  • Patent number: 10592466
    Abstract: A GPU architecture employs a crossbar switch to preferentially store operand vectors in a compressed form allowing reduction in the number of memory circuits that must be activated during an operand fetch and to allow existing execution units to be used for scalar execution. Scalar execution can be performed during branch divergence.
    Type: Grant
    Filed: May 12, 2016
    Date of Patent: March 17, 2020
    Assignee: Wisconsin Alumni Research Foundation
    Inventors: Nam Sung Kim, Zhenhong Liu
  • Patent number: 10514916
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: June 5, 2017
    Date of Patent: December 24, 2019
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 10514917
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: November 2, 2017
    Date of Patent: December 24, 2019
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 10514918
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: December 21, 2017
    Date of Patent: December 24, 2019
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 10509652
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: December 21, 2017
    Date of Patent: December 17, 2019
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 10497440
    Abstract: A crossbar array, comprises a plurality of row lines, a plurality of column lines intersecting the plurality of row lines at a plurality of intersections, and a plurality of junctions coupled between the plurality of row lines and the plurality of column lines at a portion of the plurality of intersections. Each junction comprises a resistive memory element, and the junctions are positioned to calculate a matrix multiplication of a first matrix and a second matrix.
    Type: Grant
    Filed: August 7, 2015
    Date of Patent: December 3, 2019
    Assignee: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP
    Inventors: Miao Hu, John Paul Strachan, Zhiyong Li, R. Stanley Williams
  • Patent number: 10489155
    Abstract: Systems and methods relate to a mixed-width single instruction multiple data (SIMD) instruction which has at least a source vector operand comprising data elements of a first bit-width and a destination vector operand comprising data elements of a second bit-width, wherein the second bit-width is either half of or twice the first bit-width. Correspondingly, one of the source or destination vector operands is expressed as a pair of registers, a first register and a second register. The other vector operand is expressed as a single register. Data elements of the first register correspond to even-numbered data elements of the other vector operand expressed as a single register, and data elements of the second register correspond to data elements of the other vector operand expressed as a single register.
    Type: Grant
    Filed: July 21, 2015
    Date of Patent: November 26, 2019
    Assignee: QUALCOMM Incorporated
    Inventors: Eric Wayne Mahurin, Ajay Anant Ingle
  • Patent number: 10437769
    Abstract: A method of transition minimized low speed data transfer is described herein. In an embodiment, a data rate of a set data to be transmitted on a data bus is determined. A one hot value is encoded on the data bus in response to a low data rate. An XOR operation is performed with a previous state of the data bus and the encoded one hot value. Additionally, a resulting value of the XOR operation is driven onto the data bus.
    Type: Grant
    Filed: December 26, 2013
    Date of Patent: October 8, 2019
    Assignee: Intel Corporation
    Inventor: Daniel Greenspan
  • Patent number: 10419501
    Abstract: A data streaming unit (DSU) and a method for operating a DSU are disclosed. In an embodiment the DSU includes a memory interface configured to be connected to a storage unit, a compute engine interface configured to be connected to a compute engine (CE) and an address generator configured to manage address data representing address locations in the storage unit. The data streaming unit further includes a data organization unit configured to access data in the storage unit and to reorganize the data to be forwarded to the compute engine, wherein the memory interface is communicatively connected to the address generator and the data organization unit, wherein the address generator is communicatively connected to the data organization unit, and wherein the data organization unit is communicatively connected to the compute engine interface.
    Type: Grant
    Filed: December 3, 2015
    Date of Patent: September 17, 2019
    Assignee: Futurewei Technologies, Inc.
    Inventors: Ashish Rai Shrivastava, Alan Gatherer, Sushma Wokhlu
  • Patent number: 10324515
    Abstract: Approaches are provided for a predictive electrical appliance power-saving management mode. An approach includes ascertaining a location and pace of a mobile device. The approach further includes calculating an amount of time that it will take to enable or start programs and services upon a computing device waking from a sleep mode or hybrid sleep mode. The approach further includes determining a distance threshold to the computing device that allows for the calculated amount of time to pass such that the programs and services are enabled or started prior to a user of the mobile device arriving at the computing device when the user is returning to the computing device at the ascertained pace. The approach further includes sending a signal to awaken the computing device from the sleep mode or hybrid sleep mode when the mobile device is within the distance threshold.
    Type: Grant
    Filed: May 8, 2017
    Date of Patent: June 18, 2019
    Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: James E. Bostick, John M. Ganci, Jr., Sarbajit K. Rakshit, Kimberly G. Starks
  • Patent number: 10311539
    Abstract: A SIMD processing unit processes a plurality of tasks which each include up to a predetermined maximum number of work items. The work items of a task are arranged for executing a common sequence of instructions on respective data items. The data items are arranged into blocks, with some of the blocks including at least one invalid data item. Work items which relate to invalid data items are invalid work items. The SIMD processing unit comprises a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles. A control module assembles work items into the tasks based on the validity of the work items, so that invalid work items of the particular task are temporally aligned across the processing lanes. In this way the number of wasted processing slots due to invalid work items may be reduced.
    Type: Grant
    Filed: November 2, 2016
    Date of Patent: June 4, 2019
    Assignee: Imagination Technologies Limited
    Inventors: John Howson, Jonathan Redshaw, Yoong Chert Foo
  • Patent number: 10291514
    Abstract: Aspects of this disclosure provide techniques for dynamically configuring flow splitting via software defined network (SDN) signaling instructions. An SDN controller may instruct an ingress network node to split a traffic flow between two or more egress paths, and instruct the ingress network node, and perhaps downstream network nodes, to transport portions of the traffic flow in accordance with a forwarding protocol. In one example, the SDN controller instructs the network nodes to transport portions of the traffic flow in accordance with a link-based forwarding protocol. In other examples, the SDN controller instructs the network nodes to transport portions of the traffic flow in accordance with a path-based or source-based transport protocol.
    Type: Grant
    Filed: August 28, 2017
    Date of Patent: May 14, 2019
    Assignee: Huawei Technologies Co., Ltd.
    Inventors: Xu Li, Hang Zhang
  • Patent number: 10241802
    Abstract: A parallel processor for processing a plurality of different processing instruction streams in parallel is described. The processor comprises a plurality of data processing units; and a plurality of SIMD (Single Instruction Multiple Data) controllers, each connectable to a group of data processing units of the plurality of data processing units, and each SIMD controller arranged to handle an individual processing task with a subgroup of actively connected data processing units selected from the group of data processing units. The parallel processor is arranged to vary dynamically the size of the subgroup of data processing units to which each SIMD controller is actively connected under control of received processing instruction streams, thereby permitting each SIMD controller to be actively connected to a different number of processing units for different processing tasks.
    Type: Grant
    Filed: November 20, 2015
    Date of Patent: March 26, 2019
    Assignee: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)
    Inventors: John Lancaster, Martin Whitaker
  • Patent number: 10235732
    Abstract: A method and system are described herein for an optimization technique on two aspects of thread scheduling and dispatch when the driver is allowed to pick the scheduling attributes. The present techniques rely on an enhanced GPGPU Walker hardware command and one dimensional local identification generation to maximize thread residency.
    Type: Grant
    Filed: December 27, 2013
    Date of Patent: March 19, 2019
    Assignee: INTEL CORPORATION
    Inventors: Jayanth N. Rao, Michal Mrozek
  • Patent number: 10229468
    Abstract: Systems, apparatuses and methods may provide for receiving a general purpose graphics processing unit (GPGPU) workload and converting the GPGPU workload to a three-dimensional (3D) workload. Additionally, the 3D workload may be dispatched to a 3D pipeline. In one example, converting the GPGPU workload to the 3D workload includes identifying a plurality of thread groups in the GPGPU workload and mapping the plurality of thread groups to a 3D matrix of cubes.
    Type: Grant
    Filed: June 3, 2015
    Date of Patent: March 12, 2019
    Assignee: Intel Corporation
    Inventors: Robert B. Taylor, Abhishek Venkatesh
  • Patent number: 10228972
    Abstract: In some embodiments, the present invention provides an exemplary computing device, including at least: a scheduler processor; a CPU; a GPU; where the scheduler processor configured to: obtain a computing task; divide the computing task into: a first set of subtasks and a second set of subtasks; submit the first set to the CPU; submit the second set to the GPU; determine, for a first subtask of the first set, a first execution time, a first execution speed, or both; determine, for a second subtask of the second set, a second execution time, a second execution speed, or both; dynamically rebalance an allocation of remaining non-executed subtasks of the computing task to be submitted to the CPU and the GPU, based, at least in part, on at least one of: a first comparison of the first execution time to the second execution time, and a second comparison of the first execution speed to the second execution speed.
    Type: Grant
    Filed: June 21, 2018
    Date of Patent: March 12, 2019
    Assignee: Banuba Limited
    Inventor: Yury Hushchyn
  • Patent number: 10223334
    Abstract: A native tensor processor calculates tensor contractions using a sum of outer products. In one implementation, the native tensor processor preferably is implemented as a single integrated circuit and includes an input buffer and a contraction engine. The input buffer buffers tensor elements retrieved from off-chip and transmits the elements to the contraction engine as needed. The contraction engine calculates the tensor contraction by executing calculations from an equivalent matrix multiplications, as if the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix multiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.
    Type: Grant
    Filed: July 20, 2017
    Date of Patent: March 5, 2019
    Assignee: NOVUMIND LIMITED
    Inventors: Chien-Ping Lu, Yu-Shuen Tang
  • Patent number: 10216704
    Abstract: A native tensor processor calculates tensor contractions using a sum of outer products. In one implementation, the native tensor processor preferably is implemented as a single integrated circuit and includes an input buffer and a contraction engine. The input buffer buffers tensor elements retrieved from off-chip and transmits the elements to the contraction engine as needed. The contraction engine calculates the tensor contraction by executing calculations from an equivalent matrix multiplications, as if the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix multiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.
    Type: Grant
    Filed: July 20, 2017
    Date of Patent: February 26, 2019
    Assignee: NOVUMIND LIMITED
    Inventors: Chien-Ping Lu, Yu-Shuen Tang
  • Patent number: 10218635
    Abstract: A network interface controller (NC) that can provide a connection for a device to a network. The NC can include a sideband port controller. The sideband port controller can provide a sideband connection between the network and a sideband endpoint circuit that can communicate information with the network via the sideband. The sideband port controller can include a receive data route that has an input for receiving packets of data from the network and an output for passing the packets of data received from the network to the sideband endpoint circuit. The receive data route may include a buffer to receive the packets of data from the network and to pass the packets of data received from the network to the sideband endpoint.
    Type: Grant
    Filed: September 18, 2015
    Date of Patent: February 26, 2019
    Assignee: International Business Machines Corporation
    Inventors: Jean-Paul Aldebert, Claude Basso, Jean-Luc Frenoy, Fabrice J. Verplanken
  • Patent number: 10218634
    Abstract: A network interface controller for providing a connection for a device to a network. The network interface controller may include a sideband port controller. The sideband port controller may provide a sideband connection between the network and a sideband endpoint circuit that is operative to communicate information with the network via the sideband. The sideband port controller may include a transmit data route having an input for receiving packets from the sideband endpoint circuit and an output for passing packets received from the sideband endpoint to the network. A packet parser is connected to the transmit data route. The packet parser is operative to read data from packets received from the sideband endpoint and is further operative to analyze the data.
    Type: Grant
    Filed: September 18, 2015
    Date of Patent: February 26, 2019
    Assignee: International Business Machines Corporation
    Inventors: Jean-Paul Aldebert, Claude Basso, Jean-Luc Frenoy, Fabrice J. Verplanken
  • Patent number: 10186069
    Abstract: A graphics processing system groups plural initial pilot shader programs into a set of initial pilot shader programs and associates the set of initial pilot shader programs with a set of indexes. The initial pilot shader programs each contain constant program expressions to be executed on behalf of an original shader program. The index for an initial pilot shader program is then used to obtain the instructions contained in the initial pilot shader program for executing the constant program expressions of the initial pilot shader program. The threads for executing a subset of the initial pilot shader programs are also grouped into a thread group and the threads of the thread group are executed in parallel. The graphics processing system provides for efficient preparation and execution of plural initial pilot shader programs.
    Type: Grant
    Filed: February 15, 2017
    Date of Patent: January 22, 2019
    Assignee: Arm Limited
    Inventors: Alexander Galazin, Jörg Wagner, Andreas Due Engh-Halstvedt
  • Patent number: 10152329
    Abstract: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced.
    Type: Grant
    Filed: February 9, 2012
    Date of Patent: December 11, 2018
    Assignee: NVIDIA CORPORATION
    Inventors: Michael Fetterman, Stewart Glenn Carlton, Jack Hilaire Choquette, Shirish Gadre, Olivier Giroux, Douglas J. Hahn, Steven James Heinrich, Eric Lyell Hill, Charles McCarver, Omkar Paranjape, Anjana Rajendran, Rajeshwaran Selvanesan
  • Patent number: 10073816
    Abstract: A native tensor processor calculates tensor contractions using a sum of outer products. In one implementation, the native tensor processor preferably is implemented as a single integrated circuit and includes an input buffer and a contraction engine. The input buffer buffers tensor elements retrieved from off-chip and transmits the elements to the contraction engine as needed. The contraction engine calculates the tensor contraction by executing calculations from an equivalent matrix multiplications, as if the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix mutiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.
    Type: Grant
    Filed: July 20, 2017
    Date of Patent: September 11, 2018
    Assignee: NovuMind Limited
    Inventors: Chien-Ping Lu, Yu-Shuen Tang
  • Patent number: 10061591
    Abstract: A method for reducing execution of redundant threads in a processing environment. The method includes detecting threads that include redundant work among many different threads. Multiple threads from the detected threads are grouped into one or more thread clusters based on determining same thread computation results. Execution of all but a particular one thread in each of the one or more thread clusters is suppressed. The particular one thread in each of the one or more thread clusters is executed. Results determined from execution of the particular one thread in each of the one or more thread clusters are broadcasted to other threads in each of the one or more thread clusters.
    Type: Grant
    Filed: February 26, 2015
    Date of Patent: August 28, 2018
    Assignee: Samsung Electronics Company, Ltd.
    Inventors: Boris Beylin, John Brothers, Santosh Abraham, Lingjie Xu, Maxim Lukyanov, Alex Grosul
  • Patent number: 10042641
    Abstract: An asynchronous processing system comprising an asynchronous scalar processor and an asynchronous vector processor coupled to the scalar processor. The asynchronous scalar processor is configured to perform processing functions on input data and to output instructions. The asynchronous vector processor is configured to perform processing functions in response to a very long instruction word (VLIW) received from the scalar processor. The VLIW comprises a first portion and a second portion, at least the first portion comprising a vector instruction.
    Type: Grant
    Filed: September 8, 2014
    Date of Patent: August 7, 2018
    Assignee: Huawei Technologies Co., Ltd.
    Inventors: Qifan Zhang, Wuxian Shi, Yiqun Ge, Tao Huang, Wen Tong
  • Patent number: 10019264
    Abstract: Methods and apparatuses relating to processors that contextually optimize instructions at runtime are disclosed. In one embodiment, a processors includes a fetch circuit to fetch an instruction from an instruction storage, a format of the instruction including an opcode, a first source operand identifier, and a second source operand identifier; wherein the instruction storage includes a sequence of sub-optimal instructions preceded by a start-of-sequence instruction and followed by an end-of-sequence instruction.
    Type: Grant
    Filed: February 24, 2016
    Date of Patent: July 10, 2018
    Assignee: Intel Corporation
    Inventors: Taylor W. Kidd, Matt S. Walsh
  • Patent number: 10019410
    Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
    Type: Grant
    Filed: April 6, 2017
    Date of Patent: July 10, 2018
    Assignee: CORNAMI, INC.
    Inventors: Solomon Harsha, Paul Master
  • Patent number: 10013652
    Abstract: Deep Neural Networks (DNNs) with many hidden layers and many units per layer are very flexible models with a very large number of parameters. As such, DNNs are challenging to optimize. To achieve real-time computation, embodiments disclosed herein enable fast DNN feature transformation via optimized memory bandwidth utilization. To optimize memory bandwidth utilization, a rate of accessing memory may be reduced based on a batch setting. A memory, corresponding to a selected given output neuron of a current layer of the DNN, may be updated with an incremental output value computed for the selected given output neuron as a function of input values of a selected few non-zero input neurons of a previous layer of the DNN in combination with weights between the selected few non-zero input neurons and the selected given output neuron, wherein a number of the selected few corresponds to the batch setting.
    Type: Grant
    Filed: April 29, 2015
    Date of Patent: July 3, 2018
    Assignee: Nuance Communications, Inc.
    Inventors: Jan Vlietinck, Stephan Kanthak, Rudi Vuerinckx, Christophe Ris
  • Patent number: 9979904
    Abstract: The present invention relates to reading-out sensor array pixels. In particular, the present invention provides an approach according to which only a region of interest is may be read out from the sensor array, thus leading to substantial time savings. In order to achieve this, a circuitry for configuring a region of interest for the sensor array is provided as well as a reading-out circuitry for reading-out pixels belonging to the region of interest. In addition, the corresponding methods for programming the region of interest and for reading-out the region of interest are provided. The circuitry for programming and/or reading-out the region of interest includes per pixel provided storage elements for storing an indication of whether a pixel belongs to a region of interest (ROI). These are configured by the programming circuitry and using when reading-out the ROI for only reading out the pixels of the ROI.
    Type: Grant
    Filed: January 24, 2014
    Date of Patent: May 22, 2018
    Assignee: INNOVACIONES MICROELECTRÓNICAS S.L. (ANAFOCUS)
    Inventors: Rafael Dominguez Castro, Sergio Morillas Castillo, Rafael Romay Juárez, Fernando Medeiro Hidalgo
  • Patent number: 9891912
    Abstract: An array processor includes a managing element having a load streaming unit coupled to multiple processing elements. The load streaming unit provides input data portions to each of a first subset of processing elements and receives output data from each of a second subset of the processing elements based on a comparatively sorted combination of the input data portions. Each processing element is configurable by the managing element to compare input data portions received from the load streaming unit or two or more of the other processing elements. Each processing unit can further select an input data portion to be output data based on the comparison, and in response to selecting the input data portion, remove a queue entry corresponding to the selected input data portion. Each processing element can provide the selected output data portion to the managing element or as an input to one of the processing elements.
    Type: Grant
    Filed: October 31, 2014
    Date of Patent: February 13, 2018
    Assignee: International Business Machines Corporation
    Inventors: Ganesh Balakrishnan, Bartholomew Blaner, John J. Reilly, Jeffrey A. Stuecheli
  • Patent number: 9841957
    Abstract: An apparatus stores a program including a description of loop processing of iterating a plurality of instructions, and rearranges an execution sequence of the plurality of instructions in the program such that the loop processing is pipelined by software pipeline. The apparatus inserts an instruction to use a register for single instruction multiple data (SIMD) extension instruction, into the description of the loop processing in the program.
    Type: Grant
    Filed: April 19, 2016
    Date of Patent: December 12, 2017
    Assignee: FUJITSU LIMITED
    Inventor: Shun Kamatsuka
  • Patent number: 9830164
    Abstract: A system and method for efficiently processing instructions in hardware parallel execution lanes within a processor. In response to a given divergent point within an identified loop, a compiler arranges instructions within the identified loop into very large instruction words (VLIW's). At least one VLIW includes instructions intermingled from different basic blocks between the given divergence point and a corresponding convergence point. The compiler generates code wherein when executed assigns at runtime instructions within a given VLIW to multiple parallel execution lanes within a target processor. The target processor includes a single instruction multiple data (SIMD) micro-architecture. The assignment for a given lane is based on branch direction found at runtime for the given lane at the given divergent point. The target processor includes a vector register for storing indications indicating which given instruction within a fetched VLIW for an associated lane to execute.
    Type: Grant
    Filed: January 29, 2013
    Date of Patent: November 28, 2017
    Assignee: Advanced Micro Devices, Inc.
    Inventor: Reza Yazdani
  • Patent number: 9832478
    Abstract: One exemplary video encoding method has the following steps: determining a size of a parallel motion estimation region according to encoding related information; and encoding a plurality of pixels by at least performing motion estimation based on the size of the parallel motion estimation region. One exemplary video decoding method has the following steps: decoding a video parameter stream to obtain a decoded size of a parallel motion estimation region; checking validity of the decoded size of the parallel motion estimation region, and accordingly generating a checking result; when the checking result indicates that the decoded size of the parallel motion estimation region is invalid, entering an error handling process to decide a size of the parallel motion estimation; and decoding a plurality of pixels by at least performing motion estimation based on the decided size of the parallel motion estimation region.
    Type: Grant
    Filed: May 6, 2014
    Date of Patent: November 28, 2017
    Assignee: MEDIATEK INC.
    Inventors: Tung-Hsing Wu, Kun-Bin Lee
  • Patent number: 9804826
    Abstract: System and method for pseudo-random number generation based on a recursion with significantly increased multithreaded parallelism. A single pseudo-random generator program is assigned with multiple threads to process in parallel. N state elements indexed incrementally are arranged into a matrix comprising x rows, where a respective adjacent pair of state elements in a same column are related by g=(M+j)mod N, wherein j and g represent indexes of the pair of state elements. x can be determined through an modular manipulative inverse of M and N. The matrix can be divided into sections with each section having a number of columns, and each thread is assigned with a section. In this manner, the majority of the requisite interactions among the state elements occur without expensive inter-thread communications, and further each thread may only need to communicate with a single other thread for a small number of times.
    Type: Grant
    Filed: December 5, 2014
    Date of Patent: October 31, 2017
    Assignee: Nvidia Corporation
    Inventors: Przemyslaw Tredak, John Clifton Woolley, Jr.
  • Patent number: 9804848
    Abstract: Method, apparatus, and program for performing a string comparison operation. The apparatus includes execution resources to execute a first instruction. In response to the first instruction, the execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.
    Type: Grant
    Filed: December 5, 2014
    Date of Patent: October 31, 2017
    Assignee: Intel Corporation
    Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
  • Patent number: 9772864
    Abstract: When an OpenCL kernel is to be executed, a bitfield index representation to be used for the indices of the kernel invocations is determined based on the number of bits needed to represent the maximum value that will be needed for each index dimension for the kernel. A bitfield placement data structure 33 describing how the bitfield index representation is partitioned is then prepared together with a maximum value data structure 32 indicating the maximum index dimension values to be used for the kernel. A processor then executes the kernel invocations 36 across the index space indicated by the maximum value data structure 32. A bitfield index representation 35, 37, 38 configured in accordance with the bitfield placement data structure 33 is associated with each kernel invocation to indicate its index.
    Type: Grant
    Filed: April 16, 2013
    Date of Patent: September 26, 2017
    Assignee: ARM LIMITED
    Inventor: Jorn Nystad
  • Patent number: 9766888
    Abstract: A processor of an aspect includes packed data registers, and a decode unit to decode an instruction. The instruction may indicate a first source packed data to include at least four data elements, indicate a second source packed data to include at least four data elements, and indicate a destination storage location. An execution unit is coupled with the packed data registers and the decode unit. The execution unit, in response to the instruction, is to store a result packed data in the destination storage location. The result packed data may include at least four indexes that may identify corresponding data element positions in the first and second source packed data. The indexes may be stored in positions in the result packed data that are to represent a sorted order of corresponding data elements in the first and second source packed data.
    Type: Grant
    Filed: March 28, 2014
    Date of Patent: September 19, 2017
    Assignee: Intel Corporation
    Inventors: Shay Gueron, Vlad Krasnov
  • Patent number: 9760530
    Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
    Type: Grant
    Filed: March 31, 2017
    Date of Patent: September 12, 2017
    Assignee: CORNAMI, INC.
    Inventors: Solomon Harsha, Paul Master
  • Patent number: 9760531
    Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
    Type: Grant
    Filed: April 6, 2017
    Date of Patent: September 12, 2017
    Assignee: CORNAMI, INC.
    Inventors: Solomon Harsha, Paul Master
  • Patent number: 9727380
    Abstract: Global register protection in a multi-threaded processor is described. In an embodiment, global resources within a multi-threaded processor are protected by performing checks, before allowing a thread to write to a global resource, to determine whether the thread has write access to the particular global resource. The check involves accessing one or more local control registers or a global control field within the multi-threaded processor and in an example, a local register associated with each other thread in the multi-threaded processor is accessed and checked to see whether it contains an identifier for the particular global resource. Only if none of the accessed local resources contain such an identifier, is the instruction issued and the thread allowed to write to the global resource. Otherwise, the instruction is blocked and an exception may be raised to alert the program that issued the instruction that the write failed.
    Type: Grant
    Filed: February 19, 2015
    Date of Patent: August 8, 2017
    Assignee: Imagination Technologies Limited
    Inventors: Guixin Wang, Hugh Jackson, Robert Graham Isherwood
  • Patent number: 9710172
    Abstract: Aspects include communicating synchronous input/output (I/O) commands between an operating system and a recipient. Communicating synchronous I/O commands includes issuing a first synchronous I/O command with a first initiation bit set, where the first synchronous I/O command cause a first mailbox command to be initiated by the recipient with respect to a first storage control unit. Further, communicating synchronous I/O commands issuing a second synchronous I/O command with a second initiation bit set, where the second synchronous I/O command causes a second mailbox command to be initiated by the recipient with respect to at least one subsequent storage control unit. Communicating synchronous I/O commands also includes issuing a third synchronous I/O command with a first completion bit set in response to the first mailbox command being initiated and issuing a fourth synchronous I/O command with a second completion bit set in response to the first mailbox command being initiated.
    Type: Grant
    Filed: June 14, 2016
    Date of Patent: July 18, 2017
    Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: David F. Craddock, Peter G. Sutton, Harry M. Yudenfriend
  • Patent number: 9710171
    Abstract: Aspects include communicating synchronous input/output (I/O) commands between an operating system and a recipient. Communicating synchronous I/O commands includes issuing a first synchronous I/O command with a first initiation bit set, where the first synchronous I/O command cause a first mailbox command to be initiated by the recipient with respect to a first storage control unit. Further, communicating synchronous I/O commands issuing a second synchronous I/O command with a second initiation bit set, where the second synchronous I/O command causes a second mailbox command to be initiated by the recipient with respect to at least one subsequent storage control unit. Communicating synchronous I/O commands also includes issuing a third synchronous I/O command with a first completion bit set in response to the first mailbox command being initiated and issuing a fourth synchronous I/O command with a second completion bit set in response to the first mailbox command being initiated.
    Type: Grant
    Filed: October 1, 2015
    Date of Patent: July 18, 2017
    Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: David F. Craddock, Peter G. Sutton, Harry M. Yudenfriend
  • Patent number: 9703966
    Abstract: A data processing system includes a single instruction multiple data register file and single instruction multiple processing circuitry. The single instruction multiple data processing circuitry supports execution of cryptographic processing instructions for performing parts of a hash algorithm. The operands are stored within the single instruction multiple data register file. The cryptographic support instructions do not follow normal lane-based processing and generate output operands in which the different portions of the output operand depend upon multiple different elements within the input operand.
    Type: Grant
    Filed: July 7, 2015
    Date of Patent: July 11, 2017
    Assignee: ARM LIMITED
    Inventors: Matthew James Horsnell, Richard Roy Grisenthwaite, Stuart David Biles, Daniel Kershaw
  • Patent number: 9652435
    Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
    Type: Grant
    Filed: October 18, 2016
    Date of Patent: May 16, 2017
    Assignee: CORNAMI, INC.
    Inventors: Solomon Harsha, Paul Master