Patents Examined by Michael J Metzger

Robust, efficient multiprocessor-coprocessor interface

Patent number: 11138009

Abstract: Systems and methods for an efficient and robust multiprocessor-coprocessor interface that may be used between a streaming multiprocessor and an acceleration coprocessor in a GPU are provided. According to an example implementation, in order to perform an acceleration of a particular operation using the coprocessor, the multiprocessor: issues a series of write instructions to write input data for the operation into coprocessor-accessible storage locations, issues an operation instruction to cause the coprocessor to execute the particular operation; and then issues a series of read instructions to read result data of the operation from coprocessor-accessible storage locations to multiprocessor-accessible storage locations.

Type: Grant

Filed: August 10, 2018

Date of Patent: October 5, 2021

Assignee: NVIDIA CORPORATION

Inventors: Ronald Charles Babich, Jr., John Burgess, Jack Choquette, Tero Karras, Samuli Laine, Ignacio Llamas, Gregory Muthler, William Parsons Newhall, Jr.
Computer processor for higher precision computations using a mixed-precision decomposition of operations

Patent number: 11126428

Abstract: Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.

Type: Grant

Filed: December 17, 2020

Date of Patent: September 21, 2021

Assignee: INTEL CORPORATION

Inventors: Gregory Henry, Alexander Heinecke
Memory-based distributed processor architecture

Patent number: 11126511

Abstract: Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.

Type: Grant

Filed: July 16, 2019

Date of Patent: September 21, 2021

Assignee: NeuroBlade, Ltd.

Inventors: Elad Sity, Eliad Hillel
SM3 hash algorithm acceleration processors, methods, systems, and instructions

Patent number: 11128443

Abstract: A processor includes a decode unit to decode an SM3 two round state word update instruction. The instruction is to indicate one or more source packed data operands. The source packed data operand(s) are to have eight 32-bit state words Aj, Bj, Cj, Dj, Ej, Fj, Gj, and Hj that are to correspond to a round (j) of an SM3 hash algorithm. The source packed data operand(s) are also to have a set of messages sufficient to evaluate two rounds of the SM3 hash algorithm. An execution unit coupled with the decode unit is operable, in response to the instruction, to store one or more result packed data operands, in one or more destination storage locations. The result packed data operand(s) are to have at least four two-round updated 32-bit state words Aj+2, Bj+2, Ej+2, and Fj+2, which are to correspond to a round (j+2) of the SM3 hash algorithm.

Type: Grant

Filed: November 6, 2020

Date of Patent: September 21, 2021

Assignee: Intel Corporation

Inventors: Shay Gueron, Vlad Krasnov
Event messaging in a system having a self-scheduling processor and a hybrid threading fabric

Patent number: 11126587

Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Type: Grant

Filed: April 30, 2019

Date of Patent: September 21, 2021

Assignee: Micron Technology, Inc.

Inventor: Tony M. Brewer
Multi-threaded, self-scheduling processor

Patent number: 11119972

Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Type: Grant

Filed: April 30, 2019

Date of Patent: September 14, 2021

Assignee: Micron Technology, Inc.

Inventor: Tony M. Brewer
Thread commencement using a work descriptor packet in a self-scheduling processor

Patent number: 11119782

Abstract: Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Type: Grant

Filed: April 30, 2019

Date of Patent: September 14, 2021

Assignee: Micron Technology, Inc.

Inventor: Tony M. Brewer
Processor with processing cores each including arithmetic unit array

Patent number: 11119765

Abstract: A processor having a systolic array that can perform operations efficiently is provided. The processor includes multiple processing cores aligned in a matrix, and each of the processing cores includes an arithmetic unit array including multiple arithmetic units that can form a systolic array. Each of the processing cores includes a first memory that stores first data, a second memory that stores second data, a first multiplexer that connects a first input for receiving the first data at the arithmetic unit array to an output of the first memory in the processing core or an output of the arithmetic unit array in an adjacent processing core, and a second multiplexer that connects a second input for receiving the second data at the arithmetic unit array to an output of the second memory in the processing core or an output of the arithmetic unit array in an adjacent processing core.

Type: Grant

Filed: November 1, 2019

Date of Patent: September 14, 2021

Assignee: Preferred Networks, Inc.

Inventor: Tanvir Ahmed
Speculative instruction wakeup to tolerate draining delay of memory ordering violation check buffers

Patent number: 11113065

Abstract: A technique for speculatively executing load-dependent instructions includes detecting that a memory ordering consistency queue is full for a completed load instruction. The technique also includes storing data loaded by the completed load instruction into a storage location for storing data when the memory ordering consistency queue is full. The technique further includes speculatively executing instructions that are dependent on the completed load instruction. The technique also includes in response to a slot becoming available in the memory ordering consistency queue, replaying the load instruction. The technique further includes in response to receiving loaded data for the replayed load instruction, testing for a data mis-speculation by comparing the loaded data for the replayed load instruction with the data loaded by the completed load instruction that is stored in the storage location.

Type: Grant

Filed: October 31, 2019

Date of Patent: September 7, 2021

Assignee: Advanced Micro Devices, Inc.

Inventors: John Kalamatianos, Susumu Mashimo, Krishnan V. Ramani, Scott Thomas Bingham
Processor and control methods thereof for performing deep learning

Patent number: 11093439

Abstract: A processor for performing deep learning is provided herein. The processor includes a processing element unit including a plurality of processing elements arranged in a matrix form including a first row of processing elements and a second row of processing elements. The processing elements are fed with filter data by a first data input unit which is connected to the first row processing elements. A second data input unit feeds target data to the processing elements. A shifter composed of registers feeds instructions to the processing elements. A controller in the processor controls the processing elements, the first data input unit and second data input unit to process the filter data and target data, thus providing sum of products (convolution) functionality.

Type: Grant

Filed: September 27, 2018

Date of Patent: August 17, 2021

Assignee: SAMSUNG ELECTRONICS CO., LTD.

Inventors: Kyoung-hoon Kim, Young-hwan Park, Dong-kwan Suh, Keshava prasad Nagaraja, Suk-jin Kim, Han-su Cho, Hyun-jung Kim
Systems and methods for performing instructions to convert to 16-bit floating-point format

Patent number: 11068263

Abstract: Disclosed embodiments relate to systems and methods for performing instructions to convert to 16-bit floating-point format. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a first source vector comprising N single-precision elements, and a destination vector comprising at least N 16-bit floating-point elements, the opcode to indicate execution circuitry is to convert each of the elements of the specified source vector to 16-bit floating-point, the conversion to include truncation and rounding, as necessary, and to store each converted element into a corresponding location of the specified destination vector, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.

Type: Grant

Filed: December 23, 2020

Date of Patent: July 20, 2021

Assignee: Intel Corporation

Inventors: Alexander F. Heinecke, Robert Valentine, Mark J. Charney, Raanan Sade, Menachem Adelman, Zeev Sperber, Amit Gradstein, Simon Rubanovich
Systems and methods for performing instructions to convert to 16-bit floating-point format

Patent number: 11068262

Abstract: Disclosed embodiments relate to systems and methods for performing instructions to convert to 16-bit floating-point format. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a first source vector comprising N single-precision elements, and a destination vector comprising at least N 16-bit floating-point elements, the opcode to indicate execution circuitry is to convert each of the elements of the specified source vector to 16-bit floating-point, the conversion to include truncation and rounding, as necessary, and to store each converted element into a corresponding location of the specified destination vector, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.

Type: Grant

Filed: December 23, 2020

Date of Patent: July 20, 2021

Assignee: Intel Corporation

Inventors: Alexander F. Heinecke, Robert Valentine, Mark J. Charney, Raanan Sade, Menachem Adelman, Zeev Sperber, Amit Gradstein, Simon Rubanovich
Vector reduction processor

Patent number: 11061854

Abstract: A vector reduction circuit configured to reduce an input vector of elements comprises a plurality of cells, wherein each of the plurality of cells other than a designated first cell that receives a designated first element of the input vector is configured to receive a particular element of the input vector, receive, from another of the one or more cells, a temporary reduction element, perform a reduction operation using the particular element and the temporary reduction element, and provide, as a new temporary reduction element, a result of performing the reduction operation using the particular element and the temporary reduction element. The vector reduction circuit also comprises an output circuit configured to provide, for output as a reduction of the input vector, a new temporary reduction element corresponding to a result of performing the reduction operation using a last element of the input vector.

Type: Grant

Filed: July 1, 2020

Date of Patent: July 13, 2021

Assignee: Google LLC

Inventors: Gregory Michael Thorson, Andrew Everett Phelps, Olivier Temam
Advanced processor architecture

Patent number: 11061682

Abstract: The invention relates to a method for processing instructions out-of-order on a processor comprising an arrangement of execution units. The inventive method comprises looking up operand sources in a Register Positioning Table and setting operand input references of the instruction to be issued accordingly, checking for an Execution Unit (EXU) available for receiving a new instruction, and issuing the instruction to the available Execution Unit and entering a reference of the result register addressed by the instruction to be issued to the Execution Unit into the Register Positioning Table (RPT).

Type: Grant

Filed: December 13, 2015

Date of Patent: July 13, 2021

Inventor: Martin Vorbach
Processor, and method for processing information applied to processor

Patent number: 11055100

Abstract: Embodiments of the present disclosure relate to a method for processing information, and a processor. The processor includes an arithmetic and logic unit, a bypass unit, a queue unit, a multiplexer, and a register file. The bypass unit includes a data processing subunit; the data processing subunit is configured to acquire at least one valid processing result outputted by the arithmetic and logic unit, determine a processing result from the at least one valid processing result, output the determined processing result to the multiplexer, and output processing results except for the determined processing result of among the at least one valid processing result to the queue unit; and the multiplexer is configured to sequentially output more than one valid processing results to the register file.

Type: Grant

Filed: July 3, 2019

Date of Patent: July 6, 2021

Assignee: Beijing Baidu Netcom Science and Technology Co., Ltd.

Inventor: Jian Ouyang
Commit window move element

Patent number: 11048609

Abstract: A trace module has monitoring circuitry for monitoring processing of instructions by processing circuitry, and trace output circuitry for outputting a sequence of elements indicative of outcomes of the processing of instructions by the processing circuitry. The trace module supports output of a commit window move element indicating that a commit window, representing a portion of the trace stream comprising at least one speculative element representing at least one speculatively executed instruction, should move while the oldest remaining speculative element of the trace stream remains uncommitted. This can be useful for tracing of transactional memory functionality within program code.

Type: Grant

Filed: December 10, 2018

Date of Patent: June 29, 2021

Assignee: Arm Limited

Inventor: Michael John Gibbs
Systems and methods for performing 16-bit floating-point vector dot product instructions

Patent number: 11036504

Abstract: Disclosed embodiments relate to systems and methods for performing 16-bit floating-point vector dot product instructions. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of first source, second source, and destination vectors, the opcode to indicate execution circuitry is to multiply N pairs of 16-bit floating-point formatted elements of the specified first and second sources, and accumulate the resulting products with previous contents of a corresponding single-precision element of the specified destination, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.

Type: Grant

Filed: December 23, 2020

Date of Patent: June 15, 2021

Assignee: Intel Corporation

Inventors: Alexander F. Heinecke, Robert Valentine, Mark J. Charney, Raanan Sade, Menachem Adelman, Zeev Sperber, Amit Gradstein, Simon Rubanovich
Reducing latency of common source data movement instructions

Patent number: 11029950

Abstract: A move data instruction to move data from one location to another location is obtained. Based on obtaining the move data instruction, a determination is made as to whether the data to be moved is located in a buffer. The buffer is configured to maintain the data for use by multiple move data instructions. The buffer is used to move the data from the one location to the other location, based on determining that the data to be moved is in the buffer.

Type: Grant

Filed: July 3, 2019

Date of Patent: June 8, 2021

Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Yossi Shapira, Yair Fried, Eyal Naor, Amir Turi
Apparatus for calculating and retaining a bound on error during floating-point operations and methods thereof

Patent number: 11023230

Abstract: The apparatus and method for calculating and retaining a bound on error during floating-point operations inserts an additional bounding field into the standard floating-point format that records the retained significant bits of the calculation with notification upon insufficient retention. The bounding field, accounting for both rounding and cancellation errors, includes the lost bits D Field and the accumulated rounding error R Field. The D Field states the number of bits in the floating-point representation that are no longer meaningful. The bounds on the represented real value are determined by the truncated floating-point value and the addition of the error determined by the number of lost bits. The true, real value is absolutely contained by these bounds. The allowable loss (optionally programmable) of significant digits provides a fail-safe, real-time notification of loss of significant digits. This allows representation of real numbers accurate to the last digit.

Type: Grant

Filed: January 20, 2020

Date of Patent: June 1, 2021

Inventor: Alan A. Jorgensen
Bit string operations using a computing tile

Patent number: 11016765

Abstract: Systems, apparatuses, and methods related to bit string operations using a computing tile are described. An example apparatus includes a computing device (or “tile”) including a processing unit and a memory resource configured as a cache for the processing unit. The computing device can include circuitry to receive a command to initiate an operation to convert data comprising a bit string having a first format that supports arithmetic operations to a first level of precision to a bit string having a second format that supports arithmetic operations to a second level of precision. The computing device can receive, by the memory resource, the bit string based, at least in part, on receipt of the command and, responsive to receipt of the data, perform the operation on the bit string to convert the data from the first format to the second format.

Type: Grant

Filed: April 29, 2019

Date of Patent: May 25, 2021

Assignee: Micron Technology, Inc.

Inventor: Vijay S. Ramesh

prev … 4 5 6 7 8 9 10 11 12 … next