Patents Examined by Michael J Metzger
  • Patent number: 11966737
    Abstract: Systems and methods for an efficient and robust multiprocessor-coprocessor interface that may be used between a streaming multiprocessor and an acceleration coprocessor in a GPU are provided. According to an example implementation, in order to perform an acceleration of a particular operation using the coprocessor, the multiprocessor: issues a series of write instructions to write input data for the operation into coprocessor-accessible storage locations, issues an operation instruction to cause the coprocessor to execute the particular operation; and then issues a series of read instructions to read result data of the operation from coprocessor-accessible storage locations to multiprocessor-accessible storage locations.
    Type: Grant
    Filed: September 2, 2021
    Date of Patent: April 23, 2024
    Assignee: NVIDIA CORPORATION
    Inventors: Ronald Charles Babich, Jr., John Burgess, Jack Choquette, Tero Karras, Samuli Laine, Ignacio Llamas, Gregory Muthler, William Parsons Newhall, Jr.
  • Patent number: 11947963
    Abstract: Computing resource management is improved during fast sorting using vector instructions. The process includes: determining a pivot value and a pivot position in a data set (e.g., by sampling with vectors and determining the sample median), determining whether moving data in the sampled portion may be avoided (e.g., if it is constant-valued or already sorted) and, leveraging that determination to possibly avoid unnecessary data movement, sorting the data set. Some examples further determine the microarchitecture version of the computing device performing the sorting and select an implementation of sorting instruction that is tuned for that microarchitecture version (e.g., based on the number of vector registers and motherboard cache configuration). Some examples leverage a soft 3-way quicksort by finding data elements adjacent to the pivot position that also have the pivot value and adding a partition boundary at the end of the set of same-valued data elements.
    Type: Grant
    Filed: April 4, 2022
    Date of Patent: April 2, 2024
    Assignee: Microsoft Technology Licensing, LLC.
    Inventors: Conor John Cunningham, Thierry Fevrier
  • Patent number: 11940946
    Abstract: A vector reduction circuit configured to reduce an input vector of elements comprises a plurality of cells, wherein each of the plurality of cells other than a designated first cell that receives a designated first element of the input vector is configured to receive a particular element of the input vector, receive, from another of the one or more cells, a temporary reduction element, perform a reduction operation using the particular element and the temporary reduction element, and provide, as a new temporary reduction element, a result of performing the reduction operation using the particular element and the temporary reduction element. The vector reduction circuit also comprises an output circuit configured to provide, for output as a reduction of the input vector, a new temporary reduction element corresponding to a result of performing the reduction operation using a last element of the input vector.
    Type: Grant
    Filed: June 22, 2021
    Date of Patent: March 26, 2024
    Assignee: Google LLC
    Inventors: Gregory Michael Thorson, Andrew Everett Phelps, Olivier Temam
  • Patent number: 11941482
    Abstract: In some aspects, a heterogeneous computing system includes a quantum processor unit and a classical processor unit. In some instances, variables defined by a computer program are stored in a classical memory in the heterogeneous computing system. The computer program is executed in the heterogeneous computing system by operation of the quantum processor unit and the classical processor unit. Instructions are generated for the quantum processor by a host processor unit based on values of the variables stored in the classical memory. The instructions are configured to cause the quantum processor unit to perform a data processing task defined by the computer program. The values of the variables are updated in the classical memory based on output values generated by the quantum processor unit. The classical processor unit processes the updated values of the variables.
    Type: Grant
    Filed: March 9, 2021
    Date of Patent: March 26, 2024
    Assignee: Rigetti & Co, LLC
    Inventors: Chad Tyler Rigetti, William J. Zeng, Dane Christoffer Thompson
  • Patent number: 11934826
    Abstract: Methods, systems, and apparatus, including computer-readable media, are described for performing vector reductions using a shared scratchpad memory of a hardware circuit having processor cores that communicate with the shared memory. For each of the processor cores, a respective vector of values is generated based on computations performed at the processor core. The shared memory receives the respective vectors of values from respective resources of the processor cores using a direct memory access (DMA) data path of the shared memory. The shared memory performs an accumulation operation on the respective vectors of values using an operator unit coupled to the shared memory. The operator unit is configured to accumulate values based on arithmetic operations encoded at the operator unit. A result vector is generated based on performing the accumulation operation using the respective vectors of values.
    Type: Grant
    Filed: November 19, 2021
    Date of Patent: March 19, 2024
    Assignee: Google LLC
    Inventors: Thomas Norrie, Gurushankar Rajamani, Andrew Everett Phelps, Matthew Leever Hedlund, Norman Paul Jouppi
  • Patent number: 11934870
    Abstract: A method for scheduling computing tasks on a supercomputer including offline reinforcement learning (OFRL) of a scheduler on a database (LDB). The database includes at least one execution history (HIST) that includes the state (LHPCS) of a learning supercomputer at several moments (T, T?1); the actions (LACT) related to the scheduling of learning tasks on the learning supercomputer at those moments (T); and a reward (REW) related to each task. The method also includes the use of the scheduler trained on the computing tasks to be scheduled.
    Type: Grant
    Filed: September 13, 2022
    Date of Patent: March 19, 2024
    Assignees: BULL SAS, LE COMMISSARIAT À L'ÉNERGIE ATOMIQUE ET AUX ÉNERGIES ALTERNATIVES
    Inventor: Pierre Seroul
  • Patent number: 11934835
    Abstract: Validation of correct derivation of a parallel program from a sequential program for deployment of the parallel program to a plurality of processing units is described. The system receives the program code of the sequential program and the program code of the parallel program. A static analysis component computes a first control flow graph, and determines dependencies within the sequential program code. It further computes a further control flow graph for each thread or process of the parallel program and determines dependencies within the further control flow graphs. A checking component checks if the sequential program and the derived parallel program are semantically equivalent by comparing the respective first and further control flow graphs and respective dependencies. A release component declares a correct derivation state for the parallel program to qualify the parallel program for deployment if the derived parallel program and the sequential program are semantically equivalent.
    Type: Grant
    Filed: January 18, 2023
    Date of Patent: March 19, 2024
    Assignee: emmtrix Technologies GmbH
    Inventor: Timo Stripf
  • Patent number: 11915003
    Abstract: Disclosed are a process parasitism-based branch prediction method and device for serverless computing, an electronic device, and a readable storage medium. The method includes: receiving a calling request of a user for a target function; when capacity expansion is required, scheduling a container executing the target function to a new server that has not executed the target function in a preset period of time, wherein a parasitic process is pre-added to a base image of the container; triggering the parasitic process when the container is initialized on the new server, the parasitic process being used for initiating a system call, and triggering a system kernel to select a target template function according to the type of the target function and copying the target template function N times; using execution data of the copied N target template functions as training data to train a branch predictor on the new server.
    Type: Grant
    Filed: August 31, 2023
    Date of Patent: February 27, 2024
    Assignee: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY
    Inventors: Kejiang Ye, Yanying Lin, Chengzhong Xu
  • Patent number: 11914487
    Abstract: Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
    Type: Grant
    Filed: August 9, 2021
    Date of Patent: February 27, 2024
    Assignee: NeuroBlade Ltd.
    Inventors: Elad Sity, Eliad Hillel
  • Patent number: 11907716
    Abstract: The present disclosure discloses a method for vector reading-writing, a vector-register system, a device and a medium. When a vector-writing instruction is obtained, by using a vector-register controller, a to-be-written-vector address space is converted into a to-be-written-vector-register-file bit address, and, for a nonstandard vector, by using a nonstandard-vector converting unit, after the nonstandard vector is converted into a to-be-written nonstandard vector, and, subsequently, writing is performed, to realize the saving of vector data of any format. When a vector-reading instruction is obtained, by using the vector-register controller, according to the to-be-read width and the to-be-read length, after the to-be-read-vector address space is converted into a to-be-read-vector-register-file bit address, and, subsequently, reading is performed, to realize the reading of vector data of any format.
    Type: Grant
    Filed: April 28, 2022
    Date of Patent: February 20, 2024
    Assignee: INSPUR SUZHOU INTELLIGENT TECHNOLOGY CO., LTD.
    Inventors: Lingjun Kong, Zhaochun Pang, Qi Song
  • Patent number: 11900114
    Abstract: Disclosed embodiments relate to systems and methods to skip inconsequential matrix operations. In one example, a processor includes decode circuitry to decode an instruction having fields to specify an opcode and locations of first source, second source, and destination matrices, the opcode indicating that the processor is to multiply each element at row M and column K of the first source matrix with a corresponding element at row K and column N of the second source matrix, and accumulate a resulting product with previous contents of a corresponding element at row M and column N of the destination matrix, the processor to skip multiplications that, based on detected values of corresponding multiplicands, would generate inconsequential results; scheduling circuitry to schedule execution of the instruction; and execution circuitry to execute the instructions as per the opcode.
    Type: Grant
    Filed: August 1, 2022
    Date of Patent: February 13, 2024
    Assignee: Intel Corporation
    Inventors: Elmoustapha Ould-Ahmed-Vall, William Rash, Subramaniam Maiyuran, Varghese George, Rajesh Sankaran
  • Patent number: 11893389
    Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.
    Type: Grant
    Filed: March 27, 2023
    Date of Patent: February 6, 2024
    Assignee: Intel Corporation
    Inventors: Alexander F. Heinecke, Robert Valentine, Mark J. Charney, Raanan Sade, Menachem Adelman, Zeev Sperber, Amit Gradstein, Simon Rubanovich
  • Patent number: 11886884
    Abstract: In an embodiment, a processor includes a branch prediction circuit and a plurality of processing engines. The branch prediction circuit is to: detect a coherence operation associated with a first memory address; identify a first branch instruction associated with the first memory address; and predict a direction for the identified branch instruction based on the detected coherence operation. Other embodiments are described and claimed.
    Type: Grant
    Filed: November 12, 2019
    Date of Patent: January 30, 2024
    Assignee: Intel Corporation
    Inventors: Christopher Wilkerson, Binh Pham, Patrick Lu, Jared Warner Stark, IV
  • Patent number: 11886880
    Abstract: The present disclosure provides a data processing apparatus and related products. The products include a control module including an instruction caching unit, an instruction processing unit, and a storage queue unit. The instruction caching unit is configured to store computation instructions associated with an artificial neural network operation; the instruction processing unit is configured to parse the computation instructions to obtain a plurality of operation instructions; and the storage queue unit is configured to store an instruction queue, where the instruction queue includes a plurality of operation instructions or computation instructions to be executed in the sequence of the queue. By adopting the above-mentioned method, the present disclosure can improve the operation efficiency of related products when performing operations of a neural network model.
    Type: Grant
    Filed: June 24, 2022
    Date of Patent: January 30, 2024
    Assignee: CAMBRICON TECHNOLOGIES CORPORATION LIMITED
    Inventors: Shaoli Liu, Bingrui Wang, Xiaoyong Zhou, Yimin Zhuang, Huiying Lan, Jun Liang, Hongbo Zeng
  • Patent number: 11886882
    Abstract: Described herein are systems and methods for secure multithread execution. For example, some methods include fetching an instruction of a first thread from a memory into a processor pipeline that is configured to execute instructions from two or more threads in parallel using execution units of the processor pipeline; detecting that the instruction has been designated as a sensitive instruction; responsive to detection of the sensitive instruction, disabling execution of instructions of threads other than the first thread in the processor pipeline during execution of the sensitive instruction by an execution unit of the processor pipeline; executing the sensitive instruction using an execution unit of the processor pipeline; and, responsive to completion of execution of the sensitive instruction, enabling execution of instructions of threads other than the first thread in the processor pipeline.
    Type: Grant
    Filed: April 5, 2022
    Date of Patent: January 30, 2024
    Assignee: Marvell Asia Pte, Ltd.
    Inventor: Shubhendu Sekhar Mukherjee
  • Patent number: 11880768
    Abstract: A processor-implemented data processing method includes encoding a plurality of weights of a filter of a neural network using an inverted two's complement fixed-point format; generating weight data based on values of the encoded weights corresponding to same filter positions of a plurality of filters; and performing an operation on the weight data and input activation data using a bit-serial scheme to control when to perform an activation function with respect to the weight data and input activation data.
    Type: Grant
    Filed: March 1, 2023
    Date of Patent: January 23, 2024
    Assignees: Samsung Electronics Co., Ltd., Seoul National University R&DB Foundation
    Inventors: Seungwon Lee, Dongwoo Lee, Kiyoung Choi, Sungbum Kang
  • Patent number: 11875151
    Abstract: Inter-process serving of machine learning features from mapped memory for machine learning models is described. ML features are populated in a data structure that is serialized. State data is stored that indicates that reader process(es) are to read from a first memory mapped data file and not a second memory mapped data file. The serialized bytes are stored in the second memory mapped data file and the state data is updated to indicate that the reader process(es) are to read from the second memory mapped data file. A request is received and parsed to prepare keys from attributes of the request. Based on the state data, the serialized bytes are read from the second memory mapped data file that correspond to the keys. The serialized bytes are deserialized and copied to a data structure available to an inference algorithm.
    Type: Grant
    Filed: July 26, 2023
    Date of Patent: January 16, 2024
    Assignee: CLOUDFLARE, INC.
    Inventor: Oleksandr Bocharov
  • Patent number: 11868770
    Abstract: Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.
    Type: Grant
    Filed: December 29, 2022
    Date of Patent: January 9, 2024
    Assignee: INTEL CORPORATION
    Inventors: Gregory Henry, Alexander Heinecke
  • Patent number: 11868775
    Abstract: Methods of encoding and decoding are described which use a variable number of instruction words to encode instructions from an instruction set, such that different instructions within the instruction set may be encoded using different numbers of instruction words. To encode an instruction, the bits within the instruction are re-ordered and formed into instruction words based upon their variance as determined using empirical or simulation data. The bits in the instruction words are compared to corresponding predicted values and some or all of the instruction words that match the predicted values are omitted from the encoded instruction.
    Type: Grant
    Filed: May 5, 2022
    Date of Patent: January 9, 2024
    Assignee: Imagination Technologies Limited
    Inventors: Simon Thomas Nield, James McCarthy
  • Patent number: 11868781
    Abstract: In one embodiment, a method includes accessing a loaded but paused source process executable and disassembling the source process executable to identify a system call to be instrumented and an adjacent relocatable instruction. Instrumenting the system call includes building a trampoline for the system call that includes a check flag instruction at or near an entry point to the trampoline and two areas of the trampoline that are selectively executed according to results of the check flag instruction. Building a first area of the trampoline includes providing instructions to execute a relocated copy of the adjacent relocatable instruction and return flow to an address immediately following the adjacent relocatable instruction. Building a second area of the trampoline includes providing instructions to invoke at least one handler associated with executing a relocated copy of the system call and return flow to an address immediately following the system call.
    Type: Grant
    Filed: March 24, 2022
    Date of Patent: January 9, 2024
    Assignee: Sysdig, Inc.
    Inventor: Loris Degioanni