Patents by Inventor Ron Shalev

Ron Shalev has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11714653
    Abstract: A method for computing includes defining a processing pipeline, including at least a first stage in which producer processors compute and output data to respective locations in a buffer and a second processing stage in which one or more consumer processors read the data from the buffer and apply a computational task to the data read from the buffer. The computational task is broken into multiple, independent work units, for application by the consumer processors to respective ranges of the data in the buffer, and respective indexes are assigned to the work units in a predefined index space. A mapping is generated between the index space and the addresses in the buffer, and execution of the work units is scheduled such that at least one of the work units can begin execution before all the producer processors have completed the first processing stage.
    Type: Grant
    Filed: February 15, 2021
    Date of Patent: August 1, 2023
    Assignee: HABANA LABS LTD.
    Inventors: Tzachi Cohen, Michael Zuckerman, Doron Singer, Ron Shalev, Amos Goldman
  • Patent number: 11467827
    Abstract: A method for computing includes providing software source code defining a processing pipeline including multiple, sequential stages of parallel computations, in which a plurality of processors apply a computational task to data read from a buffer. A static code analysis is applied to the software source code so as to break the computational task into multiple, independent work units, and to define an index space in which the work units are identified by respective indexes. Based on the static code analysis, mapping parameters that define a mapping between the index space and addresses in the buffer are computed, indicating by the mapping the respective ranges of the data to which the work units are to be applied. The source code is compiled so that the processors execute the work units identified by the respective indexes while accessing the data in the buffer in accordance with the mapping.
    Type: Grant
    Filed: April 6, 2021
    Date of Patent: October 11, 2022
    Assignee: HABANA LABS LTD.
    Inventors: Michael Zuckerman, Tzachi Cohen, Doron Singer, Ron Shalev, Amos Goldman
  • Patent number: 11321092
    Abstract: A processor includes an internal memory and processing circuitry. The internal memory is configured to store a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. The processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.
    Type: Grant
    Filed: October 25, 2018
    Date of Patent: May 3, 2022
    Assignee: HABANA LABS LTD.
    Inventors: Shlomo Raikin, Sergei Gofman, Ran Halutz, Evgeny Spektor, Amos Goldman, Ron Shalev
  • Patent number: 11249724
    Abstract: A computational apparatus includes a memory unit and Read-Modify-Write (RMW) logic. The memory unit is configured to hold a data value. The RMW logic, which is coupled to the memory unit, is configured to perform an atomic RMW operation on the data value stored in the memory unit.
    Type: Grant
    Filed: August 28, 2019
    Date of Patent: February 15, 2022
    Assignee: HABANA LABS LTD.
    Inventors: Shlomo Raikin, Ron Shalev, Sergei Gofman, Ran Halutz, Nadav Klein
  • Patent number: 10915297
    Abstract: Computational apparatus includes a systolic array of processing elements. In each of a sequence of processing cycles, the processing elements in a first row of the array each receive a respective first plurality of first operands, while the processing elements in a first column of the array each receive a respective second plurality of second operands. Each processing element, except in the first row and first column, receives the respective first and second pluralities of the operands from adjacent processing elements in a preceding row and column of the array. Each processing element multiplies pairs of the first and second operands together to generate multiple respective products, and accumulates the products in accumulators. Synchronization logic loads a succession of first and second vectors of the operands into the array, and upon completion of processing triggers the processing elements to transfer respective data values from the accumulators out of the array.
    Type: Grant
    Filed: November 12, 2018
    Date of Patent: February 9, 2021
    Assignee: HABANA LABS LTD.
    Inventors: Ran Halutz, Tomer Rothschild, Ron Shalev
  • Patent number: 10915494
    Abstract: A vector processor includes a coefficient memory and a processor. The processor has an Instruction Set Architecture (ISA), which includes an instruction that approximates a mathematical function by a polynomial. The processor is configured to approximate the mathematical function over an argument, by reading one or more coefficients of the polynomial from the coefficient memory and evaluating the polynomial at the argument using the coefficients.
    Type: Grant
    Filed: November 11, 2018
    Date of Patent: February 9, 2021
    Assignee: HABANA LABS LTD.
    Inventors: Ron Shalev, Evgeny Spektor, Sergei Gofman, Ran Halutz, Shlomo Raikin, Hilla Ben Yaacov
  • Patent number: 10891255
    Abstract: In one embodiment, a heterogeneous multicore processor is described that is optimized to execute multi-stage computer vision algorithms such as cascade classifier workloads. In such embodiment the heterogeneous processor includes at least one SIMD core, such as a vector processor core, coupled with one or more scalar cores. In one embodiment the heterogeneous multiprocessor executes multi-stage compute operations, where the SIMD core computes a first set of stages and the one or more scalar cores compute the second set of stages. In one embodiment, a process for designing a heterogeneous multicore processor is disclosed which optimizes the ratio of scalar to SIMD cores based on execution time of the multi-stage compute operation in relation to processor die area consumed by a processor configuration having the ratio.
    Type: Grant
    Filed: March 18, 2015
    Date of Patent: January 12, 2021
    Assignee: Intel Corporation
    Inventors: Edward T. Grochowski, Michael E. Kounavis, Ron Shalev
  • Patent number: 10853448
    Abstract: Computational apparatus includes a memory, which is configured to contain multiple matrices of input data values. An array of processing elements is configured to perform multiplications of respective first and second input operands and to accumulate products of the multiplication to generate respective output values. Data access logic is configured to select from the memory a plurality of mutually-disjoint first matrices and a second matrix, and to distribute to the processing elements the input data values in a sequence that is interleaved among the first matrices, along with corresponding input data values from the second matrix, so as to cause the processing elements to compute, in the interleaved sequence, respective convolutions of each of the first matrices with the second matrix.
    Type: Grant
    Filed: September 11, 2017
    Date of Patent: December 1, 2020
    Assignee: HABANA LABS LTD.
    Inventors: Ron Shalev, Tomer Rothschild
  • Patent number: 10853070
    Abstract: A processor includes a processing engine, an address queue, an address generation unit, and logic circuitry. The processing engine is configured to process instructions that access data in an external memory. The address generation unit is configured to generate respective addresses for the instructions to be processed by the processing engine, to provide the addresses to the processing engine, and to write the addresses to the address queue. The logic circuitry is configured to access the external memory on behalf of the processing engine while compensating for variations in access latency to the external memory, by reading the addresses from the address queue, and executing the instructions in the external memory in accordance with the addresses read from the address queue.
    Type: Grant
    Filed: October 3, 2018
    Date of Patent: December 1, 2020
    Assignee: HABANA LABS LTD.
    Inventors: Ron Shalev, Evgeny Spektor, Ran Halutz
  • Patent number: 10713214
    Abstract: Computational apparatus includes a systolic array of processing elements, each including a multiplier and first and second accumulators. In each of a sequence of processing cycles, the processing elements perform the following steps concurrently: Each processing element, except in the first row and first column of the array, receives first and second operands from adjacent processing elements in a preceding row and column of the array, respectively, multiplies the first and second operands together to generate a product, and accumulates the product in the first accumulator. In addition, each processing element passes a stored output data value from the second accumulator to a succeeding processing element along a respective column of the array, receives a new output data value from a preceding processing element along the respective column, and stores the new output data value in the second accumulator.
    Type: Grant
    Filed: September 20, 2018
    Date of Patent: July 14, 2020
    Assignee: HABANA LABS LTD.
    Inventors: Ron Shalev, Ran Halutz
  • Patent number: 10489479
    Abstract: Computational apparatus includes a memory, which contains first and second input matrices of input data values, having at least three dimensions including respective heights and widths in a predefined sampling space and a common depth in a feature dimension, orthogonal to the sampling space. An array of processing elements each perform a multiplication of respective first and second input operands and to accumulate products of the multiplication to generate a respective output value. Data access logic extracts first and second pluralities of vectors of the input data values extending in the feature dimension from the first and second input matrices, respectively, and distributes the input data values from the extracted vectors in sequence to the processing elements so as to cause the processing elements to compute a convolution of first and second two-dimensional matrices composed respectively of the first and second pluralities of vectors.
    Type: Grant
    Filed: September 11, 2017
    Date of Patent: November 26, 2019
    Assignee: Habana Labs Ltd.
    Inventors: Ron Shalev, Sergei Gofman, Amos Goldman, Tomer Rothschild
  • Patent number: 9990287
    Abstract: An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.
    Type: Grant
    Filed: December 21, 2011
    Date of Patent: June 5, 2018
    Assignee: Intel Corporation
    Inventors: Shlomo Raikin, Raanan Sade, Robert Valentine, Julius Yuli Mandelblat, Ron Shalev, Larisa Novakovsky
  • Publication number: 20160275043
    Abstract: In one embodiment, a heterogeneous multicore processor is described that is optimized to execute multi-stage computer vision algorithms such as cascade classifier workloads. In such embodiment the heterogeneous processor includes at least one SIMD core, such as a vector processor core, coupled with one or more scalar cores. In one embodiment the heterogeneous multiprocessor executes multi-stage compute operations, where the SIMD core computes a first set of stages and the one or more scalar cores compute the second set of stages. In one embodiment, a process for designing a heterogeneous multicore processor is disclosed which optimizes the ratio of scalar to SIMD cores based on execution time of the multi-stage compute operation in relation to processor die area consumed by a processor configuration having the ratio.
    Type: Application
    Filed: March 18, 2015
    Publication date: September 22, 2016
    Inventors: Edward T. Grochowski, Michael E. Kounavis, Ron Shalev
  • Patent number: 9411728
    Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line.
    Type: Grant
    Filed: December 23, 2011
    Date of Patent: August 9, 2016
    Assignee: Intel Corporation
    Inventors: Ron Shalev, Yiftach Gilad, Shlomo Raikin, Igor Yanover, Stanislav Shwartsman, Raanan Sade
  • Patent number: 9405545
    Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.
    Type: Grant
    Filed: December 30, 2011
    Date of Patent: August 2, 2016
    Assignee: Intel Corporation
    Inventors: Stanislav Shwartsman, Melih Ozgul, Sebastien Hily, Shlomo Raikin, Raanan Sade, Ron Shalev
  • Publication number: 20140223105
    Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.
    Type: Application
    Filed: December 30, 2011
    Publication date: August 7, 2014
    Inventors: Stanislav Shwartsman, Melih Ozgul, Sebastien Hily, Shlomo Raikin, Raanan Sade, Ron Shalev
  • Publication number: 20140208031
    Abstract: An apparatus and method are described for efficiently transferring data from a producer core to a consumer core within a central processing unit (CPU). For example, one embodiment of a method comprises: A method for transferring a chunk of data from a producer core of a central processing unit (CPU) to consumer core of the CPU, comprising: writing data to a buffer within the producer core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the fill buffer to a cache accessible by both the producer core and the consumer core; and upon the consumer core detecting that data is available in the cache, providing the data to the consumer core from the cache upon receipt of a read signal from the consumer core.
    Type: Application
    Filed: December 21, 2011
    Publication date: July 24, 2014
    Inventors: Shlomo Raikin, Robert Valentine, Raanan Sade, Julius Yuli Mandelbalt, Ron Shalev, Larisa Novakovsky
  • Publication number: 20140192069
    Abstract: An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.
    Type: Application
    Filed: December 21, 2011
    Publication date: July 10, 2014
    Inventors: Shlomo Raikin, Raanan Sade, Robert Valentine, Julius Yuli Mandelblat, Ron Shalev, Larisa Novakovsky
  • Publication number: 20130326145
    Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line.
    Type: Application
    Filed: December 23, 2011
    Publication date: December 5, 2013
    Inventors: Ron Shalev, Yiftach Gilad, Shlomo Raikin, Igor Yanover, Stanislav Shwartsman, Raanan Sade
  • Patent number: 8527993
    Abstract: Techniques are provided which may be implemented in various methods and/or apparatuses that to provide a tasking system buffer interface capability to interface with a plurality of shared processes/engines.
    Type: Grant
    Filed: October 1, 2010
    Date of Patent: September 3, 2013
    Assignee: Qualcomm Incorporated
    Inventors: Raheel Khan, Joseph C. Chan, Ron Shalev, Naveed U. Zaman