Patents by Inventor Ron Shalev

Ron Shalev has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Fine-grained pipelining using index space mapping

Patent number: 11714653

Abstract: A method for computing includes defining a processing pipeline, including at least a first stage in which producer processors compute and output data to respective locations in a buffer and a second processing stage in which one or more consumer processors read the data from the buffer and apply a computational task to the data read from the buffer. The computational task is broken into multiple, independent work units, for application by the consumer processors to respective ranges of the data in the buffer, and respective indexes are assigned to the work units in a predefined index space. A mapping is generated between the index space and the addresses in the buffer, and execution of the work units is scheduled such that at least one of the work units can begin execution before all the producer processors have completed the first processing stage.

Type: Grant

Filed: February 15, 2021

Date of Patent: August 1, 2023

Assignee: HABANA LABS LTD.

Inventors: Tzachi Cohen, Michael Zuckerman, Doron Singer, Ron Shalev, Amos Goldman
Index space mapping using static code analysis

Patent number: 11467827

Abstract: A method for computing includes providing software source code defining a processing pipeline including multiple, sequential stages of parallel computations, in which a plurality of processors apply a computational task to data read from a buffer. A static code analysis is applied to the software source code so as to break the computational task into multiple, independent work units, and to define an index space in which the work units are identified by respective indexes. Based on the static code analysis, mapping parameters that define a mapping between the index space and addresses in the buffer are computed, indicating by the mapping the respective ranges of the data to which the work units are to be applied. The source code is compiled so that the processors execute the work units identified by the respective indexes while accessing the data in the buffer in accordance with the mapping.

Type: Grant

Filed: April 6, 2021

Date of Patent: October 11, 2022

Assignee: HABANA LABS LTD.

Inventors: Michael Zuckerman, Tzachi Cohen, Doron Singer, Ron Shalev, Amos Goldman
Tensor-based memory access

Patent number: 11321092

Abstract: A processor includes an internal memory and processing circuitry. The internal memory is configured to store a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. The processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.

Type: Grant

Filed: October 25, 2018

Date of Patent: May 3, 2022

Assignee: HABANA LABS LTD.

Inventors: Shlomo Raikin, Sergei Gofman, Ran Halutz, Evgeny Spektor, Amos Goldman, Ron Shalev
Processing-memory architectures performing atomic read-modify-write operations in deep learning systems

Patent number: 11249724

Abstract: A computational apparatus includes a memory unit and Read-Modify-Write (RMW) logic. The memory unit is configured to hold a data value. The RMW logic, which is coupled to the memory unit, is configured to perform an atomic RMW operation on the data value stored in the memory unit.

Type: Grant

Filed: August 28, 2019

Date of Patent: February 15, 2022

Assignee: HABANA LABS LTD.

Inventors: Shlomo Raikin, Ron Shalev, Sergei Gofman, Ran Halutz, Nadav Klein
Hardware accelerator for systolic matrix multiplication

Patent number: 10915297

Abstract: Computational apparatus includes a systolic array of processing elements. In each of a sequence of processing cycles, the processing elements in a first row of the array each receive a respective first plurality of first operands, while the processing elements in a first column of the array each receive a respective second plurality of second operands. Each processing element, except in the first row and first column, receives the respective first and second pluralities of the operands from adjacent processing elements in a preceding row and column of the array. Each processing element multiplies pairs of the first and second operands together to generate multiple respective products, and accumulates the products in accumulators. Synchronization logic loads a succession of first and second vectors of the operands into the array, and upon completion of processing triggers the processing elements to transfer respective data values from the accumulators out of the array.

Type: Grant

Filed: November 12, 2018

Date of Patent: February 9, 2021

Assignee: HABANA LABS LTD.

Inventors: Ran Halutz, Tomer Rothschild, Ron Shalev
Approximation of mathematical functions in a vector processor

Patent number: 10915494

Abstract: A vector processor includes a coefficient memory and a processor. The processor has an Instruction Set Architecture (ISA), which includes an instruction that approximates a mathematical function by a polynomial. The processor is configured to approximate the mathematical function over an argument, by reading one or more coefficients of the polynomial from the coefficient memory and evaluating the polynomial at the argument using the coefficients.

Type: Grant

Filed: November 11, 2018

Date of Patent: February 9, 2021

Assignee: HABANA LABS LTD.

Inventors: Ron Shalev, Evgeny Spektor, Sergei Gofman, Ran Halutz, Shlomo Raikin, Hilla Ben Yaacov
Heterogeneous multiprocessor including scalar and SIMD processors in a ratio defined by execution time and consumed die area

Patent number: 10891255

Abstract: In one embodiment, a heterogeneous multicore processor is described that is optimized to execute multi-stage computer vision algorithms such as cascade classifier workloads. In such embodiment the heterogeneous processor includes at least one SIMD core, such as a vector processor core, coupled with one or more scalar cores. In one embodiment the heterogeneous multiprocessor executes multi-stage compute operations, where the SIMD core computes a first set of stages and the one or more scalar cores compute the second set of stages. In one embodiment, a process for designing a heterogeneous multicore processor is disclosed which optimizes the ratio of scalar to SIMD cores based on execution time of the multi-stage compute operation in relation to processor die area consumed by a processor configuration having the ratio.

Type: Grant

Filed: March 18, 2015

Date of Patent: January 12, 2021

Assignee: Intel Corporation

Inventors: Edward T. Grochowski, Michael E. Kounavis, Ron Shalev
Processor suspension buffer and instruction queue

Patent number: 10853070

Abstract: A processor includes a processing engine, an address queue, an address generation unit, and logic circuitry. The processing engine is configured to process instructions that access data in an external memory. The address generation unit is configured to generate respective addresses for the instructions to be processed by the processing engine, to provide the addresses to the processing engine, and to write the addresses to the address queue. The logic circuitry is configured to access the external memory on behalf of the processing engine while compensating for variations in access latency to the external memory, by reading the addresses from the address queue, and executing the instructions in the external memory in accordance with the addresses read from the address queue.

Type: Grant

Filed: October 3, 2018

Date of Patent: December 1, 2020

Assignee: HABANA LABS LTD.

Inventors: Ron Shalev, Evgeny Spektor, Ran Halutz
Hiding latency of multiplier-accumulator using partial results

Patent number: 10853448

Abstract: Computational apparatus includes a memory, which is configured to contain multiple matrices of input data values. An array of processing elements is configured to perform multiplications of respective first and second input operands and to accumulate products of the multiplication to generate respective output values. Data access logic is configured to select from the memory a plurality of mutually-disjoint first matrices and a second matrix, and to distribute to the processing elements the input data values in a sequence that is interleaved among the first matrices, along with corresponding input data values from the second matrix, so as to cause the processing elements to compute, in the interleaved sequence, respective convolutions of each of the first matrices with the second matrix.

Type: Grant

Filed: September 11, 2017

Date of Patent: December 1, 2020

Assignee: HABANA LABS LTD.

Inventors: Ron Shalev, Tomer Rothschild
Hardware accelerator for outer-product matrix multiplication

Patent number: 10713214

Abstract: Computational apparatus includes a systolic array of processing elements, each including a multiplier and first and second accumulators. In each of a sequence of processing cycles, the processing elements perform the following steps concurrently: Each processing element, except in the first row and first column of the array, receives first and second operands from adjacent processing elements in a preceding row and column of the array, respectively, multiplies the first and second operands together to generate a product, and accumulates the product in the first accumulator. In addition, each processing element passes a stored output data value from the second accumulator to a succeeding processing element along a respective column of the array, receives a new output data value from a preceding processing element along the respective column, and stores the new output data value in the second accumulator.

Type: Grant

Filed: September 20, 2018

Date of Patent: July 14, 2020

Assignee: HABANA LABS LTD.

Inventors: Ron Shalev, Ran Halutz
Matrix multiplication engine

Patent number: 10489479

Abstract: Computational apparatus includes a memory, which contains first and second input matrices of input data values, having at least three dimensions including respective heights and widths in a predefined sampling space and a common depth in a feature dimension, orthogonal to the sampling space. An array of processing elements each perform a multiplication of respective first and second input operands and to accumulate products of the multiplication to generate a respective output value. Data access logic extracts first and second pluralities of vectors of the input data values extending in the feature dimension from the first and second input matrices, respectively, and distributes the input data values from the extracted vectors in sequence to the processing elements so as to cause the processing elements to compute a convolution of first and second two-dimensional matrices composed respectively of the first and second pluralities of vectors.

Type: Grant

Filed: September 11, 2017

Date of Patent: November 26, 2019

Assignee: Habana Labs Ltd.

Inventors: Ron Shalev, Sergei Gofman, Amos Goldman, Tomer Rothschild
Apparatus and method for memory-hierarchy aware producer-consumer instruction

Patent number: 9990287

Abstract: An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.

Type: Grant

Filed: December 21, 2011

Date of Patent: June 5, 2018

Assignee: Intel Corporation

Inventors: Shlomo Raikin, Raanan Sade, Robert Valentine, Julius Yuli Mandelblat, Ron Shalev, Larisa Novakovsky
ENERGY AND AREA OPTIMIZED HETEROGENEOUS MULTIPROCESSOR FOR CASCADE CLASSIFIERS

Publication number: 20160275043

Abstract: In one embodiment, a heterogeneous multicore processor is described that is optimized to execute multi-stage computer vision algorithms such as cascade classifier workloads. In such embodiment the heterogeneous processor includes at least one SIMD core, such as a vector processor core, coupled with one or more scalar cores. In one embodiment the heterogeneous multiprocessor executes multi-stage compute operations, where the SIMD core computes a first set of stages and the one or more scalar cores compute the second set of stages. In one embodiment, a process for designing a heterogeneous multicore processor is disclosed which optimizes the ratio of scalar to SIMD cores based on execution time of the multi-stage compute operation in relation to processor die area consumed by a processor configuration having the ratio.

Type: Application

Filed: March 18, 2015

Publication date: September 22, 2016

Inventors: Edward T. Grochowski, Michael E. Kounavis, Ron Shalev
Methods and apparatus for efficient communication between caches in hierarchical caching design

Patent number: 9411728

Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line.

Type: Grant

Filed: December 23, 2011

Date of Patent: August 9, 2016

Assignee: Intel Corporation

Inventors: Ron Shalev, Yiftach Gilad, Shlomo Raikin, Igor Yanover, Stanislav Shwartsman, Raanan Sade
Method and apparatus for cutting senior store latency using store prefetching

Patent number: 9405545

Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.

Type: Grant

Filed: December 30, 2011

Date of Patent: August 2, 2016

Assignee: Intel Corporation

Inventors: Stanislav Shwartsman, Melih Ozgul, Sebastien Hily, Shlomo Raikin, Raanan Sade, Ron Shalev
METHOD AND APPARATUS FOR CUTTING SENIOR STORE LATENCY USING STORE PREFETCHING

Publication number: 20140223105

Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.

Type: Application

Filed: December 30, 2011

Publication date: August 7, 2014

Inventors: Stanislav Shwartsman, Melih Ozgul, Sebastien Hily, Shlomo Raikin, Raanan Sade, Ron Shalev
APPARATUS AND METHOD FOR MEMORY-HIERARCHY AWARE PRODUCER-CONSUMER INSTRUCTIONS

Publication number: 20140208031

Abstract: An apparatus and method are described for efficiently transferring data from a producer core to a consumer core within a central processing unit (CPU). For example, one embodiment of a method comprises: A method for transferring a chunk of data from a producer core of a central processing unit (CPU) to consumer core of the CPU, comprising: writing data to a buffer within the producer core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the fill buffer to a cache accessible by both the producer core and the consumer core; and upon the consumer core detecting that data is available in the cache, providing the data to the consumer core from the cache upon receipt of a read signal from the consumer core.

Type: Application

Filed: December 21, 2011

Publication date: July 24, 2014

Inventors: Shlomo Raikin, Robert Valentine, Raanan Sade, Julius Yuli Mandelbalt, Ron Shalev, Larisa Novakovsky
APPARATUS AND METHOD FOR MEMORY-HIERARCHY AWARE PRODUCER-CONSUMER INSTRUCTION

Publication number: 20140192069

Abstract: An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.

Type: Application

Filed: December 21, 2011

Publication date: July 10, 2014

Inventors: Shlomo Raikin, Raanan Sade, Robert Valentine, Julius Yuli Mandelblat, Ron Shalev, Larisa Novakovsky
METHODS AND APPARATUS FOR EFFICIENT COMMUNICATION BETWEEN CACHES IN HIERARCHICAL CACHING DESIGN

Publication number: 20130326145

Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line.

Type: Application

Filed: December 23, 2011

Publication date: December 5, 2013

Inventors: Ron Shalev, Yiftach Gilad, Shlomo Raikin, Igor Yanover, Stanislav Shwartsman, Raanan Sade
Tasking system interface methods and apparatuses for use in wireless devices

Patent number: 8527993

Abstract: Techniques are provided which may be implemented in various methods and/or apparatuses that to provide a tasking system buffer interface capability to interface with a plurality of shared processes/engines.

Type: Grant

Filed: October 1, 2010

Date of Patent: September 3, 2013

Assignee: Qualcomm Incorporated

Inventors: Raheel Khan, Joseph C. Chan, Ron Shalev, Naveed U. Zaman

1 2 next