Patents by Inventor Ron Shalev
Ron Shalev has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11714653Abstract: A method for computing includes defining a processing pipeline, including at least a first stage in which producer processors compute and output data to respective locations in a buffer and a second processing stage in which one or more consumer processors read the data from the buffer and apply a computational task to the data read from the buffer. The computational task is broken into multiple, independent work units, for application by the consumer processors to respective ranges of the data in the buffer, and respective indexes are assigned to the work units in a predefined index space. A mapping is generated between the index space and the addresses in the buffer, and execution of the work units is scheduled such that at least one of the work units can begin execution before all the producer processors have completed the first processing stage.Type: GrantFiled: February 15, 2021Date of Patent: August 1, 2023Assignee: HABANA LABS LTD.Inventors: Tzachi Cohen, Michael Zuckerman, Doron Singer, Ron Shalev, Amos Goldman
-
Patent number: 11467827Abstract: A method for computing includes providing software source code defining a processing pipeline including multiple, sequential stages of parallel computations, in which a plurality of processors apply a computational task to data read from a buffer. A static code analysis is applied to the software source code so as to break the computational task into multiple, independent work units, and to define an index space in which the work units are identified by respective indexes. Based on the static code analysis, mapping parameters that define a mapping between the index space and addresses in the buffer are computed, indicating by the mapping the respective ranges of the data to which the work units are to be applied. The source code is compiled so that the processors execute the work units identified by the respective indexes while accessing the data in the buffer in accordance with the mapping.Type: GrantFiled: April 6, 2021Date of Patent: October 11, 2022Assignee: HABANA LABS LTD.Inventors: Michael Zuckerman, Tzachi Cohen, Doron Singer, Ron Shalev, Amos Goldman
-
Patent number: 11321092Abstract: A processor includes an internal memory and processing circuitry. The internal memory is configured to store a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. The processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.Type: GrantFiled: October 25, 2018Date of Patent: May 3, 2022Assignee: HABANA LABS LTD.Inventors: Shlomo Raikin, Sergei Gofman, Ran Halutz, Evgeny Spektor, Amos Goldman, Ron Shalev
-
Patent number: 11249724Abstract: A computational apparatus includes a memory unit and Read-Modify-Write (RMW) logic. The memory unit is configured to hold a data value. The RMW logic, which is coupled to the memory unit, is configured to perform an atomic RMW operation on the data value stored in the memory unit.Type: GrantFiled: August 28, 2019Date of Patent: February 15, 2022Assignee: HABANA LABS LTD.Inventors: Shlomo Raikin, Ron Shalev, Sergei Gofman, Ran Halutz, Nadav Klein
-
Patent number: 10915297Abstract: Computational apparatus includes a systolic array of processing elements. In each of a sequence of processing cycles, the processing elements in a first row of the array each receive a respective first plurality of first operands, while the processing elements in a first column of the array each receive a respective second plurality of second operands. Each processing element, except in the first row and first column, receives the respective first and second pluralities of the operands from adjacent processing elements in a preceding row and column of the array. Each processing element multiplies pairs of the first and second operands together to generate multiple respective products, and accumulates the products in accumulators. Synchronization logic loads a succession of first and second vectors of the operands into the array, and upon completion of processing triggers the processing elements to transfer respective data values from the accumulators out of the array.Type: GrantFiled: November 12, 2018Date of Patent: February 9, 2021Assignee: HABANA LABS LTD.Inventors: Ran Halutz, Tomer Rothschild, Ron Shalev
-
Patent number: 10915494Abstract: A vector processor includes a coefficient memory and a processor. The processor has an Instruction Set Architecture (ISA), which includes an instruction that approximates a mathematical function by a polynomial. The processor is configured to approximate the mathematical function over an argument, by reading one or more coefficients of the polynomial from the coefficient memory and evaluating the polynomial at the argument using the coefficients.Type: GrantFiled: November 11, 2018Date of Patent: February 9, 2021Assignee: HABANA LABS LTD.Inventors: Ron Shalev, Evgeny Spektor, Sergei Gofman, Ran Halutz, Shlomo Raikin, Hilla Ben Yaacov
-
Patent number: 10891255Abstract: In one embodiment, a heterogeneous multicore processor is described that is optimized to execute multi-stage computer vision algorithms such as cascade classifier workloads. In such embodiment the heterogeneous processor includes at least one SIMD core, such as a vector processor core, coupled with one or more scalar cores. In one embodiment the heterogeneous multiprocessor executes multi-stage compute operations, where the SIMD core computes a first set of stages and the one or more scalar cores compute the second set of stages. In one embodiment, a process for designing a heterogeneous multicore processor is disclosed which optimizes the ratio of scalar to SIMD cores based on execution time of the multi-stage compute operation in relation to processor die area consumed by a processor configuration having the ratio.Type: GrantFiled: March 18, 2015Date of Patent: January 12, 2021Assignee: Intel CorporationInventors: Edward T. Grochowski, Michael E. Kounavis, Ron Shalev
-
Patent number: 10853070Abstract: A processor includes a processing engine, an address queue, an address generation unit, and logic circuitry. The processing engine is configured to process instructions that access data in an external memory. The address generation unit is configured to generate respective addresses for the instructions to be processed by the processing engine, to provide the addresses to the processing engine, and to write the addresses to the address queue. The logic circuitry is configured to access the external memory on behalf of the processing engine while compensating for variations in access latency to the external memory, by reading the addresses from the address queue, and executing the instructions in the external memory in accordance with the addresses read from the address queue.Type: GrantFiled: October 3, 2018Date of Patent: December 1, 2020Assignee: HABANA LABS LTD.Inventors: Ron Shalev, Evgeny Spektor, Ran Halutz
-
Patent number: 10853448Abstract: Computational apparatus includes a memory, which is configured to contain multiple matrices of input data values. An array of processing elements is configured to perform multiplications of respective first and second input operands and to accumulate products of the multiplication to generate respective output values. Data access logic is configured to select from the memory a plurality of mutually-disjoint first matrices and a second matrix, and to distribute to the processing elements the input data values in a sequence that is interleaved among the first matrices, along with corresponding input data values from the second matrix, so as to cause the processing elements to compute, in the interleaved sequence, respective convolutions of each of the first matrices with the second matrix.Type: GrantFiled: September 11, 2017Date of Patent: December 1, 2020Assignee: HABANA LABS LTD.Inventors: Ron Shalev, Tomer Rothschild
-
Patent number: 10713214Abstract: Computational apparatus includes a systolic array of processing elements, each including a multiplier and first and second accumulators. In each of a sequence of processing cycles, the processing elements perform the following steps concurrently: Each processing element, except in the first row and first column of the array, receives first and second operands from adjacent processing elements in a preceding row and column of the array, respectively, multiplies the first and second operands together to generate a product, and accumulates the product in the first accumulator. In addition, each processing element passes a stored output data value from the second accumulator to a succeeding processing element along a respective column of the array, receives a new output data value from a preceding processing element along the respective column, and stores the new output data value in the second accumulator.Type: GrantFiled: September 20, 2018Date of Patent: July 14, 2020Assignee: HABANA LABS LTD.Inventors: Ron Shalev, Ran Halutz
-
Patent number: 10489479Abstract: Computational apparatus includes a memory, which contains first and second input matrices of input data values, having at least three dimensions including respective heights and widths in a predefined sampling space and a common depth in a feature dimension, orthogonal to the sampling space. An array of processing elements each perform a multiplication of respective first and second input operands and to accumulate products of the multiplication to generate a respective output value. Data access logic extracts first and second pluralities of vectors of the input data values extending in the feature dimension from the first and second input matrices, respectively, and distributes the input data values from the extracted vectors in sequence to the processing elements so as to cause the processing elements to compute a convolution of first and second two-dimensional matrices composed respectively of the first and second pluralities of vectors.Type: GrantFiled: September 11, 2017Date of Patent: November 26, 2019Assignee: Habana Labs Ltd.Inventors: Ron Shalev, Sergei Gofman, Amos Goldman, Tomer Rothschild
-
Patent number: 9990287Abstract: An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.Type: GrantFiled: December 21, 2011Date of Patent: June 5, 2018Assignee: Intel CorporationInventors: Shlomo Raikin, Raanan Sade, Robert Valentine, Julius Yuli Mandelblat, Ron Shalev, Larisa Novakovsky
-
Publication number: 20160275043Abstract: In one embodiment, a heterogeneous multicore processor is described that is optimized to execute multi-stage computer vision algorithms such as cascade classifier workloads. In such embodiment the heterogeneous processor includes at least one SIMD core, such as a vector processor core, coupled with one or more scalar cores. In one embodiment the heterogeneous multiprocessor executes multi-stage compute operations, where the SIMD core computes a first set of stages and the one or more scalar cores compute the second set of stages. In one embodiment, a process for designing a heterogeneous multicore processor is disclosed which optimizes the ratio of scalar to SIMD cores based on execution time of the multi-stage compute operation in relation to processor die area consumed by a processor configuration having the ratio.Type: ApplicationFiled: March 18, 2015Publication date: September 22, 2016Inventors: Edward T. Grochowski, Michael E. Kounavis, Ron Shalev
-
Patent number: 9411728Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line.Type: GrantFiled: December 23, 2011Date of Patent: August 9, 2016Assignee: Intel CorporationInventors: Ron Shalev, Yiftach Gilad, Shlomo Raikin, Igor Yanover, Stanislav Shwartsman, Raanan Sade
-
Patent number: 9405545Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.Type: GrantFiled: December 30, 2011Date of Patent: August 2, 2016Assignee: Intel CorporationInventors: Stanislav Shwartsman, Melih Ozgul, Sebastien Hily, Shlomo Raikin, Raanan Sade, Ron Shalev
-
Publication number: 20140223105Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.Type: ApplicationFiled: December 30, 2011Publication date: August 7, 2014Inventors: Stanislav Shwartsman, Melih Ozgul, Sebastien Hily, Shlomo Raikin, Raanan Sade, Ron Shalev
-
Publication number: 20140208031Abstract: An apparatus and method are described for efficiently transferring data from a producer core to a consumer core within a central processing unit (CPU). For example, one embodiment of a method comprises: A method for transferring a chunk of data from a producer core of a central processing unit (CPU) to consumer core of the CPU, comprising: writing data to a buffer within the producer core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the fill buffer to a cache accessible by both the producer core and the consumer core; and upon the consumer core detecting that data is available in the cache, providing the data to the consumer core from the cache upon receipt of a read signal from the consumer core.Type: ApplicationFiled: December 21, 2011Publication date: July 24, 2014Inventors: Shlomo Raikin, Robert Valentine, Raanan Sade, Julius Yuli Mandelbalt, Ron Shalev, Larisa Novakovsky
-
Publication number: 20140192069Abstract: An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.Type: ApplicationFiled: December 21, 2011Publication date: July 10, 2014Inventors: Shlomo Raikin, Raanan Sade, Robert Valentine, Julius Yuli Mandelblat, Ron Shalev, Larisa Novakovsky
-
Publication number: 20130326145Abstract: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line.Type: ApplicationFiled: December 23, 2011Publication date: December 5, 2013Inventors: Ron Shalev, Yiftach Gilad, Shlomo Raikin, Igor Yanover, Stanislav Shwartsman, Raanan Sade
-
Patent number: 8527993Abstract: Techniques are provided which may be implemented in various methods and/or apparatuses that to provide a tasking system buffer interface capability to interface with a plurality of shared processes/engines.Type: GrantFiled: October 1, 2010Date of Patent: September 3, 2013Assignee: Qualcomm IncorporatedInventors: Raheel Khan, Joseph C. Chan, Ron Shalev, Naveed U. Zaman