Patents by Inventor Chen Koren
Chen Koren has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20240078285Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.Type: ApplicationFiled: November 6, 2023Publication date: March 7, 2024Inventors: Dan BAUM, Chen KOREN, Elmoustapha OULD-AHMED-VALL, Michael ESPIG, Christopher J. HUGHES, Raanan SADE, Robert VALENTINE, Mark J. CHARNEY, Alexander F. HEINECKE
-
Patent number: 11847185Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.Type: GrantFiled: September 24, 2021Date of Patent: December 19, 2023Assignee: Intel CorporationInventors: Dan Baum, Chen Koren, Elmoustapha Ould-Ahmed-Vall, Michael Espig, Christopher J. Hughes, Raanan Sade, Robert Valentine, Mark J. Charney, Alexander F. Heinecke
-
Patent number: 11513893Abstract: A system includes a compute circuit that preemptively performs a computation on a data word before receiving an indication of data errors from an error checking and correction (ECC) circuit. The ECC circuit reads the data word from a memory array and performs error detection and error correction on the data word. The compute circuit reads the data word and performs the computation on the data word to generate an output value, without waiting for the ECC circuit to check and correct the data word. In response to error detection in the data word by the ECC circuit, the compute circuit delays outputting the output value until correction of the output value in accordance with the error detection by the ECC circuit.Type: GrantFiled: December 21, 2020Date of Patent: November 29, 2022Assignee: Intel CorporationInventors: Somnath Paul, Charles Augustine, Chen Koren, George Shchupak, Muhammad M. Khellah
-
Patent number: 11450672Abstract: An ultra-deep compute Static Random Access Memory (SRAM) with high compute throughput and multi-directional data transfer capability is provided. Compute units are placed in both horizontal and vertical directions to achieve a symmetric layout while enabling communication between the compute units. An SRAM array supports simultaneous read and write to the left and right section of the same SRAM subarray by duplicating pre-decoding logic inside the SRAM array. This allows applications with non-overlapping read and write address spaces to have twice the bandwidth as compared to a baseline SRAM array.Type: GrantFiled: April 27, 2020Date of Patent: September 20, 2022Assignee: Intel CorporationInventors: Charles Augustine, Somnath Paul, Muhammad M. Khellah, Chen Koren
-
Publication number: 20220012305Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.Type: ApplicationFiled: September 24, 2021Publication date: January 13, 2022Inventors: Dan BAUM, Chen KOREN, Elmoustapha OULD-AHMED-VALL, Michael ESPIG, Christopher J. HUGHES, Raanan SADE, Robert VALENTINE, Mark J. CHARNEY, Alexander F. HEINECKE
-
Publication number: 20210109809Abstract: A system includes a compute circuit that preemptively performs a computation on a data word before receiving an indication of data errors from an error checking and correction (ECC) circuit. The ECC circuit reads the data word from a memory array and performs error detection and error correction on the data word. The compute circuit reads the data word and performs the computation on the data word to generate an output value, without waiting for the ECC circuit to check and correct the data word. In response to error detection in the data word by the ECC circuit, the compute circuit delays outputting the output value until correction of the output value in accordance with the error detection by the ECC circuit.Type: ApplicationFiled: December 21, 2020Publication date: April 15, 2021Inventors: Somnath Paul, Charles Augustine, Chen Koren, George Shchupak, Muhammad M. Khellah
-
Patent number: 10929503Abstract: An apparatus and method for a masked multiply instruction to support neural network pruning operations. For example, one embodiment of a processor comprises: a decoder to decode a matrix multiplication with masking (GEMM) instruction identifying a destination matrix register to store a result, and source registers storing an A-matrix, a B-matrix, and a matrix mask; execution circuitry to execute the GEMM instruction, the execution circuitry to multiply a plurality of B-matrix elements with a plurality of A-matrix elements, each of the B-matrix elements associated with a mask value in the matrix mask, wherein if the mask value is set to a first value, then the execution circuitry is to multiply the B-matrix element with one or more of the A-matrix elements to generate a first partial result, and if the mask value is set to a second value, then the execution circuitry is to multiply an alternate B-matrix element with a one or more of the A-matrix elements to generate a second partial result.Type: GrantFiled: December 21, 2018Date of Patent: February 23, 2021Assignee: Intel CorporationInventors: Omid Azizi, Chen Koren, Nitin Garegrat
-
Publication number: 20200258890Abstract: An ultra-deep compute Static Random Access Memory (SRAM) with high compute throughput and multi-directional data transfer capability is provided. Compute units are placed in both horizontal and vertical directions to achieve a symmetric layout while enabling communication between the compute units. An SRAM array supports simultaneous read and write to the left and right section of the same SRAM subarray by duplicating pre-decoding logic inside the SRAM array. This allows applications with non-overlapping read and write address spaces to have twice the bandwidth as compared to a baseline SRAM array.Type: ApplicationFiled: April 27, 2020Publication date: August 13, 2020Inventors: Charles AUGUSTINE, Somnath PAUL, Muhammad M. KHELLAH, Chen KOREN
-
Publication number: 20200210517Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.Type: ApplicationFiled: December 27, 2018Publication date: July 2, 2020Inventors: Dan BAUM, Chen KOREN, Elmoustapha OULD-AHMED-VALL, Michael ESPIG, Christopher J. HUGHES, Raanan SADE, Robert VALENTINE, Mark J. CHARNEY, Alexander F. HEINECKE
-
Patent number: 10620951Abstract: Disclosed embodiments relate to sparse matrix multiplication (SMM) acceleration using column folding and squeezing. In one example, a processor, in response to a SMM instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, uses execution circuitry to pack the second matrix by replacing one or more zero-valued elements with non-zero elements yet to be processed, each of the replaced elements further including a field to identify its logical position within the second matrix, and, the execution circuitry further to, for each non-zero element at row M and column K of the specified first matrix, generate a product of the element and each corresponding non-zero element at row K, column N of the packed second matrix, and accumulate each generated product with a previous value of a corresponding element at row M and column N of the specified output matrix.Type: GrantFiled: June 22, 2018Date of Patent: April 14, 2020Assignee: Intel CorporationInventors: Omid Azizi, Guy Boudoukh, Tony Werner, Andrew Yang, Michael Rotzin, Chen Koren, Eriko Nurvitadhi
-
Patent number: 10509846Abstract: An accelerator for increasing the processing speed of a processor. The accelerator operates in two distinct modes. In a first mode for dense layer processing, row data sets and column data sets are sent to a multiplier for multiplication. In a second mode for sparse layer processing compressed row data sets are received by a row multiplexer and compressed column data sets are received by a column multiplexer. Each multiplexer is configured to compare the indexes of data sets with one another to determine matching indexes. When indexes match, the matching data sets are selected and sent to the multiplier for multiplication. When indexes do not match, data sets are stored in memory devices for subsequent cycles.Type: GrantFiled: December 13, 2017Date of Patent: December 17, 2019Assignee: Intel CorporationInventors: Chen Koren, Dan Baum
-
Publication number: 20190121837Abstract: An apparatus and method for a masked multiply instruction to support neural network pruning operations. For example, one embodiment of a processor comprises: a decoder to decode a matrix multiplication with masking (GEMM) instruction identifying a destination matrix register to store a result, and source registers storing an A-matrix, a B-matrix, and a matrix mask; execution circuitry to execute the GEMM instruction, the execution circuitry to multiply a plurality of B-matrix elements with a plurality of A-matrix elements, each of the B-matrix elements associated with a mask value in the matrix mask, wherein if the mask value is set to a first value, then the execution circuitry is to multiply the B-matrix element with one or more of the A-matrix elements to generate a first partial result, and if the mask value is set to a second value, then the execution circuitry is to multiply an alternate B-matrix element with a one or more of the A-matrix elements to generate a second partial result.Type: ApplicationFiled: December 21, 2018Publication date: April 25, 2019Inventors: OMID AZIZI, CHEN KOREN, NITIN GAREGRAT
-
Publication number: 20190042538Abstract: An accelerator for increasing the processing speed of a processor. The accelerator operates in two distinct modes. In a first mode for dense layer processing, row data sets and column data sets are sent to a multiplier for multiplication. In a second mode for sparse layer processing compressed row data sets are received by a row multiplexer and compressed column data sets are received by a column multiplexer. Each multiplexer is configured to compare the indexes of data sets with one another to determine matching indexes. When indexes match, the matching data sets are selected and sent to the multiplier for multiplication. When indexes do not match, data sets are stored in memory devices for subsequent cycles.Type: ApplicationFiled: December 13, 2017Publication date: February 7, 2019Inventors: Chen Koren, Dan Baum
-
Publication number: 20190042237Abstract: Disclosed embodiments relate to sparse matrix multiplication (SMM) acceleration using column folding and squeezing. In one example, a processor, in response to a SMM instruction having fields to specify locations of first, second, and output matrices, the second matrix being a sparse matrix, uses execution circuitry to pack the second matrix by replacing one or more zero-valued elements with non-zero elements yet to be processed, each of the replaced elements further including a field to identify its logical position within the second matrix, and, the execution circuitry further to, for each non-zero element at row M and column K of the specified first matrix, generate a product of the element and each corresponding non-zero element at row K, column N of the packed second matrix, and accumulate each generated product with a previous value of a corresponding element at row M and column N of the specified output matrix.Type: ApplicationFiled: June 22, 2018Publication date: February 7, 2019Inventors: Omid AZIZI, Guy BOUDOUKH, Tony WERNER, Andrew YANG, Michael ROTZIN, Chen KOREN, Eriko NURVITADHI
-
Patent number: 9448879Abstract: An apparatus and method are described for detecting and correcting instruction fetch errors within a processor core. For example, in one embodiment, an instruction processing apparatus for detecting and recovering from instruction fetch errors comprises, the instruction processing apparatus performing the operations of: detecting an error associated with an instruction in response to an instruction fetch operation; and determining if the instruction is from a speculative access, wherein if the instruction is not from a speculative access, then responsively performing one or more operations to ensure that the error does not corrupt an architectural state of the processor core.Type: GrantFiled: December 22, 2011Date of Patent: September 20, 2016Assignee: INTEL CORPORATIONInventors: Theodros Yigzaw, Oded Lempel, Hisham Shafi, Geeyarpuram N. Santhanakrishnan, Jose A. Vargas, Ganapati N Srinivasa, Mohan J Kumar, Larisa Novakovsky, Lihu Rappoport, Chen Koren, Julius Mandelblat, Michael Mishaeli
-
Patent number: 9348591Abstract: This disclosure includes tracking of in-use states of cache lines to improve throughput of pipelines and thus increase performance of processors. Access data for a number of sets of instructions stored in an instruction cache may be tracked using an in-use array in a first array until the data for one or more of those sets reach a threshold condition. A second array may then be used as the in-use array to track the sets of instructions after a micro-operation is inserted into the pipeline. When the micro-operation retires from the pipeline, the first array may be cleared. The process may repeat after the second array reaches the threshold condition. During the tracking, an in-use state for an instruction line may be detected by inspecting a corresponding bit in each of the arrays. Additional arrays may also be used to track the in-use state.Type: GrantFiled: December 29, 2011Date of Patent: May 24, 2016Assignee: Intel CorporationInventors: Ilhyun Kim, Chen Koren, Alexandre J. Farcy, Robert L. Hinton, Choon Wei Khor, Lihu Rappoport
-
Publication number: 20140298140Abstract: An apparatus and method are described for detecting and correcting instruction fetch errors within a processor core. For example, in one embodiment, an instruction processing apparatus for detecting and recovering from instruction fetch errors comprises, the instruction processing apparatus performing the operations of: detecting an error associated with an instruction in response to an instruction fetch operation; and determining if the instruction is from a speculative access, wherein if the instruction is not from a speculative access, then responsively performing one or more operations to ensure that the error does not corrupt an architectural state of the processor core.Type: ApplicationFiled: December 22, 2011Publication date: October 2, 2014Inventors: Theodros Yigzaw, Oded Lempel, Hisham Hafi, Geeyarpuram N Santhanakrisnan, Jose A Vargas, Ganapati N Srinivasa, Mohan J Kumar, Larisa Novakovsky, Lihu Rappoport, Chen Koren, Julius Yuli Mandelblat
-
Patent number: 8782374Abstract: Methods and apparatus for inclusion of TLB (translation look-aside buffer) in processor micro-op caches are disclosed. Some embodiments for inclusion of TLB entries have micro-op cache inclusion fields, which are set responsive to accessing the TLB entry. Inclusion logic may the flush the micro-op cache or portions of the micro-op cache and clear corresponding inclusion fields responsive to a replacement or invalidation of a TLB entry whenever its associated inclusion field had been set. Front-end processor state may also be cleared and instructions refetched when replacement resulted from a TLB miss.Type: GrantFiled: December 2, 2008Date of Patent: July 15, 2014Assignee: Intel CorporationInventors: Lihu Rappoport, Chen Koren, Franck Sala, Oded Lempel, Ido Ouziel, Ron Gabor, Gregory Pribush, Lior Libis
-
Publication number: 20130275733Abstract: This disclosure includes tracking of in-use states of cache lines to improve throughput of pipelines and thus increase performance of processors. Access data for a number of sets of instructions stored in an instruction cache may be tracked using an in-use array in a first array until the data for one or more of those sets reach a threshold condition. A second array may then be used as the in-use array to track the sets of instructions after a micro-operation is inserted into the pipeline. When the micro-operation retires from the pipeline, the first array may be cleared. The process may repeat after the second array reaches the threshold condition. During the tracking, an in-use state for an instruction line may be detected by inspecting a corresponding bit in each of the arrays. Additional arrays may also be used to track the in-use state.Type: ApplicationFiled: December 29, 2011Publication date: October 17, 2013Inventors: Ilhyun Kim, Chen Koren, Alexandre J. Farcy, Robert L. Hinton, Choon Wei Khor, Lihu Rappoport
-
Patent number: 8433850Abstract: Methods and apparatus for instruction restarts and inclusion in processor micro-op caches are disclosed. Embodiments of micro-op caches have way storage fields to record the instruction-cache ways storing corresponding macroinstructions. Instruction-cache in-use indications associated with the instruction-cache lines storing the instructions are updated upon micro-op cache hits. In-use indications can be located using the recorded instruction-cache ways in micro-op cache lines. Victim-cache deallocation micro-ops are enqueued in a micro-op queue after micro-op cache miss synchronizations, responsive to evictions from the instruction-cache into a victim-cache. Inclusion logic also locates and evicts micro-op cache lines corresponding to the recorded instruction-cache ways, responsive to evictions from the instruction-cache.Type: GrantFiled: December 2, 2008Date of Patent: April 30, 2013Assignee: Intel CorporationInventors: Lihu Rappoport, Chen Koren, Franck Sala, Ilhyun Kim, Lior Libis, Ron Gabor, Oded Lempel