Patents by Inventor Hadi Parandeh-Afshar
Hadi Parandeh-Afshar has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11048509Abstract: Providing multi-element multi-vector (MEMV) register file access in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device includes a vector processor comprising multiple processing elements (PEs) communicatively coupled via a corresponding plurality of channels to a vector register file comprising a plurality of memory banks. The vector processor provides a direct memory access (DMA) controller that is configured to receive a plurality of vectors that each comprise a plurality of vector elements representing operands for processing a loop iteration. The DMA controller arranges the vectors in the vector register file such that, for each group of vectors to be accessed in parallel, vector elements for each vector are stored consecutively, but corresponding vector elements of consecutive vectors are stored in different memory banks of the vector register file.Type: GrantFiled: June 5, 2018Date of Patent: June 29, 2021Inventors: Hadi Parandeh Afshar, Amrit Panda, Eric Rotenberg, Gregory Michael Wright
-
Patent number: 10846260Abstract: Providing reconfigurable fusion of processing elements (PEs) in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device provides a vector processor including a plurality of PEs and a decode/control circuit. The decode/control circuit receives an instruction block containing a vectorizable loop comprising a loop body. The decode/control circuit determines how many PEs of the plurality of PEs are required to execute the loop body, and reconfigures the plurality of PEs into one or more fused PEs, each including the determined number of PEs required to execute the loop body. The plurality of PEs, reconfigured into one or more fused PEs, then executes one or more loop iterations of the loop body. Some aspects further include a PE communications link interconnecting the plurality of PEs, to enable communications between PEs of a fused PE and communications of inter-iteration data dependencies between PEs without requiring vector register file access operations.Type: GrantFiled: July 5, 2018Date of Patent: November 24, 2020Assignee: Qualcomm IncorporatedInventors: Hadi Parandeh Afshar, Amrit Panda, Eric Rotenberg, Gregory Michael Wright
-
Patent number: 10628162Abstract: Enabling parallel memory accesses by providing explicit affine instructions in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device implementing a block-based dataflow instruction set architecture (ISA) includes a decoder circuit configured to provide an affine instruction that specifies a base parameter indicating a base value B, a stride parameter indicating a stride interval value S, and a count parameter indicating a count value C. The decoder circuit of the vector-processor-based device decodes the affine instruction, and generates an output stream comprising one or more output values, wherein a count of the output values of the output stream equals the count value C. Using an index X where 0?X<C, each Xth output value in the output stream is generated as a sum of the base value B and a product of the stride interval value S and the index X.Type: GrantFiled: June 19, 2018Date of Patent: April 21, 2020Assignee: Qualcomm IncorporatedInventors: Amrit Panda, Eric Rotenberg, Hadi Parandeh Afshar, Gregory Michael Wright
-
Publication number: 20200065098Abstract: Providing efficient handling of branch divergence in vectorizable loops by vector-processor-based devices is disclosed. In some aspects, a vector-processor-based device provides a plurality of processing elements (PEs) coupled to a scheduler circuit comprising a clock cycle threshold and a mask register comprising a plurality of bits corresponding to a plurality of loop iterations of a vectorizable loop to be executed. The scheduler circuit initiates a first execution interval, during which loop iterations of the vectorizable loop are assigned to PEs for parallel execution. If a loop iteration's execution time exceeds the clock cycle threshold, the scheduler circuit sets a mask register bit corresponding to the loop iteration indicating that the loop iteration is incomplete, and defers its execution.Type: ApplicationFiled: August 21, 2018Publication date: February 27, 2020Inventors: Hadi Parandeh Afshar, Eric Rotenberg, Gregory Michael Wright
-
Publication number: 20200012618Abstract: Providing reconfigurable fusion of processing elements (PEs) in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device provides a vector processor including a plurality of PEs and a decode/control circuit. The decode/control circuit receives an instruction block containing a vectorizable loop comprising a loop body. The decode/control circuit determines how many PEs of the plurality of PEs are required to execute the loop body, and reconfigures the plurality of PEs into one or more fused PEs, each including the determined number of PEs required to execute the loop body. The plurality of PEs, reconfigured into one or more fused PEs, then executes one or more loop iterations of the loop body. Some aspects further include a PE communications link interconnecting the plurality of PEs, to enable communications between PEs of a fused PE and communications of inter-iteration data dependencies between PEs without requiring vector register file access operations.Type: ApplicationFiled: July 5, 2018Publication date: January 9, 2020Inventors: Hadi Parandeh Afshar, Amrit Panda, Eric Rotenberg, Gregory Michael Wright
-
Publication number: 20190384606Abstract: Enabling parallel memory accesses by providing explicit affine instructions in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device implementing a block-based dataflow instruction set architecture (ISA) includes a decoder circuit configured to provide an affine instruction that specifies a base parameter indicating a base value B, a stride parameter indicating a stride interval value S, and a count parameter indicating a count value C. The decoder circuit of the vector-processor-based device decodes the affine instruction, and generates an output stream comprising one or more output values, wherein a count of the output values of the output stream equals the count value C. Using an index X where 0?X<C, each Xth output value in the output stream is generated as a sum of the base value B and a product of the stride interval value S and the index X.Type: ApplicationFiled: June 19, 2018Publication date: December 19, 2019Inventors: Amrit Panda, Eric Rotenberg, Hadi Parandeh Afshar, Gregory Michael Wright
-
Publication number: 20190369994Abstract: Providing multi-element multi-vector (MEMV) register file access in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device includes a vector processor comprising multiple processing elements (PEs) communicatively coupled via a corresponding plurality of channels to a vector register file comprising a plurality of memory banks. The vector processor provides a direct memory access (DMA) controller that is configured to receive a plurality of vectors that each comprise a plurality of vector elements representing operands for processing a loop iteration. The DMA controller arranges the vectors in the vector register file such that, for each group of vectors to be accessed in parallel, vector elements for each vector are stored consecutively, but corresponding vector elements of consecutive vectors are stored in different memory banks of the vector register file.Type: ApplicationFiled: June 5, 2018Publication date: December 5, 2019Inventors: Hadi Parandeh Afshar, Amrit Panda, Eric Rotenberg, Gregory Michael Wright
-
Patent number: 9231594Abstract: New logic blocks capable of replacing the use of Look-Up Tables (LUTs) in integrated circuits, such as Field-Programmable Gate Arrays (FPGAs), are disclosed herein. In one embodiment, the new logic block is a tree structure comprised of a number of levels of cells with each cell consisting of a logic gate or the functional equivalent of a logic gate, one or more selectable inverters, and wherein the inputs of the logic block consist of the inputs to the logic gate or functional equivalent of the logic gate and inputs to the selectable inverters. The new logic blocks can map circuits more efficiently than LUTs, because they include multi-output blocks and can cover more logic depth due to the higher input and output bandwidth.Type: GrantFiled: August 13, 2014Date of Patent: January 5, 2016Assignee: ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE (EPFL)Inventors: Hadi Parandeh Afshar, David Novo Bruna, Paolo Ienne Lopez, Grace Zgheib
-
Publication number: 20140347096Abstract: New logic blocks capable of replacing the use of Look-Up Tables (LUTs) in integrated circuits, such as Field-Programmable Gate Arrays (FPGAs), are disclosed herein. In one embodiment, the new logic block is a tree structure comprised of a number of levels of cells with each cell consisting of a logic gate or the functional equivalent of a logic gate, one or more selectable inverters, and wherein the inputs of the logic block consist of the inputs to the logic gate or functional equivalent of the logic gate and inputs to the selectable inverters. The new logic blocks can map circuits more efficiently than LUTs, because they include multi-output blocks and can cover more logic depth due to the higher input and output bandwidth.Type: ApplicationFiled: August 13, 2014Publication date: November 27, 2014Inventors: Hadi Parandeh Afshar, David Novo Bruna, Paolo Ienne Lopez, Grace Zgheib
-
Patent number: 8836368Abstract: New logic blocks capable of replacing the use of Look-Up Tables (LUTs) in integrated circuits, such as Field-Programmable Gate Arrays (FPGAs), are disclosed herein. In one embodiment, the new logic block is an AND-Inverter Cone (AIC), which is a binary tree including one or more AND gates with a programmable conditional inversion and a number of intermediary outputs. Compared to LUTs, AICs are richer in terms of input and output bandwidth, because the area of the AICs grows only linearly with the number of inputs. Also, the delay grows only logarithmically with the input count. The new logic blocks can map circuits more efficiently than LUTs, because the AICs are multi-output blocks and can cover more logic depth due to the higher input bandwidth.Type: GrantFiled: December 21, 2011Date of Patent: September 16, 2014Assignee: Ecole Polytechnique Federale de Lausanne (EPFL)Inventors: Hadi Parandeh Afshar, David Novo Bruña, Paolo Ienne Lopez
-
Patent number: 8667046Abstract: A Generalized Programmable Counter Array (GPCA) is a reconfigurable multi-operand adder, which can be reprogrammed to sum a plurality of operands of arbitrary size. The GPCA is configured to compress the input words down to two operands using parallel counters. Resulting operands are then summed using a standard Ripple Carry Adder to produce the final result. The GPCA consists of a linear arrangement of identical compressor slices (CSlice).Type: GrantFiled: February 20, 2009Date of Patent: March 4, 2014Assignee: Ecole Polytechnique Federale de Lausanne/Service des Relations IndustriellesInventors: Philip Brisk, Alessandro Cevrero, Frank K. Gurkaynak, Paolo Ienne Lopez, Hadi Parandeh-Afshar
-
Publication number: 20130162292Abstract: New logic blocks capable of replacing the use of Look-Up Tables (LUTs) in integrated circuits, such as Field-Programmable Gate Arrays (FPGAs), are disclosed herein. In one embodiment, the new logic block is an AND-Inverter Cone (AIC), which is a binary tree including one or more AND gates with a programmable conditional inversion and a number of intermediary outputs. Compared to LUTs, AICs are richer in terms of input and output bandwidth, because the area of the AICs grows only linearly with the number of inputs. Also, the delay grows only logarithmically with the input count. The new logic blocks can map circuits more efficiently than LUTs, because the AICs are multi-output blocks and can cover more logic depth due to the higher input bandwidth.Type: ApplicationFiled: December 21, 2011Publication date: June 27, 2013Applicant: ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE (EPFL)Inventors: Hadi Parandeh Afshar, David Novo Bruña, Paolo Ienne Lopez
-
Publication number: 20090216826Abstract: A Generalized Programmable Counter Array (GPCA) is a reconfigurable multi-operand adder, which can be reprogrammed to sum a plurality of operands of arbitrary size. The GPCA is configured to compress the input words down to two operands using parallel counters. Resulting operands are then summed using a standard Ripple Carry Adder to produce the final result. The GPCA consists of a linear arrangement of identical compressor slices (CSlice).Type: ApplicationFiled: February 20, 2009Publication date: August 27, 2009Applicant: Ecole Polytechnique Federale de Lausanne/ Service des Relations Industrielles(SRI)Inventors: Philip Brisk, Alessandro Cevrero, Frank K. Gurkaynak, Paolo Ienne Lopez, Hadi Parandeh-Afshar