Multiplication Followed By Addition (i.e., X*y+z) Patents (Class 708/523)

Patent number: 11409536Abstract: A method and apparatus for performing a multiprecision computation in a plurality of arithmetic logic units (ALUs) includes pairing a first Single Instruction/Multiple Data (SIMD) block channel device with a second SIMD block channel device to create a first block pair having onelevel staggering between the first and second channel devices. A third SIMD block channel device is paired with a fourth SIMD block channel device to create a second block pair having onelevel staggering between the third and fourth channel devices. A plurality of source inputs are received at the first block pair and the second block pair. The first block pair computes a first result, and the second block pair computes a second result.Type: GrantFiled: November 3, 2016Date of Patent: August 9, 2022Assignee: ADVANCED MICRO DEVICES, INC.Inventors: Bin He, YunXiao Zou, Jiasheng Chen, Michael Mantor

Patent number: 11334319Abstract: An apparatus and method for multiplying packed unsigned words.Type: GrantFiled: June 30, 2017Date of Patent: May 17, 2022Assignee: Intel CorporationInventors: Venkateswara Rao Madduri, Elmoustapha OuldAhmedVall, Robert Valentine

Patent number: 11308574Abstract: Embodiments described herein provide a graphics processor that can perform a variety of mixed and multiple precision instructions and operations. One embodiment provides a streaming multiprocessor that can concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions. The streaming multiprocessor can perform concurrent integer and floatingpoint operations and includes a mixed precision core to perform operations at multiple precisions.Type: GrantFiled: August 3, 2020Date of Patent: April 19, 2022Assignee: Intel CorporationInventors: Elmoustapha OuldAhmedVall, Sara S. Baghsorkhi, Anbang Yao, Kevin Nealis, Xiaoming Chen, Altug Koker, Abhishek R. Appu, John C. Weast, Mike B. Macpherson, Dukhwan Kim, Linda L. Hurd, Ben J. Ashbaugh, Barath Lakshmanan, Liwei Ma, Joydeep Ray, Ping T. Tang, Michael S. Strickland

Patent number: 11281428Abstract: A data processing apparatus is provided to convert a plurality of signed digits to an output value, the data processing apparatus comprising: receiver circuitry to receive, at each of a plurality of iterations, a signed digit from the plurality of signed digits, and previous intermediate data. Conversion circuitry performs a negativeoutput conversion from the signed digit to an unsigned digit, such that the output value comprising the unsigned digit is negative. Concatenation circuitry concatenate bits of the unsigned digit and bits of the previous intermediate data to produce updated intermediate data and output circuitry provides the updated intermediate data as the previous intermediate data of a next iteration. After the plurality of iterations, the output circuitry outputs at least part of the updated intermediate data as the output value.Type: GrantFiled: March 12, 2019Date of Patent: March 22, 2022Assignee: ARM LIMITEDInventor: Javier Diaz Bruguera

Patent number: 11270405Abstract: An apparatus to facilitate compute optimization is disclosed. The apparatus includes a mixed precision core to perform a mixed precision multidimensional matrix multiply and accumulate operation on 8bit and/or 32 bit signed or unsigned integer elements.Type: GrantFiled: August 3, 2020Date of Patent: March 8, 2022Assignee: Intel CorporationInventors: Abhishek R. Appu, Altug Koker, Linda L. Hurd, Dukhwan Kim, Mike B. Macpherson, John C. Weast, Feng Chen, Farshad Akhbari, Narayan Srinivasa, Nadathur Rajagopalan Satish, Joydeep Ray, Ping T. Tang, Michael S. Strickland, Xiaoming Chen, Anbang Yao, Tatiana Shpeisman

Patent number: 11256476Abstract: A tile of an FPGA includes a multiple mode arithmetic circuit. The multiple mode arithmetic circuit is configured by control signals to operate in an integer mode, a floatingpoint mode, or both. In some example embodiments, multiple integer modes (e.g., unsigned, two's complement, and signmagnitude) are selectable, multiple floatingpoint modes (e.g., 16bit mantissa and 8bit sign, 8bit mantissa and 6bit sign, and 6bit mantissa and 6bit sign) are supported, or any suitable combination thereof. The tile may also fuse a memory circuit with the arithmetic circuits. Connections directly between multiple instances of the tile are also available, allowing multiple tiles to be treated as larger memories or arithmetic circuits. By using these connections, referred to as cascade inputs and outputs, the input and output bandwidth of the arithmetic circuit is further increased.Type: GrantFiled: August 8, 2019Date of Patent: February 22, 2022Assignee: Achronix Semiconductor CorporationInventors: Daniel Pugh, Raymond Nijssen, Michael Philip Fitton, Marcel Van der Goot

Patent number: 11249723Abstract: A method related to posit tensor processing can include receiving, by a plurality of multiplyaccumulator (MAC) units coupled to one another, a plurality of universal number (unum) or posit bit strings organized in a matrix and to be used as operands in a plurality of respective recursive operations performed using the plurality of MAC units and performing, using the MAC units, the plurality of respective recursive operations. Iterations of the respective recursive operations are performed using at least one bit string that is a same bit string as was used in a preceding iteration of the respective recursive operations. The method can further include prior to receiving the plurality of unum or posit bit strings, performing an operation to organize the plurality of unum or posit bit strings to achieve a threshold bandwidth ratio, a threshold latency, or both during performance of the plurality of respective recursive operations.Type: GrantFiled: April 2, 2020Date of Patent: February 15, 2022Assignee: Micron Technology, Inc.Inventor: Vijay S. Ramesh

Patent number: 11237833Abstract: The present invention discloses an instruction processing apparatus, comprising a first register adapted to store first source data, a second register adapted to store second source data, a third register adapted to store accumulated data, a decoder adapted to receive and decode a multiplyaccumulate instruction, and an execution unit. The multiplyaccumulate instruction indicates that the first register serves as a first operand, the second register serves as a second operand, the third register serves as a third operand, and a shift flag.Type: GrantFiled: April 10, 2020Date of Patent: February 1, 2022Assignee: Alibaba Group Holding LimitedInventors: Jiahui Luo, Zhijian Chen, Yubo Guo, Wenmeng Zhang

Patent number: 11200723Abstract: A texture filtering unit includes a datapath block and a control block. The datapath block includes one or more parallel computation pipelines, each containing at least one hardware logic component configured to receive a plurality of inputs and generate an output value as part of a texture filtering operation. The control block includes a plurality of sequencers and an arbiter. Each sequencer executes a microprogram that defines a sequence of operations to be performed by the one or more pipelines in the datapath block as part of a texture filtering operation and the arbiter controls access, by the sequencers, to the one or more pipelines in the datapath based on predefined prioritization rules.Type: GrantFiled: February 25, 2020Date of Patent: December 14, 2021Assignee: Imagination Technologies LimitedInventor: Casper Van Benthem

Patent number: 11113084Abstract: This application concerns methods, apparatus, and systems for performing quantum circuit synthesis and/or for implementing the synthesis results in a quantum computer system. In certain example embodiments: a universal gate set, a target unitary described by a target angle, and target precision is received (input); a corresponding quaternion approximation of the target unitary is determined; and a quantum circuit corresponding to the quaternion approximation is synthesized, the quantum circuit being over a single qubit gate set, the single qubit gate set being realizable by the given universal gate set for the target quantum computer architecture.Type: GrantFiled: September 26, 2016Date of Patent: September 7, 2021Assignee: Microsoft Technology Licensing, LLCInventors: Vadym Kliuchnikov, Jon Yard, Martin Roetteler, Alexei Bocharov

Patent number: 11061854Abstract: A vector reduction circuit configured to reduce an input vector of elements comprises a plurality of cells, wherein each of the plurality of cells other than a designated first cell that receives a designated first element of the input vector is configured to receive a particular element of the input vector, receive, from another of the one or more cells, a temporary reduction element, perform a reduction operation using the particular element and the temporary reduction element, and provide, as a new temporary reduction element, a result of performing the reduction operation using the particular element and the temporary reduction element. The vector reduction circuit also comprises an output circuit configured to provide, for output as a reduction of the input vector, a new temporary reduction element corresponding to a result of performing the reduction operation using a last element of the input vector.Type: GrantFiled: July 1, 2020Date of Patent: July 13, 2021Assignee: Google LLCInventors: Gregory Michael Thorson, Andrew Everett Phelps, Olivier Temam

Patent number: 10990354Abstract: An accelerating device includes a signal detector that converts a first input signal and a second input signal into a first converted input signal and a second converted input signal, respectively, and that generates a final zerovalue flag signal, a first onevalue flag signal, and a second onevalue flag signal. The accelerating device further includes a processing element (PE) that processes the first converted input signal and the second converted input signal based on the final zerovalue flag signal, the first onevalue flag signal, and the second onevalue flag signal and that skips a first arithmetic operation and a second arithmetic operation when the final zerovalue flag signal has a first value. The first value of the final zerovalue flag signal indicates that the first input signal, or the second input signal, or both have a value of 0.Type: GrantFiled: September 12, 2019Date of Patent: April 27, 2021Assignee: SK hynix Inc.Inventor: Jae Hyeok Jang

Patent number: 10853037Abstract: Embodiments of the present disclosure pertain to digital circuits with compressed carries. In one embodiment, an adder circuit generates a sum and carry. The carry is compressed to reduce the number of bits required to represent the carry. In one embodiment, a multiplier circuit generates output product values. The output product values may be summed to produce a sum and carry. The carry may be compressed. In other embodiments, a multiplier circuit receives an input sum and compressed carry. The compressed input carry is decompressed and added to output product values and the input sum, and a resulting carry is compressed. The output of such a multiplier is another sum and compressed carry.Type: GrantFiled: July 14, 2020Date of Patent: December 1, 2020Assignee: Groq, Inc.Inventors: Christopher Aaron Clark, Jonathan Ross

Patent number: 10846088Abstract: When executing a program on a data processor comprising an execution unit for executing instructions in a program to be executed by the data processor, the execution unit being associated with one or more hardware units operable to execute instructions, at least one instruction in a program is associated with an indication of whether the instruction should be issued directly for execution by a hardware unit or should be intercepted during its execution by the execution unit. The execution unit then, when decoding the instruction for execution by a hardware unit in the program, determines from the indication associated with the instruction whether the instruction should be issued directly for execution by a hardware unit or intercepted during its execution by the execution unit, and issues the instruction for execution by a hardware unit directly, or pauses execution of the instruction and performs another operation, accordingly.Type: GrantFiled: August 21, 2018Date of Patent: November 24, 2020Assignee: Arm LimitedInventors: Mark Underwood, Hakan LarsGoran Persson, Arne Aas

Patent number: 10838695Abstract: The present embodiments relate to circuitry that efficiently performs floatingpoint arithmetic operations and fixedpoint arithmetic operations. Such circuitry may be implemented in specialized processing blocks. If desired, the specialized processing blocks may include configurable interconnect circuitry to support a variety of different use modes. For example, the specialized processing block may efficiently perform a fixedpoint or floatingpoint addition operation or a portion thereof, a fixedpoint or floatingpoint multiplication operation or a portion thereof, a fixedpoint or floatingpoint multiplyadd operation or a portion thereof, just to name a few. In some embodiments, two or more specialized processing blocks may be arranged in a cascade chain and perform together more complex operations such as a recursive mode dot product of two vectors of floatingpoint numbers or a Radix2 Butterfly circuit, just to name a few.Type: GrantFiled: June 4, 2019Date of Patent: November 17, 2020Assignee: Altera CorporationInventor: Martin Langhammer

Patent number: 10817587Abstract: A reconfigurable matrix multiplier (RMM) system/method allowing tight or loose coupling to supervisory control processor application control logic (ACL) in a systemonachip (SOC) environment is disclosed. The RMM provides for C=A*B matrix multiplication operations having Amultipliermatrix (AMM), Bmultiplicandmatrix (BMM), and Cproductmatrix (CPM), as well as C=A*B+D operations in which Dsummationmatrix (DSM) represents the result of a previous multiplication operation or another previously defined matrix. The RMM provides for additional CPM LOAD/STORE paths allowing overlapping of compute/data transfer operations and provides for CPM data feedback to the AMM or BMM operand inputs from a previously calculated CPM result.Type: GrantFiled: February 26, 2018Date of Patent: October 27, 2020Assignee: TEXAS INSTRUMENTS INCORPORATEDInventors: Arthur John Redfern, Donald Edward Steiss, Timothy David Anderson, Kai Chirca

Patent number: 10795676Abstract: An apparatus and method for multiplying packed real and imaginary components of complex numbers.Type: GrantFiled: September 29, 2017Date of Patent: October 6, 2020Assignee: Intel CorporationInventors: Venkateswara Madduri, Elmoustapha OuldAhmedVall, Jesus Corbal, Mark Charney, Robert Valentine, Binwei Yang

Patent number: 10776109Abstract: A microprocessor with dynamically adjustable bit width is provided, which has a bit width register, a datapath, a statistical register, and a bit width adjuster. The bit width register stores at least one bit width. The datapath operates according to the bit width stored in the bit width register to acquire input operands from received data and process input operands. The statistical register collects calculation results of the datapath. The bit width adjuster adjusts the bit width stored in the bit width register based on the calculation results collected in the statistical register.Type: GrantFiled: October 18, 2018Date of Patent: September 15, 2020Assignee: SHANGHAI ZHAOXIN SEMICONDUCTOR CO., LTD.Inventors: Jing Chen, Xiaoyang Li, Juanli Song, Zhenhua Huang, Weilin Wang, Jiin Lai

Patent number: 10769746Abstract: A data queuing and format apparatus is disclosed. A first selection circuit may be configured to selectively couple a first subset of data to a first plurality of data lines dependent upon control information, and a second selection circuit may be configured to selectively couple a second subset of data to a second plurality of data lines dependent upon the control information. A storage array may include multiple storage units, and each storage unit may be configured to receive data from one or more data lines of either the first or second plurality of data lines dependent upon the control information.Type: GrantFiled: September 25, 2014Date of Patent: September 8, 2020Assignee: Apple Inc.Inventors: Liang Xia, Robert D. Kenney, Benjiman L. Goodman, Terence M. Potter

Patent number: 10664270Abstract: An apparatus and method for performing signed multiplication of packed signed/unsigned doublewords and accumulation with a quadword.Type: GrantFiled: December 21, 2017Date of Patent: May 26, 2020Assignee: Intel CorporationInventors: Elmoustapha OuldAhmedVall, Robert Valentine, Mark Charney, Jesus Corbal, Venkateswara Madduri

Patent number: 10628124Abstract: Techniques and circuits are provided for stochastic rounding. In an embodiment, a circuit includes carrysave adder (CSA) logic having three or more CSA inputs, a CSA sum output, and a CSA carry output. One of the three or more CSA inputs is presented with a random number value, while other CSA inputs are presented with input values to be summed. The circuit further includes adder logic having adder inputs and a sum output. The CSA carry output of the CSA logic is coupled with one of the adder inputs of the adder logic, and the CSA sum output of the CSA logic is coupled with another input of the adder inputs of the adder logic. A particular number of most significant bits of the sum output of the adder logic represent a stochastically rounded sum of the input values.Type: GrantFiled: March 22, 2018Date of Patent: April 21, 2020Assignee: ADVANCED MICRO DEVICES, INC.Inventor: Gabriel H. Loh

Patent number: 10546045Abstract: Systems and methods are provided for performing a dot product. Each of a first series of numbers is divided into a first value, comprising the N most significant bits of the number, and a second value to form first and second sets of values. Each of a second series of numbers is divided into a third value, comprising the N most significant bits of the number, and a fourth value to form third and fourth sets of values. A dot product of the first and fourth sets of values is computed to provide a first partial sum. A dot product of the first and third sets of values is computed to provide a second partial sum. A dot product of the second and third sets of values is computed to provide a third partial sum. The partial sums are summed to provide a result for the dot product.Type: GrantFiled: December 19, 2017Date of Patent: January 28, 2020Assignee: TEXAS INSTRUMENTS INCORPORATEDInventors: Lester Anderson Longley, Misael Lopez Cruz, Victor Cheng

Patent number: 10528346Abstract: Disclosed embodiments relate to instructions for fused multiplyadd (FMA) operations with variableprecision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.Type: GrantFiled: March 29, 2018Date of Patent: January 7, 2020Assignee: Intel CorporationInventors: Dipankar Das, Naveen K. Mellempudi, Mrinmay Dutta, Arun Kumar, Dheevatsa Mudigere, Abhisek Kundu

Patent number: 10372417Abstract: Disclosed herein is a computer implemented method for performing multiplyadd operations of binary numbers P, Q, R, S, B in an arithmetic unit of a processor, the operation calculating a result as an accumulated sum, which equals to B+nÃ—PÃ—Q+mÃ—RÃ—S, where n and m are natural numbers. Further disclosed herein is an arithmetic unit configured to implement multiplyadd operations of binary numbers P, Q, R, S, B comprising at least a first binary arithmetic unit for calculating an aligned high part result and a second binary arithmetic unit for calculating an aligned low part result of the multiplyadd operations.Type: GrantFiled: July 13, 2017Date of Patent: August 6, 2019Assignee: International Business Machines CorporationInventors: Tina Babinsky, Michael Klein, Cedric Lichtenau, Silvia M. Mueller

Patent number: 10365860Abstract: A circuit that includes a plurality of array cores, each array core of the plurality of array cores comprising: a plurality of distinct data processing circuits; and a data queue register file; a plurality of border cores, each border core of the plurality of border cores comprising: at least a register file, wherein: [i] at least a subset of the plurality of border cores encompasses a periphery of a first subset of the plurality of array cores; and [ii] a combination of the plurality of array cores and the plurality of border cores define an integrated circuit array.Type: GrantFiled: March 1, 2019Date of Patent: July 30, 2019Assignee: quadric.io, Inc.Inventors: Nigel Drego, Aman Sikka, Mrinalini Ravichandran, Ananth Durbha, Robert Daniel Firu, Veerbhan Kheterpal

Patent number: 10338925Abstract: Tensor register files in a hardware accelerator are disclosed. An apparatus may comprise tensor operation calculators each configured to perform a type of tensor operation. The apparatus may also comprises tensor register files, each of which is associated with one of the tensor operation calculators. The apparatus may also comprises logic configured to store respective ones of the tensors in the plurality of tensor register files in accordance with the type of tensor operation to be performed on the respective tensors. The apparatus may also control read access to tensor register files based on a type of tensor operation that a machine instruction is to perform.Type: GrantFiled: May 24, 2017Date of Patent: July 2, 2019Assignee: Microsoft Technology Licensing, LLCInventors: Jeremy Halden Fowers, Steven Karl Reinhardt, Kalin Ovtcharov, Eric Sen Chung

Processor and method for executing inmemory copy instructions indicating onchip or offchip memory
Patent number: 10261796Abstract: A processor and a method for executing an instruction on a processor are provided. In the method, a tobeexecuted instruction is fetched, the instruction including a source address field, a destination address field, an operation type field, and an operation parameter field; in at least one execution unit, an execution unit controlled by a tobegenerated control signal according to the operation type field is determined, a source address and a destination address of data operated by the execution unit are determined according to the source address field and the destination address field, and a data amount of the data operated by the execution unit controlled by the tobegenerated control signal is determined according to the operation parameter field; the control signal is generated; and the execution unit in the at least one execution unit is controlled by using the control signal.Type: GrantFiled: November 23, 2016Date of Patent: April 16, 2019Assignee: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY CO., LTDInventors: Jian Ouyang, Wei Qi, Yong Wang 
Patent number: 10198263Abstract: Apparatus and methods are disclosed for nullifying one or more registers identified in a target field of a nullification instruction. In some examples of the disclosed technology, an apparatus can include memory and one or more blockbased processor cores configured to fetch and execute a plurality of instruction blocks. One of the cores can include a control unit configured, based at least in part on receiving a nullification instruction, to obtain a register identification of at least one of a plurality of registers, based on a target field of the nullification instruction. A write to the at least one register associated with the register identification is nullified. The nullification instruction is in a first instruction block of the plurality of instruction blocks. Based on the nullified write to the at least one register, a subsequent instruction is executed from a second, different instruction block.Type: GrantFiled: March 3, 2016Date of Patent: February 5, 2019Assignee: Microsoft Technology Licensing, LLCInventors: Douglas C. Burger, Aaron L. Smith

Patent number: 10169297Abstract: In one example in accordance with the present disclosure a resistive memory array is described. The array includes a number of resistive memory elements to receive a commonvalued read signal. The array also includes a number of multiplication engines to perform a multiply operation by receiving a memory element output from a corresponding resistive memory element, receiving an input signal, and generating a multiplication output based on a received memory element output and a received input signal. The array also includes an accumulation engine to sum multiplication outputs from the number of multiplication engines.Type: GrantFiled: April 16, 2015Date of Patent: January 1, 2019Assignee: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LPInventor: Brent Buchanan

Patent number: 10152456Abstract: A correlation operation circuit includes a first SRAM storing a plurality of pieces of detection pattern data, productsum operators, a second SRAM storing intermediate data, and a comparator. When time series data is sequentially input, the intermediate data of all correlation functions referring to one time series data in a period during which the one time series data is input. When one time series data is input, the productsum operator multiplies the detection pattern data sequentially read from the first SRAM by the one input time series data. The corresponding intermediate data is read from the second SRAM in synchronization with the multiplication, and the sequentiallycalculated products are cumulatively added to the read intermediate data to be written back into the second SRAM as the intermediate data. As a result, the calculated correlation function data is supplied to the comparator to be compared with a predetermined specified value.Type: GrantFiled: May 1, 2017Date of Patent: December 11, 2018Assignee: Renesas Electronics CorporationInventor: Hiroshi Ueki

Patent number: 10146248Abstract: A model calculation unit for calculating a databased function model in a control unit is provided, the model calculation unit having a processor core which includes: a multiplication unit for carrying out a multiplication on the hardware side; an addition unit for carrying out an addition on the hardware side; an exponential function unit for calculating an exponential function on the hardware side; a memory in the form of a configuration register for storing hyperparameters and node data of the databased function model to be calculated; and a logic circuit for controlling, on the hardware side, the calculation sequence in the multiplication unit, the addition unit, the exponential function unit and the memory in order to ascertain the databased function model.Type: GrantFiled: April 7, 2014Date of Patent: December 4, 2018Assignee: ROBERT BOSCH GMBHInventors: Tobias Lang, Heiner Markert, Axel Aue, Wolfgang Fischer, Ulrich Schulmeister, Nico Bannow, Felix Streichert, Andre Guntoro, Christian Fleck, Anne Von Vietinghoff, Michael Saetzler, Michael Hanselmann, Matthias Schreiber

Patent number: 10140090Abstract: Methods, systems and computer program products for computing and summing up multiple products in a single multiplier are provided. Aspects include receiving a first number and a second number, creating partial products of the first number and the second number based on a multiplication of the first number and the second number, and reducing the number of partial products to create an intermediate result. Aspects also include receiving a third number and a fourth number, creating partial products of the third number and the fourth number based on a multiplication of the third number and the fourth number, creating a reduction tree and adding the intermediate result to the reduction tree. Aspects further include reducing the number of partial products in the reduction tree to create a second sum value and a second carry value and adding the second sum value and the second carry value to create a result.Type: GrantFiled: September 28, 2016Date of Patent: November 27, 2018Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Michael Klein, Manuela Niekisch

Patent number: 10140251Abstract: A processor and a method for executing a matrix multiplication operation on a processor. A specific implementation of the processor includes a data bus and an array processor having k processing units. The data bus is configured to sequentially read n columns of row vectors from an MÃ—N multiplicand matrix and input same to each processing unit in the array processor, read an nÃ—k submatrix from an NÃ—K multiplier matrix and input each column vector of the submatrix to a corresponding processing unit in the array processor, and output a result obtained by each processing unit after executing a multiplication operation. Each processing unit in the array processor is configured to execute in parallel a vector multiplication operation on the input row and column vectors. Each processing unit includes a Wallace tree multiplier having n multipliers and n?1 adders. This implementation improves the processing efficiency of a matrix multiplication operation.Type: GrantFiled: May 9, 2017Date of Patent: November 27, 2018Assignee: Beijing Baidu Netcom Science and Technology Co., Ltd.Inventors: Ni Zhou, Wei Qi, Yong Wang, Jian Ouyang

Patent number: 10127013Abstract: Integrated circuits with specialized processing blocks that can support both fixedpoint and floatingpoint operations are provided. A specialized processing block of this type may include partial product generators, compression circuits, and a main adder. The main adder may include a high adder, a middle adder, a low adder, floatingpoint rounding circuitry, and associated selection circuitry. The middle adder may include prefix networks for outputting generate and propagate vectors, and redundant LSB processing logic for outputting LSB generate and propagate bits. The middle adder may include additional logic circuitry for generating a sum output, a sumplus1 output, and a sumplus2 output. The specialized processing block may further include accumulation circuitry for support multiplyaccumulation functions for any suitable number of channels.Type: GrantFiled: December 23, 2016Date of Patent: November 13, 2018Assignee: Altera CorporationInventor: Martin Langhammer

Patent number: 10108581Abstract: A vector reduction circuit configured to reduce an input vector of elements comprises a plurality of cells, wherein each of the plurality of cells other than a designated first cell that receives a designated first element of the input vector is configured to receive a particular element of the input vector, receive, from another of the one or more cells, a temporary reduction element, perform a reduction operation using the particular element and the temporary reduction element, and provide, as a new temporary reduction element, a result of performing the reduction operation using the particular element and the temporary reduction element. The vector reduction circuit also comprises an output circuit configured to provide, for output as a reduction of the input vector, a new temporary reduction element corresponding to a result of performing the reduction operation using a last element of the input vector.Type: GrantFiled: April 3, 2017Date of Patent: October 23, 2018Assignee: Google LLCInventors: Gregory Michael Thorson, Andrew Everett Phelps, Olivier Temam

Patent number: 10097185Abstract: In an example embodiment, a digital block comprises a datapath circuit, one or more programmable logic devices (PLDs), and one or more control registers. The datapath circuit comprises structural arithmetic elements. The one or more PLDs comprise uncommitted programmable logic. The one or more control circuits comprise a control register configured to store userdefined control bits, where the one or more control circuits are configured to control both the structural arithmetic elements and the uncommitted programmable logic based on the userdefined control bits.Type: GrantFiled: December 20, 2017Date of Patent: October 9, 2018Assignee: Cypress Semiconductor CorporationInventors: Bert Sullam, Warren Snyder, Haneef Mohammed

Patent number: 10089078Abstract: A circuit includes a multiplier, an adder, a first result register and a second result register coupled to outputs of the multiplier and the adder, respectively. The circuit further includes: a first selection unit configured to selectively provide, to the multiplier and in response to a first control signal, a first value from a first plurality of values; and a second selection unit configured to selectively provide, to the multiplier and in response to a second control signal, a second value from a second plurality of values. The circuit also includes: a third selection unit configured to selectively provide, to the adder and in response to a third control signal, a third value from a third plurality of values; and a fourth selection unit configured to selectively provide, to the adder and in response to a fourth control signal, a fourth value from a fourth plurality of values.Type: GrantFiled: September 23, 2016Date of Patent: October 2, 2018Assignee: STMICROELECTRONICS S.R.L.Inventors: David Vincenzoni, Samuele Raffaelli

Patent number: 10042639Abstract: According to one embodiment, a processor includes an instruction decoder to receive an instruction to process a multiplyaccumulate operation, the instruction having a first operand, a second operand, a third operand, and a fourth operand. The first operand is to specify a first storage location to store an accumulated value; the second operand is to specify a second storage location to store a first value and a second value; and the third operand is to specify a third storage location to store a third value. The processor further includes an execution unit coupled to the instruction decoder to perform the multiplyaccumulate operation to multiply the first value with the second value to generate a multiply result and to accumulate the multiply result and at least a portion of a third value to an accumulated value based on the fourth operand.Type: GrantFiled: January 3, 2017Date of Patent: August 7, 2018Assignee: Intel CorporationInventors: Vinodh Gopal, Erdinc Ozturk, James D. Guilford, Gilbert M. Wolrich

Patent number: 10037210Abstract: An apparatus is described that includes a semiconductor chip having an instruction execution pipeline having one or more execution units with respective logic circuitry to: a) execute a first instruction that multiplies a first input operand and a second input operand and presents a lower portion of the result, where, the first and second input operands are respective elements of first and second input vectors; b) execute a second instruction that multiplies a first input operand and a second input operand and presents an upper portion of the result, where, the first and second input operands are respective elements of first and second input vectors; and, c) execute an add instruction where a carry term of the add instruction's adding is recorded in a mask register.Type: GrantFiled: September 6, 2016Date of Patent: July 31, 2018Assignee: INTEL CORPORATIONInventors: Gilbert M. Wolrich, Kirk S. Yap, James D. Guilford, Erdinc Ozturk, Vinodh Gopal, Wajdi K. Feghali, Sean M. Gulley, Martin G. Dixon

Patent number: 9946612Abstract: Implementations of encoding techniques are disclosed. In one embodiment, an encoding system includes a codec device, a switching network, a rerouting circuit, a logic integrated circuit, and memory devices. The codec device includes a plurality of input and output (I/O) ports to transport data signals. The switching network is coupled both to the plurality of I/O ports and to a plurality of channels external to the device. The plurality of I/O ports includes at least one spare channel. The rerouting circuitry is coupled to and configured to control the switching network and the logic integrated circuit has logic circuity including command and decode queueing circuitry, redundancy circuits, and error correction circuitry. The memory devices do include any circuitry included in the logic circuitry. Other systems and apparatuses are also described.Type: GrantFiled: July 20, 2015Date of Patent: April 17, 2018Assignee: Micron Technology, Inc.Inventor: Timothy M. Hollis

Patent number: 9760110Abstract: Methods and systems for memorybased computing include combining multiple operations into a single lookup table and combining multiple memorybased operation requests into a single read request. Operation result values are read from a multioperation lookup table that includes result values for a first operation above a diagonal of the lookup table and includes result values for a second operation below the diagonal. Numerical inputs are used as column and row addresses in the lookup table and the requested operation determines which input corresponds to the column address and which input corresponds to the row address. Multiple operations are combined into a single request by combining respective members from each operation into respective inputs an reading an operation result value from a lookup table to produce a combined result output. The combined result output is separated into a plurality of individual result outputs corresponding to the plurality of requests.Type: GrantFiled: February 4, 2016Date of Patent: September 12, 2017Assignee: International Business Machines CorporationInventors: Minsik Cho, Ruchir Puri

Patent number: 9753695Abstract: A datapath circuit may include a digital multiply and accumulate circuit (MAC) and a digital hardware calculator for parallel computation. The digital hardware calculator and the MAC may be coupled to an input memory element for receipt of input operands. The MAC may include a digital multiplier structure with partial product generators coupled to an adder to multiply a first and second input operands and generate a multiplication result. The digital hardware calculator may include a first lookup table coupled between a calculator input and a calculator output register. The first lookup table may include table entry values mapped to corresponding math function results in accordance with a first predetermined mathematical function. The digital hardware calculator may be configured to calculate, based on the first lookup table, a computationally hard mathematical function such as a logarithm function, an exponential function, a division function and a square root function.Type: GrantFiled: August 27, 2013Date of Patent: September 5, 2017Assignee: Analog Devices GlobalInventors: Mikael M. Mortensen, Jeffrey G. Bernstein

Patent number: 9743082Abstract: The present invention relates to an apparatus and method for encoding and decoding an image by skip encoding. The imageencoding method by skip encoding, which performs intraprediction, comprises: performing a filtering operation on the signal which is reconstructed prior to an encoding object signal in an encoding object image; using the filtered reconstructed signal to generate a prediction signal for the encoding object signal; setting the generated prediction signal as a reconstruction signal for the encoding object signal; and not encoding the residual signal which can be generated on the basis of the difference between the encoding object signal and the prediction signal, thereby performing skip encoding on the encoding object signal.Type: GrantFiled: March 10, 2015Date of Patent: August 22, 2017Assignees: Electronics and Telecommunications Research Institute, Kwangwoon University IndustryAcademic Collaboration Foundation, UniversityIndustry Cooperation Group Of Kyung Hee UniversityInventors: Sung Chang Lim, Ha Hyun Lee, Se Yoon Jeong, Hui Yong Kim, Suk Hee Cho, Jong Ho Kim, Jin Ho Lee, Jin Soo Choi, Jin Woong Kim, Chie Teuk Ahn, Dong Gyu Sim, Seoung Jun Oh, Gwang Hoon Park, Sea Nae Park, Chan Woong Jeon

Patent number: 9690579Abstract: A first floatingpoint operation unit receives first and second variables and performs a first operation generating a first output. A first rounding unit receives and rounds the first output to generate a second output if a control bit is in a first state. A second floatingpoint operation unit receives a third variable and either the first output or the second output and performs a second operation on the third variable and either the first output or the second output, to generate a third output. The second floatingpoint operation unit receives and operates on the first output if the control bit is in the first state, or the second output if the control bit is in the second state. A second rounding unit receives and rounds the third output.Type: GrantFiled: December 29, 2014Date of Patent: June 27, 2017Assignee: ARM Finance Overseas LimitedInventor: David YiuMan Lau

Patent number: 9692579Abstract: According to some embodiments, a secondary network node detects a first data transmission of media content from a primary network node to a first wireless device. The first data transmission has a first data quality description D(n1) and a first transport format T(k1). The secondary network node selects a second data quality description D(n2?) and a second transport format T(k2?) for a second data transmission. The second data quality description D(n2?) and second transport format T(k2?) differ from the first data quality description D(1) and first transport format T(k1), respectively. The secondary network node transmits the second data transmission to a second wireless device according to the second data quality description D(n2?) and the second transport format T(k2?). The second data transmission includes at least a portion of the media content.Type: GrantFiled: August 5, 2014Date of Patent: June 27, 2017Assignee: Telefonaktiebolaget LM Ericsson (publ)Inventor: Ali S. Khayrallah

Patent number: 9535706Abstract: According to one embodiment, a processor includes an instruction decoder to receive an instruction to process a multiplyaccumulate operation, the instruction having a first operand, a second operand, a third operand, and a fourth operand. The first operand is to specify a first storage location to store an accumulated value; the second operand is to specify a second storage location to store a first value and a second value; and the third operand is to specify a third storage location to store a third value. The processor further includes an execution unit coupled to the instruction decoder to perform the multiplyaccumulate operation to multiply the first value with the second value to generate a multiply result and to accumulate the multiply result and at least a portion of a third value to an accumulated value based on the fourth operand.Type: GrantFiled: March 22, 2016Date of Patent: January 3, 2017Assignee: Intel CorporationInventors: Vinodh Gopal, Erdinc Ozturk, James D. Guilford, Gilbert M. Wolrich

Patent number: 9519460Abstract: A singleinstruction multipledata (SIMD) multiplieraccumulator apparatus and method. A multiplier block with two 16bit by 32bit multiplier circuits transform a selectable number of input multipliers and multiplicands into a selected number of products. Each multiplier circuit comprises an array of full adders that generates and sums partial products using carrysave addition. An accumulator block, with additional data width to help prevent overflow, adds the products to a selectable number of input addends and outputs a number of results. Embodiments perform one to four multiplications together, depending on the number of bits (eight, 16, 24, or 32) selected for the input operands. Embodiments output 20bit, 40bit, or 80bit multiplyaccumulate results at rates of at least 1.1 GHz. Embodiments support signed inputs, negated multiplication products, and Qformat data. A hybrid sign extension management approach improves performance for 80bit outputs.Type: GrantFiled: September 25, 2014Date of Patent: December 13, 2016Assignee: Cadence Design Systems, Inc.Inventors: Aamir A. Farooqui, David Lawrence Heine

Patent number: 9495154Abstract: Embodiments disclosed herein include vector processing engines (VPEs) having programmable data path configurations for providing multimode vector processing. Related vector processors, systems, and methods are also disclosed. The VPEs include a vector processing stage(s) configured to process vector data according to a vector instruction executed in the vector processing stage. Each vector processing stage includes vector processing blocks each configured to process vector data based on the vector instruction being executed. The vector processing blocks are capable of providing different vector operations for different types of vector instructions based on data path configurations. Data paths of the vector processing blocks are programmable to be reprogrammable to process vector data differently according to the particular vector instruction being executed.Type: GrantFiled: March 13, 2013Date of Patent: November 15, 2016Assignee: QUALCOMM IncorporatedInventor: Raheel Khan

Patent number: 9483442Abstract: According to an embodiment, a matrix operation apparatus executing a matrix operation includes multiple nodes, the nodes including: a multiplier configured to perform a first operation for a first input, which is column data and a second input which is row data for the matrix operation and output element components of an operation result of the matrix operation; and an accumulator configured to perform cumulative addition of operation results of the multiplier.Type: GrantFiled: February 28, 2014Date of Patent: November 1, 2016Assignee: KABUSHIKI KAISHA TOSHIBAInventors: Seiji Maeda, Hiroyuki Usui

Patent number: 9465578Abstract: A system and method are provided for performing 32bit or dual 16bit floatingpoint arithmetic operations using logic circuitry. An operating mode that specifies an operating mode for a multiplication operation is received, where the operating mode is one of a 32bit floatingpoint mode and a dual 16bit floatingpoint mode. Based on the operating mode, nine recoding terms for a mantissa of at least one floatingpoint input operand are determined. A dualmode multiplier array circuit that is configurable to generate partial products for either one 32bit floatingpoint result or for two 16bit floatingpoint results computes the partial products based on the nine recoding terms. The partial products are processed to generate an output based on the operating mode.Type: GrantFiled: December 13, 2013Date of Patent: October 11, 2016Assignee: NVIDIA CorporationInventors: David C. Tannenbaum, Srinivasan Iyer