Abstract: A process for performing vector dot products receives a row vector and a column vector as floating point numbers in a format of sign plus exponent bits plus mantissa bits. The process generates a single dot product value by separately processing the sign bits, exponent bits, and mantissa bits to form a sign bit, a normalized mantissa formed by multiplying pairs multiplicand elements, and exponent information including MAX_EXP and EXP_DIFF. A second pipeline stage receives the multiplied pairs of normalized mantissas, optionally performs an exponent adjustment, pads, complements and shifts the normalized mantissas, and the results are added in a series of stages until a single addition result remains, which is normalized using MAX_EXP to form the floating point output result.
Abstract: A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
Abstract: An operation device includes a quantizer circuit, a buffer circuit, a convolution core circuit and a multiply-add circuit. The quantizer circuit receives first feature data and performs asymmetric uniform quantization on the first feature data to obtain and store in the buffer circuit second feature data. The quantizer circuit further receives a first weighting coefficient and performs symmetric uniform quantization on the first weighting coefficient to obtain and store in the buffer circuit a second weight coefficient. The convolution core circuit performs a convolution operation on the initial operation result, an actual quantization scale factor and an actual bias value to obtain a final operation result.
Abstract: A function approximation system is disclosed for determining output floating point values of functions calculated using floating point numbers. Complex functions have different shapes in different subsets of their input domain, making them difficult to predict for different values of the input variable. The function approximation system comprises an execution unit configured to determine corresponding values of a given function given a floating point input to the function; a plurality of look up tables for each function type; a correction table of values which determines if corrections to the output value are required; and a table selector for finding an appropriate table for a given function.
Abstract: An embodiment method of processing at least one sensing signal comprising a time-series of signal samples comprises high-pass filtering the time series of signal samples to produce a filtered time series; applying delay embedding processing to the filtered time series; producing a first matrix by storing the set of time-shifted time series as an ordered list of entries in the first matrix; applying a first truncation to produce a second matrix by truncating the entries in the ordered list of entries at one end of the first matrix to remove a number of items equal to the product of the first delay embedding parameter decreased by one times the second delay embedding parameter; applying entry-wise processing to the second matrix, and forwarding a set of estimated kernel densities and/or a set of images generated as a function of the set of estimated kernel densities to a user circuit.
Abstract: In some examples, a device includes a first processing core comprising a resistive memory array to perform an analog computation, and a digital processing core comprising a digital memory programmable with different values to perform different computations responsive to respective different conditions. The device further includes a controller to selectively apply input data to the first processing core and the digital processing core.
Type:
Grant
Filed:
April 30, 2018
Date of Patent:
January 2, 2024
Assignee:
Hewlett Packard Enterprise Development LP
Inventors:
John Paul Strachan, Dejan S. Milojicic, Martin Foltin, Sai Rahul Chalamalasetti, Amit S. Sharma
Abstract: A method includes dividing a fraction of a floating point result into a first portion and a second portion. The method includes outputting a first normalizer result based on the first portion during to a first clock cycle. The method includes storing a first segment of the first portion during to the first clock cycle. The method includes outputting a first rounder result based on the first normalizer result during to the first clock cycle. The method includes outputting a second normalizer result based on the second portion during to a second clock cycle. The method includes outputting a second rounder result based on the second normalizer result and the first segment during to the second clock cycle.
Type:
Grant
Filed:
September 21, 2021
Date of Patent:
January 2, 2024
Assignee:
International Business Machines Corporation
Inventors:
Nicol Hofmann, Michael Klein, Petra Leber, Kerstin Claudia Schelm
Abstract: In order to provide smaller, faster and less error-prone circuits for sorting possibly metastable inputs, a novel sorting circuit is provided. According to the invention, the circuit is metastability-containing.
Type:
Grant
Filed:
October 31, 2019
Date of Patent:
December 26, 2023
Assignee:
Max-Planck-Gesellschaft zur Förderung D. Wissenschaften e.V.
Abstract: An execution unit for a processor, the execution unit comprising: a look up table having a plurality of entries, each of the plurality of entries comprising an initial estimate for a result of an operation; a preparatory circuit configured to search the look up table using an index value dependent upon the operand to locate an entry comprising a first initial estimate for a result of the operation; a plurality of processing circuits comprising at least one multiplier circuit; and control circuitry configured to provide the first initial estimate to the at least one multiplier circuit of the plurality of processing circuits so as perform processing, by the plurality of processing units, of the first initial estimate to generate the function result, said processing comprising applying one or more Newton Raphson iterations to the first initial estimate.
Abstract: A pseudo-random number generation circuit device includes a pseudo-random number generation circuit including a logic circuit configured based on rule data that generates a next random number value from a current random number value, a cycle detection circuit that detects, based on a seed, an end of a cycle of random numbers, which are generated by the pseudo-random number generation circuit, and a rule data generation circuit that generates new rule data at a first trigger, at which the cycle detection circuit detects the end of the cycle of random numbers, to output the new rule data to the pseudo-random number generation circuit, wherein the cycle detection circuit stores a random number value, which is generated by a new logic circuit configured based on the new rule data, as the seed.
Abstract: An apparatus and method for efficiently performing a multiply add or multiply accumulate operation. For example, one embodiment of a processor comprises: a decoder to decode an instruction specifying an operation, the instruction comprising a first operand identifying a multiplier and a second operand identifying a multiplicand; and fused multiply-add (FMA) execution circuitry comprising first multiplication circuitry to perform a multiplication using the multiplicand and multiplier to generate a result for multipliers and multiplicands falling within a first precision range, and second multiplication circuitry to be used instead of the first multiplication circuitry for multipliers and multiplicands falling within a second precision range.
Abstract: A hardware logic representation of a circuit to implement an operation to perform multiplication by an invariant rational is generated by truncating an infinite single summation array (which is represented in a finite way). The truncation is performed by identifying a repeating section and then discarding all but a finite number of the repeating sections whilst still satisfying a defined error bound. To further reduce the size of the summation array, the binary representation of the invariant rational is converted into canonical signed digit notation prior to creating the finite representation of the infinite array.
Abstract: Apparatus and methods are disclosed for performing block floating-point (BFP) operations, including in implementations of neural networks. All or a portion of one or more matrices or vectors can share one or more common exponents. Techniques are disclosed for selecting the shared common exponents. In some examples of the disclosed technology, a method includes producing BFP representations of matrices or vectors, at least two elements of the respective matrices or vectors sharing a common exponent, performing a mathematical operation on two or more of the plurality of matrices or vectors, and producing an output matrix or vector. Based on the output matrix or vector, one or more updated common exponents are selected, and an updated matrix or vector is produced having some elements that share the updated common exponents.
Abstract: A neural network system includes a data type converter and a MAC operator. The data type converter may convert 32-bit floating-point format into one of a plurality of 16-bit floating-point formats. The MAC operator may perform MAC operations using 16-bit floating-point format data converted by the data type converter. The MAC operator includes a data type modulator configured to modulate the bit number of the converted 16-bit floating-point format to provide a modulated floating-point format with bit number different from the bit number of the converted 16-bit floating-point format.
Abstract: The disclosure relates to a low-loss arithmetic circuit, which includes a plurality of arithmetic units, a plurality of storage units, and one or more reset MOSFETs. Each arithmetic unit includes 4 MOSFETs. The disclosure also relates to an operating method of the low-loss arithmetic circuit and a low-loss Processing-in-Memory circuit.
Abstract: A systolic array can implement an architecture tailored to perform matrix multiplications on constrained fine-grained sparse weight matrices. Each processing element in the systolic array may include a weight register configured to store a weight value, and a multiplexor configured to select a feature map (FMAP) input element from multiple FMAP input data buses based on metadata associated with the weight value. Each processing element may also include a multiplier configured to multiply the selected feature map input element with the weight value to generate a multiplication result, and an adder configured to add the multiplication result to a partial sum input to generate a partial sum output.
Type:
Grant
Filed:
June 30, 2020
Date of Patent:
October 31, 2023
Assignee:
Amazon Technologies, Inc.
Inventors:
Paul Gilbert Meyer, Thiam Khean Hah, Randy Renfu Huang, Ron Diamant, Vignesh Vivekraja
Abstract: An Application Specific Integrated Circuit (ASIC) for computing a convolutional neural network (CNN) has a first input bus receiving an ordered stream of values from an array, each position in the array having one or more channels, and a plurality of kernel processing tiles receiving inputs through configurable multiplexors. The kernel processing tiles and buses are arranged and connected in a manner that the ASIC operates as a pipelined system delivering an output stream in synchronization with the input stream.
Abstract: A multiplier circuit is provided to multiply a first operand and a second operand. The multiplier circuit includes a carry-save adder network comprising a plurality of carry-save adders to perform partial product additions to reduce a plurality of partial products to a redundant result value that represents a product of the first operand and the second operand. A number of the carry-save adders that is used to generate the redundant result value is controllable and is dependent on a width of at least one of the first operand and the second operand.
Type:
Grant
Filed:
August 5, 2020
Date of Patent:
October 17, 2023
Assignee:
Arm Limited
Inventors:
Tai Li, Jack William Derek Andrew, Michael Alexander Kennedy
Abstract: A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring matrices to form paired matrices solving the paired matrices simultaneously.
Type:
Grant
Filed:
May 10, 2021
Date of Patent:
October 17, 2023
Assignee:
INTERNATIONAL BUSINESS MACHINES CORPORATION
Inventors:
Minsik Cho, David Shing-ki Kung, Ruchir Puri
Abstract: An arithmetic logic unit according to an embodiment of the present technology includes: a plurality of input lines; and a multiply-accumulate operation device. Electrical signals are input to the plurality of input lines.