Abstract: Provided is a method and system for performing multi-device-based inference for a large language model. A multi-device-based inference performance system may include a plurality of devices configured to map to partitions that separate a large language model (LLM) according to an intra-layer parallelism method. Here, each of the plurality of devices may be implemented to synchronize data by sharing a sub-result of matrix multiplication on the data with another device of the plurality of devices while the matrix multiplication is being performed.
Abstract: Provided is a mixed-precision multiply-and-accumulation (MAC) tree structure to maximize memory bandwidth usage for computational acceleration of a generative large language model. A MAC tree-based operator may include a plurality of floating-point (FP) multipliers connected in parallel and configured to process a multiplication operation on data delivered from an external memory; a plurality of first converters configured to convert output of each of the plurality of FP multipliers from floating point to fixed point; a fixed-point (FXP) adder tree connected to the plurality of first converters and configured to process summation of multiplication results of the plurality of FP multipliers; an FXP accumulator configured to accumulate output of the FXP adder tree; and a second converter configured to convert output of the FXP accumulator from the fixed point to the floating point.
Abstract: Provided is a mixed-precision multiply-and-accumulation (MAC) tree structure to maximize memory bandwidth usage for computational acceleration of a generative large language model. A MAC tree-based operator may include a plurality of floating-point (FP) multipliers connected in parallel and configured to process a multiplication operation on data delivered from an external memory; a plurality of first converters configured to convert output of each of the plurality of FP multipliers from floating point to fixed point; a fixed-point (FXP) adder tree connected to the plurality of first converters and configured to process summation of multiplication results of the plurality of FP multipliers; an FXP accumulator configured to accumulate output of the FXP adder tree; and a second converter configured to convert output of the FXP accumulator from the fixed point to the floating point.
Abstract: Provided is a method and system for efficient hardware mapping of a generative giant artificial intelligence model. A hardware mapping method may include receiving, by at least one processor, model software and sequentially performing, by the at least one processor, source code level simulation, instruction level simulation, and register transfer level simulation for the model software.
Abstract: Provided is a method and system for performing multi-device-based inference for a large language model. A multi-device-based inference performance system may include a plurality of devices configured to map to partitions that separate a large language model (LLM) according to an intra-layer parallelism method. Here, each of the plurality of devices may be implemented to synchronize data by sharing a sub-result of matrix multiplication on the data with another device of the plurality of devices while the matrix multiplication is being performed.
Abstract: Provided is a method and system for weight memory mapping for a streaming operation of giant generative artificial intelligence hardware. A weight memory mapping system may include a weight memory configured to store a weight matrix for a pretrained artificial intelligence model; an input register configured to store a plurality of input data; a first hardware operator configured to process a matrix multiplication operation between the plurality of input data and the weight matrix and to compute a lane-level final sum during the progress of the matrix multiplication operation by reusing a partial sum of the matrix multiplication operation; and a second hardware operator configured to preprocess a next matrix multiplication operation during the progress of the matrix multiplication operation using the final sum.
Abstract: Provided is a method and system for verifying an operation and data precision of generative giant artificial intelligence hardware. A verification method may include receiving target device information, a model instruction, and a model parameter related to an artificial intelligence (AI) model; constructing a simulator corresponding to real hardware based on the target device information; processing an operation between the model instruction and the model parameter through the constructed simulator; and storing a processing result of the operation in a memory module included in the simulator. Here, the at least one processor may include a CPU and a GPU and the constructing of the simulator may include constructing a first simulator that uses the CPU in response to a high-precision mode being selected and constructing a second simulator that uses both the CPU and the GPU in response to a low-latency mode being selected.
Abstract: Provided is a latency processing unit. The latency processing unit may include a plurality of multiplier-accumulator (MAC) trees configured to perform a matrix product operation for at least one of a plurality of partitions that implement an artificial intelligence (AI) model, streamlined memory access configured to connect each of the plurality of MAC trees to high bandwidth memory in which the at least one partition has been stored through a plurality of channels, a vector execution engine configured to perform an additional operation on results of the operation of the plurality of MAC trees, a local memory unit configured to store the results of the operation of the vector execution engine and an activation value, and an instruction scheduling unit configured to schedule the operations of the plurality of MAC trees and the vector execution engine.