Operator Processing Method and Computer Device
An operator processing method includes obtaining a real-time shape of any to-be-output first tensor by combining in real time one or more micro-operators in a pre-constructed micro-operator library. Then, a micro-operator included in one combination (for example, a combination with optimal performance because different combinations have different performance) is selected for execution. Micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”.
This is a continuation of International Patent Application No. PCT/CN2022/143087 filed on Dec. 29, 2022, which claims priority to Chinese Patent Application No. 202210049301.8 filed on Jan. 17, 2022. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
TECHNICAL FIELDThe present disclosure relates to the field of machine learning, and in particular, to an operator processing method and a computer device.
BACKGROUNDJust-in-time (JIT) compilation is a widely used compilation technology. To be specific, the just-in-time compilation is a process in which a compiler is immediately invoked at runtime, to perform standard compilation (including front-end lexical/syntactical analysis, intermediate representation analysis, running cost estimation based on a cost model, back-end code generation, and the like) on a specified task, and generate and execute binary code.
Time consumption of invoking the static shape compiler in just-in-time compilation to generate binary code of an operator whose shape is consistent with the shape of the to-be-output tensor is at a second level (generally within 10 seconds), and calculation overheads of one operator (for example, convolution) are at a microsecond level to a millisecond level, that is, compilation overheads of just-in-time compilation are excessively high in an entire calculation process. In addition, in the dynamic shape scenario, the static shape compiler may be invoked for hundreds of thousands of times, and overall calculation performance of the system is severely affected by just-in-time compilation.
SUMMARYEmbodiments of the present disclosure provide an operator processing method and a computer device. A real-time shape of any to-be-output first tensor can be obtained by combining in real time one or more micro-operators in a micro-operator library that is constructed in advance. Finally, a micro-operator included in one combination is selected (for example, a combination with optimal performance is selected) for execution because different combinations have different performance. This indirectly improves system performance. In embodiments of the present disclosure, micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. This reduces compilation overheads, avoiding excessively high compilation overheads of a just-in-time compilation technology at runtime. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”. Therefore, any shape may be a combination of the one or more micro-operators in the micro-operator library, providing completeness in solution space, and programmability is high in a dynamic shape scenario. In addition, a quantity of micro-operators in the micro-operator library is limited, and a size of memory space is small.
In view of this, embodiments of the present disclosure provide the following technical solutions.
According to a first aspect, an embodiment of the present disclosure first provides an operator processing method, which may be applied to the field of artificial intelligence, and may be specifically applied to a dynamic shape scenario. For example, a common Kalman filtering (CV) model (such as a detection, segmentation, speech model, an automatic speech recognition (ASR) model, or a natural language processing (NLP) model) involves a dynamic shape problem in training and inference. The method includes: obtaining a shape of a to-be-output tensor (that is, from an output perspective), where the to-be-output tensor may be referred to as a first tensor. After a first shape of the to-be-output first tensor is obtained, at least one combination that meets the first shape is further determined. Each combination includes at least one target micro-operator, and each target micro-operator is from a same micro-operator library. It should be noted herein that micro-operators in the micro-operator library are pre-compiled micro-operators, and each micro-operator is an event independent of each other. It should be noted that in some implementations, any shape of the tensor can be obtained by combining one or more micro-operators in the micro-operator library. In other words, the micro-operator library has completeness of “solution space”, and each micro-operator in the micro-operator library may be considered as a “basis” of “shape space”. There may be one or more combinations of target micro-operators that are selected from the micro-operator library, to form the first shape. If there are n combinations, a micro-operator (which may be referred to as a first micro-operator) included in a first combination is selected from the n combinations for execution. The first micro-operator is one or more of the target micro-operators.
In the foregoing implementation of the present disclosure, a real-time shape of any to-be-output first tensor can be obtained by combining in real time one or more micro-operators in a micro-operator library that is constructed in advance. Finally, a micro-operator included in one combination is selected (for example, a combination with optimal performance is selected) for execution because different combinations have different performance. This indirectly improves system performance. In embodiments of the present disclosure, micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. This reduces compilation overheads, avoiding excessively high compilation overheads of a just-in-time compilation technology at runtime. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”. Therefore, any shape may be a combination of the one or more micro-operators in the micro-operator library, providing completeness in solution space, and programmability is high in a dynamic shape scenario. In addition, a quantity of micro-operators in the micro-operator library is limited, and a size of memory space is small.
In a possible implementation of the first aspect, a final first combination may be selected from the n combinations based on a perspective of a consumed running cost. Specifically, first, a total running cost of micro-operators included in each of the n combinations can be calculated, to obtain n running costs. Then, one running cost that meets a preset condition (which may be referred to as a first preset condition) is selected from the n running costs as a final running cost (namely, a first running cost). For example, a combination with a smallest running cost value can be selected as the first combination, and finally, the micro-operator (namely, the first micro-operator) included in the first combination corresponding to the first running cost is executed.
In the foregoing implementation of the present disclosure, that the final first combination is selected from the n combinations based on the running cost is specifically described. A combination with a small running cost value is obtained by estimating a running cost of a micro-operator included in each combination. This improves system performance.
In a possible implementation of the first aspect, the running cost may be indicated by total duration required to execute micro-operators included in the combination. Specifically, duration required to execute each of m micro-operators included in a target combination is calculated, to obtain m duration, where one micro-operator in the target combination corresponds to one duration. Then, total duration is obtained based on the m duration (that is, the total duration is obtained by adding the m duration), where the target combination is any one of the n combinations, and m≥1. Finally, total duration is calculated by using each of the n combinations as the target combination, to obtain n total duration.
In the foregoing implementation of the present disclosure, a running cost of a micro-operator included in each combination is indicated by total duration required to execute all micro-operators in the combination. This is feasible.
In a possible implementation of the first aspect, each micro-operator in the micro-operator library has a fixed shape, and different micro-operators have different shape sizes.
The foregoing implementation of the present disclosure describes a prerequisite to be met by the micro-operator in the micro-operator library, to obtain any shape of the first tensor by combining as few micro-operators as possible. This reduces memory overheads.
In a possible implementation of the first aspect, the micro-operator library is selected from a plurality of candidate micro-operator libraries that are constructed in advance, operator types of micro-operators from different micro-operator libraries are different, and operator types of micro-operators from the same micro-operator library are the same. For example, it is assumed that there are three micro-operator libraries (namely, candidate micro-operator libraries) deployed on a computer device: micro-operator libraries A, B, and C. The micro-operator library A is used to perform a convolution operation, the micro-operator library B is used to perform a matrix multiplication operation, and the micro-operator library C is used to perform a vector addition operation. In this case, each micro-operator in the micro-operator library A is a convolution operator, and each micro-operator in the micro-operator library B is a matrix multiplication operator, each micro-operator in the micro-operator library C is a vector addition operator. It is also assumed that a currently ongoing task is a training process of a convolutional neural network, and a currently invoked micro-operator library is the micro-operator library A in the foregoing three micro-operator libraries that are constructed in advance on the computer device (if the computer device has only one micro-operator library, the currently invoked micro-operator library can only be the one micro-operator library).
In the foregoing implementation of the present disclosure, that the micro-operator library may be selected from the candidate micro-operator libraries that are constructed is specifically described. This is widely applicable and flexible.
In a possible implementation of the first aspect, shapes of the micro-operators in the micro-operator library are fixed and different, but particular. In this embodiment of the present disclosure, it is assumed that all the micro-operators in the micro-operator library are second-order tensors, so that all the micro-operators in the micro-operator library may be in a square shape, or may be in a rectangular shape, or may include both a square and a rectangle. This is not limited in the present disclosure, provided that any shape of the tensor can be obtained by combining the one or more micro-operators in the micro-operator library.
In the foregoing implementation of the present disclosure, that the micro-operators in the micro-operator library may be in various shapes is described. This is flexible.
In a possible implementation of the first aspect, because each micro-operator in the micro-operator library has a fixed shape (namely, a static shape), each micro-operator may be continuously optimized to obtain optimal performance, and a micro-operator with optimal performance is added to the micro-operator library, that is, each micro-operator in the micro-operator library is selected from at least two candidate micro-operators, shapes of the at least two candidate micro-operators are the same, and the at least two candidate micro-operators are pre-compiled.
In the foregoing implementation of the present disclosure, that each micro-operator in the micro-operator library is also obtained through continuous optimization and selection is specifically described. This is implementable.
In a possible implementation of the first aspect, a selection manner may include: calculating performance of each candidate micro-operator based on attribute information of each candidate micro-operator, and using a target candidate micro-operator as one micro-operator in the micro-operator library. The target candidate micro-operator is one candidate micro-operator whose performance meets a preset condition (which may be referred to as a second preset condition). For example, a micro-operator with optimal performance is selected from the candidate micro-operators and added to the micro-operator library.
In the foregoing implementation of the present disclosure, that the micro-operator whose performance meets a requirement, for example, the micro-operator with the optimal performance, is selected as each micro-operator in the micro-operator library is specifically described. This indirectly improves overall system performance.
In a possible implementation of the first aspect, the attribute information of the candidate micro-operator includes but is not limited to a throughput, occupied bandwidth, and the like. For example, the micro-operator with the optimal performance may be selected from the candidate micro-operators based on two factors: the throughput and the occupied bandwidth. The micro-operator is added to the micro-operator library. In an example, some candidate micro-operators that do not meet a throughput requirement may be first screened out from the candidate micro-operators based on the throughput factor, and then a micro-operator finally added to the micro-operator library is selected from remaining candidate micro-operators based on the bandwidth occupation factor. In another example, different weights may be assigned to the two factors: the throughput and the occupied bandwidth, weighted summation is performed on each candidate micro-operator, and finally, a candidate micro-operator with an optimal result is selected based on a weighted summation result, and added to the micro-operator library. A specific implementation of how to select the micro-operator from the candidate micro-operators and add the candidate micro-operator to the micro-operator library is not limited in the present disclosure.
The foregoing implementation of the present disclosure specifically describes the specific attribute information of the micro-operator. This is universally applicable.
In a possible implementation of the first aspect, the shape of the first tensor dynamically changes in the dynamic shape scenario. To be specific, the first tensor is a dynamic shape tensor to be output, and a specific value of the shape of the first tensor is determined only at the runtime. The shape with the specific value is referred to as the first shape of the first tensor.
In the foregoing implementation of the present disclosure, that the first tensor may be the to-be-output tensor in the dynamic shape scenario is specifically described. This is widely applicable.
According to a second aspect, an embodiment of the present disclosure provides a computer device, and the computer device has a function of performing the method according to any one of the first aspect or the possible implementations of the first aspect. The function may be implemented by hardware, or by hardware executing corresponding software. The hardware or the software includes one or more modules corresponding to the foregoing function.
According to a third aspect, an embodiment of the present disclosure provides a computer device. The computer device may include a memory, a processor, and a bus system. The memory is configured to store a program. The processor is configured to invoke the program stored in the memory, to perform the method in any one of the first aspect or the possible implementations of the first aspect of an embodiment of the present disclosure.
According to a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium. The computer-readable storage medium stores instructions. When the instructions are run on a computer, the computer is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.
According to a fifth aspect, an embodiment of the present disclosure provides a computer program. When the computer program is run on a computer, the computer is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.
According to a sixth aspect, an embodiment of the present disclosure provides a chip. The chip (for example, a central processing unit (CPU)) includes at least one processor and at least one interface circuit. The interface circuit is coupled to the processor. The at least one interface circuit is configured to: perform receiving and sending functions, and send instructions to the at least one processor. The at least one processor is configured to run a computer program or the instructions, and has a function of performing the method in any one of the first aspect or the possible implementations of the first aspect. The function may be implemented by hardware, software, or a combination of hardware and software. The hardware or the software includes one or more modules corresponding to the function. In addition, the interface circuit is configured to communicate with another module outside the chip. For example, the interface circuit may send, to a graphics processing unit (GPU) for execution, a first micro-operator included in a first combination obtained by the processor on the chip.
Embodiments of the present disclosure provide an operator processing method and a computer device. A real-time shape of any to-be-output first tensor can be obtained by combining in real time one or more micro-operators in a micro-operator library that is constructed in advance. Finally, a micro-operator included in one combination is selected (for example, a combination with optimal performance is selected) for execution because different combinations have different performance. This indirectly improves system performance. In embodiments of the present disclosure, micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. This reduces compilation overheads, avoiding excessively high compilation overheads of a just-in-time compilation technology at runtime. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”. Therefore, any shape may be a combination of the one or more micro-operators in the micro-operator library, providing completeness in solution space, and programmability is high in a dynamic shape scenario. In addition, a quantity of micro-operators in the micro-operator library is limited, and a size of memory space is small.
In the specification, claims, and the accompanying drawings of the present disclosure, the terms such as “first” and “second” are intended to distinguish similar objects but do not necessarily indicate a specific order or sequence. It should be understood that the terms used in such a way are interchangeable in proper circumstances, and this is only a discrimination manner for describing objects having a same attribute in embodiments of the present disclosure. In addition, the terms “include” and “contain” and any other variants mean to cover the non-exclusive inclusion, so that a process, method, system, product, or device that includes a series of units is not necessarily limited to those units, but may include other units not expressly listed or inherent to such a process, method, product, or device.
Embodiments of the present disclosure relate to much related knowledge about an operator, a dynamic shape, just-in-time compilation, and the like. To better understand the solutions in embodiments of the present disclosure, the following first describes related terms and concepts that may be used in embodiments of the present disclosure. It should be understood that explanations of related concepts may be limited due to specific cases of embodiments of the present disclosure, but this does not mean that the present disclosure is limited to the specific cases. Specific cases in different embodiments may also vary. This is not specifically limited herein.
(1) OperatorThe operator is a general term of a mathematical operation, such as a convolution operator, and a matrix multiplication operator. Specifically, the operator may be considered as mapping from one function space to function space, that is, O: X→X. The operator in a broad sense can be extended to any space, such as inner product space.
(2) TensorIn the field of artificial intelligence, the tensor is a multi-dimensional array. Specifically, the tensor may also be defined, given a basis, as a group of numbers that meet a specific transformation rule. If these numbers are written in a vertical row, these numbers are a first-order tensor. If these numbers are written in one two-dimensional array of numbers, these numbers are a second-order tensor. If these numbers are written in a three-dimensional array of numbers, these numbers are a third-order tensor. A tensor of order higher than three may be referred to as a higher order tensor.
(3) Shape of a TensorA literal meaning of the shape is a shape. In applications such as AI and HPC, computationally intensive operations (referred to as operators) such as matrix multiplication, convolution, and vector operations are required. These operators are mathematical function computation of a multi-dimensional array (namely, the tensor). Each dimension of the multi-dimensional array has a specific value, for example, a height H and a width W of a two-dimensional array, and a height H, a width W, and a depth D of a three-dimensional array. Values of these different dimensions are collectively referred to as a shape, namely, the shape of the multi-dimensional array (namely, the tensor).
The shape of the tensor is a shape assigned to the tensor when the tensor is defined. For example, a shape of a 5*5 matrix is that a length is 5 and a width is 5. For another example, a tensor of a convolution kernel (filter) used for convolution calculation is defined as 3*3, and a shape of the tensor is that a length is 3 and a width is 3. In an example, it is assumed that a shape of a pixel of a convolved picture is 6*6, a depth is 3 (indicating R/G/B, which are three primary colors), and a convolution kernel is a 3*3*3 cube. In this case, the convolution kernel is a tensor of shape (3, 3, 3), and may be denoted as Tensor1. The convolved picture may be indicated as a tensor of shape (6, 6, 3), and may be denoted as Tensor2. An operation of performing a convolution operation on the two tensors Tensor1 and Tensor2 is referred to as one operator, denoted as convolution.
(4) Dynamic Shape of a TensorEach tensor indicates a physical meaning, for example, a pixel or a convolution kernel of a picture. If a shape of a to-be-output tensor of each layer does not change in a model training or inference calculation process, an operator used in this calculation process may be referred to as a fixed shape operator, and the to-be-output tensor is a fixed shape tensor. On the contrary, if a shape of a to-be-output tensor at each layer can change in a model training or inference calculation process, an operator used in this calculation process may be referred to as a dynamic shape operator, and the to-be-output tensor is a dynamic shape tensor.
(5) Static CompilerThe static compiler means that a compilation process is completed within an off-line time life cycle. Binary code is generated during compilation. The binary code is executed by a processor at runtime.
(6) JIT CompilerThe JIT compiler means that a compilation process is completed within a life cycle at runtime. The binary code generated during compilation is temporarily generated at the runtime.
(7) Static Shape CompilerThe static shape compiler can also be referred to as a fixed shape compiler, and is a static compiler that only supports compiling a fixed shape operator.
(8) Dynamic Shape CompilerThe dynamic shape compiler is a dynamic compiler that supports compiling an operator of any shape.
(9) Neural NetworkThe neural network may include a neural unit, and may be specifically understood as a neural network including an input layer, a hidden layer, and an output layer. Usually, a first layer is the input layer, a last layer is the output layer, and a middle layer is the hidden layer. A neural network with a plurality of hidden layers is called a deep neural network (DNN). Work at each layer of the neural network may be described according to a mathematical expression {right arrow over (y)}=a(W·{right arrow over (x)}+b). Work at each physical layer of the neural network may be understood as completing transformation from input space to output space (that is, from row space to column space of a matrix) by performing five operations on the input space (a set of input vectors). The five operations include: 1. dimension increase/dimension reduction; 2. zooming in/out; 3. rotation; 4. translation; and 5. “bending”. The operations 1, 2, and 3 are completed by “W·{right arrow over (x)}”, the operation 4 is completed by “+b”, and the operation 5 is implemented by “a( )”. The word “space” is used herein for expression because a classified object is not a single thing, but a type of thing. Space is a set of all individuals of such a type of thing. W is a weight matrix of each layer of the neural network, and each value in the matrix indicates a weight value of one neuron at the layer. The matrix W determines space transformation from the input space to the output space described above. In other words, W at each layer of the neural network controls how to transform space. A purpose of training the neural network is to finally obtain a weight matrix at all layers of a trained neural network. Therefore, a training process of the neural network is essentially a manner of learning control of space transformation, and more specifically, learning a weight matrix.
(10) Loss FunctionDuring training of the neural network, because it is expected that an output of the neural network is as close as possible to a value that is actually expected to be predicted, a current prediction value of the network may be compared with a target value that is actually expected, and then a weight matrix at each layer of the neural network is updated based on a difference between the current prediction value and the target value (there is usually an initialization process before the first update, that is, a parameter is preconfigured for each layer of the neural network). For example, if the prediction value of the network is large, the weight matrix is adjusted to lower the prediction value until the neural network can predict the target value that is actually expected. Therefore, “how to obtain the difference between the predicted value and the target value through comparison” needs to be predefined. This is a loss function or an objective function. The loss function and the objective function are important equations that measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the neural network is a process of minimizing the loss as much as possible.
In a training process of a neural network, a value of a parameter of a neural network model may be corrected by using an error back propagation (BP) algorithm, so that a reconstruction error loss of the neural network model becomes increasingly smaller. Specifically, an input signal is transferred forward until an error loss occurs at an output, and the parameter in the initial neural network model is updated based on back propagation error loss information, to make the error loss converge. The back propagation algorithm is an error-loss-centered back propagation motion intended to obtain a parameter, such as a weight matrix, of an optimal neural network model.
The following describes embodiments of the present disclosure with reference to the accompanying drawings. A person of ordinary skill in the art may learn that, with development of technologies and emergence of a new scenario, the technical solutions provided in embodiments of the present disclosure are also applicable to a similar technical problem.
First, a system architecture provided in an embodiment of the present disclosure is described. For details,
It should be noted that an actual implementation product form of the present disclosure may be a software form, or may be a hardware form. This is not specifically limited in the present disclosure. For ease of description, the following uses an example in which the product form is the software form for description. Specifically, refer to
In this embodiment of the present disclosure, the micro-operator library may be a single-kernel micro-operator library, or may be a multi-kernel micro-operator library. This is not specifically limited in the present disclosure. The single-kernel micro-operator library means that each micro-operator in the micro-operator library is implemented as single-kernel code.
The micro-operator library may support a plurality of different types of operators, for example, operator operations of a convolution operation, matrix multiplication, and vector addition. However, it should be noted that operators in a same micro-operator library are of a same operator type. These operators are packaged in a micro-operator library of a type to which the operators belong in a form of binary code of one group of micro-operators. To be specific, for a convolution operator, binary code of one group of convolution micro-operators of fixed shapes and different shapes is implemented, for example, 1*1, 2*2, 4*4, . . . , or 512*512 convolution micro-operators that support calculation. However, it should be noted that these micro-operators of different shapes need to have the following features: a. The binary code can be implemented; b. Each micro-operator has a fixed and different shape; c. The micro-operators are independent of each other and have their own attribute information, such as bandwidth and a throughput.
(2) Real-Time Cost ModelIn the foregoing implementation of the present disclosure, any shape of a to-be-output tensor can be obtained by using one or more combinations in the micro-operator library in “real time”. Different combinations have different calculation performance. Therefore, one “real-time” cost model is required to obtain a combination with optimal performance through calculation based on an attribute of each operator, and finally execute a micro-operator included in the combination with the optimal performance.
The real-time cost model may be embedded in a standard operator launch and execution process in a form of one independent module. As shown in
It should be noted herein that the execution process is executed by a GPU (for example, an AI Core) of the computer device by starting an operator, and the AI core is also referred to as a device end. After an AI CPU selects, based on the real-time cost model, the micro-operator to be finally executed included in the combination, the micro-operator is launched to the AI core, and the AI core performs a specific micro-operator execution operation.
It should be noted that, in some implementations of the present disclosure, the real-time cost model may be further divided into the following three modules.
-
- (1) Combiner: The combiner uses the one or more fixed shape micro-operators in the micro-operator library in real time, to complete the combination of any shape (namely, the dynamic shape) of the to-be-output tensor, where there may be one or more combinations.
- (2) Cost model: The cost model uses a combiner result as an input, and determines a specific combination whose calculation performance meets a preset condition (for example, optimal calculation performance) in a plurality of different combinations through calculation based on an attribute (for example, bandwidth or a throughput) of each micro-operator.
- (3) Device parameter calculation: A combination output by the cost model is micro-operators that are to be scheduled to each device (for example, a GPU) and are actually executed. Therefore, a corresponding parameter needs to be calculated for each device, to achieve an objective of calculating binary code of a corresponding micro-operator.
Based on the foregoing system architecture, the following describes in detail an operator processing method provided in an embodiment of the present disclosure. Specifically,
401: Obtain a first shape of a first tensor.
First, a computer device may obtain a shape of a to-be-output tensor (that is, from an output perspective) by using a CPU (namely, a host end), where the to-be-output tensor may be referred to as the first tensor. A shape of the first tensor dynamically changes in a dynamic shape scenario. To be specific, the first tensor is a dynamic shape tensor to be output, and a specific value of the shape of the first tensor is determined only at runtime. The shape with the specific value is referred to as the first shape of the first tensor.
It should be noted that, in this embodiment of the present disclosure, the computer device may include a handheld terminal device, for example, a mobile phone, a computer, or an iPad, or may include a smart wearable device, for example, a smart band, a smart watch, or a smart heart rate meter, or may include a wheeled mobile device, for example, a vehicle (for example, an autonomous driving vehicle), an aircraft, or a robot (for example, a robotic vacuum cleaner). A specific product form of the computer device is not limited in the present disclosure. Any electronic device that can be used to perform the operator processing method in the present disclosure may be referred to as the computer device.
402: Determine at least one combination that meets a first shape, where each combination includes at least one target micro-operator, each target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other.
After obtaining the first shape of the to-be-output first tensor, the computer device further determines the at least one combination that meets the first shape. Each combination includes at least one target micro-operator, and the target micro-operators are from a same micro-operator library. It should be noted herein that micro-operators in the micro-operator library are pre-compiled micro-operators, and the micro-operators are event independent of each other. It should be noted that in some implementations of the present disclosure, any shape of the tensor can be obtained by combining one or more micro-operators in the micro-operator library. In other words, the micro-operator library has completeness of “solution space”, and each micro-operator in the micro-operator library may be considered as a “basis” of “shape space”. In this embodiment of the present disclosure, the micro-operators in the micro-operator library are pre-compiled and only micro-operators with optimal performance are selected. Therefore, a compiler is not needed at runtime, and only an appropriate target micro-operator needs to be selected from the micro-operator library for combination. This reduces compilation overheads.
It should be noted that, in some implementations of the present disclosure, each micro-operator in the micro-operator library has a fixed shape, and different micro-operators have different shape sizes. For ease of understanding, the following provides an example for illustration. Specifically,
It should be noted that, in some implementations of the present disclosure, shapes of the micro-operators in the micro-operator library are fixed and different, but particular. In this embodiment of the present disclosure, it is assumed that all the micro-operators in the micro-operator library are second-order tensors, so that all the micro-operators in the micro-operator library may be in a square shape, or may be in a rectangular shape, or may include both a square and a rectangle. This is not limited in the present disclosure, provided that any shape of the tensor can be obtained by combining the one or more micro-operators in the micro-operator library.
In addition, it should be further noted that, in this embodiment of the present disclosure, the micro-operator library is obtained based on at least one candidate micro-operator library that is constructed in advance, and operator types of micro-operators from a same micro-operator library are the same. For example, it is assumed that there are three micro-operator libraries (namely, candidate micro-operator libraries) deployed on a computer device: micro-operator libraries A, B, and C. The micro-operator library A is used to perform a convolution operation, the micro-operator library B is used to perform a matrix multiplication operation, and the micro-operator library C is used to perform a vector addition operation. In this case, each micro-operator in the micro-operator library A is a convolution operator, and each micro-operator in the micro-operator library B is a matrix multiplication operator, each micro-operator in the micro-operator library C is a vector addition operator. It is also assumed that a currently ongoing task is a training process of a convolutional neural network, and a currently invoked micro-operator library is the micro-operator library A in the foregoing three micro-operator libraries that are constructed in advance on the computer device (if the computer device has only one micro-operator library, the currently invoked micro-operator library can only be the one micro-operator library). It should be noted herein that, in this embodiment of the present disclosure, the micro-operator library may be a single-kernel micro-operator library, or may be a multi-kernel micro-operator library. This is not specifically limited in the present disclosure. The single-kernel micro-operator library means that each micro-operator in the micro-operator library is implemented as single-kernel code.
It should be further noted that, in some implementations of the present disclosure, because each micro-operator in the micro-operator library has a fixed shape (namely, a static shape), each micro-operator may be continuously optimized to obtain optimal performance, and a micro-operator with optimal performance is added to the micro-operator library, that is, each micro-operator in the micro-operator library is selected from at least two candidate micro-operators, shapes of the at least two candidate micro-operators are the same, and the at least two candidate micro-operators are pre-compiled.
In an example, a selection manner may include: calculating performance of each candidate micro-operator based on attribute information of each candidate micro-operator, and using a target candidate micro-operator as one micro-operator in the micro-operator library. The target candidate micro-operator is one candidate micro-operator whose performance meets a preset condition (which may be referred to as a second preset condition). For example, a micro-operator with optimal performance is selected from the candidate micro-operators and added to the micro-operator library.
It should be noted that, in some implementations of the present disclosure, the attribute information of the candidate micro-operator includes but is not limited to a throughput, occupied bandwidth, and the like. For example, the micro-operator with the optimal performance may be selected from the candidate micro-operators based on two factors: the throughput and the occupied bandwidth. The micro-operator is added to the micro-operator library. In an example, some candidate micro-operators that do not meet a throughput requirement may be first screened out from the candidate micro-operators based on the throughput factor, and then a micro-operator finally added to the micro-operator library is selected from remaining candidate micro-operators based on the bandwidth occupation factor. In another example, different weights may be assigned to the two factors: the throughput and the occupied bandwidth, weighted summation is performed on each candidate micro-operator, and finally, a candidate micro-operator with an optimal result is selected based on a weighted summation result, and added to the micro-operator library. A specific implementation of how to select the micro-operator from the candidate micro-operators and add the candidate micro-operator to the micro-operator library is not limited in the present disclosure.
The following describes how to select a micro-operator with optimal performance from at least two candidate micro-operators whose shapes are the same and that are pre-compiled.
For ease of understanding, the following first uses matrix multiplication as an example to describe a core concept “tiling” of a compilation operator. Specifically,
403: If there are n combinations, select, from the n combinations, a first micro-operator included in a first combination for execution, where the first micro-operator is one or more of the target micro-operators, and n≥2.
There may be one or more combinations of target micro-operators that are selected from the micro-operator library, to form the first shape. The following separately describes the combinations.
(1) One CombinationIf there is one combination, there is no comparison condition, and a micro-operator included in the combination is directly executed. Specifically, an AI CPU of the computer device transfers the micro-operator included in the combination to an AI core, to start a micro-operator execution operation.
(2) N Combinations, where n≥2
If there are n combinations, a micro-operator (which may be referred to as a first micro-operator) included in a first combination is selected from the n combinations for execution. The first micro-operator is one or more of the target micro-operators.
It should be noted that, in some implementations of the present disclosure, a final first combination may be selected from the n combinations based on a perspective of a consumed running cost. Specifically, first, a total running cost of micro-operators included in each of the n combinations can be calculated, to obtain n running costs. Then, one running cost that meets a preset condition (which may be referred to as a first preset condition) is selected from the n running costs as a final running cost (namely, a first running cost). For example, a combination with a smallest running cost value can be selected as the first combination, and finally, the micro-operator (namely, the first micro-operator) included in the first combination corresponding to the first running cost is executed. For a specific process, refer to
For ease of understanding the process, the following is an example for illustration. It is assumed that for a specific first shape, there are totally three combinations: a combination 1, a combination 2, and a combination 3. It is assumed that target micro-operators included in the combination 1 include two micro-operators A and one micro-operator B, target micro-operators included in the combination 2 include four micro-operators C and one micro-operator D, and target micro-operators included in the combination 3 include one micro-operator A and three micro-operators E. A total running cost of micro-operators included in each combination is calculated, to separately obtain a running cost 1 corresponding to the combination 1, a running cost 2 corresponding to the combination 2, and a running cost 3 corresponding to the combination 3. Assuming that a value of the running cost 2 is lowest among the three running costs, the four micro-operators C and the one micro-operator D included in the combination 2 corresponding to the running cost 2 may be executed.
It should be noted that, in some implementations of the present disclosure, the running cost may be indicated by total duration required to execute micro-operators included in the combination. Specifically, duration required to execute each of m micro-operators included in a target combination is calculated, to obtain m duration, where one micro-operator in the target combination corresponds to one duration. Then, total duration is obtained based on the m duration (that is, the total duration is obtained by adding the m duration), where the target combination is any one of the n combinations, and m≥1. Finally, total duration is calculated by using each of the n combinations as the target combination, to obtain n total duration.
For ease of understanding the process, the foregoing example is still used for further illustration. It is assumed that for a specific first shape, there are totally three combinations: a combination 1, a combination 2, and a combination 3. It is assumed that target micro-operators included in the combination 1 include two micro-operators A and one micro-operator B, target micro-operators included in the combination 2 include four micro-operators C and one micro-operator D, and target micro-operators included in the combination 3 include one micro-operator A and three micro-operators E. The combination 1 includes the three target micro-operators. Therefore, duration required to execute each micro-operator in the combination 1 may be calculated, totally three duration is obtained, and the three duration is added to obtain total duration A. Similarly, the combination 2 includes the five target micro-operators, five duration may be obtained through calculation, and the five duration is added to obtain total duration B. The combination 3 includes four target micro-operators, and four duration may be obtained through calculation, and the four duration is added to obtain total duration C. Finally, a final combination may be selected by comparing a length of total duration corresponding to a combination.
It should be noted that, in this embodiment of the present disclosure, in a specific calculation process, a running cost of each combination may be calculated based on the following formula (1):
ei indicates a specific micro-operator in a micro-operator library, for example, a micro-operator of shape 512*512. DataVolume(ei) indicates a total calculation amount of the micro-operator ei, namely, a calculation amount of a floating point (a sum of floating-point multiplication and addition). Throughout (ei) indicates a throughput of the micro-operator ei, and a unit is FLOPS, that is, “floating point operations per second (floating point of per second, FLOPS)” indicates an amount of floating-point multiplication or addition per second. Indicates duration required by a single-core AI core to calculate ei, and a unit is second.
Amount(ei) indicates a quantity of ei in a current combination. HardwareAICoreNumber indicates a quantity of AI cores in a computer device, for example, Ascend 910 has 32 AI cores. Π Indicates rounding up, and rounding up
indicates a quantity of rounds required to calculate all ei. For example, there are 33 ei in the current combination, that is, Amount(ei)=33, and
is rounded up to 2, that is, “two rounds” are required to complete calculation of all ei.
indicates time required to complete calculation of all ei. Finally, any to-be-calculated first shape may be decomposed into a plurality of micro-operators, for example, x e1, y e2, and z e3. Each ei (namely, e1, e2, and e3) may obtain duration through calculation based on the foregoing formula (1). Finally, a sum of the time required by the three ei is obtained, which is final total duration of the micro-operators included in the combination.
For ease of further understanding of the foregoing process of step 401 to step 403, the following uses a specific instance to describe the operator processing method provided in this embodiment of the present disclosure. For an implementation process of the instance, refer to
It should be noted herein that, in the embodiment corresponding to
It should be further noted that, in this embodiment of the present disclosure, because micro-operators included in the micro-operator library are used as a “bases” of “shape space”, generally, a larger quantity of “bases” indicates more combinations, which means that a micro-operator combination with better performance can be finally selected. However, one disadvantage is that a combination process and a performance estimation process are correspondingly slowed down when a quantity of “bases” is larger. On the contrary, a smaller quantity of “bases” indicates fewer combinations (on a premise that any shape can be obtained through combination), and a corresponding disadvantage is that operator performance is impaired. Therefore, in an actual application process, a compromise needs to be considered. Generally, a quantity of micro-operators in a micro-operator library is less than 10. This also indicates that the micro-operator library constructed in advance in this embodiment of the present disclosure only needs to store binary code of only a limited quantity of micro-operators, and any dynamic shape of the first tensor can be obtained through combination in real time at runtime. This avoids high memory overheads and overcoming a problem of poor dynamic shape programmability.
Further, the operator processing method in this embodiment of the present disclosure may be applied to an architecture shown in
It should be noted that the operator processing method provided in this embodiment of the present disclosure may be applied to various AI chips. For example, for an AI-dedicated chip such as Ascend, a to-be-implemented micro-operator (for example, matrix multiplication) may be selected, and 1*1, 2*2, 4*4, . . . , and the like are separately implemented from an output perspective, until single-core matrix operators of fixed shape N*N (the N*N operator is a single-core (Ascend AI Core) micro-operator with highest performance on a target platform, namely, a micro-operator with a largest single-core throughput) form a micro-operator library. These fixed micro-operators can automatically search for code implementation with highest single-core performance by using a tool like auto TVM in advance, compile the code into binary code, and store the binary code in the micro-operator library. Then, in a graph execution engine of a DNN, the combiner, the real-time cost model, and a related device parameter calculation module are separately implemented. The combiner considers various different micro-operator combinations to form a shape of the first to-be-output tensor. The real-time cost model may be implemented based on the formula (1), and is used to obtain a combination with a smallest running cost through calculation in real time. A micro-operator included in the combination with the smallest running cost calculates related parameter information by using the device parameter calculation module, and the micro-operator is launched by the micro-operator scheduler to each AI core for execution.
In this embodiment of the present disclosure, a host end (namely, an AI CPU) does not run a compiler (that is, does not perform a specific compilation process), but runs the combiner and the real-time cost model. This reduces compilation overheads. A device end (namely, the AI Core) executes binary code of a fixed shape micro-operator, and each micro-operator has highest single-core performance. Performance, a throughput, and bandwidth usage of each micro-operator are known in advance. This improves overall calculation performance of a system.
Finally, an application scenario of embodiments of the present disclosure is described. The operator processing method provided in this embodiment of the present disclosure may be applied to a dynamic shape scenario. For example, a CV model (such as a detection, segmentation, speech model, an ASR model, or an NLP model) involves a dynamic shape problem in training and inference, typically for example, an image resolution changes and a batch data size (batch size) changes in picture detection, input lengths of ASR corpus slices are different, and a size of each tensor in a common sparse model in an advertisement recommendation service is also variable. The operator processing method provided in this embodiment of the present disclosure may be applied to these application scenarios.
Based on the corresponding embodiments, the following further provides a related computer device used to implement the solutions, to better implement the solutions in embodiments of the present disclosure. The computer device may include a handheld terminal device, for example, a mobile phone, a computer, or an iPad, or may include a smart wearable device, for example, a smart band, a smart watch, or a smart heart rate meter, or may include a wheeled mobile device, for example, a vehicle (for example, an autonomous driving vehicle), an aircraft, or a robot (for example, a robotic vacuum cleaner). A specific product form of the computer device is not limited in the present disclosure. Any electronic device that can be used to perform the operator processing method in the present disclosure may be referred to as the computer device. Specifically,
In a possible design, the selection module 1203 is specifically configured to: calculate a total running cost of micro-operators included in each of the n combinations, to obtain n running costs, select one running cost that meets a first preset condition from the n running costs as a first running cost, and finally execute the first micro-operator included in the first combination corresponding to the first running cost.
In a possible design, if the running cost is indicated by total duration, the selection module 1203 is further specifically configured to: calculate duration required to execute each of m micro-operators included in a target combination, to obtain m duration, obtain total duration based on the m duration, where the target combination is any one of the n combinations, and m≥1, and calculate total duration by using each of the n combinations as the target combination, to obtain n total duration.
In a possible design, each micro-operator in the micro-operator library has a fixed shape, and different micro-operators have different shape sizes.
In a possible design, the micro-operator library is selected from a plurality of candidate micro-operator libraries that are constructed in advance, operator types of micro-operators from different micro-operator libraries are different, and operator types of micro-operators from the same micro-operator library are the same.
In a possible design, the shape of the micro-operator in the micro-operator library includes at least one of the following: a square and a rectangular.
In a possible design, each micro-operator in the micro-operator library is selected from at least two candidate micro-operators, shapes of the at least two candidate operators are the same, and the at least two candidate micro-operators are pre-compiled.
In a possible design of the, a selection manner includes but is not limited to: calculating performance of each candidate micro-operator based on attribute information of each candidate micro-operator, and using a target candidate micro-operator as one micro-operator in the micro-operator library. The target candidate micro-operator is one candidate micro-operator whose performance meets a second preset condition.
In a possible design of the, the attribute information includes at least any one of the following: a throughput and occupied bandwidth.
In a possible design, the first tensor is a dynamic shape tensor to be output.
It should be noted that content such as information exchange and an execution process between the modules/units in the computer device 1200 in
An embodiment of the present disclosure further provides a computer device.
The computer device 1300 may further include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems 1341, for example, Windows Server™, Mac OS X™, Unix™, Linux™, and FreeBSD™
In this embodiment of the present disclosure, the central processing unit 1322 is configured to perform the operator processing method in the embodiment corresponding to
It should be noted that the central processing unit 1322 may be further configured to perform any step in the method embodiment corresponding to
An embodiment of the present disclosure further provides a computer-readable storage medium. The computer-readable storage medium stores a program for signal processing. When the program is run on a computer, the computer is enabled to perform the steps performed by the computer device in the descriptions of the foregoing embodiments.
In addition, it should be noted that the described apparatus embodiments are only examples. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected based on actual requirements to achieve the objectives of the solutions in embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided by the present disclosure, connection relationships between modules indicate that the modules have communication connections with each other, which may be specifically implemented as one or more communication buses or signal cables.
Based on the description of the foregoing implementations, a person skilled in the art may clearly understand that the present disclosure may be implemented by software in addition to necessary universal hardware, or by dedicated hardware, including a dedicated integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, and the like. Usually, any function implemented by a computer program may be easily implemented by using corresponding hardware. In addition, specific hardware structures used to implement a same function may be various, for example, an analog circuit, a digital circuit, or a dedicated circuit. In some embodiments, technical solutions of the present disclosure may be implemented in a form of a software product. The computer software product is stored in a readable storage medium, for example, a floppy disk, a Universal Serial Bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disc of a computer, and includes several instructions for instructing a computer device (which may be a personal computer, a training device, a network device, or the like) to perform the methods described in embodiments of the present disclosure.
All or some of the foregoing embodiments may be implemented by software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or some of the embodiments may be implemented in a form of a computer program product.
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the process or functions according to embodiments of the present disclosure are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or another programmable apparatus. The computer instructions may be stored in a computer-readable storage medium, or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, training device, or data center to another website, computer, training device, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by the computer, or a data storage device, such as a training device or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (DVD)), a semiconductor medium (for example, a solid-state disk (SSD)), or the like.
Claims
1. A method comprising:
- obtaining a first shape of a first tensor;
- determining n combinations of micro-operators satisfying the first shape, wherein n≥2, wherein each combination in the n combinations comprises at least one target micro-operator, wherein each of the at least one target micro-operator is from a micro-operator library, and wherein micro-operators in the micro-operator library are pre-compiled and independent of each other; and
- selecting, from the n combinations, a first micro-operator in a first combination for execution, wherein the first micro-operator is one or more of the least one target micro-operator.
2. The method of claim 1, wherein selecting, from the n combinations, the first micro-operator in the first combination for execution comprises:
- calculating a total running cost of micro-operators comprised in each of the n combinations to obtain n running costs;
- selecting, from the n running costs, a first running cost satisfying a preset condition; and
- executing the first micro-operator in the first combination corresponding to the first running cost.
3. The method of claim 2, wherein calculating the total running cost of the micro-operators comprised in each of the n combinations to obtain the n running costs comprises:
- calculating a duration required to execute each of m micro-operators comprised in a target combination to obtain m durations, wherein the target combination is any one of the n combinations, and wherein m≥1;
- obtaining a total duration of the target combination based on the m durations; and
- calculating the total duration for each of the n combinations as the target combination to obtain n total durations, wherein the n total durations are the n running costs.
4. The method of claim 1, wherein each micro-operator in the micro-operator library has a fixed shape, and wherein different micro-operators have different shape sizes.
5. The method of claim 1, further comprising selecting the micro-operator library from a plurality of pre-constructed candidate micro-operator libraries, wherein micro-operators from different micro-operator libraries have different operator types, and wherein the micro-operators from a same micro-operator library have a same operator type.
6. The method of claim 1, wherein a shape of a micro-operator in the micro-operator library is square or rectangular.
7. The method of claim 1, further comprising selecting each micro-operator in the micro-operator library from at least two pre-compiled candidate micro-operators having a same shape.
8. The method of claim 7, wherein selecting each micro-operator in the micro-operator library comprises:
- calculating a performance of each pre-compiled candidate micro-operator based on attribute information of each candidate micro-operator; and
- using a target candidate micro-operator as one micro-operator in the micro-operator library, wherein the target candidate micro-operator is a candidate micro-operator whose performance satisfies a second preset condition.
9. The method of claim 8, wherein the attribute information comprises at least a throughput or occupied bandwidth.
10. The method of claim 1, further comprising outputting the first tensor, wherein the first tensor is a dynamic shape tensor.
11. A device comprising:
- an obtainer configured to obtain a first shape of a first tensor;
- a combiner configured to determine n combinations of micro-operators satisfying the first shape, wherein n≥2, wherein each combination comprises at least one target micro-operator, wherein each of the at least one target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other; and
- a selector configured to select, from the n combinations, a first micro-operator in a first combination for execution, wherein the first micro-operator is one or more of the least one target micro-operator.
12. The device of claim 11, wherein the selector is further configured to:
- calculate a total running cost of micro-operators comprised in each of the n combinations to obtain n running costs;
- select, from the n running costs, a first running cost satisfying a preset condition; and
- execute the first micro-operator in the first combination corresponding to the first running cost.
13. The device of claim 12, wherein the selector is further configured to:
- calculate a duration required to execute each of m micro-operators comprised in a target combination to obtain m durations, wherein the target combination is any one of the n combinations, and wherein m≥1;
- obtain a total duration of the target combination based on the m durations; and
- calculate the total duration for each of the n combinations as the target combination to obtain n total durations, wherein the n total durations are the n running costs.
14. The device of claim 11, wherein each micro-operator in the micro-operator library has a fixed shape, and wherein different micro-operators have different shape sizes.
15. The device of claim 11, wherein the selector is further configured to select the micro-operator library from a plurality of pre-constructed candidate micro-operator libraries, wherein the micro-operators from different micro-operator libraries have different operator types, and wherein the micro-operators from a same micro-operator library have a same operator type.
16. The device of claim 11, wherein a shape of a micro-operator in the micro-operator library is square or rectangular.
17. The device of claim 11, wherein the selector is further configured to select each micro-operator in the micro-operator library from at least two pre-compiled candidate micro-operators having a same shape.
18. The device of claim 17, wherein the selector is further configured to calculate performance of each pre-compiled candidate micro-operator based on attribute information of each candidate micro-operator and using a target candidate micro-operator as one micro-operator in the micro-operator library, wherein the target candidate micro-operator is one candidate micro-operator whose performance satisfies a second preset condition.
19. The device of claim 18, wherein the attribute information comprises at least a throughput or occupied bandwidth.
20. A chip comprising:
- a memory configured to store instructions; and
- at least one processor coupled to the memory and configured to execute the instructions to cause the chip to: obtain a first shape of a first tensor; determine at least n combinations of micro-operators satisfying the first shape, wherein n≥2, wherein each combination comprises at least one target micro-operator, wherein each of the at least one target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other; and select, from the n combinations, a first micro-operator in a first combination for execution, wherein the first micro-operator is one or more of the least one target micro-operator.