Operator Processing Method and Computer Device

An operator processing method includes obtaining a real-time shape of any to-be-output first tensor by combining in real time one or more micro-operators in a pre-constructed micro-operator library. Then, a micro-operator included in one combination (for example, a combination with optimal performance because different combinations have different performance) is selected for execution. Micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation of International Patent Application No. PCT/CN2022/143087 filed on Dec. 29, 2022, which claims priority to Chinese Patent Application No. 202210049301.8 filed on Jan. 17, 2022. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.

TECHNICAL FIELD

The present disclosure relates to the field of machine learning, and in particular, to an operator processing method and a computer device.

BACKGROUND

Just-in-time (JIT) compilation is a widely used compilation technology. To be specific, the just-in-time compilation is a process in which a compiler is immediately invoked at runtime, to perform standard compilation (including front-end lexical/syntactical analysis, intermediate representation analysis, running cost estimation based on a cost model, back-end code generation, and the like) on a specified task, and generate and execute binary code.

FIG. 1 describes a process of applying the just-in-time compilation technology in a dynamic shape scenario. A literal meaning of the shape is a shape. In applications such as artificial intelligence (AI) and high-performance computing (HPC), computationally intensive operations (referred to as operators) such as matrix multiplication, convolution, and vector operations are required. These operators are mathematical function computation of a multi-dimensional array (or referred to as tensor). Each dimension of the multi-dimensional array has a specific value, for example, a height H and a width W of a two-dimensional array, and a height H, a width W, and a depth D of a three-dimensional array. Values of these different dimensions are collectively referred to as a shape, namely, the shape of the multi-dimensional array (namely, tensor). In FIG. 1, due to the dynamic shape scenario, a shape of a specific to-be-output tensor cannot be known in advance (for example, before neural network training). In other words, because the shape of the tensor changes in real time, the binary code cannot be pre-compiled from an operator (such as a convolution operator) of a corresponding layer. Herein, that the binary code cannot be pre-compiled specifically means a scenario in which a static shape compiler is used. However, a specific value of the shape of the tensor is determined at runtime. To be specific, when a specific operator is calculated at the runtime, a system knows a value of a shape of a to-be-output tensor at a current moment. For example, in FIG. 1, at the runtime, the system knows that the shape of the to-be-output tensor is one 378*125 matrix. Therefore, determined 378*125 shape information is sufficient for the static shape compiler. To be specific, the system pauses calculation (actually, calculation cannot be performed at this moment, because there is no binary code of an operator that supports the 378*125 matrix shape), actively invokes the static shape compiler to correspondingly compile the operator that supports the 378*125 tensor, and generates binary code for use.

Time consumption of invoking the static shape compiler in just-in-time compilation to generate binary code of an operator whose shape is consistent with the shape of the to-be-output tensor is at a second level (generally within 10 seconds), and calculation overheads of one operator (for example, convolution) are at a microsecond level to a millisecond level, that is, compilation overheads of just-in-time compilation are excessively high in an entire calculation process. In addition, in the dynamic shape scenario, the static shape compiler may be invoked for hundreds of thousands of times, and overall calculation performance of the system is severely affected by just-in-time compilation.

SUMMARY

Embodiments of the present disclosure provide an operator processing method and a computer device. A real-time shape of any to-be-output first tensor can be obtained by combining in real time one or more micro-operators in a micro-operator library that is constructed in advance. Finally, a micro-operator included in one combination is selected (for example, a combination with optimal performance is selected) for execution because different combinations have different performance. This indirectly improves system performance. In embodiments of the present disclosure, micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. This reduces compilation overheads, avoiding excessively high compilation overheads of a just-in-time compilation technology at runtime. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”. Therefore, any shape may be a combination of the one or more micro-operators in the micro-operator library, providing completeness in solution space, and programmability is high in a dynamic shape scenario. In addition, a quantity of micro-operators in the micro-operator library is limited, and a size of memory space is small.

In view of this, embodiments of the present disclosure provide the following technical solutions.

According to a first aspect, an embodiment of the present disclosure first provides an operator processing method, which may be applied to the field of artificial intelligence, and may be specifically applied to a dynamic shape scenario. For example, a common Kalman filtering (CV) model (such as a detection, segmentation, speech model, an automatic speech recognition (ASR) model, or a natural language processing (NLP) model) involves a dynamic shape problem in training and inference. The method includes: obtaining a shape of a to-be-output tensor (that is, from an output perspective), where the to-be-output tensor may be referred to as a first tensor. After a first shape of the to-be-output first tensor is obtained, at least one combination that meets the first shape is further determined. Each combination includes at least one target micro-operator, and each target micro-operator is from a same micro-operator library. It should be noted herein that micro-operators in the micro-operator library are pre-compiled micro-operators, and each micro-operator is an event independent of each other. It should be noted that in some implementations, any shape of the tensor can be obtained by combining one or more micro-operators in the micro-operator library. In other words, the micro-operator library has completeness of “solution space”, and each micro-operator in the micro-operator library may be considered as a “basis” of “shape space”. There may be one or more combinations of target micro-operators that are selected from the micro-operator library, to form the first shape. If there are n combinations, a micro-operator (which may be referred to as a first micro-operator) included in a first combination is selected from the n combinations for execution. The first micro-operator is one or more of the target micro-operators.

In the foregoing implementation of the present disclosure, a real-time shape of any to-be-output first tensor can be obtained by combining in real time one or more micro-operators in a micro-operator library that is constructed in advance. Finally, a micro-operator included in one combination is selected (for example, a combination with optimal performance is selected) for execution because different combinations have different performance. This indirectly improves system performance. In embodiments of the present disclosure, micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. This reduces compilation overheads, avoiding excessively high compilation overheads of a just-in-time compilation technology at runtime. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”. Therefore, any shape may be a combination of the one or more micro-operators in the micro-operator library, providing completeness in solution space, and programmability is high in a dynamic shape scenario. In addition, a quantity of micro-operators in the micro-operator library is limited, and a size of memory space is small.

In a possible implementation of the first aspect, a final first combination may be selected from the n combinations based on a perspective of a consumed running cost. Specifically, first, a total running cost of micro-operators included in each of the n combinations can be calculated, to obtain n running costs. Then, one running cost that meets a preset condition (which may be referred to as a first preset condition) is selected from the n running costs as a final running cost (namely, a first running cost). For example, a combination with a smallest running cost value can be selected as the first combination, and finally, the micro-operator (namely, the first micro-operator) included in the first combination corresponding to the first running cost is executed.

In the foregoing implementation of the present disclosure, that the final first combination is selected from the n combinations based on the running cost is specifically described. A combination with a small running cost value is obtained by estimating a running cost of a micro-operator included in each combination. This improves system performance.

In a possible implementation of the first aspect, the running cost may be indicated by total duration required to execute micro-operators included in the combination. Specifically, duration required to execute each of m micro-operators included in a target combination is calculated, to obtain m duration, where one micro-operator in the target combination corresponds to one duration. Then, total duration is obtained based on the m duration (that is, the total duration is obtained by adding the m duration), where the target combination is any one of the n combinations, and m≥1. Finally, total duration is calculated by using each of the n combinations as the target combination, to obtain n total duration.

In the foregoing implementation of the present disclosure, a running cost of a micro-operator included in each combination is indicated by total duration required to execute all micro-operators in the combination. This is feasible.

In a possible implementation of the first aspect, each micro-operator in the micro-operator library has a fixed shape, and different micro-operators have different shape sizes.

The foregoing implementation of the present disclosure describes a prerequisite to be met by the micro-operator in the micro-operator library, to obtain any shape of the first tensor by combining as few micro-operators as possible. This reduces memory overheads.

In a possible implementation of the first aspect, the micro-operator library is selected from a plurality of candidate micro-operator libraries that are constructed in advance, operator types of micro-operators from different micro-operator libraries are different, and operator types of micro-operators from the same micro-operator library are the same. For example, it is assumed that there are three micro-operator libraries (namely, candidate micro-operator libraries) deployed on a computer device: micro-operator libraries A, B, and C. The micro-operator library A is used to perform a convolution operation, the micro-operator library B is used to perform a matrix multiplication operation, and the micro-operator library C is used to perform a vector addition operation. In this case, each micro-operator in the micro-operator library A is a convolution operator, and each micro-operator in the micro-operator library B is a matrix multiplication operator, each micro-operator in the micro-operator library C is a vector addition operator. It is also assumed that a currently ongoing task is a training process of a convolutional neural network, and a currently invoked micro-operator library is the micro-operator library A in the foregoing three micro-operator libraries that are constructed in advance on the computer device (if the computer device has only one micro-operator library, the currently invoked micro-operator library can only be the one micro-operator library).

In the foregoing implementation of the present disclosure, that the micro-operator library may be selected from the candidate micro-operator libraries that are constructed is specifically described. This is widely applicable and flexible.

In a possible implementation of the first aspect, shapes of the micro-operators in the micro-operator library are fixed and different, but particular. In this embodiment of the present disclosure, it is assumed that all the micro-operators in the micro-operator library are second-order tensors, so that all the micro-operators in the micro-operator library may be in a square shape, or may be in a rectangular shape, or may include both a square and a rectangle. This is not limited in the present disclosure, provided that any shape of the tensor can be obtained by combining the one or more micro-operators in the micro-operator library.

In the foregoing implementation of the present disclosure, that the micro-operators in the micro-operator library may be in various shapes is described. This is flexible.

In a possible implementation of the first aspect, because each micro-operator in the micro-operator library has a fixed shape (namely, a static shape), each micro-operator may be continuously optimized to obtain optimal performance, and a micro-operator with optimal performance is added to the micro-operator library, that is, each micro-operator in the micro-operator library is selected from at least two candidate micro-operators, shapes of the at least two candidate micro-operators are the same, and the at least two candidate micro-operators are pre-compiled.

In the foregoing implementation of the present disclosure, that each micro-operator in the micro-operator library is also obtained through continuous optimization and selection is specifically described. This is implementable.

In a possible implementation of the first aspect, a selection manner may include: calculating performance of each candidate micro-operator based on attribute information of each candidate micro-operator, and using a target candidate micro-operator as one micro-operator in the micro-operator library. The target candidate micro-operator is one candidate micro-operator whose performance meets a preset condition (which may be referred to as a second preset condition). For example, a micro-operator with optimal performance is selected from the candidate micro-operators and added to the micro-operator library.

In the foregoing implementation of the present disclosure, that the micro-operator whose performance meets a requirement, for example, the micro-operator with the optimal performance, is selected as each micro-operator in the micro-operator library is specifically described. This indirectly improves overall system performance.

In a possible implementation of the first aspect, the attribute information of the candidate micro-operator includes but is not limited to a throughput, occupied bandwidth, and the like. For example, the micro-operator with the optimal performance may be selected from the candidate micro-operators based on two factors: the throughput and the occupied bandwidth. The micro-operator is added to the micro-operator library. In an example, some candidate micro-operators that do not meet a throughput requirement may be first screened out from the candidate micro-operators based on the throughput factor, and then a micro-operator finally added to the micro-operator library is selected from remaining candidate micro-operators based on the bandwidth occupation factor. In another example, different weights may be assigned to the two factors: the throughput and the occupied bandwidth, weighted summation is performed on each candidate micro-operator, and finally, a candidate micro-operator with an optimal result is selected based on a weighted summation result, and added to the micro-operator library. A specific implementation of how to select the micro-operator from the candidate micro-operators and add the candidate micro-operator to the micro-operator library is not limited in the present disclosure.

The foregoing implementation of the present disclosure specifically describes the specific attribute information of the micro-operator. This is universally applicable.

In a possible implementation of the first aspect, the shape of the first tensor dynamically changes in the dynamic shape scenario. To be specific, the first tensor is a dynamic shape tensor to be output, and a specific value of the shape of the first tensor is determined only at the runtime. The shape with the specific value is referred to as the first shape of the first tensor.

In the foregoing implementation of the present disclosure, that the first tensor may be the to-be-output tensor in the dynamic shape scenario is specifically described. This is widely applicable.

According to a second aspect, an embodiment of the present disclosure provides a computer device, and the computer device has a function of performing the method according to any one of the first aspect or the possible implementations of the first aspect. The function may be implemented by hardware, or by hardware executing corresponding software. The hardware or the software includes one or more modules corresponding to the foregoing function.

According to a third aspect, an embodiment of the present disclosure provides a computer device. The computer device may include a memory, a processor, and a bus system. The memory is configured to store a program. The processor is configured to invoke the program stored in the memory, to perform the method in any one of the first aspect or the possible implementations of the first aspect of an embodiment of the present disclosure.

According to a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium. The computer-readable storage medium stores instructions. When the instructions are run on a computer, the computer is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.

According to a fifth aspect, an embodiment of the present disclosure provides a computer program. When the computer program is run on a computer, the computer is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.

According to a sixth aspect, an embodiment of the present disclosure provides a chip. The chip (for example, a central processing unit (CPU)) includes at least one processor and at least one interface circuit. The interface circuit is coupled to the processor. The at least one interface circuit is configured to: perform receiving and sending functions, and send instructions to the at least one processor. The at least one processor is configured to run a computer program or the instructions, and has a function of performing the method in any one of the first aspect or the possible implementations of the first aspect. The function may be implemented by hardware, software, or a combination of hardware and software. The hardware or the software includes one or more modules corresponding to the function. In addition, the interface circuit is configured to communicate with another module outside the chip. For example, the interface circuit may send, to a graphics processing unit (GPU) for execution, a first micro-operator included in a first combination obtained by the processor on the chip.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic flowchart of applying a just-in-time compilation technology in a dynamic shape scenario;

FIGS. 2A and 2B are schematic diagrams of a dynamic shape decoupled architecture according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a product form according to an embodiment of the present disclosure;

FIG. 4 is a schematic flowchart of an operator processing method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a micro-operator library according to an embodiment of the present disclosure;

FIGS. 6A and 6B are schematic diagrams of a specific shape of a micro-operator included in a micro-operator library according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of compiling an operator according to an embodiment of the present disclosure;

FIG. 8 is another schematic diagram of compiling an operator according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of combining and calculating a running cost in real time according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of an instance of an operator processing method according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of an application architecture according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a structure of a computer device according to an embodiment of the present disclosure; and

FIG. 13 is a schematic diagram of a structure of a computer device according to an embodiment of the present disclosure.

DESCRIPTION OF EMBODIMENTS

Embodiments of the present disclosure provide an operator processing method and a computer device. A real-time shape of any to-be-output first tensor can be obtained by combining in real time one or more micro-operators in a micro-operator library that is constructed in advance. Finally, a micro-operator included in one combination is selected (for example, a combination with optimal performance is selected) for execution because different combinations have different performance. This indirectly improves system performance. In embodiments of the present disclosure, micro-operators in the micro-operator library are pre-compiled. Therefore, a compiler is not needed. This reduces compilation overheads, avoiding excessively high compilation overheads of a just-in-time compilation technology at runtime. In addition, shapes of the micro-operators are fixed and different, and are used as a “basis” of “shape space”. Therefore, any shape may be a combination of the one or more micro-operators in the micro-operator library, providing completeness in solution space, and programmability is high in a dynamic shape scenario. In addition, a quantity of micro-operators in the micro-operator library is limited, and a size of memory space is small.

In the specification, claims, and the accompanying drawings of the present disclosure, the terms such as “first” and “second” are intended to distinguish similar objects but do not necessarily indicate a specific order or sequence. It should be understood that the terms used in such a way are interchangeable in proper circumstances, and this is only a discrimination manner for describing objects having a same attribute in embodiments of the present disclosure. In addition, the terms “include” and “contain” and any other variants mean to cover the non-exclusive inclusion, so that a process, method, system, product, or device that includes a series of units is not necessarily limited to those units, but may include other units not expressly listed or inherent to such a process, method, product, or device.

Embodiments of the present disclosure relate to much related knowledge about an operator, a dynamic shape, just-in-time compilation, and the like. To better understand the solutions in embodiments of the present disclosure, the following first describes related terms and concepts that may be used in embodiments of the present disclosure. It should be understood that explanations of related concepts may be limited due to specific cases of embodiments of the present disclosure, but this does not mean that the present disclosure is limited to the specific cases. Specific cases in different embodiments may also vary. This is not specifically limited herein.

(1) Operator

The operator is a general term of a mathematical operation, such as a convolution operator, and a matrix multiplication operator. Specifically, the operator may be considered as mapping from one function space to function space, that is, O: X→X. The operator in a broad sense can be extended to any space, such as inner product space.

(2) Tensor

In the field of artificial intelligence, the tensor is a multi-dimensional array. Specifically, the tensor may also be defined, given a basis, as a group of numbers that meet a specific transformation rule. If these numbers are written in a vertical row, these numbers are a first-order tensor. If these numbers are written in one two-dimensional array of numbers, these numbers are a second-order tensor. If these numbers are written in a three-dimensional array of numbers, these numbers are a third-order tensor. A tensor of order higher than three may be referred to as a higher order tensor.

(3) Shape of a Tensor

A literal meaning of the shape is a shape. In applications such as AI and HPC, computationally intensive operations (referred to as operators) such as matrix multiplication, convolution, and vector operations are required. These operators are mathematical function computation of a multi-dimensional array (namely, the tensor). Each dimension of the multi-dimensional array has a specific value, for example, a height H and a width W of a two-dimensional array, and a height H, a width W, and a depth D of a three-dimensional array. Values of these different dimensions are collectively referred to as a shape, namely, the shape of the multi-dimensional array (namely, the tensor).

The shape of the tensor is a shape assigned to the tensor when the tensor is defined. For example, a shape of a 5*5 matrix is that a length is 5 and a width is 5. For another example, a tensor of a convolution kernel (filter) used for convolution calculation is defined as 3*3, and a shape of the tensor is that a length is 3 and a width is 3. In an example, it is assumed that a shape of a pixel of a convolved picture is 6*6, a depth is 3 (indicating R/G/B, which are three primary colors), and a convolution kernel is a 3*3*3 cube. In this case, the convolution kernel is a tensor of shape (3, 3, 3), and may be denoted as Tensor1. The convolved picture may be indicated as a tensor of shape (6, 6, 3), and may be denoted as Tensor2. An operation of performing a convolution operation on the two tensors Tensor1 and Tensor2 is referred to as one operator, denoted as convolution.

(4) Dynamic Shape of a Tensor

Each tensor indicates a physical meaning, for example, a pixel or a convolution kernel of a picture. If a shape of a to-be-output tensor of each layer does not change in a model training or inference calculation process, an operator used in this calculation process may be referred to as a fixed shape operator, and the to-be-output tensor is a fixed shape tensor. On the contrary, if a shape of a to-be-output tensor at each layer can change in a model training or inference calculation process, an operator used in this calculation process may be referred to as a dynamic shape operator, and the to-be-output tensor is a dynamic shape tensor.

(5) Static Compiler

The static compiler means that a compilation process is completed within an off-line time life cycle. Binary code is generated during compilation. The binary code is executed by a processor at runtime.

(6) JIT Compiler

The JIT compiler means that a compilation process is completed within a life cycle at runtime. The binary code generated during compilation is temporarily generated at the runtime.

(7) Static Shape Compiler

The static shape compiler can also be referred to as a fixed shape compiler, and is a static compiler that only supports compiling a fixed shape operator.

(8) Dynamic Shape Compiler

The dynamic shape compiler is a dynamic compiler that supports compiling an operator of any shape.

(9) Neural Network

The neural network may include a neural unit, and may be specifically understood as a neural network including an input layer, a hidden layer, and an output layer. Usually, a first layer is the input layer, a last layer is the output layer, and a middle layer is the hidden layer. A neural network with a plurality of hidden layers is called a deep neural network (DNN). Work at each layer of the neural network may be described according to a mathematical expression {right arrow over (y)}=a(W·{right arrow over (x)}+b). Work at each physical layer of the neural network may be understood as completing transformation from input space to output space (that is, from row space to column space of a matrix) by performing five operations on the input space (a set of input vectors). The five operations include: 1. dimension increase/dimension reduction; 2. zooming in/out; 3. rotation; 4. translation; and 5. “bending”. The operations 1, 2, and 3 are completed by “W·{right arrow over (x)}”, the operation 4 is completed by “+b”, and the operation 5 is implemented by “a( )”. The word “space” is used herein for expression because a classified object is not a single thing, but a type of thing. Space is a set of all individuals of such a type of thing. W is a weight matrix of each layer of the neural network, and each value in the matrix indicates a weight value of one neuron at the layer. The matrix W determines space transformation from the input space to the output space described above. In other words, W at each layer of the neural network controls how to transform space. A purpose of training the neural network is to finally obtain a weight matrix at all layers of a trained neural network. Therefore, a training process of the neural network is essentially a manner of learning control of space transformation, and more specifically, learning a weight matrix.

(10) Loss Function

During training of the neural network, because it is expected that an output of the neural network is as close as possible to a value that is actually expected to be predicted, a current prediction value of the network may be compared with a target value that is actually expected, and then a weight matrix at each layer of the neural network is updated based on a difference between the current prediction value and the target value (there is usually an initialization process before the first update, that is, a parameter is preconfigured for each layer of the neural network). For example, if the prediction value of the network is large, the weight matrix is adjusted to lower the prediction value until the neural network can predict the target value that is actually expected. Therefore, “how to obtain the difference between the predicted value and the target value through comparison” needs to be predefined. This is a loss function or an objective function. The loss function and the objective function are important equations that measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the neural network is a process of minimizing the loss as much as possible.

In a training process of a neural network, a value of a parameter of a neural network model may be corrected by using an error back propagation (BP) algorithm, so that a reconstruction error loss of the neural network model becomes increasingly smaller. Specifically, an input signal is transferred forward until an error loss occurs at an output, and the parameter in the initial neural network model is updated based on back propagation error loss information, to make the error loss converge. The back propagation algorithm is an error-loss-centered back propagation motion intended to obtain a parameter, such as a weight matrix, of an optimal neural network model.

The following describes embodiments of the present disclosure with reference to the accompanying drawings. A person of ordinary skill in the art may learn that, with development of technologies and emergence of a new scenario, the technical solutions provided in embodiments of the present disclosure are also applicable to a similar technical problem.

First, a system architecture provided in an embodiment of the present disclosure is described. For details, FIGS. 2A and 2B are schematic diagrams of a dynamic shape decoupled architecture according to an embodiment of the present disclosure. A sub-schematic diagram in FIG. 2A shows a two-layer decoupled architecture from a top-down perspective. An upper layer is a dynamic shape calculation interface layer, a lower layer is a hardware virtualization layer, and the hardware virtualization layer supports application programming interfaces (APIs) of one group of micro-operators. From an output perspective, a shape of a to-be-output tensor at the dynamic shape calculation interface layer dynamically changes in a runtime process, which may be denoted as a shape (X, Y), where X and Y change in real time. However, values of X and Y can be determined at the runtime. In this case, the shape (X, Y) of the determined value may be obtained by combining one or more micro-operators in a micro-operator library (namely, the group of micro-operators), and each micro-operator in the micro-operator library may be considered as a “basis” of “shape space”. Then, running costs of different combinations of a group of “bases” may be calculated in real time by using a cost model. The hardware virtualization layer includes abstract bottom-layer hardware, and stores the micro-operator library, for example, micro-operators 1*1, 2*2, . . . shown in FIG. 2A. A sub-schematic diagram in FIG. 2B shows a two-layer decoupled architecture from a down-top perspective. The hardware virtualization layer maps implementation of an actual hardware layer to the APIs of the group of micro-operators. The actual hardware layer may include physical hardware such as an AI core and a GPU, and is configured to execute binary code of one group of micro-operators finally selected by the cost model.

It should be noted that an actual implementation product form of the present disclosure may be a software form, or may be a hardware form. This is not specifically limited in the present disclosure. For ease of description, the following uses an example in which the product form is the software form for description. Specifically, refer to FIG. 3. In some implementations of the present disclosure, a software product form in this embodiment of the present disclosure may include two parts: a micro-operator library (which may also be referred to as an operator code library, a binary code library, and the like) and a real-time cost model. The two parts may be separately deployed on a CPU (for example, an AI CPU) of a computer device, namely, a host end. The following separately describes the two parts.

(1) Micro-Operator Library

In this embodiment of the present disclosure, the micro-operator library may be a single-kernel micro-operator library, or may be a multi-kernel micro-operator library. This is not specifically limited in the present disclosure. The single-kernel micro-operator library means that each micro-operator in the micro-operator library is implemented as single-kernel code.

The micro-operator library may support a plurality of different types of operators, for example, operator operations of a convolution operation, matrix multiplication, and vector addition. However, it should be noted that operators in a same micro-operator library are of a same operator type. These operators are packaged in a micro-operator library of a type to which the operators belong in a form of binary code of one group of micro-operators. To be specific, for a convolution operator, binary code of one group of convolution micro-operators of fixed shapes and different shapes is implemented, for example, 1*1, 2*2, 4*4, . . . , or 512*512 convolution micro-operators that support calculation. However, it should be noted that these micro-operators of different shapes need to have the following features: a. The binary code can be implemented; b. Each micro-operator has a fixed and different shape; c. The micro-operators are independent of each other and have their own attribute information, such as bandwidth and a throughput.

(2) Real-Time Cost Model

In the foregoing implementation of the present disclosure, any shape of a to-be-output tensor can be obtained by using one or more combinations in the micro-operator library in “real time”. Different combinations have different calculation performance. Therefore, one “real-time” cost model is required to obtain a combination with optimal performance through calculation based on an attribute of each operator, and finally execute a micro-operator included in the combination with the optimal performance.

The real-time cost model may be embedded in a standard operator launch and execution process in a form of one independent module. As shown in FIG. 3, the standard operator launch and execution process includes: a graph execution engine→infer shape→memory allocation→real-time cost model→operator launch→writing a device parameter to a GM→starting operator execution. The real-time cost model is located between initiating an operator calculation operation by the graph execution engine and formally launching an operator to each device. There are two pieces of input information of the real-time cost model: the micro-operator library and shape information of the to-be-output tensor. A main function of the real-time cost model is to use one or more fixed shape micro-operators in the micro-operator library in real time, to complete a combination of any shape (namely, a dynamic shape) of the to-be-output tensor, and select a micro-operator included in a combination (for example, a combination with optimal performance) for execution.

It should be noted herein that the execution process is executed by a GPU (for example, an AI Core) of the computer device by starting an operator, and the AI core is also referred to as a device end. After an AI CPU selects, based on the real-time cost model, the micro-operator to be finally executed included in the combination, the micro-operator is launched to the AI core, and the AI core performs a specific micro-operator execution operation.

It should be noted that, in some implementations of the present disclosure, the real-time cost model may be further divided into the following three modules.

    • (1) Combiner: The combiner uses the one or more fixed shape micro-operators in the micro-operator library in real time, to complete the combination of any shape (namely, the dynamic shape) of the to-be-output tensor, where there may be one or more combinations.
    • (2) Cost model: The cost model uses a combiner result as an input, and determines a specific combination whose calculation performance meets a preset condition (for example, optimal calculation performance) in a plurality of different combinations through calculation based on an attribute (for example, bandwidth or a throughput) of each micro-operator.
    • (3) Device parameter calculation: A combination output by the cost model is micro-operators that are to be scheduled to each device (for example, a GPU) and are actually executed. Therefore, a corresponding parameter needs to be calculated for each device, to achieve an objective of calculating binary code of a corresponding micro-operator.

Based on the foregoing system architecture, the following describes in detail an operator processing method provided in an embodiment of the present disclosure. Specifically, FIG. 4 is a schematic flowchart of an operator processing method according to an embodiment of the present disclosure. The method may specifically include the following steps.

401: Obtain a first shape of a first tensor.

First, a computer device may obtain a shape of a to-be-output tensor (that is, from an output perspective) by using a CPU (namely, a host end), where the to-be-output tensor may be referred to as the first tensor. A shape of the first tensor dynamically changes in a dynamic shape scenario. To be specific, the first tensor is a dynamic shape tensor to be output, and a specific value of the shape of the first tensor is determined only at runtime. The shape with the specific value is referred to as the first shape of the first tensor.

It should be noted that, in this embodiment of the present disclosure, the computer device may include a handheld terminal device, for example, a mobile phone, a computer, or an iPad, or may include a smart wearable device, for example, a smart band, a smart watch, or a smart heart rate meter, or may include a wheeled mobile device, for example, a vehicle (for example, an autonomous driving vehicle), an aircraft, or a robot (for example, a robotic vacuum cleaner). A specific product form of the computer device is not limited in the present disclosure. Any electronic device that can be used to perform the operator processing method in the present disclosure may be referred to as the computer device.

402: Determine at least one combination that meets a first shape, where each combination includes at least one target micro-operator, each target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other.

After obtaining the first shape of the to-be-output first tensor, the computer device further determines the at least one combination that meets the first shape. Each combination includes at least one target micro-operator, and the target micro-operators are from a same micro-operator library. It should be noted herein that micro-operators in the micro-operator library are pre-compiled micro-operators, and the micro-operators are event independent of each other. It should be noted that in some implementations of the present disclosure, any shape of the tensor can be obtained by combining one or more micro-operators in the micro-operator library. In other words, the micro-operator library has completeness of “solution space”, and each micro-operator in the micro-operator library may be considered as a “basis” of “shape space”. In this embodiment of the present disclosure, the micro-operators in the micro-operator library are pre-compiled and only micro-operators with optimal performance are selected. Therefore, a compiler is not needed at runtime, and only an appropriate target micro-operator needs to be selected from the micro-operator library for combination. This reduces compilation overheads.

It should be noted that, in some implementations of the present disclosure, each micro-operator in the micro-operator library has a fixed shape, and different micro-operators have different shape sizes. For ease of understanding, the following provides an example for illustration. Specifically, FIG. 5 is a schematic diagram of a micro-operator library according to an embodiment of the present disclosure. The following uses an example in which all micro-operators in the micro-operator library are second-order tensors (namely, matrices) for description. The micro-operator library totally includes n micro-operators: E1, E2, . . . , and En. A shape of the micro-operator E1 is H1*W1, a shape of the micro-operator E2 is H2*W2, . . . , and a shape of the micro-operator En is Hn*Wn. Generally, H1≠H2≠ . . . ≠Hn, and/or W1≠W2≠ . . . ≠Wn. However, in a special case, some Hs are equal, or some Ws are equal. This is not limited in the present disclosure, provided that each micro-operator has the fixed and different shape.

It should be noted that, in some implementations of the present disclosure, shapes of the micro-operators in the micro-operator library are fixed and different, but particular. In this embodiment of the present disclosure, it is assumed that all the micro-operators in the micro-operator library are second-order tensors, so that all the micro-operators in the micro-operator library may be in a square shape, or may be in a rectangular shape, or may include both a square and a rectangle. This is not limited in the present disclosure, provided that any shape of the tensor can be obtained by combining the one or more micro-operators in the micro-operator library. FIGS. 6A and 6B are schematic diagrams of a specific shape of a micro-operator included in a micro-operator library according to an embodiment of the present disclosure. A sub-schematic diagram in FIG. 6A indicates that each micro-operator in the micro-operator library is in a square shape, to be specific, H1=W1, H2=W2, . . . , and Hn=Wn. A sub-schematic diagram in FIG. 6B indicates that each micro-operator in the micro-operator library is in a rectangle shape, to be specific, H1≠W1, H2≠W2, . . . , Hn≠Wn. In an example, a value of each H may be a half, a quarter, or the like of a corresponding W, to be specific, H1=1/2 W1, H2=1/2 W2, . . . , Hn=1/2 Wn, or H1=1/4 W1, H2=1/4 W2, . . . , Hn=1/4 Wn, and the like. In another example, a value of each W may be a half, a quarter, or the like of a corresponding H, to be specific, 1/2 H1=W1, 1/2 H2=W2, . . . , 1/2 Hn=Wn, or 1/4 H1=W1, 1/4 H2=W2, . . . , 1/4 Hn=Wn, and the like. Specifically, a specific rectangle shape of each micro-operator in the micro-operator library is not limited in the present disclosure. Similarly, in this embodiment of the present disclosure, it is assumed that all the micro-operators in the micro-operator library are third-order tensors, so that all the micro-operators in the micro-operator library may be in a polyhedron shape, for example, may be in a cube shape, or may be in a cuboid shape, or may include both a cube and a cuboid. This is not limited in the present disclosure, provided that any shape of the tensor can be obtained by combining the one or more micro-operators in the micro-operator library. A case of a higher-order tensor is similar, and details are not described herein.

In addition, it should be further noted that, in this embodiment of the present disclosure, the micro-operator library is obtained based on at least one candidate micro-operator library that is constructed in advance, and operator types of micro-operators from a same micro-operator library are the same. For example, it is assumed that there are three micro-operator libraries (namely, candidate micro-operator libraries) deployed on a computer device: micro-operator libraries A, B, and C. The micro-operator library A is used to perform a convolution operation, the micro-operator library B is used to perform a matrix multiplication operation, and the micro-operator library C is used to perform a vector addition operation. In this case, each micro-operator in the micro-operator library A is a convolution operator, and each micro-operator in the micro-operator library B is a matrix multiplication operator, each micro-operator in the micro-operator library C is a vector addition operator. It is also assumed that a currently ongoing task is a training process of a convolutional neural network, and a currently invoked micro-operator library is the micro-operator library A in the foregoing three micro-operator libraries that are constructed in advance on the computer device (if the computer device has only one micro-operator library, the currently invoked micro-operator library can only be the one micro-operator library). It should be noted herein that, in this embodiment of the present disclosure, the micro-operator library may be a single-kernel micro-operator library, or may be a multi-kernel micro-operator library. This is not specifically limited in the present disclosure. The single-kernel micro-operator library means that each micro-operator in the micro-operator library is implemented as single-kernel code.

It should be further noted that, in some implementations of the present disclosure, because each micro-operator in the micro-operator library has a fixed shape (namely, a static shape), each micro-operator may be continuously optimized to obtain optimal performance, and a micro-operator with optimal performance is added to the micro-operator library, that is, each micro-operator in the micro-operator library is selected from at least two candidate micro-operators, shapes of the at least two candidate micro-operators are the same, and the at least two candidate micro-operators are pre-compiled.

In an example, a selection manner may include: calculating performance of each candidate micro-operator based on attribute information of each candidate micro-operator, and using a target candidate micro-operator as one micro-operator in the micro-operator library. The target candidate micro-operator is one candidate micro-operator whose performance meets a preset condition (which may be referred to as a second preset condition). For example, a micro-operator with optimal performance is selected from the candidate micro-operators and added to the micro-operator library.

It should be noted that, in some implementations of the present disclosure, the attribute information of the candidate micro-operator includes but is not limited to a throughput, occupied bandwidth, and the like. For example, the micro-operator with the optimal performance may be selected from the candidate micro-operators based on two factors: the throughput and the occupied bandwidth. The micro-operator is added to the micro-operator library. In an example, some candidate micro-operators that do not meet a throughput requirement may be first screened out from the candidate micro-operators based on the throughput factor, and then a micro-operator finally added to the micro-operator library is selected from remaining candidate micro-operators based on the bandwidth occupation factor. In another example, different weights may be assigned to the two factors: the throughput and the occupied bandwidth, weighted summation is performed on each candidate micro-operator, and finally, a candidate micro-operator with an optimal result is selected based on a weighted summation result, and added to the micro-operator library. A specific implementation of how to select the micro-operator from the candidate micro-operators and add the candidate micro-operator to the micro-operator library is not limited in the present disclosure.

The following describes how to select a micro-operator with optimal performance from at least two candidate micro-operators whose shapes are the same and that are pre-compiled.

For ease of understanding, the following first uses matrix multiplication as an example to describe a core concept “tiling” of a compilation operator. Specifically, FIG. 7 is a schematic diagram of compiling an operator according to an embodiment of the present disclosure. It is assumed that the to-be-output tensor is a second-order tensor, and a shape is M*N. In FIG. 7, the compiler separately performs two different compilation policies on a same matrix multiplication serial code. One is tiling along an M-axis, for example, a first row in FIG. 7, which may be referred to as a tiling policy 1. The other is tiling along both an M-axis and an N-axis, for example, a second row in FIG. 7, which may be referred to as a tiling policy 2. From the output perspective, a tiling arrangement manner obtained through the first compilation manner is shown on the right side of the first row in FIG. 7, and a tiling arrangement manner obtained through the second compilation manner is shown on the right side of the second row in FIG. 7. Performance of operators compiled by using different tiling policies is also different. FIG. 8 still uses a same matrix multiplication serial semantic expression as an example. It is assumed that there are N compilation policies: a policy 1, a policy 2, . . . , and a policy N. The policy 1 corresponds to the tiling policy 1, the policy 2 corresponds to the tiling policy 2, . . . , and the policy N corresponds to the tiling policy N. Obtained tiling arrangement manners are separately shown in FIG. 8, that is, different policies correspond to different tiling policies, and consequently binary code of compilation operators of different versions are generated. N compilation operators (that is, binary code 1 to binary code N) obtained in the N compilation manners are run on hardware. If binary code corresponding to a policy has high performance, it is considered that a compilation operator corresponding to the policy is optimal code, and the compilation operator may be added to the micro-operator library as an optimal fixed shape micro-operator.

403: If there are n combinations, select, from the n combinations, a first micro-operator included in a first combination for execution, where the first micro-operator is one or more of the target micro-operators, and n≥2.

There may be one or more combinations of target micro-operators that are selected from the micro-operator library, to form the first shape. The following separately describes the combinations.

(1) One Combination

If there is one combination, there is no comparison condition, and a micro-operator included in the combination is directly executed. Specifically, an AI CPU of the computer device transfers the micro-operator included in the combination to an AI core, to start a micro-operator execution operation.

(2) N Combinations, where n≥2

If there are n combinations, a micro-operator (which may be referred to as a first micro-operator) included in a first combination is selected from the n combinations for execution. The first micro-operator is one or more of the target micro-operators.

It should be noted that, in some implementations of the present disclosure, a final first combination may be selected from the n combinations based on a perspective of a consumed running cost. Specifically, first, a total running cost of micro-operators included in each of the n combinations can be calculated, to obtain n running costs. Then, one running cost that meets a preset condition (which may be referred to as a first preset condition) is selected from the n running costs as a final running cost (namely, a first running cost). For example, a combination with a smallest running cost value can be selected as the first combination, and finally, the micro-operator (namely, the first micro-operator) included in the first combination corresponding to the first running cost is executed. For a specific process, refer to FIG. 9. FIG. 9 is a schematic diagram of combining and calculating a running cost in real time according to an embodiment of the present disclosure. It is assumed that the to-be-output first tensor is a second-order tensor, the shape of the first tensor is a quadrilateral P whose width is W and height is H, and H and W dynamically change. At runtime, each specific shape P can be obtained by combining the micro-operators in the micro-operator library. A plurality of different combinations may form the shape P. Finally, one or more micro-operators included in one of the combinations are selected and launched to M AI cores for execution. In this process, any shape P can be combined in real time, and a specific combination can be obtained through calculation in real time, to obtain optimal performance of P on the AI core.

For ease of understanding the process, the following is an example for illustration. It is assumed that for a specific first shape, there are totally three combinations: a combination 1, a combination 2, and a combination 3. It is assumed that target micro-operators included in the combination 1 include two micro-operators A and one micro-operator B, target micro-operators included in the combination 2 include four micro-operators C and one micro-operator D, and target micro-operators included in the combination 3 include one micro-operator A and three micro-operators E. A total running cost of micro-operators included in each combination is calculated, to separately obtain a running cost 1 corresponding to the combination 1, a running cost 2 corresponding to the combination 2, and a running cost 3 corresponding to the combination 3. Assuming that a value of the running cost 2 is lowest among the three running costs, the four micro-operators C and the one micro-operator D included in the combination 2 corresponding to the running cost 2 may be executed.

It should be noted that, in some implementations of the present disclosure, the running cost may be indicated by total duration required to execute micro-operators included in the combination. Specifically, duration required to execute each of m micro-operators included in a target combination is calculated, to obtain m duration, where one micro-operator in the target combination corresponds to one duration. Then, total duration is obtained based on the m duration (that is, the total duration is obtained by adding the m duration), where the target combination is any one of the n combinations, and m≥1. Finally, total duration is calculated by using each of the n combinations as the target combination, to obtain n total duration.

For ease of understanding the process, the foregoing example is still used for further illustration. It is assumed that for a specific first shape, there are totally three combinations: a combination 1, a combination 2, and a combination 3. It is assumed that target micro-operators included in the combination 1 include two micro-operators A and one micro-operator B, target micro-operators included in the combination 2 include four micro-operators C and one micro-operator D, and target micro-operators included in the combination 3 include one micro-operator A and three micro-operators E. The combination 1 includes the three target micro-operators. Therefore, duration required to execute each micro-operator in the combination 1 may be calculated, totally three duration is obtained, and the three duration is added to obtain total duration A. Similarly, the combination 2 includes the five target micro-operators, five duration may be obtained through calculation, and the five duration is added to obtain total duration B. The combination 3 includes four target micro-operators, and four duration may be obtained through calculation, and the four duration is added to obtain total duration C. Finally, a final combination may be selected by comparing a length of total duration corresponding to a combination.

It should be noted that, in this embodiment of the present disclosure, in a specific calculation process, a running cost of each combination may be calculated based on the following formula (1):

Cost = i [ 1 , maxBase ] ( DataVolume ( e i ) Throughout ( e i ) × π Amount ( e i ) HardwareAICoreNumber ) ( 1 )

ei indicates a specific micro-operator in a micro-operator library, for example, a micro-operator of shape 512*512. DataVolume(ei) indicates a total calculation amount of the micro-operator ei, namely, a calculation amount of a floating point (a sum of floating-point multiplication and addition). Throughout (ei) indicates a throughput of the micro-operator ei, and a unit is FLOPS, that is, “floating point operations per second (floating point of per second, FLOPS)” indicates an amount of floating-point multiplication or addition per second. Indicates duration required by a single-core AI core to calculate ei, and a unit is second.

DataVolume ( e i ) Throughout ( e i )

Amount(ei) indicates a quantity of ei in a current combination. HardwareAICoreNumber indicates a quantity of AI cores in a computer device, for example, Ascend 910 has 32 AI cores. Π Indicates rounding up, and rounding up

Amount ( e i ) HardwareAICoreNumber

indicates a quantity of rounds required to calculate all ei. For example, there are 33 ei in the current combination, that is, Amount(ei)=33, and

Amount ( e i ) HardwareAICoreNumber

is rounded up to 2, that is, “two rounds” are required to complete calculation of all ei.

DataVolume ( e i ) Throughout ( e i ) × π Amount ( e i ) HardwareAICoreNumber

indicates time required to complete calculation of all ei. Finally, any to-be-calculated first shape may be decomposed into a plurality of micro-operators, for example, x e1, y e2, and z e3. Each ei (namely, e1, e2, and e3) may obtain duration through calculation based on the foregoing formula (1). Finally, a sum of the time required by the three ei is obtained, which is final total duration of the micro-operators included in the combination.

For ease of further understanding of the foregoing process of step 401 to step 403, the following uses a specific instance to describe the operator processing method provided in this embodiment of the present disclosure. For an implementation process of the instance, refer to FIG. 10. FIG. 10 is a schematic diagram of an instance of an operator processing method according to an embodiment of the present disclosure. In this instance, it is assumed that a to-be-output first tensor is a second-order tensor of dynamic shape M*N, it is assumed that determined values of M and N are respectively 512 and 768 at runtime, and it is assumed that all micro-operators included in a micro-operator library that is constructed in advance are in square shapes (as shown in FIG. 10). These micro-operators are binary code of one group of pre-compiled micro-operators with optimal performance. After it is determined that a current first shape of the first tensor is 512*768, a target micro-operator may be selected from the micro-operator library for combination, to form the first shape. This operation may be completed by a combiner. It is assumed that there are two combinations, as shown in FIG. 10. Running costs of micro-operators included in different combinations may be further calculated, and a micro-operator included in a combination with a lowest running cost value is selected as a to-be-executed micro-operator. This operation may be completed by a real-time cost model, and the micro-operator included in the selected target combination is directly run by an AI core.

It should be noted herein that, in the embodiment corresponding to FIG. 10, one combination includes two 256*256 micro-operators and one 512*512 micro-operator. FIG. 10 shows an arrangement manner, that is, the two 256*256 micro-operators are arranged on the right, and the 512*512 micro-operator are arranged on the left. In actual application, the two 256*256 micro-operators may be arranged on the left, and the 512*512 micro-operator may be arranged on the right. A specific arrangement manner of selected target micro-operators is not limited in the present disclosure, and is considered as a combination (whether belonging to a same combination is only related to whether quantities and shapes of the selected target micro-operators are the same). Running costs of micro-operators included in an estimation combination are not related to locations of the target micro-operators in the combination, because an operator operation of the micro-operator is not affected by specific data of a real-time tensor, and the data is written in real time.

It should be further noted that, in this embodiment of the present disclosure, because micro-operators included in the micro-operator library are used as a “bases” of “shape space”, generally, a larger quantity of “bases” indicates more combinations, which means that a micro-operator combination with better performance can be finally selected. However, one disadvantage is that a combination process and a performance estimation process are correspondingly slowed down when a quantity of “bases” is larger. On the contrary, a smaller quantity of “bases” indicates fewer combinations (on a premise that any shape can be obtained through combination), and a corresponding disadvantage is that operator performance is impaired. Therefore, in an actual application process, a compromise needs to be considered. Generally, a quantity of micro-operators in a micro-operator library is less than 10. This also indicates that the micro-operator library constructed in advance in this embodiment of the present disclosure only needs to store binary code of only a limited quantity of micro-operators, and any dynamic shape of the first tensor can be obtained through combination in real time at runtime. This avoids high memory overheads and overcoming a problem of poor dynamic shape programmability.

Further, the operator processing method in this embodiment of the present disclosure may be applied to an architecture shown in FIG. 11. FIG. 11 is a schematic diagram of an application architecture according to an embodiment of the present disclosure. The architecture includes an operator launch engine, a micro-operator library, a combiner, a real-time cost model, a micro-operator scheduler, and a runtime executor. Functions of these modules are as follows: (1) The operator launch engine is an initiator of operator calculation, and knows in real time a specific operator to be calculated and information about the to-be-output tensor (for example, the first shape of the first tensor). (2) The micro-operator library is binary code of one group of pre-compiled optimal micro-operators. (3) The combiner combines, in real time by using the micro-operators in the micro-operator library, to obtain any shape of the first to-be-output tensor. (4) The real-time cost model obtains a specific combination with optimal performance through calculation based on a performance attribute of the micro-operator in real time. (5) The micro-operator scheduler evenly schedules one group of micro-operators in the combination to be executed on bottom-layer multi-cores. (6) The runtime executor initiates execution of each micro-operator in a final target combination.

It should be noted that the operator processing method provided in this embodiment of the present disclosure may be applied to various AI chips. For example, for an AI-dedicated chip such as Ascend, a to-be-implemented micro-operator (for example, matrix multiplication) may be selected, and 1*1, 2*2, 4*4, . . . , and the like are separately implemented from an output perspective, until single-core matrix operators of fixed shape N*N (the N*N operator is a single-core (Ascend AI Core) micro-operator with highest performance on a target platform, namely, a micro-operator with a largest single-core throughput) form a micro-operator library. These fixed micro-operators can automatically search for code implementation with highest single-core performance by using a tool like auto TVM in advance, compile the code into binary code, and store the binary code in the micro-operator library. Then, in a graph execution engine of a DNN, the combiner, the real-time cost model, and a related device parameter calculation module are separately implemented. The combiner considers various different micro-operator combinations to form a shape of the first to-be-output tensor. The real-time cost model may be implemented based on the formula (1), and is used to obtain a combination with a smallest running cost through calculation in real time. A micro-operator included in the combination with the smallest running cost calculates related parameter information by using the device parameter calculation module, and the micro-operator is launched by the micro-operator scheduler to each AI core for execution.

In this embodiment of the present disclosure, a host end (namely, an AI CPU) does not run a compiler (that is, does not perform a specific compilation process), but runs the combiner and the real-time cost model. This reduces compilation overheads. A device end (namely, the AI Core) executes binary code of a fixed shape micro-operator, and each micro-operator has highest single-core performance. Performance, a throughput, and bandwidth usage of each micro-operator are known in advance. This improves overall calculation performance of a system.

Finally, an application scenario of embodiments of the present disclosure is described. The operator processing method provided in this embodiment of the present disclosure may be applied to a dynamic shape scenario. For example, a CV model (such as a detection, segmentation, speech model, an ASR model, or an NLP model) involves a dynamic shape problem in training and inference, typically for example, an image resolution changes and a batch data size (batch size) changes in picture detection, input lengths of ASR corpus slices are different, and a size of each tensor in a common sparse model in an advertisement recommendation service is also variable. The operator processing method provided in this embodiment of the present disclosure may be applied to these application scenarios.

Based on the corresponding embodiments, the following further provides a related computer device used to implement the solutions, to better implement the solutions in embodiments of the present disclosure. The computer device may include a handheld terminal device, for example, a mobile phone, a computer, or an iPad, or may include a smart wearable device, for example, a smart band, a smart watch, or a smart heart rate meter, or may include a wheeled mobile device, for example, a vehicle (for example, an autonomous driving vehicle), an aircraft, or a robot (for example, a robotic vacuum cleaner). A specific product form of the computer device is not limited in the present disclosure. Any electronic device that can be used to perform the operator processing method in the present disclosure may be referred to as the computer device. Specifically, FIG. 12 is a schematic diagram of a structure of a computer device according to an embodiment of the present disclosure. The computer device 1200 includes an obtaining module 1201, a combination module 1202, and a selection module 1203. The obtaining module 1201 is configured to obtain a first shape of a first tensor. The combination module 1202 is configured to determine at least one combination that meets the first shape, where each combination includes at least one target micro-operator, each target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other. Any shape can be obtained by combining one or more micro-operators in the micro-operator library. The selection module 1203 is configured to: if there are n combinations, select, from the n combinations, a first micro-operator included in a first combination for execution, where the first micro-operator is one or more of the target micro-operators, and n≥2.

In a possible design, the selection module 1203 is specifically configured to: calculate a total running cost of micro-operators included in each of the n combinations, to obtain n running costs, select one running cost that meets a first preset condition from the n running costs as a first running cost, and finally execute the first micro-operator included in the first combination corresponding to the first running cost.

In a possible design, if the running cost is indicated by total duration, the selection module 1203 is further specifically configured to: calculate duration required to execute each of m micro-operators included in a target combination, to obtain m duration, obtain total duration based on the m duration, where the target combination is any one of the n combinations, and m≥1, and calculate total duration by using each of the n combinations as the target combination, to obtain n total duration.

In a possible design, each micro-operator in the micro-operator library has a fixed shape, and different micro-operators have different shape sizes.

In a possible design, the micro-operator library is selected from a plurality of candidate micro-operator libraries that are constructed in advance, operator types of micro-operators from different micro-operator libraries are different, and operator types of micro-operators from the same micro-operator library are the same.

In a possible design, the shape of the micro-operator in the micro-operator library includes at least one of the following: a square and a rectangular.

In a possible design, each micro-operator in the micro-operator library is selected from at least two candidate micro-operators, shapes of the at least two candidate operators are the same, and the at least two candidate micro-operators are pre-compiled.

In a possible design of the, a selection manner includes but is not limited to: calculating performance of each candidate micro-operator based on attribute information of each candidate micro-operator, and using a target candidate micro-operator as one micro-operator in the micro-operator library. The target candidate micro-operator is one candidate micro-operator whose performance meets a second preset condition.

In a possible design of the, the attribute information includes at least any one of the following: a throughput and occupied bandwidth.

In a possible design, the first tensor is a dynamic shape tensor to be output.

It should be noted that content such as information exchange and an execution process between the modules/units in the computer device 1200 in FIG. 12 is based on a same concept as the method embodiment corresponding to FIG. 4 in the present disclosure. For specific content, refer to the descriptions in the foregoing method embodiments of the present disclosure.

An embodiment of the present disclosure further provides a computer device. FIG. 13 is a schematic diagram of a structure of a computer device according to an embodiment of the present disclosure. For ease of description, only a part related to this embodiment of the present disclosure is shown. For specific technical details that are not disclosed, refer to the method part in an embodiment of the present disclosure. The modules described in the embodiment corresponding to FIG. 12 may be deployed on the computer device 1300, to implement the functions of the computer device 1200 in the embodiment corresponding to FIG. 12. Specifically, the computer device 1300 is implemented by one or more servers. The computer device 1300 may have a relatively large difference due to different configurations or performance, and may include one or more CPUs 1322 and a memory 1332, and one or more storage media 1330 (for example, one or more mass storage devices) that store an application 1342 or data 1344. The memory 1332 and the storage medium 1330 may be for temporary storage or permanent storage. A program stored in the storage medium 1330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations for the computer device 1300. Further, the central processing unit 1322 may be configured to communicate with the storage medium 1330, and perform, on the computer device 1300, the series of instruction operations in the storage medium 1330.

The computer device 1300 may further include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems 1341, for example, Windows Server™, Mac OS X™, Unix™, Linux™, and FreeBSD™

In this embodiment of the present disclosure, the central processing unit 1322 is configured to perform the operator processing method in the embodiment corresponding to FIG. 4. For example, the central processing unit 1322 may be configured to obtain a shape of a to-be-output tensor (that is, from an output perspective), where the to-be-output tensor may be referred to as a first tensor. A specific value of the shape of the first tensor is determined only at the runtime. The shape with the specific value is referred to as the first shape of the first tensor. After a first shape of the to-be-output first tensor is obtained, at least one combination that meets the first shape is further determined. Each combination includes at least one target micro-operator, and each target micro-operator is from a same micro-operator library. It should be noted herein that micro-operators in the micro-operator library are pre-compiled micro-operators, and each micro-operator is an event independent of each other. In addition, any shape of the tensor can be obtained by combining one or more micro-operators in the micro-operator library. In other words, the micro-operator library has completeness of “solution space”, and each micro-operator in the micro-operator library may be considered as a “basis” of “shape space”. In this embodiment of the present disclosure, the micro-operators in the micro-operator library are pre-compiled and only micro-operators with optimal performance are selected. Therefore, a compiler is not needed at runtime, and only an appropriate target micro-operator needs to be selected from the micro-operator library for combination. This reduces compilation overheads. There may be one or more combinations of target micro-operators that are selected from the micro-operator library, to form the first shape. If there are n combinations, a micro-operator (which may be referred to as a first micro-operator) included in a first combination is selected from the n combinations for execution. The first micro-operator is one or more of the target micro-operators.

It should be noted that the central processing unit 1322 may be further configured to perform any step in the method embodiment corresponding to FIG. 4 in the present disclosure. For specific content, refer to the descriptions in the foregoing method embodiments of the present disclosure.

An embodiment of the present disclosure further provides a computer-readable storage medium. The computer-readable storage medium stores a program for signal processing. When the program is run on a computer, the computer is enabled to perform the steps performed by the computer device in the descriptions of the foregoing embodiments.

In addition, it should be noted that the described apparatus embodiments are only examples. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected based on actual requirements to achieve the objectives of the solutions in embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided by the present disclosure, connection relationships between modules indicate that the modules have communication connections with each other, which may be specifically implemented as one or more communication buses or signal cables.

Based on the description of the foregoing implementations, a person skilled in the art may clearly understand that the present disclosure may be implemented by software in addition to necessary universal hardware, or by dedicated hardware, including a dedicated integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, and the like. Usually, any function implemented by a computer program may be easily implemented by using corresponding hardware. In addition, specific hardware structures used to implement a same function may be various, for example, an analog circuit, a digital circuit, or a dedicated circuit. In some embodiments, technical solutions of the present disclosure may be implemented in a form of a software product. The computer software product is stored in a readable storage medium, for example, a floppy disk, a Universal Serial Bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disc of a computer, and includes several instructions for instructing a computer device (which may be a personal computer, a training device, a network device, or the like) to perform the methods described in embodiments of the present disclosure.

All or some of the foregoing embodiments may be implemented by software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or some of the embodiments may be implemented in a form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the process or functions according to embodiments of the present disclosure are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or another programmable apparatus. The computer instructions may be stored in a computer-readable storage medium, or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, training device, or data center to another website, computer, training device, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by the computer, or a data storage device, such as a training device or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (DVD)), a semiconductor medium (for example, a solid-state disk (SSD)), or the like.

Claims

1. A method comprising:

obtaining a first shape of a first tensor;
determining n combinations of micro-operators satisfying the first shape, wherein n≥2, wherein each combination in the n combinations comprises at least one target micro-operator, wherein each of the at least one target micro-operator is from a micro-operator library, and wherein micro-operators in the micro-operator library are pre-compiled and independent of each other; and
selecting, from the n combinations, a first micro-operator in a first combination for execution, wherein the first micro-operator is one or more of the least one target micro-operator.

2. The method of claim 1, wherein selecting, from the n combinations, the first micro-operator in the first combination for execution comprises:

calculating a total running cost of micro-operators comprised in each of the n combinations to obtain n running costs;
selecting, from the n running costs, a first running cost satisfying a preset condition; and
executing the first micro-operator in the first combination corresponding to the first running cost.

3. The method of claim 2, wherein calculating the total running cost of the micro-operators comprised in each of the n combinations to obtain the n running costs comprises:

calculating a duration required to execute each of m micro-operators comprised in a target combination to obtain m durations, wherein the target combination is any one of the n combinations, and wherein m≥1;
obtaining a total duration of the target combination based on the m durations; and
calculating the total duration for each of the n combinations as the target combination to obtain n total durations, wherein the n total durations are the n running costs.

4. The method of claim 1, wherein each micro-operator in the micro-operator library has a fixed shape, and wherein different micro-operators have different shape sizes.

5. The method of claim 1, further comprising selecting the micro-operator library from a plurality of pre-constructed candidate micro-operator libraries, wherein micro-operators from different micro-operator libraries have different operator types, and wherein the micro-operators from a same micro-operator library have a same operator type.

6. The method of claim 1, wherein a shape of a micro-operator in the micro-operator library is square or rectangular.

7. The method of claim 1, further comprising selecting each micro-operator in the micro-operator library from at least two pre-compiled candidate micro-operators having a same shape.

8. The method of claim 7, wherein selecting each micro-operator in the micro-operator library comprises:

calculating a performance of each pre-compiled candidate micro-operator based on attribute information of each candidate micro-operator; and
using a target candidate micro-operator as one micro-operator in the micro-operator library, wherein the target candidate micro-operator is a candidate micro-operator whose performance satisfies a second preset condition.

9. The method of claim 8, wherein the attribute information comprises at least a throughput or occupied bandwidth.

10. The method of claim 1, further comprising outputting the first tensor, wherein the first tensor is a dynamic shape tensor.

11. A device comprising:

an obtainer configured to obtain a first shape of a first tensor;
a combiner configured to determine n combinations of micro-operators satisfying the first shape, wherein n≥2, wherein each combination comprises at least one target micro-operator, wherein each of the at least one target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other; and
a selector configured to select, from the n combinations, a first micro-operator in a first combination for execution, wherein the first micro-operator is one or more of the least one target micro-operator.

12. The device of claim 11, wherein the selector is further configured to:

calculate a total running cost of micro-operators comprised in each of the n combinations to obtain n running costs;
select, from the n running costs, a first running cost satisfying a preset condition; and
execute the first micro-operator in the first combination corresponding to the first running cost.

13. The device of claim 12, wherein the selector is further configured to:

calculate a duration required to execute each of m micro-operators comprised in a target combination to obtain m durations, wherein the target combination is any one of the n combinations, and wherein m≥1;
obtain a total duration of the target combination based on the m durations; and
calculate the total duration for each of the n combinations as the target combination to obtain n total durations, wherein the n total durations are the n running costs.

14. The device of claim 11, wherein each micro-operator in the micro-operator library has a fixed shape, and wherein different micro-operators have different shape sizes.

15. The device of claim 11, wherein the selector is further configured to select the micro-operator library from a plurality of pre-constructed candidate micro-operator libraries, wherein the micro-operators from different micro-operator libraries have different operator types, and wherein the micro-operators from a same micro-operator library have a same operator type.

16. The device of claim 11, wherein a shape of a micro-operator in the micro-operator library is square or rectangular.

17. The device of claim 11, wherein the selector is further configured to select each micro-operator in the micro-operator library from at least two pre-compiled candidate micro-operators having a same shape.

18. The device of claim 17, wherein the selector is further configured to calculate performance of each pre-compiled candidate micro-operator based on attribute information of each candidate micro-operator and using a target candidate micro-operator as one micro-operator in the micro-operator library, wherein the target candidate micro-operator is one candidate micro-operator whose performance satisfies a second preset condition.

19. The device of claim 18, wherein the attribute information comprises at least a throughput or occupied bandwidth.

20. A chip comprising:

a memory configured to store instructions; and
at least one processor coupled to the memory and configured to execute the instructions to cause the chip to: obtain a first shape of a first tensor; determine at least n combinations of micro-operators satisfying the first shape, wherein n≥2, wherein each combination comprises at least one target micro-operator, wherein each of the at least one target micro-operator is from a micro-operator library, and micro-operators in the micro-operator library are pre-compiled and independent of each other; and select, from the n combinations, a first micro-operator in a first combination for execution, wherein the first micro-operator is one or more of the least one target micro-operator.
Patent History
Publication number: 20240370521
Type: Application
Filed: Jul 16, 2024
Publication Date: Nov 7, 2024
Inventors: Qing Zhou (Beijing), Feng Yu (Shenzhen), Jian He (Shenzhen)
Application Number: 18/773,914
Classifications
International Classification: G06F 17/15 (20060101); G06F 8/41 (20060101);