NEURAL NETWORK SPARSIFICATION DEVICE AND METHOD, AND RELATED PRODUCT
This disclosure relates to a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model, wherein a processing device of the present disclosure is included in an integrated circuit device, and the integrated circuit device comprises a universal interconnection interface and a computation device. The computation device interacts with the processing device to jointly complete computing operations specified by the user. The integrated circuit device further comprises a storage device, and the storage device is connected to the computation device and the processing device, respectively, for data storage of the computation device and the processing device.
This application claims the priority to the Chinese patent application No. 2020112169035 filed on Nov. 4, 2020 and entitled “NEURAL NETWORK SPARSIFICATION DEVICE, METHOD AND CORRESPONDING PRODUCT” and the Chinese patent application No. 2020115661411 filed on Dec. 25, 2020 and entitled “NEURAL NETWORK SPARSIFICATION DEVICE, METHOD AND CORRESPONDING PRODUCT.”
TECHNICAL FIELDThe present disclosure relates generally to the field of neural network. More particularly, the present disclosure relates to a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model.
BACKGROUNDIn recent years, with the rapid development of deep learning, the performance of algorithms in a range of fields such as computer vision and natural language processing has progressed by leaps and bounds. However, a deep learning algorithm is a computation-intensive and storage-intensive tool. As information processing tasks have become increasingly complex, demands for real-time and accuracy of the algorithm are increasing, and a neural network is often designed to be deeper and deeper, therefore, its requirements for the amount of computation and storage space are increasing. As a result, it is difficult to directly apply an existing artificial intelligence technique based on the deep learning to mobile phones, satellites or embedded devices with limited hardware resources.
Therefore, compression, acceleration, optimization of a deep neural network model becomes particularly important. A large number of researches are attempting to reduce computation and storage requirements of a neural network without affecting the accuracy of the model, which has great significance on engineering applications of a deep learning technique in an embedded end and a mobile end, where sparsification is just one of model lightweight methods.
Network parameter sparsification is to reduce redundant components in a larger network through a proper method to reduce the requirements of the network for the amount of computation and the storage space. Although an existing fine-grained parameter sparsification method has excellent model performance, it is unfriendly to hardware access, in other words, on-chip and off-chip input/output overhead is high and performance is low; in addition, although a structured sparsity method based on channel and convolution kernel improves hardware performance, the model accuracy has a greater loss; finally, most of existing sparse algorithms are in an off-line fine-tuning mode, in other words, a pre-trained model is subjected to fine-tuning after sparsification, so that the off-line fine-tuning mode is greatly limited, and considerable performance benefits cannot be obtained from model training.
Therefore, a solution of performing inference by using a sparse online-trained parameter tensor is urgently needed.
SUMMARYIn order to at least partially solve technical problems mentioned in the background, solutions of the present disclosure provide a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model.
In one aspect of the present disclosure, the present disclosure provides a method of performing sparse training on a neural network model, which comprises a mask adjustment stage and a mask fixing stage. In the mask adjustment stage, the following steps are repeated in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; updating the mask adjustment parameters based on the partial derivatives; and updating the mask tensor based on the updated mask adjustment parameters. In the mask fixing stage, the updated mask adjustment parameters in the mask adjustment stage may be taken as initial values of mask fixing parameters, and the following steps are repeated in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on an updated mask tensor to compute the value of the loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; and updating the mask fixing parameters based on the partial derivatives. The updated mask fixing parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
In another aspect of the present disclosure, the present disclosure provides a method of performing sparse training on a neural network model, comprising: in a mask adjustment stage, repeating the following steps in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; updating the mask adjustment parameters based on the partial derivatives; and updating the mask tensor based on updated mask adjustment parameters. The updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
In another aspect of the present disclosure, the present disclosure provides a computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model. The computer program code, when executed by a processing device, performs the aforementioned method.
In another aspect of the present disclosure, the present disclosure provides an integrated circuit device for performing sparse training on a neural network model, comprising: a processing device and a computation device. The processing device comprises a control module, a computation module, and an updating module. When the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives, and update the mask tensor based on updated mask adjustment parameters. When the control module sets that a mask fixing stage is entered, the updating module takes the updated mask adjustment parameters as initial values of mask fixing parameters, and the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters. The updating module updates the mask fixing parameters based on the partial derivatives. The computation device is configured to shield the updated mask fixing parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
In another aspect of the present disclosure, the present disclosure provides an integrated circuit device for performing sparse training on a neural network model, comprising: a processing device and a computation device. The processing device comprises a control module, a computation module, and an updating module; when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives, and update the mask tensor based on the updated mask adjustment parameters. The computation device is configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
In another aspect of the present disclosure, the present disclosure provides a board card, comprising the aforementioned integrated circuit device.
According to the present disclosure, in the model training stage, the parameters are trained and the mask tensor is updated at the same time, which achieves the technical effects of reducing input/output overhead and improving accuracy.
The above and other objectives, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are illustrated exemplarily rather than restrictively, and identical or corresponding reference numerals refer to identical or corresponding parts, in which:
Technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only some of the embodiments of the present disclosure, but not all of them. All other embodiments, which can be derived by those skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that terms “first”, “second”, “third”, “fourth”, and the like, in the claims, description, and drawings of the present disclosure are used for distinguishing different objects, rather than for describing a specific order. Terms “including” and “comprising”, when used in the description and claims of this disclosure, indicate the presence of stated features, unity, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, unity, steps, operations, elements, components, and/or combinations thereof.
It should also be understood that terms used in the description of the present disclosure is for the purpose of describing specific embodiments only, and is not intended to limit the present disclosure. As used in the description and claims of this disclosure, “a”, “an” and “this” in the singular are intended to include the plural, unless other cases are clearly indicated in the context. It should be further understood that a term “and/or” used in the description and claims of this disclosure refers to any and all possible combinations of one or more of associated listed items and includes these combinations.
As used in the description and claims, a term “if” may be interpreted contextually as “when”, or “once”, or “in response to determining”, or “in response to detecting.”
Specific implementations of the present disclosure will be described in detail below in conjunction with the accompanying drawings.
A neural network is composed of an input layer, a convolutional layer, an activation function, a pooling layer and a fully connected layer, and has at least several layers and at most hundreds of layers. One operator is executed for each layer, for example, a convolutional operator is executed for a convolutional layer, and the number of operators that need to be executed is as many as the number of the layers. In this disclosure, when a specific layer is mentioned, an operator to which the layer corresponds is indicated.
The chip 101 is connected with an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a WIFI interface, or the like. Data to be processed may, via the external interface device 102, be transferred to the chip 101 by the external device 103. A computation result of the chip 101 may, via the external interface device 102, be transmitted back to the external device 103. The external interface device 102 may have interfaces in different forms, for example, a PCIe interface, and the like, according to different application scenes.
The board card 10 further comprises a storage device 104 for data storage, which comprises one or more storage units 105. The storage device 104 is, through a bus, in connection and data transmission with a control device 106 and the chip 101. The control device 106 in the board card 10 is configured to regulate a state of the chip 101. For this reason, in an application scene, the control device 106 may include a micro controller unit (MCU).
The computation device 201 is configured to perform a user-specified operation, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform computation of deep learning or machine learning, and it may interact with the processing device 203 through the interface device 202, to jointly complete the user-specified operation.
The interface device 202 is configured to transmit data and control instructions between the computation device 201 and the processing device 203. For example, the computation device 201 may, via the interface device 202, acquire input data from the processing device 203, and write the input data into a storage device on the computation device 201. Further, the computation device 201 may, via the interface device 202, acquire a control instruction from the processing device 203, and write the control instruction into a control cache on the computation device 201. Alternatively or optionally, the interface device 202 may also read data in the storage device of the computation device 201 and transmit the data to the processing device 203.
The processing device 203, as a general-purpose processing device, performs basic controls that include, but are not limited to, data move, start and/or stop of the computation device 201, and the like. Depending on different implementations, the processing device 203 may be a processor of one or more of central processing unit (CPU), graphics processing unit (GPU) or other general-purpose and/or special-purpose processor. These processors include, but are not limited to, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic, a discrete hardware component, and the like, and the number of the processors may be determined according to actual needs. As mentioned above, for the computation device 201 of the present disclosure alone, it may be regarded as having a single-core structure or an isomorphic multi-core structure. However, when the computation device 201 and the processing device 203 are integrated to be considered together, the two are regarded as forming a heterogeneous multi-core structure.
The DRAM 204 is configured to store data to be processed, and is a DDR memory with a size of typically 16G or larger, to store data of the computation device 201 and/or the processing device 203.
The control module 31 is configured to coordinate and control work of the computation module 32 and the storage module 33 to complete a deep learning task, and comprises an instruction fetch unit (IFU) 311 and an instruction decode unit (IDU) 312. The instruction fetch unit 311 is configured to acquire an instruction from the processing device 203, and the instruction decode unit 312 is configured to decode the acquired instruction and send a decoded result as control information to the computation module 32 and the storage module 33.
The computation module 32 comprises a vector computation unit 321 and a matrix computation unit 322. The vector computation unit 321 is configured to perform vector computation, and may support complex computation such as vector multiplication, addition, nonlinear transformation; and the matrix computation unit 322 is responsible for core computation of a deep learning algorithm, in other words, matrix multiplication and convolution.
The storage module 33 is configured to store or move related data, and comprises a neuron RAM (NRAM) 331, a weight RAM (WRAM) 332, and a direct memory access (DMA) 333. The NRAM 331 is configured to store an input neuron, output neuron and computed intermediate result; the WRAM 332 is configured to store a convolution kernel, in other words, a weight, of a deep learning network; and the DMA 333 is connected to the DRAM 204 through a bus 34 and is responsible for data move between the single-core computation device 301 and the DRAM 204.
From the perspective of the system-on-chip level, the multi-core computation device 41 comprises, as shown in
There may be a plurality of external storage controllers 401, two exemplarily shown in the figure. The external storage controller 401 is configured to access, in response to an access request issued by a processor core, external storage device, for example, the DRAM 204 in
From the perspective of the cluster level, each cluster 405 comprises, as shown in
There are 4 processor cores 406 exemplarily shown in the figure, and the number of the processor cores 406 is not limited in the present disclosure. An internal structure of the processor core is shown in
Returning to
The MEM core 407 comprises the SRAM 408, the broadcast bus 409, a cluster direct memory access (CDMA) module 410, and a global direct memory access (GDMA) module 411. The SRAM 408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in a same cluster 405 does not need to be individually obtained from the DRAM 204 through the processor cores 406, but is transferred among the processor cores 406 through the SRAM 408, and the MEM core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to the plurality of processor cores 406, so that inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.
The broadcast bus 409, CDMA 410, and GDMA 411 are used for performing communication among the processor cores 406, communication among the clusters 405, and data transmission between the cluster 405 and the DRAM 204, respectively, which will be described separately below.
The broadcast bus 409 is used for completing high-speed communication among the processor cores 406 in the cluster 405, and an inter-core communication mode supported by the broadcast bus 409 of this embodiment comprises unicast, multicast and broadcast. The unicast refers to point-to-point (for example, from a single processor core to a single processor core) data transmission, the multicast is a communication mode that one data is transmitted from the SRAM 408 to several specific processor cores 406, and the broadcast is a communication mode that one data is transmitted from the SRAM 408 to all the processor cores 406, which is a special case of the multicast.
The CDMA 410 is used for controlling access to SRAMs 408 between different clusters 405 within a same computation device 201.
The GDMA 411 cooperates with the external storage controller 401 to control access from the SRAM 408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408. As can be learned from the foregoing, the communication between the DRAM 204 and NRAM 531 or WRAM 32 may be accomplished via two channels. A first channel is to directly connect the DRAM 204 with the NRAM 531 or WRAM 532 through IODAM 533; and a second channel is to transmit data between the DRAM 204 and the SRAM 408 via the GDMA 411, and then between the SRAM 408 and the NRAM 531 or WRAM 532 via MVDMA 534. Although outwardly, the second channel requires more elements to participate and the data flow is longer, in fact, the bandwidth of the second channel is much greater than that of the first channel in some embodiments, and therefore, the communication between the DRAM 204 and the NRAM 531 or WRAM 532 may be more efficient through the second channel. In an embodiment of the present disclosure, a data transmission channel can be selected according to its own hardware condition.
In other embodiments, functions of the GDMA 411 and IODMA 533 may be integrated in a same component. For ease of description, the GDMA 411 and the IODMA 533 are regarded as different components in the present disclosure, and for those skilled in the art, the component will fall within the scope of protection of the present disclosure as long as it realizes functions and technical effects similar to the present disclosure. Further, the functions of the GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 can be also implemented by a same component.
The training of the neural network is to adjust parameters of the layers by inputting training samples, so that a result computed by the neural network is close to a real result as far as possible. The neural network training comprises forward propagation and backward propagation. The forward propagation is to, based on an existing model, input a training sample which is computed by the layers of the neural network, and to gradually extract an input feature map into abstract features. The backward propagation is to compute a loss function according to a result of the forward propagation and a real value, compute partial derivatives of the loss function with respect to the parameters through a chain rule by adopting a gradient descent method, to update the parameters, and then perform training by using updated parameters, and repeat the above many times, such that a final computation result of the forward propagation is as anticipated.
In this embodiment, an epoch refers to a process of performing training once by using all training samples, a set of these training samples is a training set, and the training of a batchsize of the training samples is an iteration. For example, a training set has 1000 training samples, and a batchsize is set to 10, then 10 training samples are required to participate in training for each iteration, and there are 100 iterations in one epoch. In practice, the training of a neural network model may go through a plurality of epochs.
Based on the aforementioned hardware environment, the embodiment provides a solution of performing sparse training on a neural network model. More specifically, the processing device 203 trains parameters and a mask tensor at the same time in a neural network training stage. As shown in
In step 701, it is set that a mask adjustment stage is entered. While performing training, the prior art only trains all the parameters (such as weights, offsets), and usually does not mask the parameters. The purpose of masking the parameters in this embodiment is to reduce participation of the parameters in the training stage, avoid over-fitting to reduce the amount of computation, and meanwhile, make the mask tensor updated with the updating of the parameters in the training process to obtain a more ideal mask tensor. The control module 62 starts to enter the mask adjustment stage, in other words, begins to mask a part of the parameters by using the mask tensor. In an application scene, the parameters and the mask tensor are randomly generated at the beginning of the training, and the random generation module 61 randomly generates initial values of the mask tensor and the parameters. In another application scene, the mask tensor is generated according to the randomly generated parameters at the beginning of the training, in other words, the random generation module 61 randomly generates initial values of the parameters, and the mask tensor determination module 65 determines an initial value of the mask tensor according to the initial values of the parameters.
In some embodiments, when the mask tensor is a one-dimensional tensor (in other words, a vector), the mask tensor determination module 65 may determine the initial value of the mask tensor based on: selecting, from every m data elements of a specified dimension of the initial values of the above parameters, n data elements with larger absolute values as valid data elements, where m>n; and generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements. In some implementations, the specified dimension may be an input channel dimension (Cin). Specifically, in this embodiment, the parameters are divided into a plurality of intervals in units of a specific parameter count m, parameters in each interval are sorted according to their absolute values from large to small, then in the mask tensor, elements at positions that correspond to first n parameters with larger absolute values in each interval are set to 1, and elements at positions that correspond to m-n parameters with smaller absolute values in each interval are set to 0.
In some other embodiments, when the mask tensor are a two-dimensional tensor, the control module 62 will preset a specific count of two-dimensional mask tensors and then select one of the preset two-dimensional mask tensors as the initial value of the mask tensor. Each dimension of these two-dimensional mask tensors comprises m elements, of which n elements are 1, and m-n elements are 0, where m>n.
The mask tensor of this embodiment is exemplarily set as a two-dimensional mask matrix for masking an input channel (cm) and an output channel (cout) of a convolution kernel of the convolution layer, and assuming that m is 4 and n is 2, the mask matrix cin×cout is set to 4(m)×4(m), where there are 2(n) elements of 1 and 2(m-n) elements of 0 in any row and any column. Since there are 90 such 4×4 mask matrices, the control module 62 presets, in this step, 90 4×4 mask matrices with 2 elements of 1 and 2 elements of 0 in any row and any column, which are pre-stored in the DRAM 204. Although this embodiment is illustrated by taking the input channel (cin) and the output channel (cout) as an example, the present disclosure is not limited thereto, and any parameter can be masked according to the teaching of this embodiment.
Selecting one from among these specific count (for example, 90) of two-dimensional mask tensors as the initial value may comprise: respectively masking two specified dimensions of the initial values of the parameters of the neural network layer based on each preset two-dimensional mask tensor to obtain masked parameter tensors; performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and selecting a two-dimensional mask tensor that yields the largest of all parameter evaluation values as the initial value of the mask tensor. In some implementations, the above two specified dimensions can be an input channel dimension and an output channel dimension. A masking process of the two-dimensional mask tensor can refer to the description made thereinafter in conjunction with
After entering the mask adjustment stage, the processing device 203 repeats the following steps in a plurality of epochs.
In step 702, the mask adjustment parameters are masked based on the mask tensor in forward propagation to compute a value of a loss function. Here, for ease of identification, the parameters in the mask adjustment stage is defined as mask adjustment parameters. Taking the aforementioned 4×4 mask matrix as an example, in this step, the computation module 63 masks the input channel and the output channel respectively according to one mask matrix selected from the 90 mask matrices in the initialization step.
In step 703, partial derivatives of the loss function with respect to the mask adjustment parameters are computed in backward propagation. When the computation module 63 is in back propagation, it propagates the output error of the neural network from an output end of the neural network model to an input direction step by step, and in the process, an effect of each mask adjustment parameter on the loss function is computed by using a chain rule, in other words, the partial derivative of the loss function with respect to each mask adjustment parameter is computed.
In step 704, the mask adjustment parameters are updated based on the partial derivatives. The updating module 64, according to effects of the mask adjustment parameters on the error, multiplies the mask adjustment parameters by a stride, to update the mask adjustment parameters of the whole neural network.
In this embodiment, the updating module 64 may update the mask adjustment parameters based on the partial derivatives in each training sample or each iteration. Taking the aforementioned example that the epoch comprises the training set of 1000 training samples and the batchesize is 10, if the mask adjustment parameters are updated after each training sample is trained, the updating will be performed 1000 times in the epoch; and if the mask adjustment parameters are updated every iteration, the updating will be performed 100 times in the epoch.
In step 705, the mask tensor is updated based on the updated mask adjustment parameters. The updating module 64 of this embodiment updates the mask tensor in a variety of ways.
If the mask tensor is one-dimensional, in other words, a mask vector, the mask vector can only mask a single parameter. As shown in
The division unit 641 divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m, the sorting unit 642 sorts the mask adjustment parameters in each interval according to their absolute values from large to small, and the adjustment unit 643 sets, in the mask vector, to 1, elements that correspond to first n mask adjustment parameters sorted, and sets, to 0, remaining elements that correspond to m-n mask adjustment parameters with smaller absolute values, in other words, first n mask adjustment parameters with larger absolute values are retained, and m-n mask adjustment parameters with smaller absolute values are masked.
In this embodiment, the mask adjustment parameters in each interval are completely sorted to identify n parameters with larger absolute values and m-n parameters with smaller absolute values, but the present disclosure does not necessarily need to perform complete sorting as long as the n parameters with larger absolute values and the m-n parameters with smaller absolute values can be identified, and the sorting of the n parameters with larger absolute values from large to small and the sorting of the m-n parameters with smaller absolute values from large to small are not necessary information. Taking the first interval 902 as an example, the present disclosure only needs to judge that b01 and b02 are 2 parameters with larger absolute values, and boa and boo are 2 parameters with smaller absolute values, and the sorting of absolute values of b01 and b02 from large to small and the sorting of absolute values of b03 and b04 from large to small are not critical, and they may not be sorted, to save computation resources.
If the masked tensor is multi-dimensional, the updating module 64 may perform product sum computation on the training data and each masked parameter tensor to obtain a parameter evaluation value. The purpose of obtaining the parameter evaluation value is to compute the amount of information retained after the masking by the mask tensor. If the evaluation value parameter is high, it is indicated that the amount of information is not lost too much due to the masking, the mask tensor reduces the amount of computation on the premise of retaining most information and is a high-quality mask tensor; and on the contrary, if the parameter evaluation value is low, it is indicated that the amount of information is lost too much after the masking, and the mask tensor is not a high-quality mask tensor. An updating process of the multi-dimensional mask tensor is similar to the initialization process described above for the two-dimensional mask tensor, in other words, the mask tensor determination module 65 may be implemented as part of the updating module 64.
For another example, absolute values of the training data matrix 1001 and corresponding elements of the masked parameter matrix 803 are multiplied, and then the products are summed to obtain a parameter evaluation value S2:
The parameter evaluation value reflects a result of similar absolute value computations, and the parameter evaluation value S1 or S2 represents how much the amount of information is retained after the masking, and the higher the parameter evaluation value, the more amount of information retained. In one application scene, either of the parameter evaluation value S1 or S2 may be selected, while in another application scene, both the parameter evaluation value S1 and S2 may be employed, which is not limited in this disclosure.
The updating module 64 masks all mask tensors and obtains the parameter evaluation value. In the foregoing example, it means that all of the 90 4×4 mask matrices are masked and 90 parameter evaluation values are obtained. A mask tensor with a largest parameter evaluation value is selected as the updated mask tensor, in other words, the parameter mask tensor. There are many ways of selecting the largest parameter evaluation value, for example, the sorting unit 642 may sort all the parameter evaluation values according to the values from large to small to obtain the largest parameter evaluation value, or simply compare two values with a two-input comparator, then leave the larger for comparison with a next parameter evaluation value, and then the largest parameter evaluation value is left after all the 90 parameter evaluation values have been compared. If a plurality of mask tensors have the same largest parameter evaluation value, the updating module 64 may select one of them based on a specific rule or hardware characteristic, for example, a top-ranked one, bottom-ranked one, first-left one, last-left one, or one randomly chosen.
The mask tensor having the largest parameter evaluation value is a mask tensor that retains the most amount of information, and in this embodiment, this mask tensor is taken as the parameter mask tensor.
In this embodiment, the updating module 64 will update the parameter mask tensor in each iteration or each epoch. If in the step 704, the mask adjustment parameters are updated after each training sample is trained, it is advantageous that the parameter mask tensor is updated every iteration; and if in the step 704, the mask adjustment parameters are updated every iteration, it is advantageous that the parameter mask tensor is updated at the end of each epoch.
Through the flow shown in
Another embodiment of the present disclosure provides, similarly based on the aforementioned hardware environment, a solution of performing sparse training on a neural network model. It is different from the previous embodiment in that a mask-free stage is entered before the mask adjustment stage. In the mask-free stage, the processing device 203 only trains the parameters, in other words, does not mask the parameters, and after the mask-free stage is finished and the mask adjustment stage is entered, the parameters are trained and the mask matrix is updated at the same time. Training flow of this embodiment is shown in
In step 1101, the control module 62 first sets that a mask-free stage is entered. In the mask-free stage, in this embodiment, the parameters are not masked, all of them participate in the training, and at the beginning of the training, the random generation module 61 randomly generates parameter values. For ease of identification, the parameters participating in the training in the mask-free stage are called mask-free parameters.
In step 1102, the computation module 63 computes a value of a loss function in forward propagation based on the mask-free parameters. In this step, the computation module 63 employs the way of computing the loss function in the prior art as follows: inputting a training sample in forward propagation, through the computation by the layers of the neural network, gradually extracting an input feature map as abstract features and computing the loss function by using forward propagation results and a real value.
In step 1103, the computation module 63 computes partial derivatives of the loss function with respect to the mask-free parameters in back propagation. The computation module 63 computes the partial derivative of the loss function with respect to each mask-free parameter through a chain rule by employing a gradient descent method.
In step 1104, the updating module 64 updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters. According to effects of the mask-free parameters on error, the updating module 64 multiplies the parameters by a stride to update the mask-free parameters of the whole neural network. In this embodiment, the updating module 64 may also update the mask-free parameters based on the partial derivatives in each training sample or each iteration.
In this embodiment, the steps 1102, 1103 and 1104 may be repeated in a specific count of epochs, to update the mask-free parameters many times, and after a last updating, the updated mask-free parameters will be taken as initial values of the mask adjustment parameters in a next stage.
In step 1105, it is set that a mask adjustment stage is entered. The control module 62 sets that the mask adjustment stage is entered, in other words, it begins to use the mask tensor to mask part of the parameters. At the beginning of entering the mask adjustment stage, as described above, initial values of the mask adjustment parameters are mask-free parameters finally updated in the mask-free stage, and the mask tensor can be generated in two ways, one of which is to randomly generate the mask tensor by the random generation module 61, the other of which is to obtain the initial values of the mask adjustment parameters based on the mask-free parameters finally updated in the mask-free stage, in a same manner as the step 705, which is not repeated.
In step 1106, the mask adjustment parameters are masked in forward propagation based on the mask tensor to compute the value of the loss function. In step 1107, the partial derivatives of the loss function with respect to the mask adjustment parameters are computed in back propagation. In step 1108, the mask adjustment parameters are updated based on the partial derivatives. In step 1109, the mask tensor is updated based on updated mask adjustment parameters. These steps are the same as the steps 702, 703, 704 and 705, respectively, and are not repeated.
The counts of epochs in the mask-free stage and in the mask adjustment stage are not limited in this embodiment, and can be arranged by those skilled in the art according to a specific situation, and the counts of epochs in the mask-free stage and in the mask adjustment stage are not necessarily the same.
Another embodiment of the present disclosure provides, similarly based on the aforementioned hardware environment, a solution of performing sparse training on a neural network model. It is different from the above embodiment in that the training is divided into three stages: a mask-free stage, a mask adjustment stage and a mask fixing stage. In the mask-free stage, the processing device 203 only trains the parameters and does not mask the parameters; in the mask adjustment stage, the processing device 203 takes the updated mask-free parameters as initial values, trains the parameters and the mask tensor at the same time; and in the mask fixing stage, the processing device 203 takes the updated mask adjustment parameters and the updated mask tensor in the mask adjustment stage as initial values, and continues to train the parameters without changing or updating the mask tensor.
The flow executed in the mask-free stage and the mask adjustment stage in this embodiment is shown in
In step 1201, the control module 62 sets that a mask fixing stage is entered. In the mask fixing stage, the control module 62 takes the updated mask adjustment parameters in the mask adjustment stage as initial values of the parameters (hereinafter referred to as the mask fixing parameters) in this stage, and in this embodiment, since the mask tensor has been updated in the mask adjustment stage, the mask tensor will be no longer updated in this stage, and the mask fixing parameters are masked based on the mask tensor finally updated in the mask adjustment stage, and the mask fixing parameters are continually trained.
In this embodiment, the following steps are repeated in at least one epoch.
In step 1202, the computation module 63 masks the mask fixing parameters in forward propagation based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function. This step is similar to the step 702 and is not repeated.
In step 1203, the computation module 63 computes the partial derivatives of the loss function with respect to the mask fixing parameters in backward propagation. This step is similar to the step 703 and is not repeated.
In step 1204, the updating module 64 updates the mask fixing parameters based on the partial derivatives. This step is similar to the step 704 and is not repeated.
In this embodiment, the training is divided into three stages. In the mask-free stage, no mask tensor masks the parameters, and only the parameters are trained to speed up the convergence of the parameters. In the mask adjustment stage, the initial values of the parameters are no longer randomly generated, but are trained mask-free parameters, which is helpful to quickly obtain an ideal mask tensor. After the mask tensor has been updated, the mask fixing stage is entered, parameters are continually trained by using the updated mask tensor, and finally, the trained parameters will better match the mask tensor.
In view of the above, those skilled in the art will understand that there may be several implementations of the present disclosure as shown in
The implementation 1301 only has a mask adjustment stage, in which both an initial values W0 of the parameter and an initial value MO of the mask tensor are randomly generated by the random generation module 61, or the initial value MO of the mask tensor is determined based on the initial value W0 of the parameter. The parameters are trained and the mask matrix is updated at the same time, to obtain a trained parameter Wf and an updated mask tensor Mf.
The implementation 1302 only has a mask-free stage and a mask adjustment stage. In the mask-free stage, only parameters are trained, an initial value W0 of the parameter is randomly generated by the random generation module 61, and an updated parameter W1 is obtained after the training. In the mask adjustment stage, the parameters are trained and a mask matrix is updated at the same time, an initial value of the parameter in this stage is the updated parameter W1, and an initial value MO of the mask tensor is randomly generated by the random generation module 61, or the initial value MO of the mask tensor is obtained by using the updated parameter W1, and finally a trained parameter Wf and an updated mask tensor Mf are obtained.
The implementation 1303 only has a mask adjustment stage and a mask fixing stage. In the mask adjustment stage, both an initial value W0 of the parameter and an initial value MO of the mask tensor are randomly generated by the random generation module 61, or the initial value MO of the mask tensor is determined based on the initial values W0 of the parameter. The parameters are trained and the mask matrix is updated at the same time, to obtain an updated parameter W1 and an updated mask tensor Mf. In the mask fixing stage, the parameters are masked by the updated mask tensor Mf and are continually trained, an initial value of the parameter in the stage is the updated parameter W1, and finally a trained parameter Wf is obtained.
The implementation 1304 has a mask-free stage, a mask adjustment stage, and a mask fixing stage. In the mask-free stage, only parameters are trained, and an initial value W0 of the parameter is randomly generated by the random generation module 61, and an updated parameter W1 is obtained after the training. In the mask adjustment stage, the parameters are trained and a mask matrix is updated at the same time, an initial value of the parameter in this stage is the updated parameter W1, and an initial value MO of the mask tensor is randomly generated by the random generation module 61, or the initial value MO of the mask tensor is obtained by using the updated parameter W1, and finally an updated parameter W2 and an updated mask tensor Mf are obtained. In the mask fixing stage, the parameters are masked by the updated mask tensor Mf and are continually trained, an initial value of the parameter in the stage is the updated parameter W2, and finally a trained parameters Wf is obtained.
The implementation 1305 has, in addition to a mask-free stage, a mask adjustment stage, and a mask fixing stage, other training stages (indicated by dashed lines) between the mask-free stage and the mask adjustment stage, and between the mask adjustment stage and the mask fixing stage. In the mask-free stage, only parameters are trained, and an initial value W0 of the parameter is randomly generated by the random generation module 61, and an updated parameter W1 is obtained after the training. Then, the mask-free stage may be followed by any training stage disclosed or not disclosed in this disclosure, in which the parameters are trained or a mask matrix is be updated. Assuming that this stage is a mask fixing stage, an initial value of the parameter in this stage is the updated parameter W1, and an initial value MO of the mask tensor is randomly generated by the random generation module 61, or the initial value MO of the mask tensor is obtained by using the updated parameter W1, to obtain an updated parameter W2.
Then the mask adjustment stage is entered, in which the parameters are trained and the mask matrix is be updated at the same time, an initial value of the parameter in this stage is the updated parameter W2, and an initial value of the mask tensor is still the mask tensor MO, to obtain an updated parameter W3 and an updated mask tensor Ml. And then, this stage is followed by any stage disclosed or not disclosed in the present disclosure, in which the parameters are trained or the mask matrix is updated. Assuming that this stage is a parameter fixing stage, in other words, the parameters are fixed and not trained, and only the mask tensor is trained, an initial value of the parameter in this stage is the updated parameter W3, and an initial value of the mask tensor is the updated mask tensor Ml, to obtain an updated mask tensor Mf.
Finally, in the mask fixing stage, the parameters are masked by the updated tensor Mf and are continually trained, an initial value of the parameter in this stage is the updated parameter W3, and finally a trained parameter Wf is obtained.
The various implementations shown in
The count of epochs in each stage of various implementations is not limited in the present disclosure, and can be arranged by those skilled in the art according to a specific situation, and the count of epochs in each stage is not necessarily the same.
The aforementioned embodiments do not necessarily require that all preset specific count of epochs are completed. The control unit 62 may further judge whether a percentage of all unchanged element values of the parameter mask tensor reaches a threshold in 2 consecutive epochs. If the threshold is reached, it is indicated that the training result is basically converged and more training brings limited improvement on the accuracy, and therefore, the mask adjustment stage is ended to complete the training. Such a threshold is typically set above 70%, in other words, if the percentage of all unchanged element values of the parameter mask tensor is above 70%, the training will be stopped. The threshold is not limited in the present disclosure, and may be 80%, 90%, 100%, or any other percentage.
Another embodiment of the present disclosure is a computer-readable storage medium that has stored thereon computer program code for performing sparse training on a neural network model, which when executed by a processor, performs the methods of the embodiments as described above. In some implementation scenes, the above integrated units may be implemented in a form of a software program module. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated units may be stored in a computer-readable memory. Based on this, when the solutions of the present disclosure are embodied in a form of a software product (for example, computer-readable storage medium), the software product may be stored in a memory, and it may include several instructions to cause a computer device (for example, a personal computer, a server, a network device, or the like) to perform some or all of the steps of the methods described in the embodiments of the present disclosure. The above memory may include, but is not limited to, various media that may have stored thereon program code, for example, a U disk, a flash disk, a read only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk, an optical disk, or the like.
In the above embodiments, after the training is completed, when the computation device 201 performs inference, the trained parameters are shielded by using the updated parameter mask tensor, to control the processing area of the feature map input to the neural network model, so that on one hand, the expected accuracy can be reached, and on the other hand, the amount of computation can be reduced in the process of the inference to achieve sparsification.
According to different application scenes, an electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an Internet-of-Things terminal, a mobile terminal, a mobile phone, a dashboard camera, a navigator, a sensor, a webcam, a camera, a video camera, a projector, a watch, an earphone, a mobile memory, a wearable device, a visual terminal, an automatic driving terminal, transportation, a household appliance, and/or medical device. The transportation comprises an airplane, a ship and/or a vehicle; the household appliance comprises a television set, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; and the medical device comprises a nuclear magnetic resonance instrument, a B-ultrasonic scanner and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to fields such as Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunication, finance, retail, construction site, medical care, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in application scenes related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, an electronic device or apparatus with high computational capability according to the present disclosure may be applied to a cloud device (for example, cloud server), and an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (for example, a smartphone or a camera). In one or more embodiments, as hardware information of a cloud device and hardware information of a terminal device and/or an edge device are compatible with each other, appropriate hardware resources can be matched from hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate hardware resources of the terminal device and/or the edge device, so that unified management, scheduling and cooperative work of terminal-cloud integration or cloud-edge-terminal integration are achieved.
It should be noted that for the sake of brevity, this disclosure presents some methods and their embodiments as a series of actions and their combinations, but those skilled in the art will appreciate that the solutions of the present disclosure are not limited by the order of the described actions. Accordingly, those of ordinary skill in the art will appreciate, in light of the disclosure or teachings of the present disclosure, that certain steps therein may be performed in other orders or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be regarded as alternative embodiments, in other words, the actions or modules involved therein are not necessarily required for the implementation of a certain solution or solutions of the present disclosure. In addition, according to different solutions, the description of some embodiments in the present disclosure may focus on different emphases. In view of this, those skilled in the art will appreciate that for portions that are not described in detail in one embodiment of the present disclosure, reference may also be made to the related description of other embodiments.
In a specific implementation, based on the disclosure and teachings of the present disclosure, those skilled in the art will appreciate that several embodiments disclosed in the present disclosure may also be implemented in other ways not disclosed herein. For example, as for units in the foregoing embodiments of the electronic device or apparatus, the units are split based on logic functions considered herein, and they may be split in other ways in a practical implementation. For another example, a plurality of units or components may be combined or integrated into another system, or some features or functions in a unit or component may be selectively disabled. In terms of a connection relation between different units or components, the connection discussed above in conjunction with the accompanying drawings may be direct or indirect coupling between the units or components. In some scenes, the foregoing direct or indirect coupling involves a communication connection that uses an interface, wherein the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, an unit described as a separate part may or may not be physically separate, and a component shown as an unit may or may not be a physical unit. The aforementioned components or units may be in a same position or distributed over a plurality of network units. In addition, according to actual needs, some or all of the units can be selected to achieve the objectives of the solutions described in the embodiments of the present disclosure. In addition, in some scenes, a plurality of units in embodiments of the present disclosure may be integrated into one unit or each unit exists physically separately.
In other implementation scenes, the above integrated unit may also be implemented in a form of hardware, in other words, a specific hardware circuit, which may include a digital circuit, and/or an analog circuit, and the like. A physical implementation of a hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In view of this, various devices described herein (for example, computation devices or other processing devices) may be implemented by a suitable hardware processor, such as a central processing unit, GPU, FPGA, DSP, ASIC, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including a magnetic storage medium, a magneto-optical storage medium, or the like), and it may be, for example, a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), a ROM, a RAM, and the like.
The above content may be better understood in light of the following Clauses:
Clause 1. A method of performing sparse training on a neural network model, comprising:
-
- in a mask adjustment stage, repeating the following steps in a plurality of epochs:
- masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters;
- updating the mask adjustment parameters based on the partial derivatives; and
- updating the mask tensor based on the updated mask adjustment parameters; and
- in a mask fixing stage, taking the updated mask adjustment parameters in the mask adjustment stage as initial values of mask fixing parameters, repeating the following steps in a plurality of epochs:
- masking, in forward propagation, the mask fixing parameters based on the updated mask tensor to compute a value of the loss function;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; and
- updating the mask fixing parameters based on the partial derivatives,
- wherein the updated mask fixing parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
Clause 2. The method of Clause 1, further comprising:
-
- in a mask-free stage, repeating the following steps in a plurality of epochs:
- computing, in forward propagation, the value of the loss function based on mask-free parameters;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and
- updating the mask-free parameters based on the partial derivatives,
- wherein the updated mask-free parameters are taken as initial values of the mask adjustment parameters.
Clause 3. The method of Clause 2, further comprising:
-
- randomly generating initial values of the mask tensor and the mask-free parameters.
Clause 4. The method of Clause 1, further comprising:
-
- determining the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
Clause 5. The method of Clause 4, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
-
- selecting, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, where m>n; and
- generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
Clause 6. The method of Clause 5, wherein the specified dimension is an input channel dimension.
Clause 7. The method of Clause 4, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor includes:
-
- presetting a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1, and m-n elements are 0, where m>n;
- masking two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
- performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
- selecting a two-dimensional mask tensor that yields the largest of all parameter evaluation values as the initial value of the mask tensor.
Clause 8. The method of Clause 7, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
Clause 9. The method of Clause 1, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
Clause 10. The method of Clause 1, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
-
- after a specific count of epochs have been performed, dividing the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m;
- sorting the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small;
- setting, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1; and
- setting, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
Clause 11. The method of Clause 10, wherein the mask adjustment stage further comprises:
-
- judging whether a percentage of all unchanged element values of the mask tensor reaches a threshold in a plurality of consecutive epochs; and
- if the threshold is reached, ending the mask adjustment stage.
Clause 12. The method of Clause 11, wherein the threshold is one of 80%, 90%, and 100%.
Clause 13. The method of Clauses 5 to 8 or 10, wherein m is 4 and n is 2.
Clause 14. The method of Clause 10, wherein the specific count is 1.
Clause 15. A computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of any of Clauses 1 to 12.
Clause 16. An integrated circuit device for performing sparse training on a neural network model, comprising:
-
- a processing device, comprising a control module, a computation module, and an updating module,
- wherein when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives and update the mask tensor based on the updated mask adjustment parameters,
- wherein when the control module sets that a mask fixing stage is entered, the updating module takes the updated mask adjustment parameters as initial values of mask fixing parameters, and the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; the updating module updates the mask fixing parameters based on the partial derivatives; and
- a computation device configured to shield the updated mask fixing parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
Clause 17. The integrated circuit device of Clause 16, wherein when the control module sets that a mask-free stage is entered, the computation module repeats the following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
Clause 18. The integrated circuit device of Clause 17, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
Clause 19. The integrated circuit device of Clause 16, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
Clause 20. The integrated circuit device of Clause 19, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
-
- select, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, where m>n; and
- generate the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
Clause 21. The integrated circuit device of Clause 20, wherein the specified dimension is an input channel dimension.
Clause 22. The integrated circuit device of Clause 19, wherein when the mask tensor is a two-dimensional tensor, the mask tensor determination module is configured to:
-
- preset a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
- mask two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
- perform product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
- select a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
Clause 23. The integrated circuit device of Clause 22, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
Clause 24. The integrated circuit device of Clause 16, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
Clause 25. The integrated circuit device of Clause 16, wherein when the mask tensor is a one-dimensional tensor, the updating module includes a division unit, a sorting unit, and an adjustment unit, and in the mask adjustment stage, after a specific count of epochs have been performed, the division unit divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m; the sorting unit sorts the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small; the adjustment unit sets, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1, and sets, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
Clause 26. The integrated circuit device of Clause 25, wherein in the mask adjustment stage, the control module judges whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and if the threshold is reached, the mask adjustment stage is ended.
Clause 27. The integrated circuit device of Clause 26, wherein the threshold is one of 80%, 90%, and 100%.
Clause 28. The integrated circuit device of Clauses 20 to 23 or 25, wherein m is 4 and n is 2.
Clause 29. The integrated circuit device of Clause 25, wherein the specific count is 1.
Clause 30. A board card, comprising the integrated circuit device of any of Clauses 16 to 29.
Clause 31. A method of performing sparse training on a neural network model, comprising:
-
- in a mask adjustment stage, repeating the following steps in a plurality of epochs:
- masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters;
- updating the mask adjustment parameters based on the partial derivatives; and
- updating the mask tensor based on the updated mask adjustment parameters,
- wherein the updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
Clause 32. The method of Clause 31, further comprising:
-
- in a mask-free stage, repeating the following steps in a plurality of epochs:
- computing, in forward propagation, the value of the loss function based on mask-free parameters;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and
- updating the mask-free parameters based on the partial derivatives;
- wherein the updated mask-free parameters are taken as initial values of the mask adjustment parameters.
Clause 33. The method of Clause 32, further comprising:
-
- randomly generating initial values of the mask tensor and the mask-free parameters.
Clause 34. The method of Clause 31, further comprising:
-
- determining the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
Clause 35. The method of Clause 34, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
-
- selecting, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
- generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
Clause 36. The method of Clause 35, wherein the specified dimension is an input channel dimension.
Clause 37. The method of Clause 34, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor comprises:
-
- presetting a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
- masking two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
- performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
- selecting a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
Clause 38. The method of Clause 37, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
Clause 39. The method of Clause 31, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
Clause 40. The method of Clause 31, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
-
- after a specific count of epochs have been performed, dividing the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m;
- sorting the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small;
- setting, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1; and
- setting, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
Clause 41. The method of Clause 40, wherein the mask adjustment stage further includes:
-
- judging whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and
- if the threshold is reached, ending the mask adjustment stage.
Clause 42. The method of Clause 41, wherein the threshold is one of 80%, 90%, and 100%.
Clause 43. The method of Clauses 35 to 38 or 40, wherein m is 4 and n is 2.
Clause 44. The method of Clause 40, wherein the specific count is 1.
Clause 45. A computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of any of Clauses 31 to 42.
Clause 46. An integrated circuit device for performing sparse training on a neural network model, comprising:
-
- a processing device, comprising a control module, a computation module, and an updating module,
- wherein when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives and update the mask tensor based on the updated mask adjustment parameters; and
- a computation device configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
Clause 47. The integrated circuit device of Clause 46, wherein when the control module sets that a mask-free stage is entered, the computation module repeats the following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
Clause 48. The integrated circuit device of Clause 47, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
Clause 49. The integrated circuit device of Clause 46, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
Clause 50. The integrated circuit device of Clause 49, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
-
- select, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
- generate the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
Clause 51. The integrated circuit device of Clause 50, wherein the specified dimension is an input channel dimension.
Clause 52. The integrated circuit device of Clause 49, wherein when the mask tensor is a two-dimensional tensor, the mask tensor determination module is configured to:
-
- preset a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
- mask two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
- perform product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
- select a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
Clause 53. The integrated circuit device of Clause 52, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
Clause 54. The integrated circuit device of Clause 46, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
Clause 55. The integrated circuit device of Clause 46, wherein when the mask tensor is a one-dimensional tensor, the updating module includes a division unit, a sorting unit, and an adjustment unit, and in the mask adjustment stage, after a specific count of epochs have been performed, the division unit divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m; the sorting unit sorts the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small; the adjustment unit sets, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1, and sets, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
Clause 56. The integrated circuit device of Clause 55, wherein in the mask adjustment stage, the control module judges whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and if the threshold is reached, the mask adjustment stage is ended.
Clause 57. The integrated circuit device of Clause 56, wherein the threshold is one of 80%, 90%, and 100%.
Clause 58. The integrated circuit device of Clauses 50 to 53 or 55, wherein m is 4 and n is 2.
Clause 59. The integrated circuit device of Clause 55, wherein the specific count is 1.
Clause 60. A board card, comprising the integrated circuit device of any of Clauses 46 to 59.
The embodiments of the present disclosure have been described above in detail, specific examples have been applied herein to explain the principles and implementations of the present disclosure, and the description of the above embodiments is only used to help understand the methods and core ideas of the present disclosure; meanwhile, for those of ordinary skill in the art, according to the ideas of the present disclosure, variations will be made in specific implementations and application scopes, and in summary, the contents of this specification should not be construed as restrictions on the present disclosure.
Claims
1.-30. (canceled)
31. A method of performing sparse training on a neural network model, comprising:
- in a mask adjustment stage, repeating following steps in a plurality of epochs:
- masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters;
- updating the mask adjustment parameters based on the partial derivatives; and
- updating the mask tensor based on the updated mask adjustment parameters, wherein the updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
32. The method of claim 31, further comprising:
- in a mask-free stage, repeating following steps in a plurality of epochs:
- computing, in forward propagation, the value of the loss function based on mask-free parameters;
- computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and
- updating the mask-free parameters based on the partial derivatives;
- wherein the updated mask-free parameters are taken as initial values of the mask adjustment parameters.
33. The method of claim 32, further comprising:
- randomly generating initial values of the mask tensor and the mask-free parameters.
34. The method of claim 31, further comprising:
- determining the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
35. The method of claim 34, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
- selecting, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
- generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
36. The method of claim 35, wherein the specified dimension is an input channel dimension.
37. The method of claim 34, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor includes:
- presetting a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
- masking two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
- performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
- selecting a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
38. The method of claim 37, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
39. The method of claim 31, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
40. The method of claim 31, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
- after a specific count of epochs have been performed, dividing the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m;
- sorting the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small;
- setting, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1; and
- setting, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
41.-44. (canceled)
45. A non-transitory computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of claim 31.
46. An integrated circuit device for performing sparse training on a neural network model, comprising:
- a processing device, comprising a control module, a computation module, and an updating module,
- wherein when the control module sets that a mask adjustment stage is entered, the computation module repeats following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives and update the mask tensor based on the updated mask adjustment parameters; and
- a computation device configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
47. The integrated circuit device of claim 46, wherein when the control module sets that a mask-free stage is entered, the computation module repeats following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
48. The integrated circuit device of claim 47, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
49. The integrated circuit device of claim 46, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
50. The integrated circuit device of claim 49, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
- select, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
- generate the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
51. The integrated circuit device of claim 50, wherein the specified dimension is an input channel dimension.
52. The integrated circuit device of claim 49, wherein when the mask tensor is a two-dimensional tensor, the mask tensor determination module is configured to:
- preset a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
- mask two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
- perform product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
- select a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
53. The integrated circuit device of claim 52, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
54. The integrated circuit device of claim 46, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
55.-60. (canceled)
Type: Application
Filed: Feb 3, 2022
Publication Date: Jul 21, 2022
Inventors: Yufeng GAO (Hefei), Shibing ZHU (Hefei)
Application Number: 17/557,802