Fault Diagnosis Method and Apparatus for Big-Data Network System
A fault diagnosis method for a big-data network system includes extracting fault information from historical data in the network system, to form training sample data, which is trained to obtain a deep sum product network model that can be used to perform fault diagnosis; and diagnosing a fault of the network system based on the deep sum product network model. The embodiments of the present application resolve a problem that it is difficult to diagnose a fault of a big-data network system.
This application claims priority to Chinese Patent Application No. 201510669888.2, filed on Oct. 13, 2015, which is hereby incorporated by reference in its entirety.
TECHNICAL FIELDEmbodiments of the present application relate to the big data processing field, and more specifically, to a fault diagnosis method and apparatus for a big-data network system.
BACKGROUNDFault diagnosis is a process in which various checking and testing methods are used to determine a status and an abnormal situation of a system and locate a type of a fault or a cause why a fault is generated, and finally, a solution is provided to perform fault recovery.
Fault diagnosis is an important process in many industrial systems. In recent years, as the modern industry demonstrates a trend of becoming larger and more complex, fault diagnosis becomes more important, and a greater challenge is imposed on a fault diagnosis technology.
Conventional fault diagnosis manners mainly include two types: one type is to establish an accurate fault diagnosis model, and the other type is to perform diagnosis based on experience of experts. The foregoing two fault diagnosis modes are mainly applicable to fault diagnosis of a simple system. However, data of a big-data network system (such as a network of a telecommunications operator or a large data center network) is of various types, and includes structured data, such as data in a database, and also includes non-structured data, such as a graph or text. In addition, the big-data network system generates massive data each day, and network system fault symptoms and causes are also diversified. Therefore, it is very difficult to perform diagnosis by relying on a conventional fault diagnosis method.
SUMMARYEmbodiments of the present application provide a fault diagnosis method and apparatus for a big-data network system, to resolve a problem that it is difficult to perform diagnosis on the big-data network system.
According to a first aspect, a fault diagnosis method for a big-data network system is provided, including obtaining historical data of the network system, where the historical data is heterogeneous data, the heterogeneous data includes structured data and non-structured data, and the historical data includes fault information, which is used to describe a cause and a symptom of multiple faults of the network system; obtaining the fault information from a structured field of the structured data and data content of the non-structured data, to determine multiple groups of values of fault-related random variables, where one group of values of the fault-related random variables is used to indicate an association relationship between a symptom and a cause of one fault of the network system, and the fault-related random variables include a random variable of a first category and a random variable of a second category, where the random variable of the first category is used to represent a symptom of a fault of the network system, and the random variable of the second category is used to represent a cause of the fault of the network system; using the multiple groups of values of the fault-related random variables as training sample data, to train a deep sum product network model; assigning a value to the random variable of the first category according to a symptom of a current fault of the network system; determining a marginal probability or a conditional probability of the random variable of the second category by using the deep sum product network model and according to the assigned value of the random variable of the first category; and deducing a cause of the current fault according to the marginal probability or the conditional probability of the random variable of the second category.
With reference to the first aspect, in an implementation manner of the first aspect, the using the multiple groups of values of the fault-related random variables as training sample data, to train a deep sum product network model includes generating a numerical matrix according to the multiple groups of values, where each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and each column of the numerical matrix is corresponding to one variable of the fault-related random variables; dividing the numerical matrix into m×n first submatrices of an equal size, where both m and n are positive integers, and a sum of m and n is greater than or equal to 2; obtaining m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and determining the deep sum product network model according to the m×n sum product network models.
With reference to the first aspect or any one of the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the determining the deep sum product network model according to the m×n sum product network models includes calculating a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and calculating a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
With reference to the first aspect or any one of the foregoing implementation manners of the first aspect, in another implementation manner of the first aspect, the determining the deep sum product network model according to the m×n sum product network models includes calculating a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and calculating a product of the n intermediate sum product network models, to obtain the deep sum product network model.
With reference to the first aspect or any one of the foregoing implementation manners of the first aspect, in another implementation manner of the first aspect, the determining the deep sum product network model according to the m×n sum product network models includes generating a target matrix that uses the m×n sum product network models as elements; and recursively splitting the target matrix based on an independence test of a random variable and mixture probabilistic model estimation, to obtain the deep sum product network model, where the deep sum product network uses the m×n sum product network models as leaf nodes.
With reference to the first aspect or any one of the foregoing implementation manners of the first aspect, in another implementation manner of the first aspect, the obtaining the fault information from a structured field of the structured data and data content of the non-structured data, to determine multiple groups of values of fault-related random variables includes discretizing information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and/or extracting information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
With reference to the first aspect or any one of the foregoing implementation manners of the first aspect, in another implementation manner of the first aspect, the structured data is data in a database, and the discretizing information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables includes performing discretization column by column in the database according to a value range of a field corresponding to each column in the database, to obtain the values of the fault-related random variables.
According to a second aspect, a fault diagnosis apparatus for a big-data network system is provided, including an obtaining module configured to obtain historical data of the network system, where the historical data is heterogeneous data, the heterogeneous data includes structured data and non-structured data, and the historical data includes fault information, which is used to describe a cause and a symptom of multiple faults of the network system; an extraction module configured to obtain the fault information from a structured field of the structured data and data content of the non-structured data, to determine multiple groups of values of fault-related random variables, where one group of values of the fault-related random variables is used to indicate an association relationship between a symptom and a cause of one fault of the network system, and the fault-related random variables include a random variable of a first category and a random variable of a second category, where the random variable of the first category is used to represent a symptom of a fault of the network system, and the random variable of the second category is used to represent a cause of the fault of the network system; a training module configured to use the multiple groups of values of the fault-related random variables as training sample data, to train a deep sum product network model; a value assignment module configured to assign a value to the random variable of the first category according to a symptom of a current fault of the network system; a determining module configured to determine a marginal probability or a conditional probability of the random variable of the second category by using the deep sum product network model and according to the assigned value of the random variable of the first category; and a deduction module configured to deduce a cause of the current fault according to the marginal probability or the conditional probability of the random variable of the second category.
With reference to the second aspect, in an implementation manner of the second aspect, the training module is configured to generate a numerical matrix according to the multiple groups of values, where each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and each column of the numerical matrix is corresponding to one variable of the fault-related random variables; divide the numerical matrix into m×n first submatrices of an equal size, where both m and n are positive integers, and a sum of m and n is greater than or equal to 2; obtain m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and determine the deep sum product network model according to the m×n sum product network models.
With reference to the second aspect or any one of the foregoing implementation manner of the second aspect, in another implementation manner of the second aspect, the training module is configured to calculate a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and calculate a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
With reference to the second aspect or any one of the foregoing implementation manners of the second aspect, in another implementation manner of the second aspect, the training module is configured to calculate a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and calculate a product of the n intermediate sum product network models, to obtain the deep sum product network model.
With reference to the second aspect or any one of the foregoing implementation manners of the second aspect, in another implementation manner of the second aspect, the training module is configured to generate a target matrix that uses the m×n sum product network models as elements; and recursively split the target matrix based on an independence test of a random variable and mixture probabilistic model estimation, to obtain the deep sum product network model, where the deep sum product network uses the m×n sum product network models as leaf nodes.
With reference to the second aspect or any one of the foregoing implementation manners of the second aspect, in another implementation manner of the second aspect, the extraction module is configured to discretize information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and/or extract information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
With reference to the second aspect or any one of the foregoing implementation manners of the second aspect, in another implementation manner of the second aspect, the structured data is data in a database, and the extraction module is configured to perform discretization column by column in the database according to a value range of a field corresponding to each column in the database, to obtain the values of the fault-related random variables.
A deep sum product network model is a multi-layer non-linear probability model. This type of probability model features large scale, strong expressiveness, high efficiency in accurate deduction, and the like, and is mostly applied in an image processing field. In order to apply the deep sum product network model to fault diagnosis in a big-data network system, in the embodiments of the present application, random variables are first divided into a random variable of a first category and a random variable of a second category, and then fault information is extracted from various types of heterogeneous data, so as to assign a value to a random variable, thereby obtaining training sample data that meets a training requirement of the deep sum product network model. After the deep sum product network model is trained, a value is assigned to the random variable of the first category according to a symptom of a current fault of the network system, and then a marginal probability or a conditional probability of the random variable of the second category is deduced, thereby deducing a cause of the current fault of the network system. By using the foregoing manner, in the embodiments of the present application, the deep sum product network model is applied to a fault diagnosis process in the big-data network system, so as to resolve a problem that it is difficult to diagnose a fault of the big-data network system.
To describe the technical solutions in the embodiments of the present application more clearly, the following briefly describes the accompanying drawings required for describing the embodiments of the present application. The accompanying drawings in the following description show merely some embodiments of the present application, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
The following clearly describes the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. The described embodiments are a part rather than all of the embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without creative efforts shall fall within the protection scope of the present application.
For ease of understanding, first, a definition of a deep sum product network model and a probability deduction manner are briefly described.
A problem addressed by the deep sum product network model is as follows: for a group of random variables {X1, X2, . . . , Xp}, several observation samples are given, and a multi-variate probabilistic model of the observation samples is established, thereby deducing a marginal probability P(XA) or a conditional probability P(XA|XB), where both A and B are subsets of {1, . . . , p}.
The following briefly describes the deep sum product network model with reference to
A manner of deducing a deep sum product network is as follows.
Values are assigned to leaf nodes and are calculated from bottom to top, and a final probability value can be obtained when a value of a root node is calculated.
A marginal probability is calculated by setting a marginalized variable probability to 1.
A conditional probability is calculated by using a Bayes formula P(XA|XB)=P(XA, XB)/P(XB).
The following briefly describes a process of training a deep sum product network model.
First, training sample data of a deep sum product network may be considered as a |T|×|V| matrix M, where T represents a sample set (where |T| represents a quantity of samples), and V represents a random variable set (where |V| represents a quantity of random variable).
A process of training a structure of the deep sum product network is shown in
In splitting in the variable dimension (that is, column splitting of the matrix M), an independence test (Independency test), or referred to as an independence hypothesis test, is performed on the random variable set, to split variables into independent sets, where each variable splitting is corresponding to one “product” node in the deep sum product network.
In splitting in the sample dimension (that is, row splitting of the matrix M), a mixture probabilistic model is trained, to divide a sample into different components, where each sample splitting is corresponding to one “sum” node in the deep sum product network.
The following is pseudo code of the deep sum product network training algorithm.
A definition, a probability deduction manner, and a training process of a deep sum product network are briefly described above. The following describes a fault diagnosis method for a big-data network system according to an embodiment of the present application in detail with reference to
410. Obtain historical data of the network system, where the historical data includes heterogeneous data, the heterogeneous data includes structured data and non-structured data, and the historical data includes fault information, which is used to describe a cause and a symptom of multiple faults of the network system.
Alternatively, the heterogeneous data includes at least two types of the following: structured data, non-structured data, or semi-structured data.
420. Obtain the fault information from a structured field of the structured data and data content of the non-structured data, to determine multiple groups of values of fault-related random variables, where one group of values of the fault-related random variables is used to indicate an association relationship between a symptom and a cause of one fault of the network system, and the fault-related random variables include a random variable of a first category and a random variable of a second category, where the random variable of the first category is used to represent a symptom of a fault of the network system, and the random variable of the second category is used to represent a cause of the fault of the network system.
430. Use the multiple groups of values of the fault-related random variables as training sample data, to train a deep sum product network model.
440. Assign a value to the random variable of the first category according to a symptom of a current fault of the network system.
450. Determine a marginal probability or a conditional probability of the random variable of the second category by using the deep sum product network model and according to the assigned value of the random variable of the first category.
460. Deduce a cause of the current fault according to the marginal probability or the conditional probability of the random variable of the second category.
A deep sum product network model is a multi-layer non-linear probability model. This type of probability model features large scale, strong expressiveness, high speed and accuracy, and the like, and is mostly applied in an image processing field. In order to apply the deep sum product network model to fault diagnosis in a big-data network system, in this embodiment of the present application, random variables are first divided into a random variable of a first category and a random variable of a second category, and then fault information is extracted from various types of heterogeneous data, so as to assign a value to a random variable, thereby obtaining training sample data that meets a training requirement of the deep sum product network model. After the deep sum product network model is trained, a value is assigned to the random variable of the first category according to a symptom of a current fault of the network system, and then a marginal probability or a conditional probability of the random variable of the second category is deduced, thereby deducing a cause of the current fault of the network system. By using the foregoing manner, in this embodiment of the present application, the deep sum product network model is applied to a fault diagnosis process in the big-data network system, so as to resolve a problem that it is difficult to diagnose a fault of the big-data network system.
It should be understood that, a network system may include structured data, non-structured data, and semi-structured data. In this embodiment of the present application, fault information needs to be first extracted from various types of historical data, so as to assign a value to a fault-related random variable. Manners of extracting information from data of different types are different.
Information in the structured field may be discretized according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and/or extract information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
Structured data, pure non-structured data, and semi-structured data are used as examples in the following and are separately described.
(1) Structured data: Generally, structured data is directly converted. Using a table in a relational database as an example, each row of the table is corresponding to one instance, and each column of the table is corresponding to a subset of different variables. In each column, discretization is performed according to a value range of the column, to convert the column to several variables. For example, a value of a column (“Country”) is {“China”, “India”, “US”}. Then, this column can be converted to three variables, which respectively mean “Country=China”, “Country=India”, and “Country=US”. For a specific instance, if a value of a “Country” column is “India”, variables to which the column is converted are [0, 1 ,0].
(2) Semi-structured data: Semi-structured data has no fixed structure, and is generally represented in a format such as Extensible Markup Language (XML) or JavaScript Object Notation (JSON). A structured field (such as an XML tag or a JSON key) is enumerable, and each field may be corresponding to one variable subset. If content of each field is also structured, the data is processed in accordance with the manner of processing structured data in (1); or if content of each field is non-structured, the data is processed in accordance with the manner of processing non-structured data in (3).
(3) Non-structured data: Non-structured data refers to unenumerable data with a variable length, typically, such as text, an image, or a video. Methods for processing non-structured data are relatively diversified. Using text data as an example, structured data related to an application (fault diagnosis) needs to be extracted from non-structured data by using an information extraction technology, which mainly includes named entity recognition, which extracts an entity word or phrase that occurs in text; keyword extraction, which extracts an important word and phrase from text; relationship extraction, which extracts a relationship between entities in text; and text categorization, which automatically maps text to a preconfigured categorization system.
The foregoing task needs to be implemented by means of statistics collection, machine learning, data mining, or human assistance according to a characteristic of specific data. Some typical technologies are, for example, named entity recognition: a conditional random field (CRF); keyword extraction: a common weighting technology (term frequency-inverse document frequency (TF-IDF)) model for information retrieval and data mining; relationship extraction: a Bootstrapping method; and text extraction: a categorizer such as a support vector machine (SVM), a decision tree, or a neural network.
The big-data network system has another characteristic. That is, a data volume is large, and symptoms displayed when faults occur on the network system and causes why the faults are generated are generally diversified. In addition, multiple types of heterogeneous data are involved, and many variables are exported from semi-structured data or non-structured data; therefore, a quantity of dimensions of a random variable is generally very large. As a scale of the training data increases, an existing single-machine training algorithm of a deep sum product network converges more slowly or even cannot perform processing.
Optionally, as an embodiment, step 430 may include generating a numerical matrix according to the multiple groups of values, where each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and each column of the numerical matrix is corresponding to one variable of the fault-related random variables; dividing the numerical matrix into m×n first submatrices of an equal size, where both m and n are positive integers, and a sum of m and n is greater than or equal to 2; obtaining m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and determining the deep sum product network model according to the m×n sum product network models.
This embodiment of the present application proposes a distributed training manner for a deep sum product network model, where a numerical matrix is divided into multiple submatrices, multiple sum product network models are obtained in a distributed training manner, and then the deep sum product network model is determined based on the multiple obtained sum product network models. The distributed training manner is more suitable for training a deep sum product network model for a big-data network system, which can effectively avoid a problem that an algorithm cannot perform calculation or cannot be converged because a sample size is too large or there are too many random variables.
It should be noted that, there may be multiple manners of determining the deep sum product network model according to the m×n sum product network models. The following describes the manners in detail with reference to specific examples.
Optionally, as an embodiment, the determining the deep sum product network model according to the m×n sum product network models may include calculating a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and calculating a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
Optionally, as an embodiment, the determining the deep sum product network model according to the m×n sum product network models may include calculating a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and calculating a product of the n intermediate sum product network models, to obtain the deep sum product network model.
Optionally, as an embodiment, the determining the deep sum product network model according to the m×n sum product network models may include generating a target matrix that uses the m×n sum product network models as elements; and recursively splitting the target matrix based on an independence test of a random variable and mixture probabilistic model estimation (or referred to as mixture probabilistic model estimation of a sample), to obtain the deep sum product network model, where the deep sum product network uses the m×n sum product network models as leaf nodes.
As shown in
Step one: Equally and randomly divide the target matrix according to rows and columns into m×n first submatrices {Mij}, i=1, . . . , m;j=1, . . . , n.
Step two: Train a deep sum product network Sij with each Mij, where this process may be performed in parallel for the m×n {Mij}, and according to a scale of Mij, a manner of training the deep sum product network Sij may use the LearnSPNSingle algorithm mentioned above or a distributed algorithm LearnSPN mentioned in the following.
Step three: For the obtained m×n deep sum product networks {Sij}, use a LearnSPNProb algorithm in the following to learn a deep sum product network S, where a leaf node of S is a sum product network model.
It should be noted that, a main difference between the LearnSPNProb algorithm and the LearnSPNSingle algorithm lies in that input of the latter one is a numerical matrix and input of the former one is a target matrix in probability distribution. More accurately, each element of a target matrix is a sum product network model. For details, refer to
Independence Test [1]:
Independence of two groups of random variables is tested by calculating mutual information of the two groups of variables, where a manner of calculating the mutual information may be:
where {(xj(l),xk(l))}l=1L are samples from Pi(xj,xk). These samples may be sampled from an initial numerical matrix or may be obtained through sampling by using a Monte Carlo method. When I(xj,xk)<ε, the two groups of random variables are considered independent, where ε is a threshold given in advance.
An independence test is performed between any two groups of random variables, and a line is added between variable groups that are not independent. If a finally formed diagram includes multiple connected components, it indicates that variables are divided in accordance with components; or if a finally formed diagram includes only one connected component, samples are divided in the following mixture probabilistic model manner.
Mixture Probabilistic Model [2]:
In the mixture probabilistic model, assuming that a probability includes K latent components:
where DKL(p,q)=Σp(x)log p(x)/q(x) indicates a Kullback-Leibler (KL) divergence between distributions. A log-likelihood function of an entire sample is:
where the likelihood function includes a hidden variable z, and a standard Expectation-maximization (EM) algorithm may be used for optimization and solution. The sample can be divided into K components according to an optimization result, thereby implementing splitting from a sample dimension.
Input of the distributed algorithm LearnSPN for training a deep sum product network is a |T|×|V| numerical matrix M, where V represents a variable set, and T represents a sample set. Output of the algorithm is a deep sum product network. This algorithm equally and randomly divides M into m×n copies of training data (where each copy of data is corresponding to one first submatrix in the foregoing description): {(Mij,Ti,Vj,), i=1, . . . , m, j=1, . . . , n}, evenly distributes the m×n copies of the training data to K machines, and performs parallel calculation. When a scale of Mij is relatively small, the single-machine training algorithm LearnSPNSingle may be used for training; or when a scale of Mij is relatively large, the distributed algorithm LearnSPN can still be used for training. After calculation is complete, m×n sum product networks {Sij, i=1, . . . , m, j=1, . . . , n} are obtained. Then, the LearnSPNProb algorithm for the target matrix is used to construct the final deep sum product network S, where the leaf node of S is the sum product network {Sij}.
The following is pseudo code of the LearnSPN algorithm and the LearnSPNProb algorithm.
The following describes the embodiments of the present application in more detail with reference to a specific example. It should be noted that, the example in
Referring to
Step one: Collect historical data (which is heterogeneous data and records fault information) from a network system, and process the data to form training sample data (which is a numerical matrix).
Step two: Input the training sample to a deep sum product network model, and perform distributed learning, to obtain the deep sum product network model.
Step three: Collect, from the system, information that includes a fault symptom, so as to assign a value to a random variable of a first category, and deduce a conditional probability or a marginal probability of a random variable of a second category by using the deep sum product network model obtained through training, thereby learning a cause of a current fault of the network system.
For example, the random variable of the second category includes three variables, which respectively indicate three causes that may lead to the fault of the network system; and based on the deduced conditional probability, a cause corresponding to a random variable with a highest conditional probability is used as the cause of the current fault of the network system.
The foregoing describes the fault diagnosis method for the big-data network system according to the embodiments of the present application in detail with reference to
A deep sum product network model is a multi-layer non-linear probability model. This type of probability model features large scale, strong expressiveness, high speed and accuracy, and the like, and is mostly applied in an image processing field. In order to apply the deep sum product network model to fault diagnosis in a big-data network system, in this embodiment of the present application, random variables are first divided into a random variable of a first category and a random variable of a second category, and then fault information is extracted from various types of heterogeneous data, so as to assign a value to a random variable, thereby obtaining training sample data that meets a training requirement of the deep sum product network model. After the deep sum product network model is trained, a value is assigned to the random variable of the first category according to a symptom of a current fault of the network system, and then a marginal probability or a conditional probability of the random variable of the second category is deduced, thereby deducing a cause of the current fault of the network system. By using the foregoing manner, in this embodiment of the present application, the deep sum product network model is applied to a fault diagnosis process in the big-data network system, so as to resolve a problem that it is difficult to diagnose a fault of the big-data network system.
Optionally, as an embodiment, the training module 730 may be configured to generate a numerical matrix according to the multiple groups of values, where each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and each column of the numerical matrix is corresponding to one variable of the fault-related random variables; divide the numerical matrix into m×n first submatrices of an equal size, where both m and n are positive integers, and a sum of m and n is greater than or equal to 2; obtain m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and determine the deep sum product network model according to the m×n sum product network models.
Optionally, as an embodiment, the training module 730 may be configured to calculate a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and calculate a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
Optionally, as an embodiment, the training module 730 may be configured to calculate a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and calculate a product of the n intermediate sum product network models, to obtain the deep sum product network model.
Optionally, as an embodiment, the training module 730 may be configured to generate a target matrix that uses the m×n sum product network models as elements; and recursively split the target matrix based on an independence test of a random variable and mixture probabilistic model estimation, to obtain the deep sum product network model, where the deep sum product network uses the m×n sum product network models as leaf nodes.
Optionally, as an embodiment, the extraction module 720 may be configured to discretize information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and/or extract information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
Optionally, as an embodiment, the structured data is data in a database, and the extraction module 720 may be configured to perform discretization column by column in the database according to a value range of a field corresponding to each column in the database, to obtain the values of the fault-related random variables.
A deep sum product network model is a multi-layer non-linear probability model. This type of probability model features large scale, strong expressiveness, high speed and accuracy, and the like, and is mostly applied in an image processing field. In order to apply the deep sum product network model to fault diagnosis in a big-data network system, in this embodiment of the present application, random variables are first divided into a random variable of a first category and a random variable of a second category, and then fault information is extracted from various types of heterogeneous data, so as to assign a value to a random variable, thereby obtaining training sample data that meets a training requirement of the deep sum product network model. After the deep sum product network model is trained, a value is assigned to the random variable of the first category according to a symptom of a current fault of the network system, and then a marginal probability or a conditional probability of the random variable of the second category is deduced, thereby deducing a cause of the current fault of the network system. By using the foregoing manner, in this embodiment of the present application, the deep sum product network model is applied to a fault diagnosis process in the big-data network system, so as to resolve a problem that it is difficult to diagnose a fault of the big-data network system.
Optionally, as an embodiment, the processor 820 may be configured to generate a numerical matrix according to the multiple groups of values, where each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and each column of the numerical matrix is corresponding to one variable of the fault-related random variables; divide the numerical matrix into m×n first submatrices of an equal size, where both m and n are positive integers, and a sum of m and n is greater than or equal to 2; obtain m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and determine the deep sum product network model according to the m×n sum product network models.
Optionally, as an embodiment, the processor 820 may be configured to calculate a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and calculate a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
Optionally, as an embodiment, the processor 820 may be configured to calculate a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and calculate a product of the n intermediate sum product network models, to obtain the deep sum product network model.
Optionally, as an embodiment, the processor 820 may be configured to generate a target matrix that uses the m×n sum product network models as elements; and recursively split the target matrix based on an independence test of a random variable and mixture probabilistic model estimation, to obtain the deep sum product network model, where the deep sum product network uses the m×n sum product network models as leaf nodes.
Optionally, as an embodiment, the processor 820 may be configured to discretize information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and/or extract information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
Optionally, as an embodiment, the structured data is data in a database, and the processor 820 may be configured to perform discretization column by column in the database according to a value range of a field corresponding to each column in the database, to obtain the values of the fault-related random variables.
A person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present application.
It may be clearly understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
When the functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present application essentially, or some of the technical solutions may be implemented in a form of a software product. The software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or some of the steps of the methods described in the embodiments of the present application. The foregoing storage medium includes any medium that can store program code, such as a universal serial bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc.
The foregoing descriptions are merely specific implementation manners of the present application, but are not intended to limit the protection scope of the present application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in the present application shall fall within the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims
1. A fault diagnosis method for a network system, comprising:
- obtaining historical data of the network system, wherein the historical data is heterogeneous data, wherein the heterogeneous data comprises structured data and non-structured data, wherein the historical data comprises fault information, and wherein the fault information is used to describe a cause and a symptom of multiple faults of the network system;
- obtaining the fault information from a structured field of the structured data and data content of the non-structured data, to determine multiple groups of values of fault-related random variables, wherein one group of values of the fault-related random variables is used to indicate an association relationship between a symptom and a cause of one fault of the network system, wherein the fault-related random variables comprise a random variable of a first category and a random variable of a second category, wherein the random variable of the first category is used to represent a symptom of a fault of the network system, and wherein the random variable of the second category is used to represent a cause of the fault of the network system;
- using the multiple groups of values of the fault-related random variables as training sample data, to train a deep sum product network model;
- assigning a value to the random variable of the first category according to a symptom of a current fault of the network system;
- determining a marginal probability or a conditional probability of the random variable of the second category by using the deep sum product network model and according to the assigned value of the random variable of the first category; and
- deducing a cause of the current fault according to the marginal probability or the conditional probability of the random variable of the second category.
2. The method according to claim 1, wherein using the multiple groups of values of the fault-related random variables as training sample data, to train the deep sum product network model comprises:
- generating a numerical matrix according to the multiple groups of values, wherein each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and wherein each column of the numerical matrix is corresponding to one variable of the fault-related random variables;
- dividing the numerical matrix into m×n first submatrices of an equal size, wherein both m and n are positive integers, and wherein a sum of m and n is greater than or equal to 2;
- obtaining m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and
- determining the deep sum product network model according to the m×n sum product network models.
3. The method according to claim 2, wherein determining the deep sum product network model according to the m×n sum product network models comprises:
- calculating a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and
- calculating a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
4. The method according to claim 2, wherein determining the deep sum product network model according to the m×n sum product network models comprises:
- calculating a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and
- calculating a product of the n intermediate sum product network models, to obtain the deep sum product network model.
5. The method according to claim 2, wherein determining the deep sum product network model according to the m×n sum product network models comprises:
- generating a target matrix that uses the m×n sum product network models as elements; and
- recursively splitting the target matrix based on an independence test of a random variable and mixture probabilistic model estimation, to obtain the deep sum product network model,
- wherein the deep sum product network uses the m×n sum product network models as leaf nodes.
6. The method according to claim 1, wherein obtaining the fault information from the structured field of the structured data and data content of the non-structured data, to determine the multiple groups of values of fault-related random variables comprises discretizing information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables.
7. The method according to claim 1, wherein obtaining the fault information from the structured field of the structured data and data content of the non-structured data, to determine the multiple groups of values of fault-related random variables comprises extracting information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
8. The method according to claim 1, wherein obtaining the fault information from the structured field of the structured data and data content of the non-structured data, to determine the multiple groups of values of fault-related random variables comprises:
- discretizing information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and
- extracting information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
9. The method according to claim 6, wherein the structured data is data in a database, and wherein discretizing information in the structured field according to the value range of the structured field in the structured data, to determine the values of the fault-related random variables comprises performing discretization column by column in the database according to a value range of a field corresponding to each column in the database, to obtain the values of the fault-related random variables.
10. A fault diagnosis apparatus for a network system, comprising:
- a non-transitory computer readable medium having instructions stored thereon; and
- a computer processor coupled to the non-transitory computer readable medium and configured to execute the instructions to: obtain historical data of the network system, wherein the historical data is heterogeneous data, wherein the heterogeneous data comprises structured data and non-structured data, and wherein the historical data comprises fault information, which is used to describe a cause and a symptom of multiple faults of the network system; obtain the fault information from a structured field of the structured data and data content of the non-structured data, to determine multiple groups of values of fault-related random variables, wherein one group of values of the fault-related random variables is used to indicate an association relationship between a symptom and a cause of one fault of the network system, wherein the fault-related random variables comprise a random variable of a first category and a random variable of a second category, wherein the random variable of the first category is used to represent a symptom of a fault of the network system, and wherein the random variable of the second category is used to represent a cause of the fault of the network system; use the multiple groups of values of the fault-related random variables as training sample data, to train a deep sum product network model; assign a value to the random variable of the first category according to a symptom of a current fault of the network system; determine a marginal probability or a conditional probability of the random variable of the second category by using the deep sum product network model and according to the assigned value of the random variable of the first category; and deduce a cause of the current fault according to the marginal probability or the conditional probability of the random variable of the second category.
11. The apparatus according to claim 10, wherein the computer processor is further configured to execute the instructions to:
- generate a numerical matrix according to the multiple groups of values, wherein each row of the numerical matrix is corresponding to one fault of the multiple faults of the network system, and wherein each column of the numerical matrix is corresponding to one variable of the fault-related random variables;
- divide the numerical matrix into m×n first submatrices of an equal size, wherein both m and n are positive integers, and wherein a sum of m and n is greater than or equal to 2;
- obtain m×n sum product network models in a distributed training manner and according to the m×n first submatrices; and
- determine the deep sum product network model according to the m×n sum product network models.
12. The apparatus according to claim 11, wherein the computer processor is further configured to execute the instructions to:
- calculate a product of sum product network models obtained by training first submatrices that are located in a same row in the m×n first submatrices, to obtain m intermediate sum product network models; and
- calculate a sum of the m intermediate sum product network models, to obtain the deep sum product network model.
13. The apparatus according to claim 11, wherein the computer processor is further configured to execute the instructions to:
- calculate a sum of sum product network models obtained by training first submatrices that are located in a same column in the m×n first submatrices, to obtain n intermediate sum product network models; and
- calculate a product of the n intermediate sum product network models, to obtain the deep sum product network model.
14. The apparatus according to claim 11, wherein the computer processor is further configured to execute the instructions to:
- generate a target matrix that uses the m×n sum product network models as elements; and
- recursively split the target matrix based on an independence test of a random variable and mixture probabilistic model estimation, to obtain the deep sum product network model,
- wherein the deep sum product network uses the m×n sum product network models as leaf nodes.
15. The apparatus according to claim 10, wherein the computer processor is further configured to discretize information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables.
16. The apparatus according to claim 10, wherein the computer processor is further configured to extract information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
17. The apparatus according to claim 10, wherein the computer processor is further configured to:
- discretize information in the structured field according to a value range of the structured field in the structured data, to determine the values of the fault-related random variables; and
- extract information from the data content of the non-structured data by using at least one of a named-entity recognition algorithm, a keyword extraction algorithm, a relationship extraction algorithm, or a text categorization algorithm, to determine the values of the fault-related random variables.
18. The apparatus according to claim 15, wherein the structured data is data in a database, and wherein the computer processor is further configured to perform discretization column by column in the database according to a value range of a field corresponding to each column in the database, to obtain the values of the fault-related random variables.