Lagrangian support vector machine
A Lagrangian support vector machine solves problems having massive data sets (e.g., millions of sample points) by defining an input matrix representing a set of data having an input space with a dimension of n that corresponds to a number of features associated with the data set, generating a support vector machine to solve a system of linear equations corresponding to the input matrix with the system of linear equations defined by a positive definite matrix, and calculating a separating surface with the support vector machine to divide the set of data into two subsets of data
Latest Wisconsin Alumni Research Foundation Patents:
- IFNy AND TNFa CO-STIMULATION OF MESENCHYMAL STROMAL CELLS DERIVED FROM MINOR SALIVARY (LABIAL) GLANDS FOR THERAPEUTIC USE
- Dual-responsive nanoparticles for enhanced antibacterial efficacy
- Arogenate dehydrogenase polynucleotides, polypeptides and methods of using the same
- SINGLE-DOMAIN ANTIBODIES AND VARIANTS THEREOF AGAINST OPIOIDS
- RSV virus-like particles and methods of use thereof
This invention was made with United States government support awarded by the following agencies: DODAF F-49620-00-1-0085. The United States has certain rights in this invention.
BACKGROUND OF THE INVENTION1. Field of the Invention
This invention relates to support vector machines for separating data based on multiple characteristics. More particularly, it is directed to an apparatus and method for classifying millions of data points into separate classes using a linear or nonlinear separator using a Lagrangian support vector machine.
2. Discussion of the Prior Art
Support vector machines are powerful tools for data classification and are often used for data mining operations. Classification is based on identifying a linear or nonlinear separating surface to discriminate between elements of an extremely large data set containing millions of sample points by tagging each of the sample points with a tag determined by the separating surface. The separating surface depends only on a subset of the original data. This subset of data, which is all that is needed to generate the separating surface, constitutes the set of support vectors. Mathematically, support vectors are data points corresponding to constraints with positive multipliers in a constrained optimization formulation of a support vector machine.
Support vector machines have been used by medical institutions in making diagnostic and prognostic decisions as well as by financial institutions making credit and fraud detection decisions. For example, support vector machines are used to classify breast cancer patients using a criterion that is closely related to the decision whether a patient is prescribed to have chemotherapy treatment or not. This criterion is the presence of metastasized lymph nodes (node-positive) or their absence (node-negative).
By using a linear support vector machine, a number of available features are selected to classify patients into node-positive and node-negative patients. The total number of features used to constitute the n-dimensional space in which the separation is accomplished is made up of the mean, standard error and the maximum value of a certain number of cytological nuclear measurements of the size, shape and texture taken from a patient's breast along with the tumor size. A subset of the features is then used in a nonlinear support vector machine to classify the entire set of patients into three prognosis groups: good (node-negative), intermediate (1 to 4 metastasized lymph nodes) and poor (more than 4 metastasized lymph nodes). The classification method is used to assign new patients to one of the three prognostic groups with an associated survival curve and a possible indication of the utilization of chemotherapy or not.
This classification and data mining process, however, is extremely resource intensive, slow and expensive given current classification tools. To separate the millions of sample points into different data sets, costly linear and quadratic and programming solvers are often used that are complicated and cost prohibitive. Unfortunately, these tools are also very slow in processing and classifying the sample points.
What is needed, therefore, is an apparatus and method for simply and quickly solving problems with millions of sample points using standard tools, thereby eliminating the need for complicated and costly optimization tools. This apparatus and method would need to be based on a simple reformulation of the problem (e.g., an implicit Lagrangian formulation of the dual of a simple reformulation of the standard quadratic program of a linear support vector machine). This reformulation would thereby minimize an unconstrained differentiable convex function in an m-dimensional space where m is the number of points to be classified in a given n-dimensional input space. The necessary optimality condition for the unconstrained minimization problem would therefore be transformed into a simple symmetric positive definite complementary problem, thereby significantly reducing the computational resources necessary to classify the data.
SUMMARY OF THE INVENTIONThe present invention is directed to an apparatus and method for classifying data comprising the steps of defining an input matrix representing a set of data having an input space with a dimension of n, wherein n corresponds to a number of features associated with a data set, generating a support vector machine to solve a system of linear equations corresponding to the input matrix, wherein the system of linear equations is defined by a positive definite matrix, and calculating a separating surface with the support vector machine to divide the set of data into a plurality of subsets of data.
According to another aspect of the preferred embodiment of the present invention, a method of classifying data comprises the steps of defining an input matrix representing a set of data having an input space with a dimension of n, wherein n corresponds to a number of features associated with a data set, generating a support vector machine to solve a system of linear equations corresponding to the input matrix, wherein the system of linear equations is defined by a positive definite matrix, and calculating a linear separating surface with the support vector machine to divide the set of data into a plurality of subsets of data.
According to another aspect of the invention, a method of classifying data comprises the steps of defining an input matrix representing a set of data having an input space with a dimension of n, wherein n corresponds to a number of features associated with a data set, generating a support vector machine to solve a system of linear equations corresponding to the input matrix, wherein the system of linear equations is defined by a positive definite matrix, and calculating a nonlinear separating surface with the support vector machine to divide the set of data into a plurality of subsets of data.
According to yet a further aspect of the preferred embodiment of the present invention, a method of determining a separating surface between features of a data set comprises the steps of defining an input matrix A representing the data set having an input space with a dimension of n, wherein n corresponds to a number of the features associated with the data set, constructing a support vector machine to define the separating surface by solving a system of linear equations corresponding to the input matrix A, wherein the system of linear equations is defined by a positive definite matrix with a dimension equal to (n+1), and dividing the data set into a plurality of subsets of data based on the separating surface calculated by the support vector machine.
According to yet another aspect of the preferred embodiment of the present invention, a support vector machine includes an input module that generates an input matrix representing a set of data having an input space with a dimension of n, wherein n corresponds to a number of features associated with a data set, a processor that receives an input signal from the input module representing the input matrix, wherein the processor calculates an output signal representing a solution to a system of linear equations corresponding to the input signal, and the system of linear equations is defined by a positive definite matrix, and an output module that divides the set of data into a plurality of subsets of data based on the output signal from the processor that corresponds to a separating surface between the plurality of subsets of data.
According to yet another aspect of the preferred embodiment of the present invention, a method of classifying patients comprises the steps of defining an input matrix representing a set of patient data having an input space with a dimension of n, wherein n corresponds to a number of features associated with each patent in the set of patient data, generating a support vector machine to solve a system of linear equations corresponding to the input matrix, wherein the system of linear equations is defined by a positive definite matrix, and calculating a separating surface with the support vector machine to divide the set of patient data into a plurality of subsets of data.
These and other objects, features, and advantages of the invention will become apparent to those skilled in the art from the following detailed description and the accompanying drawings. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the present invention, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the present invention without departing from the spirit thereof, and the invention includes all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGSA preferred exemplary embodiment of the invention is illustrated in the accompanying drawings in which like reference numerals represent like parts throughout, and in which:
In particular, Lagrangian support vector machine 10 includes an input module 12, a Lagrangian support vector processor 14 and an output module 16. Input module 12 receives a data set 18 via a bus 20 and generates an input matrix representing data set 18. The input matrix has an input space with a dimension of n corresponding to the number of features associated with data set 18.
Processor 14 receives an input signal transmitted from input module 12 via a bus 22 representing the input matrix and calculates an output signal representing a solution to a system of linear equations corresponding to the input signal. A positive definite matrix defines the system of linear equations. Output module 16 receives the output signal via a bus 24 and generates a separating surface 26 to divide the set of data into two subsets of data based on the output signal from processor 14. Separating surface 26 is a linear or nonlinear surface.
LSVM processor 14 constructs the separating surface 26 by using each sample point in the set of data based on an implicit Lagrangian formulation of the dual of a simple reformulation of the standard quadratic program of a linear support vector machine. This leads to the minimization of an unconstrained differential convex function in an m-dimensional space where m is the number of points to be classified in a given n dimensional input space. The necessary optimality condition for this unconstrained minimization problem is transformed into a simple symmetric positive definite complimentary problem.
In a step 28, an input matrix is defined representing the set of data having an input space with a dimension of n corresponding to the number of features associated with the data set. Thereafter, in a step 30, support vector machine 10 is generated by using a step 32 and solving the system of linear equations corresponding to the input matrix. Machine 10 then calculates the separating surface in a step 34 and uses it to classify new data in a step 36.
Referring to
All vectors described in this specification are column vectors unless transposed to a row vector by a prime ′. For a vector x in the n-dimensional real space Rn, x+ denotes the vector in Rn with all of its negative components set to zero. This corresponds to projecting x onto the nonnegative orthant. The base of the natural logarithms is denoted by ε, and for a vector y in Rm, ε−y denotes a vector in Rm with components ε−y
where v is a positive number and H is an arbitrary m×k matrix. This identity, easily verifiable by premultiplying both sides by
enables inversion of a large m×m matrix by merely inverting a smaller k×k matrix.
Processor 14 classifies m points in the n-dimensional real space Rn, represented by the m×n matrix A, according to membership of each point Ai in the class A+ or A− as specified by a given m×m diagonal matrix D with plus ones or minus ones along its diagonal. For this problem, a standard support vector machine with a linear kernel is given by the following quadratic program with parameter v>0:
wherein w is the normal to the bounding planes 40:
x′w=γ±1 (3)
and γ determines their location relative to the origin (
x′w=γ, (4)
midway between the bounding planes 40. The quadratic term in (2) is twice the reciprocal of the square of the 2-norm distance 2/∥w∥2 between the two bounding planes 40 (
Aiw+yi≧γ+1, for Dii=1,
Aiw−yi≦γ−1, for Dii=−1. (5)
Traditionally the 1-norm of the error variable y is minimized parametrically with weight v in (2) resulting in an approximate separation as illustrated in
The variables (w,γ) of the primal problem (2) which determine separating surface 44 are obtained from the solution of the dual problem (6). In this regard, matrix DAA′D appearing in the dual objective function (6) is not positive definite in general because typically m>>n. Also, there is an equality constraint present, in addition to bound constraints, which for large problems necessitate special computational procedures. Furthermore, a one-dimensional optimization problem must be solved in order to determine the locator γ of separating surface 44.
In order to overcome all these difficulties as well as that of dealing with the necessity of having to essentially invert a very large matrix of the order of m×m, the preferred embodiment of the present invention includes critical modifications to the standard support vector machine formulation.
Lagrangian Support Vector Machine In the preferred embodiment of the present invention, Lagrangian support vector machine 10 is generated by changing the 1-norm of y to a 2-norm squared which makes the constraint y≧0 redundant. The term γ2 is also appended to w′w, thereby maximizing margin 38 between the parallel bounding planes 40 with respect to both w and γ (e.g., with respect to both orientation and location of the planes, rather that just with respect to w which merely determines the orientation of the plane). Therefore, Lagrangian support vector machine 10 in the present invention is defined by:
for which the dual is:
The variables (w,γ) of the primal problem which determine separating surface 44 are recovered directly from the solution of the dual (8) above by the relations:
The matrix appearing in the dual objective function is positive definite and there is no equality constraint and no upper bound on the dual variable u. The only constraint present is a nonnegativity constraint. Based on these facts, Lagrangian support vector processor 14 implements an iterative method (
The following two matrices are defined to simplify notation:
With these definitions, the dual problem (8) becomes
The single time that Q−1 is computed at the outset of the method illustrated in
Necessary and sufficient optimality conditions for the dual problem (11) generate:
0≦u⊥Qu−e≧0. (12)
Therefore, by using an established identity between any two real numbers (or vectors) a and b:
0≦a⊥b≧0a=(a−αb)+,α>0, (13)
wherein the optimality condition (12) can be then written in the following equivalent form for any positive α:
Qu−e=((Qu−e)−αu)+. (14)
These optimality conditions lead to processor 14 implementing the following simple iterative scheme that constitutes the method illustrated in
ui+1=Q−1(e+((Qui−e)−αui)+),i=0,1, . . . , (15)
for which global linear convergence is established from any starting point under the condition:
In the preferred embodiment of the present invention, this condition is implemented as α=1.9/v, wherein v is the parameter of the SVM formulation (7). As a result, the optimality condition (14) is also the necessary and sufficient condition for the unconstrained minimum of the implicit Lagrangian associated with the dual problem (11):
Processor 14 sets the gradient with respect to u of this convex and differentiable Lagrangian to zero so that:
or equivalently:
(αI−Q)((Qu−e)−((Q−αI)u−e)+)=0 (19)
that is equivalent to the optimality condition (14) under the assumption that α is positive and not an eigenvalue of Q.
Lagrangian support vector machine 10 achieves global linear convergence of iteration (15) under condition (16) given as follows:
Let Q in Rm×m be the symmetric positive definite matrix defined by (10) and let (16) hold. Starting with an arbitrary u0εRm, the iterates ui of (15) converge to the unique solution
∥Qui+1−Q
In the preferred embodiment of the present invention, LSVM processor 14 generates the linear and nonlinear separators by implementing the method illustrated in
For example, using (15) with standard MATLAB™ commands, processor 14 solves problems with millions of sample points using only MATLAB™ commands. The input parameters, besides A, D and v of (10), which define the problem, are: itmax, the maximum number of iterations and tol, the tolerated nonzero error in ∥ui+1−ui∥ at termination. The quantity ∥ui+1−ui∥ bounds from above:
∥Q∥−1·∥Qui−e−((Qui−e)−αui)+∥, (21)
which measures the violation of the optimality criterion (14). It follows that ∥ui+1−ui∥ also bounds ∥ui−
Lagrangian support vector machine 10 is also used to solve classification problems with positive semidefinite nonlinear kernels. The method implemented by processor 14 as illustrated in
In this regard, problems with large datasets are handled using the Sherman-Morrison-Woodbury (SMW) identity (1) only if the inner product terms of the kernel (3) are explicitly known. In this regard, LSVM processor 14 is a useful tool for classification with nonlinear kernels because of the following implementation as also illustrated, for example, using MATLAB™ commands and not making use of the Sherman-Morrison-Woodbury identity or any optimization package.
For AεRm×n and BεRn×l, the kernel K(A, B) maps Rm×n×Rn×l into Rm×l. A typical kernel is the Gaussian kernel ε−μ|A′
where u is the solution of the dual problem (11) with Q re-defined for a general nonlinear kernel as follows:
The nonlinear separating surface (22) degenerates to the linear one (4) if K(G,G′)=GG′ and (9) is utilized.
The dual problem for a linear kernel (8) is written in the following equivalent form:
and the linear kernel GG′ is replaced by a general nonlinear positive semidefinite symmetric kernel K(G,G′) to obtain:
This is the formulation given above in (23). The Karush-Kuhn-Tucker necessary and sufficient optimality conditions for this problem are:
which is the basis for a nonlinear support vector machine with a positive semidefinite kernel K(G,G′). The positive semidefiniteness of the nonlinear kernel K(G,G′) is needed in order to ensure the existence of a solution to both (25) and (26).
The above-referenced results remain valid, with Q redefined as above for any positive semidefinite kernel K. This includes the iterative scheme (15) and the convergence result given under the above MATLAB™ commands. However, because the Sherman-Morrison-Woodbury identity for a nonlinear kernel is not used, the MATLAB™ commands used to generate a solution for a nonlinear solution are different than commands to generate the linear solution.
The implementation of the method illustrated in
and
resulted in LSVM processor 14 solving the problem in 6 iterations in 81.52 minutes to an optimality criterion of 9.398e-5 on a 2-norm violation of (14). The same problem was solved in the same number of iterations and to the same accuracy in 6.74 minutes on a 250 MHz UltraSPARC II processor with 2 gigabytes of memory.
Additional experiments were conducted using a 400 MHz Pentium II Xeon processor and a maximum of 2 Gigabytes of memory available for each process. The computer used Windows NT Server 4.0, with MATLAB 5.3.1. A set of experiments comparing LSVM processor 14 to SVMlight were run on a 250 MHz UltraSPARC II processor with a maximum of 8 Gigabytes of memory available running MATLAB 5.3.0 under an experimental version of Solaris 5.6.
Table 1 below illustrates experimental results indicating that the reformulation of the standard vector machine as implemented by processor 14 and described above performs similarly to SVM-QP, the conventional SVM. Results are also shown for an active set SVM (ASVM) method. For six data sets, tenfold cross validation was performed in order to compare test set accuracies between the methods. Moreover, a tuning set for each algorithm was utilized to find the optimal value of the parameter v. For both LSVM and ASVM, an optimality tolerance of 0.001 was used to determine when to terminate. SVM-QP was implemented using the high-performing CPLEX barrier quadratic programming solver with its default stopping criterion. Altering the CPLEX default stopping criterion to match that of LSVM did not result in significant change in timing relative to LSVM, but did reduce test set correctness for SVM-QP. The results in Table 1 include both normalized data (by subtracting the mean and dividing by the standard deviation) and unnormalized data.
The results outlined above in Table 1 illustrate that the method of
Table 2 compares the method implemented by LSVM processor 14 with SVM on the Adult dataset, which is commonly used to compare standard SVM methods. The results below demonstrate that for the largest training sets, the method of
Table 3 illustrates results from running the method in
Additional experiments demonstrate the effectiveness of the method of
K(G,G′)=exp(−2·10−4∥G′i−Gj∥22),i,j=1, . . . ,m (27)
Therefore, the method of
The scope of the application is not to be limited by the description of the preferred embodiments described above, but is to be limited solely by the scope of the claims that follow. For example, having all the data in memory is simpler to code and results in faster running times. However, it is not a fundamental requirement of the method illustrated in
Claims
1. A method of classifying numerical data sets comprising the steps of:
- defining an input matrix representing a set of numerical data having an input space with a dimension of n, wherein n corresponds to a number of features associated with the data set;
- generating a support vector machine to solve a system of linear equations corresponding to the input matrix, wherein the system of linear equations is defined by a positive definite matrix; and
- calculating a separating surface with the support vector machine to divide the set of numerical data into at least two subsets of data.
2. A method according to claim 1, wherein a dimension of the positive definite matrix is equal to the dimension of (n+1).
3. A method according to claim 2, wherein the separating surface is a linear surface.
4. A method according to claim 2, wherein the separating surface is a nonlinear surface.
5. A method according to claim 3, wherein the separating surface is midway between a pair of parallel bounding planes.
6. A method according to claim 5, wherein the positive definite matrix is defined as: Q = I v + H H ′ wherein H = D [ A - e ]. and v is a parameter associated with the distance between the pair of parallel bounding planes, A is a matrix representing the set of data, e is a vector of ones, and D is a diagonal matrix wherein a value on a diagonal of the D matrix is equal to a value of the corresponding row of the A matrix.
7. A method according to claim 6, further comprising the step of minimizing a function defined by: min 0 ≤ u ∈ R m f ( u ) : = 1 2 u ′ Q u - e ′ u.
8. A method according to claim 7, wherein the separating plane is generated by iteratively calculating a value u defined by: ui+1=Q−1(e+((Qui−e)−αui)+),i=0,1,...,.
9. A method according to claim 8, wherein global linear convergence is achieved by satisfying a condition defined by: 0 〈 α 〈 2 v.
10. A method according to claim 9, wherein the separating surface is defined by a vector that is orthogonal to the pair of parallel bounding planes, and a coordinate that represents the location of the separating surface relative to an origin.
11. A method according to claim 10, wherein the vector is represented by: w=A′Du.
12. A method according to claim 10, wherein the coordinate is represented by: γ=−e′Du.
13. A method according to claim 4, wherein the positive definite matrix is defined as: Q = I v + D K ( G, G ′ ) D, wherein G = [ A - e ] and v is a parameter, I is an identity matrix, A is a matrix representing the set of data, e is a vector of ones, D is a matrix wherein a value on a diagonal of the D matrix is equal to the classification of the corresponding row of the A matrix, and K is a mathematical kernel.
14. A method according to claim 13, wherein the kernel K is a positive semidefinite kernel function.
15. A method according to claim 14, wherein the kernel K(A,B) maps Rm×n×Rn×l into Rm×l for AεRm×n and BεRn×l.
16. A method according to claim 15, wherein the kernel K(A,B) is a Gaussian kernel.
17. A method according to claim 16, further comprising the step of minimizing a function defined by: min 0 ≤ u ∈ R m f ( u ) : = 1 2 u ′ Q u - e ′ u.
18. A method according to claim 17, wherein the nonlinear separating surface is generated by iteratively calculating a value u defined by: ui+1=Q−1(e+((Qui−e)−αui)+),i=0,1,...,
19. A method according to claim 18, wherein the nonlinear separating surface is defined by: K ( [ x ′ - 1 ], [ A ′ - e ′ ] ) D u = 0.
20-41. (canceled)
42. A support vector computing machine to classify numerical data sets comprising:
- an input module that generates an input matrix representing a set of numerical data having an input space with a dimension of n, wherein n corresponds to a number of features associated with the numerical data set;
- a processor that receives an input signal from the input module representing the numerical data, wherein the processor calculates an output signal representing a solution to a system of linear equations corresponding to the input signal, and the system of linear equations is defined by a positive definite matrix; and
- an output module that divides the set of numerical data into a plurality of subsets of numerical data based on the output signal from the processor that corresponds to a separating surface between the plurality of subsets of data.
43. A machine according to claim 42, wherein a dimension of the positive definite matrix is equal to the dimension of (n+1).
44. A machine according to claim 42, wherein the separating surface is a nonlinear surface.
45. A method of classifying patients comprising the steps of:
- defining an input matrix representing a set of patient data having an input space with a dimension of n, wherein n corresponds to a number of features associated with each patient in the set of patient data;
- generating a support vector machine to solve a system of linear equations corresponding to the input matrix, wherein the system of linear equations is defined by a positive definite matrix; and
- calculating a separating surface with the support vector machine to divide the set of patient data into a plurality of subsets of data.
46. A method according to claim 45, wherein a dimension of the positive definite matrix is equal to the dimension of (n+1).
47. A method according to claim 45, wherein the separating surface is a linear surface.
48. A method according to claim 45, wherein the separating surface is a nonlinear surface.
49. (canceled)
50. (canceled)
51. A machine according to claim 42, wherein the separating surface is a linear surface.
52. A method according to claim 1, wherein the at least two subsets of data include a node-positive set of data and a node-negative set of data.
53. A method according to claim 1, wherein the at least two subsets of data include a good prognostic set of data and a poor prognostic set of data.
54. A method according to claim 1, wherein the set of numerical data comprises a set of data indicating presence of metastasized lymph nodes.
55. A method according to claim 45, wherein the plurality of subsets of data include a node-positive set of data and a node-negative set of data.
56. A method according to claim 45, wherein the plurality of subsets of data include a good prognostic set of data, an intermediate prognostic set of data, and a poor prognostic set of data.
57. A method according to claim 45, wherein the set of patient data comprises a set of data indicating presence of metastasized lymph nodes.
Type: Application
Filed: Mar 9, 2006
Publication Date: Jan 4, 2007
Applicant: Wisconsin Alumni Research Foundation (Madison, WI)
Inventors: Olvi Mangasarian (Madison, WI), David Musicant (Burnsville, MN)
Application Number: 11/372,483
International Classification: G06F 15/18 (20060101);