Clustering and classification employing softmax function including efficient bounds
A function optimization method includes the operations of: constructing an upper bound using a double majorization bounding process to a sum-of-exponentials function including a summation of exponentials of the form ∑ k = 1 K ⅇ β k T x ; optimizing the constructed upper bound respective to parameters β to generate optimized parameters β; and outputting the optimized sum-of-exponentials function represented at least by the optimized parameters β. An inference process includes the operations of: invoking the function optimization method respective to a softmax function constrained by discrete observations y defining categorization observation conditioned by continuous variables x representing at least one input object; and applying the optimized softmax function output by the invocation of the softmax function optimization method to the continuous variables x representing at least one input object to generate classification probabilities.
Latest Xerox Corporation Patents:
- Methods and systems for adding annotations from a printed version of a document to a digital version of the document
- Method and system for facilitating a ranking score using attack volume to find optimal configurations
- System and method for performing collaborative learning of machine representations for a target concept
- Multi-function device communities for optimization of device image quality
- LASER-ASSISTED CONTROLLED HEATING HOTSPOT MITIGATION FOR 3D PRINTING
The following relates to the information processing arts and related arts.
The softmax function is the extension of the sigmoid function for more than two variables. The softmax function has the form:
The denominator is a sum-of-exponentials function of the form:
where xkεxεK. The softmax function finds application in neural networks, classifiers, and so forth, while the sum-of-exponentials function finds even more diverse application in these fields as well as in statistical thermodynamics (for example, the partition function), quantum mechanics, information science, classification, and so forth. For some applications a log of the sum-of-exponentials function is a more useful formulation.
One application of the softmax function is in the area of inference problems, such as Gaussian process classifiers, Bayesian multiclass logistic regression, and more generally for deterministic approximation of probabilistic models dealing with discrete variables conditioned on continuous variables. Such applications entail computing the expectation of the log-sum-of-exponentials function:
where EQ denotes the expectation for a distribution Q(β) which is the probability density function (pdf) of a given multidimensional distribution in d×K and x is a vector of d. The expectation can be computed using Monte Carlo simulations, but this can be computationally expensive. Taylor expansion techniques are also known, but tend to provide skewed results when the variance of the pdf Q(β) is large.
Another known approach for computing the expectation is to use an upper bound. In this approach an upper bound on the log-sum-of-exponentials function is identified, from which an estimate of the expectation is obtained. The chosen upper bound should be tight respective to the log-sum-of-exponentials function, and should be computationally advantageous for computing the expectation. For the log-sum-of-exponentials function, a known upper bound having a quadratic form is given by:
See, e.g., Krishnapuram et al., Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):957-68, 2005; Böhning, Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(9):197-200, 1992. These quadratic bounds are generally tight. However, they use the worst curvature over the space, which can result in inefficient integrating when using the upper bound.
As a result, the use of the softmax function for inference problems with more than two variables has heretofore been computationally difficult or impossible for many practical inference and classification problems.
BRIEF DESCRIPTIONIn some embodiments disclosed as illustrative examples, a storage medium stores instructions executable to implement a sum-of-exponentials function optimization method including the operations of: constructing an upper bound using a double majorization bounding process to a sum-of-exponentials function including at least one summation of exponentials; optimizing the constructed upper bound respective to parameters of the exponentials of the at least one summation of exponentials to generate optimized parameters; and outputting the optimized sum-of-exponentials function represented at least by the optimized parameters.
In some embodiments disclosed as illustrative examples, an inference engine is disclosed comprising a processor programmed to perform an inference process comprising: generating an upper bound by double majorization for a sum-of-exponentials function including at least one summation of exponentials of the form
constrained by an input object representation vector and an output classification observation vector; optimizing the upper bound respective to parameters βk of the sum-of-exponentials function; and classifying one or more input objects by applying the sum-of-exponentials function with the optimized parameters βk to said one or more input objects.
In some embodiments disclosed as illustrative examples, a storage medium stores instructions defining a function for optimizing a sum-of-exponentials function including a summation of exponentials of the form
by optimization operations comprising: upper bounding the sum of exponentials
by a product of sigmoids; upper bounding terms of the form log(1+ex) in the product of sigmoids to generate a double majorization upper bound; minimizing the double majorization upper bound respective to the parameters β to generate optimized parameters β; and outputting the optimized sum-of-exponentials function represented at least by the optimized parameters β.
The log-sum-of-exponentials function has concavity and a suitable upper bound is given by:
where the equality holds if and only if:
Advantageously, the expectation of the right hand side can be readily computed for certain useful distributions since EQex
For any xεK and for any αε, a product of sigmoids and corresponding bound can be written as follows:
from which can be obtained:
A property for this bound is that its asymptotes are parallel in most directions. More exactly, by applying the bound to ax where a→∞, the difference between the right and the left part of equation tends to a constant if there exists at least one xk positive and xk≠xk′ for all k≠k′. The standard quadratic bound for log(1+ex) is given by:
for all ξε. It is applied inside the sum of Equation (8). For any xεK and any αεK, and for any ξε[0,∞)K to yield:
Equations (10) and (11) set forth an upper bound on the log-sum-of-exponentials.
Attention is turned to the illustrative application of estimating the expectation γ of the log-sum-of-exponentials set forth in Equation (3). A variational technique based on the maximization of a lower bound of γ can be used. In the following, the pdf Q(βk) is considered to be a multivariate normal distribution with mean μk and variance Σk. This illustrative example is readily extended to other multivariate distributions with bounded variances. Using the moment generating function
of the multivariate normal distribution, the following bound on the expectation γ is obtained:
for every φεK. Minimizing the right hand side with respect to φ leads to the upper bound:
which shows that the bound becomes accurate when the variance of Q is small. The expectation over quadratic bounds gives:
A minimization of the right hand side with respect to χ gives the solution χk=μkTx. At this point, the difference between the expectation γ and its upper bound is at least (K−1)xTΣx. This means that the bound is tight if the distribution lies in a manifold orthogonal to x.
Returning to Equation (10), the following holds:
The minimization of the upper bound with respect to ξ gives:
ξk2=xTΣhx+(μkTx)2+α2−2αμkTx (18),
for k=1, . . . , K. The minimization with respect to α gives:
Several experiments were performed to compare the approximations based on the log, on the quadratic upper bound for the softmax (see Equation (4)), and the upper bound based on a double majorization as disclosed herein. The table shown in
With continuing reference to the table shown in
With reference to
In some embodiments, the disclosed softmax optimization processes, inference engines employing same, and so forth are embodied by a storage medium storing instructions executable (for example, by a digital processor) to implement a softmax function optimization method, inference engine, or so forth. The storage medium may include, for example: a magnetic disk or other magnetic storage medium; an optical disk or other optical storage medium; a random access memory (RAM), read-only memory (ROM), or other electronic memory device or chip or set of operatively interconnected chips; an Internet server from which the stored instructions may be retrieved via the Internet or a local area network; or so forth.
The illustrative Bayesian inference engine 10 employs multinomial logistic regression. A multinomial logistic model is considered, with discrete observations y=(y1, . . . , yn)□{1, . . . , K}n 20 conditioned by continuous variables x=(x1, . . . , xn)εn×d 22, such as a vector to be transformed, via a softmax function of the form:
An initial (that is, not yet optimized) softmax function 24 has the form of Equation (20) with an initial (that is, not yet optimized) transformation matrix β 26. The objective is to optimize the transformation matrix β of the softmax function respective to the constraints 20, 22. As an example of one practical application of the Bayesian inference engine 10, the discrete observations y 20 may be category observations for a text-based document classifier, and the continuous variables x 22 may represent bag-of-words” representations of documents to be classified. In another practical application, the discrete observations y 20 may be category observations for an image classifier, and the continuous variables x 22 may represent features vector representations of images to be classified. These are merely illustrative practical applications, and numerous other applications entailing optimization of parameters 26 of the softmax function 24 constrained by observations 20 conditioned by continuous variables 22 are contemplated.
The Bayesian inference determines values for the transformation matrix elements of the parameters matrix β 26 that optimally satisfy the softmax function 24 under the constraints 20, 22. In other words, it is desired to maximize the posterior distribution P(β|x,y). For any choice of the prior there do not exist a closed form solution for the posterior distribution P(β|x,y). Variational approximations are based on the minimization of KL(Q(β)∥P(β|y,x)) over a class of distributions FQ for which the computation is effective.
With continuing reference to
Using a quadratic bound for the second expectation, a lower bound F(μ,Σ,ξ) of L(μ,Σ) is obtained:
where Ak, bk, and c depend on the choice of the quadratic bound. Equation (22) represents the upper bound output by the construction operation 30.
It will be appreciated that it is possible to construct 30 the upper bound as set forth in Equation (22) without actually generating the initial softmax function 24. To illustrate this option, in
With continuing reference to
With continuing reference to
so that a closed-form solution is obtained of the maximum of F(μ,Σ,ξ) with respect to μ and Σ:
{circumflex over (Σ)}k=(Ak+
and
{circumflex over (μ)}k={circumflex over (Σ)}k(b+
This gives the updates:
The optimized parameters βk, k=1, . . . , K 34 are directly computed from the optimized independent Gaussian priors P=N(
In some applications, it may be desirable to operate on the moment generating function EQ[eβ
The minimization 32 with respect to φ is straightforward. For a fixed φ, the objective can be decomposed into a sum of independent functions of (μk,Σk) that can be maximized independently for k=1, . . . , K. Since the gradient can be computed easily and F is concave with respect to μ and Σ, the minimization and can be done using a standard optimization package. In one suitable approach the reparameterizations φ=a2 and Σ=R0.5 (R0.5)T are used to transform to an unconstrained maximization problem. Alternatively, it is possible to find μk and Σk using a fixed point equation. In the case of quadratic bounds, the maximization of this function is done by iteratively maximizing with respect to the variational parameters and (μ, Σ). Every computation is analytical in this case.
As used herein, the terms “optimization”, “minimization”, and the like are intended to encompass both absolute optimization or absolute minimization, and approximate optimization or approximate minimization. For example, in the Bayesian inference engine 10 of
Phraseology such as “of the form” when used respective to mathematical expressions is intended to be broadly construed to encompass the base mathematical expression modified by substantially any substantively insubstantial modification. For example, the softmax function form of Equation (20) may be modified by the addition of scalar constants, scaling constants within the exponential, or so forth, and the softmax function so modified remains of the same form as that of Equation (20).
The illustrated embodiments relate to the softmax function. However, the disclosed techniques are more generally applicable to substantially any sum-of-exponentials function that includes at least one summation of exponentials. The optimization techniques disclosed herein entail constructing an upper bound using a double majorization bounding process to the softmax function or other sum-of-exponentials function including at least one summation of exponentials, and optimizing the constructed upper bound respective to parameters of the exponentials of the at least one summation of exponentials to generate optimized parameters.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Claims
1. A storage medium storing instructions executable to implement a sum-of-exponentials function optimization method including the operations of:
- constructing an upper bound using a double majorization bounding process to a sum-of-exponentials function including at least one summation of exponentials;
- optimizing the constructed upper bound respective to parameters of the exponentials of the at least one summation of exponentials to generate optimized parameters; and
- outputting the optimized sum-of-exponentials function represented at least by the optimized parameters.
2. The storage medium as set forth in claim 1, wherein the sum-of-exponentials function including at least one summation of exponentials comprises at least one softmax function of the form: ⅇ β k T x ∑ k ′ = 1 K ⅇ β k ′ T x and the optimizing comprises:
- optimizing the constructed upper bound respective to parameters β to generate optimized parameters β.
3. The storage medium as set forth in claim 2 where the optimized function contains a conditional probability model of outputs yε{1,..., K} given inputs xεd under the form of a softmax function P ( y = k | x; β ) = ⅇ β k T x ∑ k ′ = 1 K ⅇ β k ′ T x.
4. The storage medium as set forth in claim 1, wherein the summation of exponentials is of the form ∑ k = 1 K ⅇ β k T x and the operation of constructing an upper bound using a double majorization bounding process comprises first and second majorization processes comprising: ∑ k = 1 K ⅇ β k T x.
- upper bounding the summation of exponentials by a product of sigmoid functions; and
- employing a bound on log(1+ex) in the upper bounding of the summation of exponentials of the form
5. The storage medium as set forth in claim 4, wherein the first majorization process generates an upper bound of the form: log ∑ k = 1 K ⅇ x k ≤ α + ∑ k = 1 K log ( 1 + ⅇ x k - α ) where α ∈ K.
6. The storage medium as set forth in claim 5, wherein the second majorization process modifies the upper bound of the form: log ∑ k = 1 K ⅇ x k ≤ α + ∑ k = 1 K log ( 1 + ⅇ x k - α ) to the form: log ∑ k = 1 K ⅇ x k ≤ α + ∑ k = 1 K x k - α - ξ x 2 + λ ( ξ k ) ( ( x k - α ) 2 - ξ k 2 ) + log ( 1 + ⅇ ξ k ) for α ∈ K and ξ ∈ [ 0, ∞ ) K where : λ ( ξ ) = 1 2 ξ [ 1 1 + ⅇ - ξ - 1 2 ].
7. The storage medium as set forth in claim 1, wherein the summation of exponentials is of the form ∑ k = 1 K ⅇ β k T x and the operation of constructing an upper bound using a double majorization bounding process comprises constructing an upper bound to an expectation of the sum-of-exponentials function under the form E Q [ ∑ k = 1 K ⅇ β k T x ], into an expectation having the form EQ[eβkTx] where Q(βk) denotes a probability density function (pdf) over the parameters β and EQ[... ] denotes an expectation with respect to Q.
8. The storage medium as set forth in claim 7, wherein a second majorization process of the double majorization bounding process produces an upper bound on expectation γ of the form γ ≤ ∑ k = 1 K λ ( ξ k ) ( x T Σ k x + ( μ x T x ) 2 ) + ( 1 2 - 2 αλ ( ξ k ) ) μ k T x + α - ∑ k = 1 K ξ k + α 2 + λ ( ξ k ) ( α 2 - ξ k 2 ) + log ( 1 + ⅇ ξ k ) for α ∈ K and ξ ∈ [ 0, ∞ ) K where : λ ( ξ k ) = 1 2 ξ [ 1 1 + ⅇ - ξ - 1 2 ] where μk is the expectation of βk and Σk is the expectation of (βk−μk)(βk−μk)T.
9. The storage medium as set forth in claim 1, wherein the storage medium stores further instructions executable to implement an inference method including the operations of:
- invoking the sum-of-exponentials function optimization method respective to a sum-of-exponentials function constrained by discrete observations y defining categorization observation conditioned by continuous variables x representing at least one input object where y=(y1,..., yn)ε{1,..., K}n and x=(x1,..., xn)εn×d; and
- applying the optimized sum-of-exponentials function output by the invocation of the sum-of-exponentials function optimization method to the continuous variables x representing at least one input object to generate classification probabilities.
10. The storage medium as set forth in claim 9, wherein the continuous variables x represent at least one input text-based document in a bag-of-words representation.
11. The storage medium as set forth in claim 9, wherein the continuous variables x represent at least one input image in a features vector representation.
12. An inference engine comprising a processor programmed to perform an inference process comprising: ∑ k = 1 K ⅇ β k T x constrained by an input object representation vector and an output classification observation vector;
- generating an upper bound by double majorization for a sum-of-exponentials function including at least one summation of exponentials of the form
- optimizing the upper bound respective to parameters βk of the sum-of-exponentials function; and
- classifying one or more input objects by applying the sum-of-exponentials function with the optimized parameters βk to said one or more input objects.
13. The inference engine as set forth in claim 12, wherein the generating comprises: ∑ k = 1 K ⅇ β k T x by a product of sigmoid functions; and
- a first majorization operation comprising upper bounding the summation of exponentials
- a second majorization operation comprising bounding terms of the form log(1+ex) generated by the first majorization operation.
14. The inference engine as set forth in claim 12, wherein the generating comprises generating an upper bound by double majorization for an expectation of the sum-of-exponentials function of the form EQ[eβkTx] where x denotes the input object representation vector, Q(βk) denotes a probability density function (pdf) over the parameters βk, and EQ[... ] denotes an expectation with respect to Q.
15. The inference engine as set forth in claim 12, wherein the sum-of-exponentials function is a softmax function.
16. The inference engine as set forth in claim 12, wherein the input object representation vector includes a bag-of-words representation of at least one text-based document and the classifying performs document classification.
17. The inference engine as set forth in claim 12, wherein the input object representation vector includes a features vector representation of at least one image and the classifying performs image classification.
18. A storage medium storing instructions defining a function for optimizing a sum-of-exponentials function including a summation of exponentials of the form ∑ k = 1 K ⅇ β k T x by optimization operations comprising: ∑ k = 1 K ⅇ β k T x by a product of sigmoids;
- upper bounding the sum of exponentials
- upper bounding terms of the form log(1+ex) in the product of sigmoids to generate a double majorization upper bound;
- minimizing the double majorization upper bound respective to the parameters β to generate optimized parameters β; and
- outputting the optimized sum-of-exponentials function represented at least by the optimized parameters β.
19. The storage medium as set forth in claim 18, wherein the optimization function is configured to optimize a softmax function of the form ⅇ β k T x ∑ k ′ = 1 K ⅇ β k ′ T x.
20. The storage medium as set forth in claim 19, wherein the upper bounding of the sum of exponentials by the product of sigmoids generates an upper bound of the form: log ∑ k = 1 K ⅇ x k ≤ α + ∑ k = 1 K log ( 1 + ⅇ x k - α ) where α ∈ K; and log ∑ k = 1 K ⅇ x k ≤ α + ∑ k = 1 K x k - α - ξ k 2 + λ ( ξ k ) ( ( x - α ) 2 - ξ k 2 ) + log ( 1 + ⅇ ξ k ) for ξ ∈ [ 0, ∞ ) K where : λ ( ξ k ) = 1 2 ξ [ 1 1 + ⅇ - ξ - 1 2 ].
- the upper bounding terms of the form log(1+ex) in the product of sigmoids generates the double majorization upper bound of the form:
21. The storage medium as set forth in claim 19, wherein the upper bounding a sum of exponentials by a product of sigmoids comprises upper bounding an expectation of the softmax function of the form EQ[eβkTx] where Q(βk) denotes a probability density function (pdf) over the parameters β and EQ[... ] denotes an expectation with respect to Q.
7567946 | July 28, 2009 | Andreoli et al. |
7865089 | January 4, 2011 | Andreoli et al. |
7889842 | February 15, 2011 | Gautier et al. |
7929165 | April 19, 2011 | Bressan et al. |
- Direct torque control theory of a double three-phrase permanent magnet synchronous motor, Yi Guo; Wei Feng Shi; Chen, C.L.P.; Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on Digital Object Identifier: 10.1109/ICSMC.2009.5346078 Publication Year: 2009 , pp. 4780-4785.
- “Back-Seat Driver”: Spatial Sound for Vehicular Way-Finding and Situation Awareness, Michael Cohen; Owen Noel Newton Fernando; Tatsuya Nagai; Kensuke Shimizu; Frontier of Computer Science and Technology, 2006. FCST '06. Japan-China Joint Workshop on Digital Object Identifier: 10.1109/FCST.2006.1 Publication Year: 2006 , pp. 109-115.
- Chipman et al., “Discussion of the paper Bayesian Treed Generalized Linear Models,” Proceedings Seventh Valencia International Meeting on Bayesian Statistics, vol. 7, pp. 98-101, 2002.
- Blei et al., “A Correlated Topic Model of Science,” Annals of Applied Statistics, 1:17-35, 2007.
- Bohning, “The lower bound method in probit regression,” Elsevier Science B.V., Computational Statistics & Data Analysis 30, pp. 13-17, 1999.
- Gibbs, “Bayesian Gaussian Processes for Regression and Classification,” Ph.D. Thesis, University of Cambridge, 1997.
- Girolami et al., “Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors,” Neural Comput., 18(8):1790-1817, 2006.
- Jaakkola et al., “A variational approach to Bayesian logistic regression models and their extensions,” In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics., 1996.
- Jebara et al., “On Reversing Jensen's Inequality,” In Advances in Neural Information Processing Systems 13, pp. 231-237, 2000.
- Krishnapuram et al., “Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds,” IEEE Trans Pattern Anal Mach Intell, 27(6):957-68, 2005.
- Lawrence et al., “Reducing the Variability in cDNA Microarray Image Processing by Bayesian Inference,” University of Sheffield School of Medicine and Biomedical Science, UK, pp. 1-10, (2003).
- Murphy, “Inference and Learning in Hybrid Bayesian Networks,” Technical Report UCB/CSD-98-990, EECS Department, University of California, 1998.
- Bishop et al., “VIBES: A variational Inference Engine for Bayesian Networks,” Oral presentation with live demo.
- Bouchard, “Efficient Bounds for the Softmax Function and Applications to Approximate Inference in Hybrid Models,” (2008).
Type: Grant
Filed: Oct 14, 2008
Date of Patent: Nov 22, 2011
Patent Publication Number: 20100094787
Assignee: Xerox Corporation (Norwalk, CT)
Inventor: Guillaume Bouchard (Crolles)
Primary Examiner: Michael B Holmes
Attorney: Fay Sharpe LLP
Application Number: 12/250,714
International Classification: G06F 15/18 (20060101); G06F 1/02 (20060101); G06N 3/02 (20060101); G06N 5/02 (20060101);