OPTIMIZING PARAMETERS FOR MACHINE TRANSLATION

Info

Publication number: 20100004919
Type: Application
Filed: Jul 2, 2009
Publication Date: Jan 7, 2010
Applicant: GOOGLE INC. (Mountain View, CA)
Inventors: Wolfgang Macherey (Mountain View, CA), Franz Josef Och (Palo Alto, CA), Ignacio E. Thayer (San Francisco, CA), Jakob Uszkoreit (Palo Alto, CA)
Application Number: 12/497,169

Abstract

Methods, systems, and apparatus, including computer program products, for language translation are disclosed. In one implementation, a method is provided. The method includes determining, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice; adjusting weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set; selecting weighting values that minimize error counts for the traversed combination; and applying the selected weighting values to convert a sample of text from a first language to a second language.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 61/078,262, entitled “Statistical Machine Translation”, which was filed on Jul. 3, 2008. The disclosure of the above application is incorporated herein by reference in its entirety.

BACKGROUND

This specification relates to statistical machine translation.

Manual translation of text by a human operator can be time consuming and costly. One goal of machine translation is to automatically translate text in a source language to corresponding text in a target language. There are several different approaches to machine translation including example-based machine translation and statistical machine translation. Statistical machine translation attempts to identify a most probable translation in a target language given a particular input in a source language. For example, when translating a sentence from French to English, statistical machine translation identifies the most probable English sentence given the French sentence. This maximum likelihood translation can be expressed as:

$\underset{e}{\arg \max} P (e | f),$

which describes the English sentence, e, out of all possible sentences, that provides the highest value for P(e|f). Additionally, Bayes Rule provides that:

$P (e | f) = \frac{P (e) P (f | e)}{P (f)} .$

Using Bayes Rule, this most likely sentence can be re-written as:

$\underset{e}{\arg \max} P (e | f) = \underset{e}{\arg \max} P (e) P (f | e) .$

Consequently, the most likely e (i.e., the most likely English translation) is one that maximizes the product of the probability that e occurs and the probability that e would be translated into f (i.e., the probability that a given English sentence would be translated into the French sentence).

Components that perform translation portions of a language translation task are frequently referred to as decoders. In certain instances, a first decoder (a first-pass decoder) can generate a list of possible translations, e.g., an N-best list. A second decoder (a second-pass decoder), e.g., a Minimum Bayes-Risk (MBR) decoder, can then be applied to the list to ideally identify which of the possible translations are the most accurate, as measured by minimizing a loss function that is part of the identification. Typically, an N-best list contains between 100 and 10,000 candidate translations, or hypotheses. Increasing the number of candidate translations improves the translation performance of an MBR decoder.

SUMMARY

This specification describes technologies relating to language translation.

In general, one aspect of the subject matter described in this specification can be embodied in methods that include the actions of determining, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice; adjusting weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set; selecting weighting values that minimize error counts for the traversed combination; and applying the selected weighting values to convert a sample of text from a first language to a second language. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.

These and other embodiments can optionally include one or more of the following features. The translation lattice includes a phrase lattice. Arcs in the phrase lattice represent phrase hypotheses and nodes in the phrase lattice represent states at which partial translation hypotheses were recombined. The error surfaces are determined and traversed using a line optimization technique. The line optimization technique determines and traverses, for each feature function and sentence in a group, an error surface on a set of candidate translations. The line optimization technique determines and traverses the error surface starting from a random point in a parameter space. The line optimization technique determines and traverses the error surface using random directions to adjust the weights.

The weights are limited by restrictions. The weights are adjusted using weights priors. The weights are adjusted over all sentences in a group of sentences. The method further includes selecting a target translation, from a plurality of candidate translations, that maximizes a-posteriori probability for the translation lattice. The translation lattice represents more than one billion candidate translations. The phrases include sentences. The phrases all include sentences.

In general, another aspect of the subject matter described in this specification can be embodied in systems that include a language model that includes: a collection of feature functions in a translation lattice; a plurality of error surfaces for a set of candidate language translations, across the feature functions; and weighting values for feature functions selected to minimize error for traversal of the error surfaces. Other embodiments of this aspect include corresponding methods, apparatus, and computer program products.

Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. MBR decoding of a lattice increases sizes of hypothesis and evidence spaces, thereby increasing a number of candidate translations available and the likelihood of obtaining an accurate translation. In addition, MBR decoding provides a better approximation of a corpus BLEU score (as described in further detail below), thereby further improving translation performance. Furthermore, MBR decoding of a lattice is runtime efficient, thereby increasing the flexibility of statistical machine translation since the decoding can be performed at runtime.

Lattice-based Minimum Error Rate Training (MERT) provides exact error surfaces for all translations in a translation lattice, thereby further improving translation performance of a statistical machine translation system. The systems and techniques for lattice-based MERT are also space and runtime efficient, thereby reducing an amount of memory used, e.g., limiting memory requirements to be linearly related (at most) with the size of the lattice, and increasing a speed of translation performance.

The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a conceptual diagram of an example process for translating input text from a source language to a target language.

FIG. 2A illustrates an example translation lattice.

FIG. 2B illustrates an example MBR automaton for the translation lattice of FIG. 2A.

FIG. 3 illustrates a portion of an example translation lattice.

FIG. 4 shows an example process for MBR decoding.

FIG. 5A shows an example process for Minimum Error Rate Training (MERT) on a lattice.

FIG. 5B illustrates an example Minimum Error Rate Trainer.

FIG. 6 shows an example of a generic computer device and a generic mobile computer device.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION Statistical Translation Overview

Machine translation seeks to take input text in one language and accurately convert it into text in another language. Generally, the accuracy of a translation is measured against the ways in which expert humans would translate the input. An automatic translation system can analyze prior translations performed by human experts to form a statistical model of translation from one language to another. No such model can be complete, however, because the meaning of words often depends on context. Consequently, a step-wise word-for-word transformation of words from one language to another may not provide acceptable results. For example, idioms such as “babe in the woods” or slang phrases, do not translate well in a literal word-for-word transformation.

Adequate language models can help provide such context for an automatic translation process. The models can, for example, provide indications regarding the frequency with which two words appear next to each other in normal usage, e.g., in training data, or that other groups of multiple words or elements (n-grams) appear in a language. An n-gram is a sequence of n consecutive tokens, e.g., words or characters. An n-gram has an order or size, which is the number of tokens in the n-gram. For example, a 1-gram (or unigram) includes one token; a 2-gram (or bi-gram) includes two tokens.

A given n-gram can be described according to different portions of the n-gram. An n-gram can be described as a context and a future token, (context, w), where the context has a length n−1 and w represents the future token. For example, the 3-gram “c₁c₂c₃” can be described in terms of an n-gram context and a future token, where c₁, c₂, and c₃each represent a character. The n-gram left context includes all tokens of the n-gram preceding the last token of the n-gram. In the given example, “c₁c₂” is the context. The left most token in the context is referred to as the left token. The future token is the last token of the n-gram, which in the example is “c₃”. The n-gram can also be described with respect to a right context. The right context includes all tokens of the n-gram following the first token of the n-gram, represented as a (n−1)-gram. In the example above, “c₂c₃” is the right context.

Each n-gram can have an associated probability estimate, e.g., a log-probability, that is calculated as a function of a count of occurrences in training data relative to a count of total occurrences in the training data. In some implementations, the probabilities of n-grams being a translation of input text is trained using the relative frequency of the n-grams represented in a target language as a reference translation of corresponding text in a source language in training data, e.g., training data including a set of text in the source language and corresponding text in the target language.

Additionally, in some implementations, a distributed training environment is used for large training data (e.g., terabytes of data). One example technique for distributed training is MapReduce. Details of MapReduce are described in J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pp. 137-150 (Dec. 6, 2004).

Past usage represented by a training set can be used to predict how samples in one language should be translated to a target language. In particular, the n-grams, associated probability estimates, and respective counts can be stored in a language model for use by a decoder, e.g., a Bayesian decoder to identify translations for input text. A score indicating the likelihood that input text can be translated to corresponding text in a target language can be calculated by mapping the n-grams included in the input text to associated probability estimates for a particular translation.

Example Translation Process

FIG. 1 is a conceptual diagram of an example process 100 for translating input text from a source language to a target language. A source sample 102 is shown as a passage of Chinese text, and is provided to a first decoder 104. The decoder 104 can take a variety of forms and can be used in an attempt to maximize a posterior probability for the passage, given a training set of documents 106 that has been provided to the decoder 104 during a training phase for the decoder 104. In translating the sample 102, the decoder 104 can select n-grams from within the document and attempt to translate the n-grams. The decoder 104 can be provided with a re-ordering model, alignment model, and language model, among other possible models. The models direct the decoder 104 in selecting n-grams from within the sample 102 for translation. As one simple example, the model can use delimiters, e.g., punctuation such as a comma or period, to identify the end of an n-gram that may represent a word.

The decoder 104 can produce a variety of outputs, e.g., data structures that include possible translations. For example, the decoder 104 can produce an N-best list of translations. In some implementations, the decoder 104 generates a translation lattice 108, as described in further detail below.

A second decoder 110 then processes the translation lattice 108. While the first decoder 104 is generally aimed at maximizing the posterior probability of the translation, i.e., matching the input to what the historical collection of documents 106 may indicate to be a best match to past expert manual translations of other passages, the second decoder 110 is aimed at maximizing a quality measure for the translation. As such, the second decoder 110 may re-rank the candidate translations that reside in the translation lattice so as to produce a “best” translation that may be displayed to a user of the system 100. This translation is represented by the English sample 112 corresponding to the translation of the Chinese sample 102.

The second decoder 110 can use a process known as MBR decoding, which seeks the hypothesis (or candidate translation) that minimizes the expected error in classification. The process thus directly incorporates a loss function into the decision criterion for making a translation selection.

Minimum Bayes Risk Decoding

Minimum Bayes-Risk (MBR) decoding aims to find a translation hypothesis, e.g., a candidate translation, that has the least expected error under the probability model. Statistical machine translation can be described as mapping of input text F in a source language to translated text E in a target language. A decoder δ(F), e.g., decoder 104, can perform the mapping. If the reference translation E is known, the decoder performance can be measured by the loss function L(E, δ(F)). Given such a loss function L(E, E′) between an automatic translation E′ and the reference translation E, and an underlying probability model P(E, F), the MBR decoder, e.g., the second decoder 110, can be represented by:

$\hat{E} = \underset{E^{'} \in Ψ}{\arg \min} R (E^{'}) = \underset{E^{'} \in Ψ}{\arg \min} \sum_{E^{'} \in Ψ} L (E, E^{'}) P (E | F),$

where R(E) represents the Bayes risk of candidate translation E′ under the loss function L, and Ψ represents the space of translations. For N-best MBR, the space Ψ is an N-best list produced, for example, by the first decoder 104. When a translation lattice is used, Ψ represents candidate translations encoded in the translation lattice.

If the loss function between any two hypotheses can be bounded, i.e., L(E, E′)≦L_max, the MBR decoder can be written in terms of a gain function, G(E, E′)=L_max−L(E, E′), as:

$\begin{matrix} \hat{E} = \underset{E^{'} \in Ψ}{\arg \max} \sum_{E^{'} \in Ψ} G (E, E^{'}) P (E | F) . & (Eq . 1) \end{matrix}$

In some implementations, MBR decoding uses different spaces for hypothesis selection and risk computation. For example, the hypothesis can be selected from an N-best list and the risk can be computed based on a translation lattice. In the example, the MBR decoder can be rewritten as:

$\hat{E} = \underset{E^{'} \in Ψ_{h}}{\arg \max} \sum_{E^{'} \in Ψ_{e}} G (E, E^{'}) P (E | F),$

where Ψ_hrepresents the hypothesis space and Ψ_erepresents an evidence space used for computing Bayes risk.

MBR decoding can be improved by using larger spaces, i.e., hypothesis and risk computation spaces. Lattices can include more candidate translations than an N-best list. For example, lattices can include more than one billion candidate translations. As such, representing the hypothesis and risk computation spaces using lattices increases the accuracy of MBR decoding, thereby increasing the likelihood that an accurate translation is provided.

Example Translation Lattice and MBR Decoding

FIG. 2A illustrates an example translation lattice 200. In particular, translation lattice 200 is a translation n-gram lattice that can be considered to be a compact representation for very large N-best lists of translation hypotheses and their likelihoods. Specifically, the lattice is an acyclic weighted finite state acceptor including states (e.g., states 0 through 6) and arcs representing transitions between states. Each arc is associated with an n-gram (e.g., a word or phrase) and a weight. For example, in translation lattice 200, n-grams are represented by labels “a”, “b”, “c”, “d”, and “e”. State 0 is connected to a first arc that provides a path to state 1, a second arc that provides a path to state 4 from state 1, and a third arc that provides a path to state 5 from state 4. The first arc is associated with “a” and weight 0.5, the second arc is associated with “b” and weight 0.6, and the third arc is also associated with “d” and weight 0.3.

Each path in the translation lattice 200, including consecutive transitions beginning at an initial state (e.g., state 0) and ending at a final state (e.g., state 6), expresses a candidate translation. Aggregation of the weights along a path produces a weight of the path's candidate translation H(E, F) according to the model. The weight of the path's candidate translation represents the posterior probability of the translation E given the source sentence F as:

$P (E | F) = \frac{\exp (α \cdot H (E, F))}{\sum_{E^{'} \in Ψ} \exp (α \cdot H (E^{'}, F))},$

where αε(0, ∞) is a scaling factor that flattens the distribution when α<1, and sharpens the distribution when α>1.

In some implementations, a gain function G is expressed as a sum of local gain functions g_i. A gain function can be considered to be a local gain function if it can be applied to all paths in the lattice using Weighted Finite State Transducers (WFSTs) composition, resulting in a o(N) increase in the number of states N in the lattice. The local gain functions can weight n-grams. For example, given a set of n-grams N={w₁, . . . , w_|N|}, a local gain function g_w:ε×ε→, where wεN, can be expressed as:

g_w(E|E′)=θ_w·#_w(E′)·δ_w(E),

where δ_wis a constant, #_w(E′) is a number of times that w occurs in E′, and δ_w(E) is 1 if wεE and 0 otherwise. Assuming that the overall gain function G(E, E′) can be written as a sum of local gain functions and a constant θ₀times the length of the hypothesis E′, the overall gain function can be expressed as:

$G (E, E^{'}) = θ_{0} \langle E^{'} \rangle + \sum_{w \in N} g_{w} (E | E^{'}) = θ_{0} \langle E^{'} \rangle + \sum_{w \in N} θ_{w} \cdot #_{w} (E^{'}) \cdot δ_{w} (E) .$

Using this overall gain function, the risk, i.e.,

$\sum_{E^{'} \in Ψ} G (E, E^{'}) P (E | F),$

can be rewritten such that the MBR decoder for the lattice (in Equation 1) is expressed as:

$\begin{matrix} \hat{E} = \underset{E^{'} \in Ψ}{\arg \max} {θ_{0} \langle E^{'} \rangle + \sum_{w \in N} θ_{w} \cdot #_{w} (E^{'}) \cdot P (w | Ψ)}, & (Eq . 2) \end{matrix}$

where P(w|Ψ) is the posterior probability of the n-gram w in the lattice, or Σ_EεΨ_wP(E|F), and can be expressed as:

$\begin{matrix} P (w | Ψ) = \sum_{E \in Ψ_{w}} P (E | F) = \frac{Z (Ψ_{w})}{Z (Ψ)}, & (Eq . 3) \end{matrix}$

where Ψ_w={EεΨ|δ_w(E)>0} represents the paths of the lattice containing the n-gram w at least once, and Z(Ψ_w) and Z(Ψ) represent the sum of weights of all paths in the lattices Ψ_wand Ψ, respectively.

In some implementations, the MBR decoder (Equation 2) is implemented using WFSTs. A set of n-grams that are included in the lattice are extracted, e.g., by traversing the arcs in the lattice in topological order. Each state in the lattice has a corresponding set of n-gram prefixes. Each arc leaving a state extends each of the state's prefixes by a single word. N-grams that occur at a state followed by an arc in the lattice are included in the set. As an initialization step, an empty prefix can be initially added to each state's set.

For each n-gram w, an automaton (e.g., another lattice) matching paths containing the n-gram is generated, and the automaton is intersected with the lattice to find a set of paths containing the n-gram, i.e., Ψ_w. For example, if Ψ represents the weighted lattice, Ψ_wcan be represented as:

Ψ_w=Ψ∩(Σ*wΣ*).

The posterior probability P(w|Ψ) of n-gram w can be calculated as a ratio of the total weights of paths in Ψ_wto the total weights of paths in the original lattice Ψ, as given above in Equation 3.

The posterior probability for each n-gram w can be calculated as described above, and then multiplied by θ_w(an n-gram factor) as described with respect to Equation 2. An automaton that accepts an input with weight equal to a number of times the n-gram occurs in the input times θ_wis generated. The automaton can be represented using the weighted regular expression:

w(w/(θ_wP(w|Ψ)) w)*,

where w= (Σ*wΣ*) is the language that includes all strings that do not contain the n-gram w.

Each generated automaton is successively intersected with second automatons that each begin as an un-weighted copy of the lattice. Each of these second automatons is generated by intersecting the un-weighted lattice with an automaton accepting (Σ/θ₀)*. The resulting automaton represents the total expected gain of each path. A path in the resulting automaton that represents a word sequence E′ has a cost:

$θ_{0} \langle E^{'} \rangle + \sum_{w \in N} θ_{w} \cdot #_{w} (E^{'}) \cdot P (w | Ψ) .$

The path associated with the least cost, e.g., according to Equation 2, is extracted from the resulting automaton, producing the lattice MBR candidate translation.

In implementations where the hypothesis and evidence spaces lattices are different, the evidence space lattice is used for extracting the n-grams and computing associated posterior probabilities. The MBR automaton is constructed starting with an un-weighted copy of the hypothesis space lattice. Each of the n-gram automata is successively intersected with the un-weighted copy of the hypothesis space lattice.

An approximation to the BLEU score is used to calculate a decomposition of the overall gain function G(E, E′) as a sum of local gain functions. A BLEU score is an indicator of translation quality of text which has been machine translated. Additional details of Bleu are described in K. Papineni, S. Roukes, T. Ward, and W. Zhu. 2001. Bleu: a Method for Automatic Evaluation of Machine Translation. Technical Report RC22176 (W0109-022), IBM Research Division. In particular, the system calculates a first order Taylor-series approximation to the change in corpus BLEU score from including a sentence to not including the sentence in the corpus.

Given a reference length r of a corpus (e.g., a length of a reference sentence, or a sum of the lengths of multiple reference sentences), a candidate length c₀, and a number of n-gram matches {c_n|1≦n≦4}, the corpus BLEU score B(r, c₀, c_n) can be approximated as:

$\begin{matrix} \log B = \min (0, 1 - \frac{4}{c_{0}}) + \frac{1}{4} \sum_{n = 1}^{4} \log \frac{c_{n}}{c_{0} - Δ_{n}} \\ \approx \min (0, 1 - \frac{4}{c_{0}}) + \frac{1}{4} \sum_{n = 1}^{4} \log \frac{c_{n}}{c_{0}}, \end{matrix}$

where Δ_n, the difference between a number of words in the candidate and the number of n-grams: Δ_n=n−1, is assumed to be negligible.

The corpus log(BLEU) gain is defined as the change in log(BLEU) when a new sentence's (E′) statistics is included in the corpus statistics, and expressed as:

G=log B′−log B,

where the counts in B′ are those of B added to the counts for the current sentence. In some implementations, an assumption that c≧r is used, and only c_nis treated as a variable. Therefore, the corpus log BLEU gain can be approximated by a first-order vector Taylor series expansion about the initial values of c_nas:

$G = \sum_{n = 0}^{N} (c_{n}^{'} - c_{n}) \frac{\partial \log B^{'}}{\partial c_{n}} |_{c_{n}^{'} = c_{n}},$

where the partial derivatives are expressed as:

$\frac{\partial \log B}{\partial c_{0}} = \frac{- 1}{c_{0}}, and \frac{\partial \log B}{\partial c_{n}} = \frac{1}{4 c_{n}} .$

Therefore, the corpus log(BLEU) gain can be rewritten as:

$G = Δ \log B \approx - \frac{Δ c_{0}}{c_{0}} + \frac{1}{4} \sum_{n = 1}^{4} \frac{Δ c_{n}}{c_{n}},$

where the Δ terms count various statistics in a sentence of interest, rather than the corpus as a whole. These approximations suggest that the values of θ₀and θ_w(e.g., in Equation 2) can be expressed as:

$θ_{0} = \frac{- 1}{c_{0}}, and θ_{w} = \frac{1}{4 c_{\langle w \rangle}} .$

Assuming that the precision of each n-gram is a constant ratio r times the precision of a corresponding (n−1)-gram, the BLEU score can be accumulated at the sentence level. For example, if the average sentence length in a corpus is assumed to be 25 words, then:

$\frac{# (n) gram_tokens}{# (n - 1) gram_tokens} = 1 - \frac{1}{25} = 0.96 .$

If the unigram precision is p, the n-gram factors (nε{1,2,3,4}), as a function of the parameters p and r and the number of unigram tokens T, can be expressed as:

$θ_{0} = \frac{- 1}{T}, and θ_{w} = \frac{1}{Tp \cdot 4 {(0.96 \cdot r)}^{n}} .$

In some implementations, p and r are set to the average values of unigram precision and precision ratio across multiple training sets. Substituting the n-gram factors in Equation 2 provides that the MBR decoder, e.g., a MBR decision rule, does not depend on T and multiple values of T can be used.

FIG. 2B illustrates an example MBR automaton for the translation lattice of FIG. 2A. The bold path in the translation lattice 200 in FIG. 2A is a Maximum A Posteriori (MAP) hypothesis, and the bold path in the MBR automaton 250 in FIG. 2B is an MBR hypothesis. In the example illustrated by FIGS. 2A and 2B, T=10, p=0.85, and r=0.75. Note that the MBR hypothesis (bcde) has a higher decoder cost relative to the MAP hypothesis (abde). However, bcde receives a higher expected gain than abde since it shares more n-grams with the third ranked hypothesis (bcda), illustrating how a lattice can help select MBR translations that are different from a MAP translation.

Minimum Error Rate Training (MERT) Overview

Minimum error rate training (MERT) measures an error metric of a decision rule for classification, e.g., MBR decision rule using a zero-one loss function. In particular, MERT estimates model parameters such that the decision under the zero-one loss function maximizes an end-to-end performance measure on a training corpus. In combination with log-linear models, the training procedure optimizes an unsmoothed error count. As previously stated, the translation that maximizes the a-posteriori probability can be selected based on

$\underset{e}{argmax} P (e | f) .$

Since the true posterior distribution is unknown, P(e|f) is approximated with a log-linear translation model, for example, which combines one or more feature functions h_m(e,f) with feature function weights λ_m, where m=1, . . . , M. The log-linear translation model can be expressed as:

$P (e | f) = P_{λ_{1}^{M}} (e | f) = \frac{\exp [\sum_{m = 1}^{M} λ_{m} h_{m} (e, f)]}{\sum_{e^{'}} \exp [\sum_{m = 1}^{M} λ_{m} h_{m} (e^{'}, f)]} .$

The feature function weights are the parameters of the model, and the MERT criterion finds a parameter set λ₁^Mthat minimizes the error count on a representative set of training sentences using the decision rule, e.g., P(e|ƒ). Given source sentences f₁^Sof a training corpus, reference translations r₁^S, and a set of K candidate translations C_s={e_s,1, . . . e_s,K}, the corpus-based error count for translations e₁^Sis additively decomposable into error counts of individual sentences, i.e.,

$E (r_{1}^{S}, e_{1}^{S}) = \sum_{s = 1}^{S} E (r_{1}, e_{1}) .$

The MERT criterion can be expressed as:

$\begin{matrix} \begin{matrix} λ_{1}^{M} = \underset{λ_{1}^{M}}{argmin} {\sum_{s = 1}^{S} E (r_{s}, \hat{e} (f_{s}; λ_{1}^{M}))} \\ = \underset{λ_{1}^{M}}{argmin} {\sum_{s = 1}^{S} \sum_{k = 1}^{K} E (r_{s}, r_{s, k}) δ (\hat{e} (f_{s}; λ_{1}^{M}), e_{s, k})}, \end{matrix} where \hat{e} (f_{s}, λ_{1}^{M}) = \underset{e}{argmax} {\sum_{m = 1}^{M} λ_{m} h_{m} (e, f_{s})} . & (Eq . 4) \end{matrix}$

A line optimization technique can be used to train a linear model under the MERT criterion. The line optimization determines, for each feature function h_mand sentence f_s, the exact error surface on a set of candidate translations C_s. The feature function weights are then adjusted by traversing the combined error surfaces of sentences in the training corpus and setting weights to a point where the resulting error is a minimum.

The most probable sentence hypothesis in C_salong a line λ₁^M+γ·d₁^Mcan be defined as:

$\hat{e} (f_{s}; γ) = \underset{e \in C_{s}}{argmax} {{(λ_{1}^{M} + γ \cdot d_{1}^{M})}^{T} \cdot h_{1}^{M} (e, f_{s})} .$

The total score for any candidate translation corresponds to a line in the plane with γ as the independent variable. Overall, C_sdefines K lines where each line may be divided into at most K line segments due to possible intersections with other K−1 lines.

For each γ, the decoder (e.g., the second decoder 110) determines a respective candidate translation that yields the highest score and therefore corresponds to a topmost line segment. A sequence of topmost line segments constitute an upper envelope that is a point-wise maximum over all lines defined by C_s. The upper envelope is a convex hull and can be inscribed with a convex polygon whose edges are the segments of a piecewise linear function in γ. In some implementations, the upper envelope is calculated using a sweep line technique. Details of the sweep line technique are described, for example, in W. Macherey, F. Och, I. Thayer, and J. Uzskoreit, Lattice-based Minimum Error Rate Training for Statistical Machine Translation, Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 725-734, Honolulu, October 2008.

MERT on Lattices

A lattice (e.g., a phrase lattice) for a source sentence f can be defined as a connected, directed acyclic graph _f=(ν_f, ε_f) with vertices set ν_f, unique source and sink nodes s, tεν_f, and a set of arcs; {umlaut over (ε)}_f⊂ν_f×ν_f. Each arc is labeled with a phrase ρ_ij=e_i₁, . . . , e_i_jand the (local) feature function values h₁^M(ρ_ij, f) of this phrase. A path π=(v₀, ε₀, v₁, ε₁, . . . , ε_n−1, v_n) in G_f(with ε_iεε_fand v_i, v_i+1εν_fas the tail and head of ε_i, 0≦i<n) defines a partial translation e (of f), which is the concatenation of all phrases along this path. Related feature function values are obtained by summing over the arc-specific feature function values:

$π : \underset{v_{0}}{•} \overset{ϕ_{0, 1}}{\underset{h_{1}^{M} (ϕ_{0, 1}, f)}{\to}} \underset{v_{1}}{•} \overset{ϕ_{1, 2}}{\underset{h_{1}^{M} (ϕ_{1, 2}, f)}{\to}} \dots \overset{ϕ_{n - 1, n}}{\underset{h_{1}^{M} (ϕ_{n - 1, n} . f)}{\to}} \underset{v_{n}}{•} e_{π} = \underset{i, j : v_{i} \to v_{j} \in π}{◯} ϕ_{ij} = ϕ_{0, 1} {•…•ϕ}_{n - 1, n}$ $h_{1}^{M} (e_{π}, f) = \sum_{i, j : v_{i} \to v_{j} \in π} h_{1}^{M} (ϕ_{ij}, f)$

In the following discussion, the notation enter(ν) and leave(ν) refer to the set of incoming and outgoing arcs, respectively, for a node vεν_f. Similarly, head(ε) and tail(ε) denote the head and tail of arc ε, respectively.

FIG. 3 illustrates a portion of an example translation lattice 300. In FIG. 3, incoming arcs 302, 304, and 306 enter node ν 310. In addition, outgoing arcs 312 and 314 exit node ν 310.

Each path that starts at a source node s and ends in ν (e.g., node ν 310) defines a partial translation hypothesis that can be represented as a line (cf. Equation 4). Assume that the upper envelope for these partial translation hypotheses is known, and the lines that define the envelope are denoted by f₁, . . . , f_N. Outgoing arcs ε that are elements of the set leave(ν), e.g., arc 312, represent continuations of these partial candidate translations. Each outgoing arc defines another line denoted by g(ε). Adding the parameters of g(ε) to all lines in the set f₁, . . . , f_Nproduces an upper envelope defined by f₁+g(ε), . . . , f_N+g(ε).

Because the addition of g(ε) does not change the number of line segments or their relative order in the envelope, the structure of the convex hull is preserved. Therefore, the resulting upper envelope can be propagated over an outgoing arc ε to a successor node v′=head(ε). Other incoming arcs for ν′ may be associated with different upper envelopes. The upper envelopes are merged into a single, combined envelope, which is the convex hull of the union over the line sets which constitute the individual envelopes. By combining upper envelopes for each incoming arc ν′, the upper envelope for all partial candidate translations that are associated with paths starting at the source node s and ending in ν′ is generated.

Other implementations are possible. In particular, additional refinements can be performed to improve the performance of MERT (for lattices). For example, in order to prevent the line optimization technique from stopping in a poor local optimum, MERT can explore additional starting points that are randomly chosen by sampling the parameter space. As another example, the range of some or all feature function weights can be limited by defining weight restrictions. In particular, a weight restriction for a feature function h_mcan be specified as an interval _m=[l_m, r_m], l_m, r_mε∪{−∞, +∞}, which defines an admissible region from which the feature function weight λ_mcan be selected. If the line optimization is performed under weights restrictions, γ is selected such that: l₁^M≦λ₁^M+γ·d₁^M≦r₁^M.

In some implementations, weights priors can be used. Weights priors provide a small (positive or negative) boost ω on the objective function if the new weight is chosen as to match a certain target value λ_m*:

$γ_{opt} = \arg \min_{γ} {\sum_{s} E (r_{s}, \hat{e} (f_{s}; γ)) + \sum_{m} δ (λ_{m} + γ \cdot d_{m}, λ_{m}^{*}) \cdot ω}$

A zero weights prior λ_m*=0 allows feature selection because the weights of feature functions, which are not discriminative, are set to zero. For example, an initial weights prior λ_m*=λ_mcan be used to limit changes in parameters, such that an updated parameter set has fewer differences relative to an initial weights set.

In some implementations, an interval [γ_i^f^s, γ_i+1^f^s) of a translation hypothesis, which has a change in error count ΔE_i^f^sthat is equal to zero, is merged with an interval [γ_i−1^f^s, γ_i^f^s) of its left-adjacent translation hypothesis. The resulting interval [γ_i−1^f^s, γ_i+1^f^s) has a larger range, and the reliability of a selection of the optimum value of λ can be increased.

In some implementations, the system uses random directions to update multiple feature functions simultaneously. If the directions used in line optimization are the coordinate axes of the M-dimensional parameter space, each iteration results in an update of a single feature function. While this update technique provides a ranking of the feature functions according to their discriminative power, e.g., each iteration selects a feature function for which changing the corresponding weight yields the highest gain, the update technique does not account for possible correlations between the feature functions. As a result, optimization may stop in a poor local optimum. The use of random directions allows multiple feature functions to be updated simultaneously. The use of random directions can be implemented by selecting lines which connect one or more random points on the surface of an M-dimensional hyper sphere with the hyper sphere's center (defined by the initial parameter set).

FIG. 4 shows an example process 400 for MBR decoding. For convenience, MBR decoding will be described with respect to a system that performs the decoding. The system accesses 410 a hypothesis space. The hypothesis space represents a plurality of candidate translations, e.g., in a target language of corresponding input text in a source language. For example, a decoder (e.g., second decoder 110 in FIG. 1) can access a translation lattice (e.g., translation lattice 108). The system performs 420 decoding on the hypothesis space to obtain a translation hypothesis that minimizes an expected error in classification calculated relative to an evidence space. For example, the decoder can perform the decoding. The system provides 430 the obtained translation hypothesis for use by a user as a suggested translation in a target translation. For example, the decoder can provide a translation text (e.g., English sample 112) for use by a user.

FIG. 5A shows an example process 500 for MERT on a lattice. For convenience, performing MERT will be described with respect to a system that performs the training. The system determines 510, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice. For example, error surface generation module 560 of Minimum Error Rate Trainer 550 in FIG. 5B can determine the corresponding plurality of error surfaces. The system adjusts 520 weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set. For example, update module 570 of Minimum Error Rate Trainer 550 can adjust the weights. The system selects 530 weighting values that minimize error counts for the traversed combination. For example, error minimization module 580 of Minimum Error Rate Trainer 550 can select weighting values. The system applies 540 the selected weighting values to convert a sample of text from a first language to a second language. For example, Minimum Error Rate Trainer 550 can apply the selected weighting values to a decoder.

FIG. 6 shows an example of a generic computer device 600 and a generic mobile computer device 650, which may be used with the techniques (e.g., processes 400 and 500) described. Computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 650 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the systems and techniques described and/or claimed in this document.

Computing device 600 includes a processor 602, memory 604, a storage device 606, a high-speed interface 608 connecting to memory 604 and high-speed expansion ports 610, and a low speed interface 612 connecting to low speed bus 614 and storage device 606. Each of the components 602, 604, 606, 608, 610, and 612, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 602 can process instructions for execution within the computing device 600, including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, such as display 616 coupled to high speed interface 608. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 600 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

The memory 604 stores information within the computing device 600. In one implementation, the memory 604 is a volatile memory unit or units. In another implementation, the memory 604 is a non-volatile memory unit or units. The memory 604 may also be another form of computer-readable medium, such as a magnetic or optical disk.

The storage device 606 is capable of providing mass storage for the computing device 600. In one implementation, the storage device 606 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 604, the storage device 606, or memory on processor 602.

The high speed controller 608 manages bandwidth-intensive operations for the computing device 600, while the low speed controller 612 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 608 is coupled to memory 604, display 616 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 610, which may accept various expansion cards (not shown). In the implementation, low-speed controller 612 is coupled to storage device 606 and low-speed expansion port 614. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

The computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 620, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 624. In addition, it may be implemented in a personal computer such as a laptop computer 622. Alternatively, components from computing device 600 may be combined with other components in a mobile device (not shown), such as device 650. Each of such devices may contain one or more of computing device 600, 650, and an entire system may be made up of multiple computing devices 600, 650 communicating with each other.

Computing device 650 includes a processor 652, memory 664, an input/output device such as a display 654, a communication interface 666, and a transceiver 668, among other components. The device 650 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 650, 652, 664, 654, 666, and 668, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

The processor 652 can execute instructions within the computing device 650, including instructions stored in the memory 664. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 650, such as control of user interfaces, applications run by device 650, and wireless communication by device 650.

Processor 652 may communicate with a user through control interface 658 and display interface 656 coupled to a display 654. The display 654 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 656 may comprise appropriate circuitry for driving the display 654 to present graphical and other information to a user. The control interface 658 may receive commands from a user and convert them for submission to the processor 652. In addition, an external interface 662 may be provide in communication with processor 652, so as to enable near area communication of device 650 with other devices. External interface 662 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.

The memory 664 stores information within the computing device 650. The memory 664 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 674 may also be provided and connected to device 650 through expansion interface 672, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 674 may provide extra storage space for device 650, or may also store applications or other information for device 650. Specifically, expansion memory 674 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 674 may be provide as a security module for device 650, and may be programmed with instructions that permit secure use of device 650. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.

The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 664, expansion memory 674, memory on processor 652, or a propagated signal that may be received, for example, over transceiver 668 or external interface 662.

Device 650 may communicate wirelessly through communication interface 666, which may include digital signal processing circuitry where necessary. Communication interface 666 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 668. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 670 may provide additional navigation- and location-related wireless data to device 650, which may be used as appropriate by applications running on device 650.

Device 650 may also communicate audibly using audio codec 660, which may receive spoken information from a user and convert it to usable digital information. Audio codec 660 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 650. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 650.

The computing device 650 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 680. It may also be implemented as part of a smartphone 682, personal digital assistant, or other similar mobile device.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any implementation or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular implementations. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter described in this specification have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims

1. A computer-implemented method comprising:

determining, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice;

adjusting weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set;

selecting weighting values that minimize error counts for the traversed combination; and

applying the selected weighting values to convert a sample of text from a first language to a second language.

2. The method of claim 1, where the translation lattice comprises a phrase lattice.

3. The method of claim 2, where arcs in the phrase lattice represent phrase hypotheses and nodes in the phrase lattice represent states at which partial translation hypotheses were recombined.

4. The method of claim 1, where the error surfaces are determined and traversed using a line optimization technique.

5. The method of claim 4, where the line optimization technique determines and traverses, for each feature function and sentence in a group, an error surface on a set of candidate translations.

6. The method of claim 5, where the line optimization technique determines and traverses the error surface starting from a random point in a parameter space.

7. The method of claim 5, where the line optimization technique determines and traverses the error surface using random directions to adjust the weights.

8. The method of claim 1, where the weights are limited by restrictions.

9. The method of claim 1, where the weights are adjusted using weights priors.

10. The method of claim 1, where the weights are adjusted over all sentences in a group of sentences.

11. The method of claim 1, further comprising selecting a target translation, from a plurality of candidate translations, that maximizes a-posteriori probability for the translation lattice.

12. The method of claim 1, where the translation lattice represents more than one billion candidate translations.

13. The method of claim 1, where the phrases comprise sentences.

14. The method of claim 1, where the phrases all comprise sentences.

15. A computer program product, encoded on a tangible program carrier, operable to cause data processing apparatus to perform operations comprising:

determining, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice;

adjusting weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set;

selecting weighting values that minimize error counts for the traversed combination; and

applying the selected weighting values to convert a sample of text from a first language to a second language.

16. The program product of claim 15, where the translation lattice comprises a phrase lattice.

17. The program product of claim 16, where arcs in the phrase lattice represent phrase hypotheses and nodes in the phrase lattice represent states at which partial translation hypotheses were recombined.

18. The program product of claim 15, where the error surfaces are determined and traversed using a line optimization technique.

19. The program product of claim 18, where the line optimization technique determines and traverses, for each feature function and sentence in a group, an error surface on a set of candidate translations.

20. The program product of claim 19, where the line optimization technique determines and traverses the error surface starting from a random point in a parameter space.

21. The program product of claim 19, where the line optimization technique determines and traverses the error surface using random directions to adjust the weights.

22. The program product of claim 15, where the weights are limited by restrictions.

23. The program product of claim 15, where the weights are adjusted using weights priors.

24. The program product of claim 15, where the weights are adjusted over all sentences in a group of sentences.

25. The program product of claim 15, further comprising selecting a target translation, from a plurality of candidate translations, that maximizes a-posteriori probability for the translation lattice.

26. The program product of claim 15, where the translation lattice represents more than one billion candidate translations.

27. The program product of claim 15, where the phrases comprise sentences.

28. The program product of claim 15, where the phrases all comprise sentences.

29. A system, comprising:

a machine-readable storage device including a program product; and

one or more computers operable to execute the program product and perform operations comprising: determining, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice; adjusting weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set; selecting weighting values that minimize error counts for the traversed combination; and applying the selected weighting values to convert a sample of text from a first language to a second language.

30. A computer-implemented system, comprising:

a language model that includes: a collection of feature functions in a translation lattice; a plurality of error surfaces for a set of candidate language translations, across the feature functions; and weighting values for feature functions selected to minimize error for traversal of the error surfaces.