SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Info

Publication number: 20230154467
Type: Application
Filed: Jan 20, 2023
Publication Date: May 18, 2023
Applicant: Microsoft Technology Licensing, LLC (Redmond, WA)
Inventors: Yashesh GAUR (Bellevue, WA), Jinyu LI (Redmond, WA), Liang LU (Redmond, WA), Hirofumi INAGUMA (Kyoto), Yifan GONG (Sammamish, WA)
Application Number: 18/157,303

Abstract

A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

Description

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 16/841,542, filed Apr. 6, 2020, the entirety of which is hereby incorporated herein by reference for all purposes.

BACKGROUND

In automatic speech recognition (ASR), a text transcription of a spoken input is generated at a computing device. This text transcription is frequently generated in real time as a user is speaking. When ASR is performed in real time, there is a delay between the time at which the user speaks the input and the time at which the computing device outputs the transcription. Long delays between the input and output may make an ASR application program slow and cumbersome to use.

In addition, previous attempts to reduce the latency of ASR have frequently led to increases in the word error rate (WER), the rate at which the ASR application program incorrectly identifies words included in the input. Thus, existing ASR methods have had a tradeoff between low latency and low WER.

SUMMARY

According to one aspect of the present disclosure, a computing system is provided, including one or more processors configured to receive an audio input. The one or more processors may be further configured to generate a text transcription of the audio input at a sequence-to-sequence speech recognition model. The sequence-to-sequence speech recognition model may be configured to assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the sequence-to-sequence speech recognition model may be further configured to generate a plurality of hidden states. Based on the plurality of hidden states, the sequence-to-sequence speech recognition model may be further configured to generate a plurality of output text tokens corresponding to the plurality of frames. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may be further configured to output the text transcription including the plurality of output text tokens to an application program, user interface, or file storage location.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 schematically shows an example computing system including one or more processors configured to execute a sequence-to-sequence speech recognition model, according to one embodiment of the present disclosure.

FIG. 2 shows an example timeline of the generation of a text transcription for an audio input, according to the embodiment of FIG. 1.

FIG. 3A schematically shows the one or more processors during training of an external alignment model and an encoder neural network, according to the embodiment of FIG. 1.

FIG. 3B schematically shows the one or more processors during training of the encoder neural network when multi-task training is used, according to the embodiment of FIG. 1.

FIG. 3C schematically shows the one or more processors during training of the encoder neural network when the encoder neural network is pre-trained with a framewise cross-entropy loss term, according to the embodiment of FIG. 1.

FIG. 4A shows the respective selection probabilities for a plurality of hidden states when the decoder model is a monotonic chunkwise attention model, according to the embodiment of FIG. 1.

FIG. 4B shows the selection probabilities of FIG. 4A in an embodiment in which the decoder neural network includes a one-dimensional convolutional layer.

FIG. 5A schematically shows the one or more processors during training of the decoder neural network, according to the embodiment of FIG. 1.

FIG. 5B schematically shows the one or more processors during concurrent training of the encoder neural network and the decoder neural network when the encoder neural network is trained in part at a first linear bottleneck layer and a second linear bottleneck layer, according to the embodiment of FIG. 1.

FIG. 6A shows a flowchart of a method that may be used at a computing system to generate a text transcription of an audio input, according to the embodiment of FIG. 1.

FIG. 6B shows additional steps of the method of FIG. 6A that may be performed when training an encoder neural network.

FIG. 6C shows additional steps of the method of FIG. 6A that may be performed when training a decoder neural network.

FIG. 7 shows a schematic view of an example computing environment in which the computer device of FIG. 1 may be enacted.

DETAILED DESCRIPTION

End-to-end ASR models are a class of ASR models in which the input and output are each represented as an ordered sequence of values. For example, the input and output of an end-to-end ASR model may each be represented as a vector. The respective elements of the input sequence and the output sequence may each encode frames that correspond to time intervals in the input sequence and output sequence respectively. An end-to-end ASR model may be a frame-synchronous model in which the length of the input sequence equals the length of the output sequence. Examples of frame-synchronous models include connectionist temporal classification (CTC), recurrent-neural-network-transducer (RNN-T), and recurrent neural aligner (RNA) models. Alternatively, the end-to-end ASR model may be a label-synchronous model in which the input sequence and output sequence have different respective lengths. Examples of label-synchronous models include attention-based sequence-to-sequence (S2S) and transformer models.

Some previously developed attention-based S2S models have lower WERs than frame-synchronous models. However, previous attempts to apply attention-based S2S models in real-time streaming scenarios have encountered difficulties due to the attention-based S2S models having high latencies.

In order to address the shortcomings of existing ASR models, a computing system 10 is provided, as schematically shown in FIG. 1 according to one example embodiment. The computing system 10 may include one or more processors 12. In some embodiments, the one or more processors 12 may each include a plurality of processor cores on which one or more processor threads may be executed. The computing system 10 may further include memory 14 that may be operatively coupled to the one or more processors 12 such that the one or more processors 12 may store data in the memory 14 and retrieve data from the memory 14. The memory 14 may include Random Access Memory (RAM) and may further include non-volatile storage. The non-volatile storage may store instructions configured to be executed by the one or more processors 12.

The computing system 10 may further include one or more input devices 16, which may be operatively coupled to the one or more processors 12. For example, the one or more input devices 16 may include one or more microphones, one or more cameras (e.g. RGB cameras, depth cameras, or stereoscopic cameras), one or more accelerometers, one or more orientation sensors (e.g. gyroscopes or magentometers), one or more buttons, one or more touch sensors, or other types of input devices 16. The computing system 10 may further include one or more output devices 18, which may also be operatively coupled to the one or more processors 12. The one or more output device 18 may, for example, include one or more displays, one or more speakers, one or more haptic feedback units, or other types of output devices 18. The one or more processors 12 of the computing system 10 may be configured to transmit instructions to output a user interface 74, such as a graphical user interface, on the one or more output devices 18. In addition, the one or more processors 12 may be further configured to receive user input interacting with the user interface 74 via the one or more input devices 16.

In some embodiments, the functions of the one or more processors 12 and the memory 14 may be instantiated across a plurality of operatively coupled computing devices. For example, the computing system 10 may include one or more client computing devices communicatively coupled to one or more server computing devices. Each of the operatively coupled computing devices may perform some or all of the functions of the one or more processors 12 or memory 14 discussed below. For example, a client computing device may receive one or more inputs at the one or more input devices 16 and may offload one or more steps of processing those inputs to one or more server computing devices. The server computing devices may, in this example, return one or more outputs to the client computing device to output on the one or more output devices 18. In such embodiments, the one or more processors 12 may be distributed between the client computing device and the one or more server computing devices.

The one or more processors 12 may be configured to receive an audio input 20. In embodiments in which a processor 12 and one or more microphones are included in the same physical computing device, the processor 12 may receive the audio input 20 from the one or more microphones via an application program interface (API). In other embodiments, at least one processor 12 of the one or more processors 12 may receive an audio input 20 conveyed to the processor 12 from another physical computing device (e.g. a thin client computing device). In some embodiments, the one or more processors 12 may be further configured to pre-process the audio input 20 by dividing the audio input 20 into an ordered sequence of frames 22 corresponding to time intervals within the audio input 20.

The one or more processors 12 may be further configured to generate a text transcription 70 of the audio input 20 at a sequence-to-sequence speech recognition model 30, as described in further detail below. The text transcription 70 may include a plurality of output text tokens 62, which may indicate words, portions of words, punctuation marks, speaker identifiers, utterance delimiters, and/or other text indicating one or more features of the audio input 20. In some embodiments, the audio input 20 may be a streaming audio input received by the one or more processors 12 over an input time interval. In such embodiments, the one or more processors 12 may be further configured to output the text transcription 70 during the input time interval concurrently with receiving the audio input 20. Thus, the one or more processors 12 may be configured to transcribe the audio input 20 in real time as the audio input 20 is received. After the text transcription 70 has been generated, the one or more processors 12 may be further configured to output the text transcription 70 including the plurality of output text tokens 62 to an application program 72, a user interface 74, or a file storage location 76.

The S2S speech recognition model 30 may include an external alignment model 40, an encoder neural network 50, and a decoder neural network 60. Each of these sub-models of the S2S speech recognition model 30 is described in further detail below.

At the external alignment model 40, the one or more processors 12 may be further configured to assign a respective plurality of external-model text tokens 42 to a plurality of frames 22 included in the audio input 20. The frames 22 to which the external-model text tokens 42 are assigned may be the frames 22 into which the audio input 20 was segmented during pre-processing. The external alignment model 40 may be an acoustic feature detection model that is configured to assign the external-model text tokens 42 to indicate senone-level features in the audio input 20. For example, boundaries between words included in the audio input 20 may be estimated at the external alignment model 40. The external alignment model 40 may be a recurrent neural network (RNN). In some embodiments, the external alignment model 40 may be a CTC model.

Each external-model text token 42 identified at the external alignment model 40 may have an external-model alignment 44 within the audio input 20. The external-model alignment 44 of an external-model text token 42 may be an indication of a frame 22 with which the external-model text token 42 is associated. Thus, the external-model alignment 44 may be an estimate of a ground-truth alignment of acoustic features in a user's utterance.

At the encoder neural network 50, based on the audio input 20, the one or more processors 12 may be further configured to generate a plurality of hidden states 52. The hidden states 52 may be word-level or sub-word-level latent representations of features included in the audio input 20. In some embodiments, the plurality of hidden states 52 may be represented as a vector of encoder outputs h₁. The encoder neural network 50 may be an RNN, such as a long short-term memory (LSTM) network, a gated recurrent unit (GRU), or some other type of RNN.

At the decoder neural network 60, the one or more processors 12 may be further configured to generate a plurality of output text tokens 62 based on the plurality of hidden states 52, as discussed in further detail below. The plurality of output text tokens 62 may be represented as a vector y=(y₁, . . . , y_L), where L is the total number of output text tokens 62. The plurality of output text tokens 62 may be included in the text transcription 70 that is output by the S2S speech recognition model 30. Each output text token 62 generated at the decoder neural network 60 may be associated with a frame 22 of the audio input 20 and may have a corresponding output alignment 64 within the audio input 20 that indicates the frame 22 with which the output text token 62 is associated.

For each output text token 62, a latency 66 between the output alignment 64 and the external-model alignment 44 may be below a predetermined latency threshold 68. Example values of the predetermined latency threshold 68 are 4 frames, 8 frames, 12 frames, 16 frames, 24 frames, and 32 frames. Alternatively the predetermined latency threshold 68 may be some other number of frames.

FIG. 2 shows an example timeline 90 of the generation of a text transcription 70 for the audio input 20 “add an event for dinner tomorrow at seven thirty p.m.” In the example of FIG. 2, a respective output text token 62 is generated for each word of the audio input 20, as well as for the delimiter <EOS> that marks the end of the utterance. The timeline 90 of FIG. 2 further shows the output alignment 64 for each output text token 62. For one of the output text tokens 62, the timeline 90 also shows the external-model alignment 44 for that output text token 62 and the latency 66 between the output alignment 64 and the external-model alignment 44.

To evaluate the latency 66 between the output alignment 64 and the external-model alignment 44 for a plurality of audio inputs 20, the one or more processors 12 may be configured to compute a corpus-level latency Δ_corpusor an utterance-level latency Δ_utteranceThe corpus-level latency Δ_corpusmay be computed as the difference (e.g. in number of frames 22) between respective boundaries {circumflex over (b)}_i^kof each of a plurality of output text tokens 62 and the corresponding boundaries b_i^kof the external-model text tokens 42 computed at the external alignment model 40. An example equation for the corpus-level latency Δ_corpusis provided below:

$Δ_{corpus} = \frac{1}{\sum_{k = 1}^{N} ❘ y^{k} ❘} \sum_{k = 1}^{N} \sum_{i = 1}^{❘ y^{k} ❘} ({\hat{b}}_{i}^{k} - b_{i}^{k})$

In this equation, N is the number of audio inputs 20 and y^kis the kth output text token 62. The utterance-level latency Δ_utterancemay be computed as an average of the mean latency for each audio input 20. An example equation for the utterance-level latency Δ_utteranceis as follows:

$Δ_{utterance} = \frac{1}{N} \sum_{k = 1}^{N} \frac{1}{❘ y^{k} ❘} \sum_{i = 1}^{❘ y^{k} ❘} ({\hat{b}}_{i}^{k} - b_{i}^{k})$

Turning now to FIG. 3A, the one or more processors 12 are shown according to one example embodiment when training the external alignment model 140 and the encoder neural network 150. In the example of FIG. 3A, the external alignment model 140 and the encoder neural network 150 are trained using a plurality of training audio inputs 120, each including a plurality of training frames 122. For each training audio input 120, the one or more processors 12 may be configured to generate, at the external alignment model 140, a plurality of training external-model text tokens 142 with a respective plurality of training external-model alignments 144. The external alignment model 140 may be trained using a senone-level framewise cross-entropy loss function 146. In some embodiments, the same training audio inputs 120 may be used to train both the external alignment model 140 and the encoder neural network 150.

In the example of FIG. 3A, when the encoder neural network 150 is trained, the one or more processors 12 may be further configured to generate a plurality of training hidden states 152 for each of the training audio inputs 120. The encoder neural network 150 may be trained at least in part with an encoder loss function 158 including a sequence-to-sequence loss term 158A and a framewise cross-entropy loss term 158B. In one example embodiment, the following encoder loss function 158 may be used:

L_total=(1−Δ_CE)L_S2S(y|x)+λ_CEL_CE(A|x)

In the above equation, λ_CEis a tunable hyperparameter which may have a value between 0 and 1. x may be the input sequence of the encoder neural network 150 represented as a vector x=(x₁, . . . , x_T). y may be a plurality of ground-truth output text tokens represented as a vector y=(y₁, . . . , y_L), where L is the total number of training output text tokens associated with a training audio input 120, as discussed in further detail below. In addition, A=(a₁, . . . , a_T) may be a plurality of word-level alignments received from the external alignment model 140, where each a_jis a K-dimensional one-hot vector. In this example, K is the vocabulary size of the external alignment model 140. The framewise cross-entropy loss term 158B may be given by the following equation:

$L_{CE} (A ❘ x) = - \sum_{j = 1}^{T} a_{j} \log q_{j}^{CE}$

In this equation, T is the total number of input tokens and q_j^CEis the jth posterior probability distribution for the framewise cross-entropy loss term 158B.

The above equation for the encoder loss function 158 may be used in embodiments in which the encoder neural network 150 is trained concurrently with the decoder neural network 160, as discussed in further detail below with reference to FIG. 3B. In other embodiments, the encoder neural network 150 may be pre-trained with some other loss function that does not depend upon outputs of the decoder neural network 160, and the decoder neural network 160 may be subsequently trained with the training hidden states 152 output by the pre-trained encoder neural network 150.

In some embodiments, the encoder neural network 150 may be trained with the sequence-to-sequence loss term 158A and the framewise cross-entropy loss term 158B concurrently via multi-task learning. In such embodiments, 0<λ_CE<1 in the above equation for the encoder loss function 158. When the encoder neural network 150 is trained via multi-task learning, the encoder neural network 150 neural network may be trained concurrently with the decoder neural network 160, as shown in FIG. 3B. In the example of FIG. 3B, the plurality of training hidden states 152 generated at the encoder neural network 150 are output to both the decoder neural network 160 and a framewise cross-entropy layer 170. The one or more processors 12 may be configured to compute the sequence-to-sequence loss term 158A using the training hidden states 152 output by the encoder neural network 150 and compute the framewise cross-entropy loss term 158B from the outputs of the framewise cross-entropy layer 170.

In other embodiments, the encoder neural network 150 may be pre-trained with the framewise cross-entropy loss term 158B prior to training with the sequence-to-sequence loss term 158A. In such embodiments, as shown in FIG. 3C, the encoder neural network 150 may be trained with the framewise cross-entropy loss term 158B during a first training phase 102 and trained with the sequence-to-sequence loss term 158A during a second training phase 104. When the example encoder loss function 158 shown above is used, the tunable hyperparameter λ_CEmay be set to 1 during the first training phase 102 and set to 0 during the second training phase 104.

Returning to FIG. 1, the decoder neural network 60 may be configured to receive the plurality of hidden states 52 from the encoder neural network 50. Similarly to the encoder neural network 50, the decoder neural network 60 may be an RNN, such as an LSTM or a GRU. In some embodiments, the decoder neural network 60 may be a monotonic chunkwise attention model. When the decoder neural network 60 is a monotonic chunkwise attention model, the one or more processors 12 may be further configured to stochastically determine a binary attention state 56 for each hidden state 52 that indicates whether an output text token 62 corresponding to that hidden state 52 is generated.

FIG. 4A shows a grid 200 of selection probabilities p_i,jfor pairs of decoder outputs y_iand hidden states h_j. The selection probabilities p_i,jmay be computed using the following equations:

$e_{i, j}^{mono} = g \frac{v^{T}}{ v } ReLU (W_{h} h_{j} + W_{s} s_{j} + b) + r p_{i, j} = σ (e_{i, j}^{mono})$

In these equations, e_i,j^monois a monotonic energy activation, h_jis the jth hidden state 52 output by the encoder neural network 50, s_iis the ith state of the decoder neural network 60, a is a logistic sigmoid function, ReLU is the rectified linear unit function, and g, v, W_h, W_s, b, and r are learnable parameters of the decoder neural network 60.

The grid 200 shown in FIG. 4A includes a plurality of chunks 202 that each include the respective selection probabilities p_i,jfor a plurality of consecutive hidden states h_jand a decoder output y_i. Each chunk 202 of the plurality of chunks 202 may include a number of selection probabilities p_i,jequal to a predetermined chunk size w. In the example of FIG. 4A, the predetermined chunk size w is 4. In other embodiments, some other predetermined chunk size w such as 3 or 5 may be used. Chunks 202 including the first or last element of the vector of hidden states h may be smaller than the predetermined chunk size w used for other chunks 202.

For each chunk 202, the one or more processors 12 may be configured to sample a Bernoulli random variable z_i,jfrom a probability distribution of the selection probabilities p_i,jincluded in that chunk 202. In the example grid 200 of FIG. 4A, darker colors correspond to higher selection probabilities p_i,j. When the Bernoulli random variable z_i,jhas a value of 1 for a selection probability p_i,j, the one or more processors 12 may be further configured to “attend” to the hidden state h_jassociated with that selection probability p_i,jby outputting an association between the hidden state h_jand the encoder output y_i. When the Bernoulli random variable z_i,jhas a value of 0 for a selection probability p_i,j, the one or more processors 12 may instead select some other hidden state h_jto associate with the value of the encoder output y_i.

The one or more processors 12 may be further configured to determine a respective output alignment 64 for each selection probability p_i,jincluded in each chunk 202. The output alignment α_i,jcorresponding to a selection probability p_i,jis given by the following equation:

$α_{i, j} = p_{i, j} \sum_{k = 1}^{j} (α_{i - 1, k} \prod_{l = k}^{j - 1} (1 - p_{i, l})) = p_{i, j} ((1 - p_{i, j - 1}) \frac{α_{i, j - 1}}{p_{i, j - 1}} + α_{i - 1, j})$

The plurality of output alignments α_i,jmay indicate locations in the audio input 20 of expected boundaries between the output text tokens 62. Thus, the monotonic energy activations e_i,j^monomay be used to determine the selection probabilities p_i,j, as discussed above, which may be used to determine the output alignments α_i,j.

The one or more processors 12 may be further configured to determine a chunkwise energy activation e_i,j^chunkfor each chunk 202. For example the one or more processors 12 may use the following example equation for e_i,j^chunk.

e_i,j^chunk=V*ReLU(W*s_chunk+U*h_chunk)

In the above equation, e_i,j^chunkis a scalar array with a size equal to the chunk size w. It will be appreciated that h_chunkis a sequence of the respective hidden states 52 for the selection probabilities p_i,jincluded in the chunk 202, and s_chunkis a sequence of respective decoder states for those selection probabilities p_i,j. Further, U, V, and W are affine change-of-dimension layers and may be trained when training the decoder neural network 160.

The one or more processors 12 may be further configured to normalize the chunkwise energy activation e_i,j^chunkusing the following equation for an induced probability distribution {β_i,j}:

$β_{i, j} = \sum_{k = j}^{j + w - 1} (α_{i, k} \exp (e_{i, j}^{chunk}) / \sum_{l = k - w + 1}^{k} \exp (e_{i, l}^{chunk}))$

In this equation, w is the predetermined chunk size discussed above. The induced probability distribution {β_i,j} may be a probability distribution of output text tokens 62 that may be output by the decoder neural network 50.

The one or more processors 12 may be further configured to determine a plurality of weighted encoder memory values c_iweighted using the induced probability distribution {β_i,j}, as shown in the following equation:

$c_{i} = \sum_{j = 1}^{T} β_{i, j} h_{j}$

Thus, rather than merely setting the weighted encoder memory values c_ito be equal to the corresponding hidden values h_j, the one or more processors 12 may be configured to compute a respective softmax of the selection probabilities p_i,jincluded in each chunk 202. The weighted encoded memory values c_imay be included in a context vector which the decoder neural network 60 may use as an input.

In some embodiments, as shown in FIG. 1, the plurality of hidden states 52 generated by the encoder neural network 50 may be passed through a one-dimensional convolutional layer 54 prior to generating the binary attention states 56. The one-dimensional convolutional layer 54 may be represented as W_c∈, where k is a kernel size (e.g. 3, 4, or 5) and d is a channel size for the one-dimensional convolutional layer 54. The channel size d may be equal to the dimension of the hidden states h_j. The one or more processors 12 may be further configured to transform the hidden states h_jinto an attention space using the following transformation:

h′_i,j=W_h(W_c*h_j)

In this equation, h′_i,jis a transformed hidden state.

FIG. 4B shows another example grid 210 of selection probabilities p_i,jin an embodiment where a one-dimensional convolutional layer 54 is included in the decoder neural network 60. In the embodiment of FIG. 4B, the one or more processors 12 are further configured to “look” back one value of j and “look” ahead one value of j from each of the selection probabilities p_i,jfor which z_i,j=1. Thus, the boundary prediction made by the decoder neural network 60 may be made more accurate by incorporating information from one or more frames 22 before or after a selected frame 22.

FIG. 5A shows the one or more processors 12 when training the decoder neural network 160, according to one example embodiment. In the example of FIG. 4A, the decoder neural network 150 may use the training hidden states 152 of the encoder neural network 150 as training data. In some embodiments, the decoder neural network 160 may be configured to receive the training hidden states 152 as a context vector {c₁} of weighted encoder memory values.

When the decoder neural network 160 is trained, the decoder neural network 160 may be configured to generate a plurality of training binary attention states 156 corresponding to the plurality of training hidden states 152. In some embodiments, as shown in the example of FIG. 5A, the decoder neural network 160 may further include a one-dimensional convolutional layer 154. The plurality of training hidden states 152 may be input into the one-dimensional convolutional layer 154 prior to generating the training binary attention states 156. The one-dimensional convolutional layer 154 may be trained concurrently with other layers of the decoder neural network 160.

From the plurality of training binary attention states 156, the decoder neural network 160 may be further configured to generate a respective plurality of training output text tokens 162 having a respective plurality of training output alignments 164. The decoder neural network 160 may be configured to generate the plurality of training output text tokens 162 such that each training output text token 162 has a training latency 166 below the predetermined latency threshold 68. In one example embodiment, the following constraint may be applied to the training output alignments α_i,j:

$α_{i, j} = {\begin{matrix} p_{i, j} ((1 - p_{i, j - 1}) \frac{α_{i, j - 1}}{p_{i, j - 1}} + α_{i - 1, j}), & (j \leq b_{i} + δ) \\ 0, & (otherwise) \end{matrix}$

In the above equation, b_iis the ith external model alignment 44 and δ is the predetermined latency threshold 68. Thus, the training latency 166 may be kept below the predetermined latency threshold 68 during training of the decoder neural network 60 as well as at runtime.

The decoder neural network 160 may be trained using a decoder loss function 168 including a sequence-to-sequence loss term 168A. In some embodiments, the decoder loss function 168 may be a delay constrained training loss function including the sequence-to-sequence loss term 168A and an attention weight regularization term 168B. For example, the decoder loss function 168 may be computed using the following equation:

$L_{total} = L_{S2S} + λ_{QUA} ❘ L - \sum_{i = 1}^{L} \sum_{j = 1}^{T} α_{i, j} ❘$

In the above equation, L_totalis the decoder loss function 168, L_S2Sis the sequence-to-sequence loss term 168A, λ_QUAis a tunable hyperparameter, and L is the total number of training output text tokens 162. By including the attention weight regularization term 168B in the decoder loss function 168, exponential decay of {α_i,j} may be avoided, and the number of nonzero values of α_i,jmay be matched to L.

As an alternative to the delay constrained loss function, the decoder loss function 168 may be a minimum latency training loss function including the sequence-to-sequence loss term 168A and a minimum latency loss term 168C. The minimum latency loss term 168C may be given by the following equation:

$L_{total} = L_{S2S} + λ_{MinLT} \frac{1}{L} \sum_{i = 1}^{L} ❘ \sum_{j = 1}^{T} j α_{i, j} - b_{i} ❘$

In the above equation, λ_MinLTis a tunable hyperparameter. In addition, the sum over values of jα_i,jrepresents an expected boundary location of the ith training output text token 162. Minimum latency training may account for differences in the training latencies 166 of different training output text tokens 162 when computing the value of the decoder loss function 168.

In some embodiments, as shown in FIG. 5B, the encoder neural network 150 and the decoder neural network 160 may be trained concurrently. In such embodiments, the encoder neural network 150 may be trained at least in part at a first linear bottleneck layer 180 and a second linear bottleneck layer 182. The first linear bottleneck layer 180 and the second linear bottleneck layer 182 may each be configured to receive the plurality of training hidden states 152 from the encoder neural network 150. In some embodiments, the first linear bottleneck layer 180 and the second linear bottleneck layer 182 may be configured to receive the plurality of training hidden states 152 as a context vector {c_i} of weighted encoder memory values. The one or more processors 12 may be further configured to concatenate the outputs of the first linear bottleneck layer 180 and the second linear bottleneck layer 182 to form a concatenated bottleneck layer 184. The outputs of the concatenated bottleneck layer 184 may be used as training inputs at the decoder neural network 160. In addition, the outputs of the second linear bottleneck layer 182 may be received at a framewise cross-entropy layer 170. The one or more processors 12 may be further configured to compute a framewise cross-entropy loss term 158B based on the outputs of the framewise cross-entropy layer 170.

FIG. 6A shows a flowchart of a method 300 for use with a computing system, according to one example embodiment. The method 300 may be performed at the computing system 10 of FIG. 1 or at some other computing system. At step 302, the method 300 may include receiving an audio input. The audio input may be received at one or more processors via one or more microphones included in the computing system. The one or more processors and the one or more microphones may be provided in the same physical computing device or in separate physical computing devices that are communicatively coupled. In some embodiments, the audio input may be pre-processed, such as by dividing the audio input into a plurality of frames associated with respective time intervals.

At step 304, the method 300 may further include generating a text transcription of the audio input at a sequence-to-sequence speech recognition model. The sequence-to-sequence speech recognition model may include an external alignment model configured to generate the plurality of external-model text tokens, an encoder neural network configured to generate the plurality of hidden states, and a decoder neural network configured to generate the plurality of output text tokens. Each of the external alignment model, the encoder neural network, and the decoder neural network may be an RNN, such as an LSTM, a GRU, or some other type of RNN.

At step 306, step 304 may include assigning a respective plurality of external-model text tokens to a plurality of frames included in the audio input. These external-model text tokens may be assigned by the external alignment model. Each external-model text token assigned to a frame may have an external-model alignment within the audio input that indicates the frame to which the external-model text token is assigned. The external alignment model may be an acoustic model configured to identify senone-level features in the audio input and assign the external-model text tokens to the senone-level features.

At step 308, step 304 may further include generating a plurality of hidden states based on the audio input. The hidden states may be generated at the encoder neural network and may be word-level or sub-word-level latent representations of features included in the audio input.

At step 310, step 304 may further include generating a plurality of output text tokens corresponding to the plurality of frames at a decoder neural network. The plurality of output text tokens may be generated at the decoder neural network based on the plurality of hidden states. Each output text token may have a corresponding output alignment within the audio input that indicates a frame with which the output text token is associated. In addition, the decoder neural network may be configured to generate the plurality of output text tokens such that for each output text token, a latency between the output alignment and the external-model alignment is below a predetermined latency threshold. This latency constraint may be enforced, for example, by generating a plurality of output alignments and discarding any output alignment with a latency higher than the predetermined latency threshold relative to the external-model alignment.

At step 312, the method 300 may further include outputting the text transcription including the plurality of output text tokens to an application program, a user interface, or a file storage location. In some embodiments, the audio input may be a streaming audio input received over an input time interval. In such embodiments, the text transcription may be output during the input time interval concurrently with receiving the audio input. Thus, the text transcription may be generated and output in real time as the audio input is in the process of being received.

FIG. 6B shows additional steps of the method 300 that may be performed in some embodiments to train the encoder neural network. The steps shown in FIG. 6B may be performed prior to receiving the audio input at step 302 of FIG. 6A. At step 314, the method 300 may further include training the encoder neural network at least in part with an encoder loss function including a sequence-to-sequence loss term and a framewise cross-entropy loss term. In some embodiments, step 314 may further include, at step 316, pre-training the encoder neural network with the framewise cross-entropy loss term during a first training phase. In embodiments in which the encoder neural network is pre-trained with the framewise cross-entropy loss term, the framewise cross-entropy loss term may be computed from the outputs of a framewise cross-entropy layer configured to receive the plurality of training hidden states output by the encoder neural network. After the encoder neural network is pre-trained, step 314 may further include, at step 318, training the encoder neural network with the sequence-to-sequence loss term during a second training phase.

Alternatively, step 314 may include, at step 320, training the encoder neural network with the sequence-to-sequence loss term and the framewise cross-entropy loss term concurrently via multi-task learning. When the encoder neural network is trained via multi-task learning, the encoder neural network and the decoder neural network may be trained concurrently. Training the encoder neural network via multi-task learning may, in some embodiments, include training the encoder neural network at least in part at a first linear bottleneck layer and a second linear bottleneck layer, as shown in step 322. When a first linear bottleneck layer and a second linear bottleneck layer are used to train the encoder neural network, the outputs of the first linear bottleneck layer and the second linear bottleneck layer may be concatenated to form a concatenated bottleneck layer. The outputs of the concatenated bottleneck layer may be used as inputs to the decoder neural network. In addition, the outputs of the second linear bottleneck layer may be received at a framewise cross-entropy layer. A framewise cross-entropy loss term may be computed from the outputs of the framewise cross-entropy layer.

FIG. 6C shows additional steps of the method 300 that may be performed in some embodiments to train the decoder neural network. At step 324, the method 300 may further include training the decoder neural network at least in part with a delay constrained training loss function including a sequence-to-sequence loss term and an attention weight regularization term. Alternatively, at step 326, the method 300 may further include training the decoder neural network at least in part with a minimum latency training loss function including a sequence-to-sequence loss term and a minimum latency term. In some embodiments, the decoder neural network may be trained concurrently with the encoder neural network.

Using the systems and methods discussed above, the latency between inputs and outputs during ASR may be reduced in comparison to conventional ASR techniques such as CTC, RNN-T, and RNA. This reduction in latency may improve the experience of using ASR by reducing the amount of time the user has to wait while entering speech inputs. By reducing the amount of time for which the user of an ASR system has to wait for speech inputs to be processed into text, the systems and methods discussed above may allow the user to obtain text transcriptions of speech inputs more quickly and with fewer interruptions. The systems and methods discussed above may also have higher processing efficiency compared to existing S2S ASR methods. As a result of this increase in processing efficiency, network latency may also be reduced when the S2S speech recognition model is instantiated at least in part at one or more server computing devices that communicate with a client device. In addition, the systems and methods described above may result in a reduced word error rate in comparison to existing ASR techniques.

In some embodiments, the methods and processes described herein may be tied to a computing system of one or more computing devices. In particular, such methods and processes may be implemented as a computer-application program or service, an application-programming interface (API), a library, and/or other computer-program product.

FIG. 7 schematically shows a non-limiting embodiment of a computing system 400 that can enact one or more of the methods and processes described above. Computing system 400 is shown in simplified form. Computing system 400 may embody the computing system 10 described above and illustrated in FIG. 1. Computing system 400 may take the form of one or more personal computers, server computers, tablet computers, home-entertainment computers, network computing devices, gaming devices, mobile computing devices, mobile communication devices (e.g., smart phone), and/or other computing devices, and wearable computing devices such as smart wristwatches and head mounted augmented reality devices.

Computing system 400 includes a logic processor 402 volatile memory 404, and a non-volatile storage device 406. Computing system 400 may optionally include a display subsystem 408, input subsystem 410, communication subsystem 412, and/or other components not shown in FIG. 7.

Logic processor 402 includes one or more physical devices configured to execute instructions. For example, the logic processor may be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.

The logic processor may include one or more physical processors (hardware) configured to execute software instructions. Additionally or alternatively, the logic processor may include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. Processors of the logic processor 402 may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. Individual components of the logic processor optionally may be distributed among two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the logic processor may be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration. In such a case, these virtualized aspects are run on different physical logic processors of various different machines, it will be understood.

Non-volatile storage device 406 includes one or more physical devices configured to hold instructions executable by the logic processors to implement the methods and processes described herein. When such methods and processes are implemented, the state of non-volatile storage device 406 may be transformed—e.g., to hold different data.

Non-volatile storage device 406 may include physical devices that are removable and/or built-in. Non-volatile storage device 406 may include optical memory (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc.), semiconductor memory (e.g., ROM, EPROM, EEPROM, FLASH memory, etc.), and/or magnetic memory (e.g., hard-disk drive, floppy-disk drive, tape drive, MRAM, etc.), or other mass storage device technology. Non-volatile storage device 406 may include nonvolatile, dynamic, static, read/write, read-only, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. It will be appreciated that non-volatile storage device 406 is configured to hold instructions even when power is cut to the non-volatile storage device 406.

Volatile memory 404 may include physical devices that include random access memory. Volatile memory 404 is typically utilized by logic processor 402 to temporarily store information during processing of software instructions. It will be appreciated that volatile memory 404 typically does not continue to store instructions when power is cut to the volatile memory 404.

Aspects of logic processor 402, volatile memory 404, and non-volatile storage device 406 may be integrated together into one or more hardware-logic components. Such hardware-logic components may include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC/ASICs), program- and application-specific standard products (PSSP/ASSPs), system-on-a-chip (SOC), and complex programmable logic devices (CPLDs), for example.

The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 400 typically implemented in software by a processor to perform a particular function using portions of volatile memory, which function involves transformative processing that specially configures the processor to perform the function. Thus, a module, program, or engine may be instantiated via logic processor 402 executing instructions held by non-volatile storage device 406, using portions of volatile memory 404. It will be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” may encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.

When included, display subsystem 408 may be used to present a visual representation of data held by non-volatile storage device 406. The visual representation may take the form of a graphical user interface (GUI). As the herein described methods and processes change the data held by the non-volatile storage device, and thus transform the state of the non-volatile storage device, the state of display subsystem 408 may likewise be transformed to visually represent changes in the underlying data. Display subsystem 408 may include one or more display devices utilizing virtually any type of technology. Such display devices may be combined with logic processor 402, volatile memory 404, and/or non-volatile storage device 406 in a shared enclosure, or such display devices may be peripheral display devices.

When included, input subsystem 410 may comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on- or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection and/or intent recognition; as well as electric-field sensing componentry for assessing brain activity; and/or any other suitable sensor.

When included, communication subsystem 412 may be configured to communicatively couple various computing devices described herein with each other, and with other devices. Communication subsystem 412 may include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured for communication via a wireless telephone network, or a wired or wireless local- or wide-area network, such as a HDMI over Wi-Fi connection. In some embodiments, the communication subsystem may allow computing system 400 to send and/or receive messages to and/or from other devices via a network such as the Internet.

According to one aspect of the present disclosure, a computing system is provided, including one or more processors configured to receive an audio input. The one or more processors may be further configured to generate a text transcription of the audio input at a sequence-to-sequence speech recognition model configured to at least assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the sequence-to-sequence speech recognition model may be further configured to generate a plurality of hidden states. Based on the plurality of hidden states, the sequence-to-sequence speech recognition model may be further configured to generate a plurality of output text tokens corresponding to the plurality of frames. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may be further configured to output the text transcription including the plurality of output text tokens to an application program, a user interface, or a file storage location.

According to this aspect, the sequence-to-sequence speech recognition model may include an external alignment model configured to generate the plurality of external-model text tokens, an encoder neural network configured to generate the plurality of hidden states, and a decoder neural network configured to generate the plurality of output text tokens. The encoder neural network and the decoder neural network may be recurrent neural networks.

According to this aspect, the decoder neural network may be a monotonic chunkwise attention model.

According to this aspect, for each hidden state, the one or more processors are further configured to stochastically determine a binary attention state.

According to this aspect, the audio input may be a streaming audio input received by the one or more processors over an input time interval. The one or more processors may be configured to output the text transcription during the input time interval concurrently with receiving the audio input.

According to this aspect, the encoder neural network may be trained at least in part with an encoder loss function including a sequence-to-sequence loss term and a framewise cross-entropy loss term.

According to this aspect, the encoder neural network may be pre-trained with the framewise cross-entropy loss term prior to training with the sequence-to-sequence loss term.

According to this aspect, the encoder neural network may be trained with the sequence-to-sequence loss term and the framewise cross-entropy loss term concurrently via multi-task learning.

According to this aspect, the encoder neural network may be trained at least in part at a first linear bottleneck layer and a second linear bottleneck layer.

According to this aspect, the decoder neural network may be trained at least in part with a delay constrained training loss function including a sequence-to-sequence loss term and an attention weight regularization term.

According to this aspect, the decoder neural network may be trained at least in part with a minimum latency training loss function including a sequence-to-sequence loss term and a minimum latency loss term.

According to another aspect of the present disclosure, a method for use with a computing system is provided. The method may include receiving an audio input. The method may further include generating a text transcription of the audio input at a sequence-to-sequence speech recognition model. The text transcription may be generated at least by assigning a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the text transcription may be further generated by generating a plurality of hidden states. Based on the plurality of hidden states, the text transcription may be further generated by generating a plurality of output text tokens corresponding to the plurality of frames. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The method may further include outputting the text transcription including the plurality of output text tokens to an application program, a user interface, or a file storage location.

According to this aspect, the sequence-to-sequence speech recognition model may include an external alignment model configured to generate the plurality of external-model text tokens, an encoder neural network configured to generate the plurality of hidden states, and a decoder neural network configured to generate the plurality of output text tokens. The encoder neural network and the decoder neural network may be recurrent neural networks.

According to this aspect, the audio input may be a streaming audio input received over an input time interval. The text transcription may be output during the input time interval concurrently with receiving the audio input.

According to this aspect, the method may further include training the encoder neural network at least in part with an encoder loss function including a sequence-to-sequence loss term and a framewise cross-entropy loss term.

According to this aspect, the method may further include pre-training the encoder neural network with the framewise cross-entropy loss term prior to training with the sequence-to-sequence loss term.

According to this aspect, the method may further include training the encoder neural network with the sequence-to-sequence loss term and the framewise cross-entropy loss term concurrently via multi-task learning.

According to this aspect, the method may further include training the decoder neural network at least in part with a delay constrained training loss function including a sequence-to-sequence loss term and an attention weight regularization term. The decoder neural network may be a monotonic chunkwise attention model.

According to this aspect, the method may further include training the decoder neural network at least in part with a minimum latency training loss function including a sequence-to-sequence loss term and a minimum latency loss term. The decoder neural network may be a monotonic chunkwise attention model.

According to another aspect of the present disclosure, a computing system is provided, including one or more processors configured to receive an audio input. The one or more processors may be further configured to generate a text transcription of the audio input at a sequence-to-sequence speech recognition model configured to at least, at an external alignment model, assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. The sequence-to-sequence speech recognition model may be further configured to, at one or more recurrent neural networks including at least a monotonic chunkwise attention model, generate a plurality of output text tokens corresponding to the plurality of frames. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may be further configured to output the text transcription including the plurality of output text tokens to an application program, a user interface, or a file storage location.

It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.

The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

Claims

1. A computing system comprising:

one or more processors configured to: receive an audio input; generate a text transcription of the audio input at a sequence-to-sequence speech recognition model that includes a trained external alignment model, the sequence-to-sequence speech recognition model being configured to at least: assign, via the trained external alignment model, a respective plurality of external-model text tokens to a plurality of frames included in the audio input, wherein each external-model text token has an external-model alignment within the audio input; based at least in part on the audio input and the respective external-model alignments of the external-model text tokens, generate a plurality of output text tokens corresponding to the plurality of frames; compute a latency between the plurality of external-model text tokens and the plurality of output text tokens based at least in part on differences between respective output boundaries of the output text tokens and corresponding external-model boundaries of the external-model text tokens, wherein: each output text token has a corresponding output alignment within the audio input; and for each output text token, a latency between the output alignment and the external-model alignment is constrained to be below a predetermined latency threshold; and output the text transcription including the plurality of output text tokens.

2. The computing system of claim 1, wherein:

the audio input is a streaming audio input received by the one or more processors over an input time interval; and

the one or more processors are configured to output the text transcription during the input time interval concurrently with receiving the audio input.

3. The computing system of claim 1, wherein the one or more processors are further configured to pre-process the audio input at least in part by dividing the audio input into the plurality of frames.

4. The computing system of claim 1, wherein, at the external alignment model, the one or more processors are further configured to assign the plurality of external-model text tokens to the frames as indicators of respective senone-level features included in the audio input.

5. The computing system of claim 1, wherein the sequence-to-sequence speech recognition model includes one or more recurrent neural networks.

6. The computing system of claim 5, wherein the external alignment model is a recurrent neural network.

7. The computing system of claim 5, wherein the one or more recurrent neural networks include a trained encoder neural network and a trained decoder neural network.

8. The computing system of claim 7, wherein the trained decoder neural network is a monotonic chunkwise attention model at which the one or more processors are further configured to:

compute a plurality of monotonic energy activations based at least in part on a plurality of encoder outputs of the trained encoder neural network; and

compute a respective plurality of selection probabilities of the output text tokens based at least in part on the monotonic energy activations.

9. The computing system of claim 8, wherein the one or more processors are configured to compute the output alignments of the output text tokens based at least in part on the plurality of selection probabilities.

10. The computing system of claim 1, wherein the sequence-to-sequence speech recognition model further includes a one-dimensional convolutional layer.

11. A method for use with a computing system, the method comprising:

receiving an audio input;

generating a text transcription of the audio input at a sequence-to-sequence speech recognition model that includes a trained external alignment model, wherein generating the text transcription at the sequence-to-sequence speech recognition model includes: assigning, via the trained external alignment model, a respective plurality of external-model text tokens to a plurality of frames included in the audio input, wherein each external-model text token has an external-model alignment within the audio input; based at least in part on the audio input and the respective external-model alignments of the external-model text tokens, generating a plurality of output text tokens corresponding to the plurality of frames; computing a latency between the plurality of external-model text tokens and the plurality of output text tokens based at least in part on differences between respective output boundaries of the output text tokens and corresponding external-model boundaries of the external-model text tokens, wherein: each output text token has a corresponding output alignment within the audio input; and for each output text token, a latency between the output alignment and the external-model alignment is constrained to be below a predetermined latency threshold; and

outputting the text transcription including the plurality of output text tokens.

12. The method of claim 11, wherein:

the audio input is a streaming audio input received over an input time interval; and

the text transcription is output during the input time interval concurrently with receiving the audio input.

13. The method of claim 11, further comprising pre-processing the audio input at least in part by dividing the audio input into the plurality of frames.

14. The method of claim 11, further comprising, at the external alignment model, assigning the plurality of external-model text tokens to the frames as indicators of respective senone-level features included in the audio input.

15. The method of claim 11, wherein the sequence-to-sequence speech recognition model includes one or more recurrent neural networks.

16. The method of claim 15, wherein the external alignment model is a recurrent neural network.

17. The method of claim 15, wherein the one or more recurrent neural networks include a trained encoder neural network and a trained decoder neural network.

18. The method of claim 17, wherein the trained decoder neural network is a monotonic chunkwise attention model, the method further comprising:

computing a plurality of monotonic energy activations based at least in part on a plurality of encoder outputs of the trained encoder neural network; and

computing a respective plurality of selection probabilities of the output text tokens based at least in part on the monotonic energy activations.

19. The method of claim 11, wherein the sequence-to-sequence speech recognition model further includes a one-dimensional convolutional layer.

20. A computing system comprising:

one or more processors configured to: receive an audio input; pre-process the audio input at least in part by dividing the audio input into the plurality of frames; generate a text transcription of the audio input at a sequence-to-sequence speech recognition model that includes a plurality of trained neural networks, the sequence-to-sequence speech recognition model being configured to at least: at a first trained neural network of the plurality of neural networks, assign a respective plurality of external-model text tokens to the plurality of frames, wherein each external-model text token has an external-model alignment within the audio input; and at one or more additional trained neural networks of the plurality of trained neural networks: based at least in part on the audio input and the respective external-model alignments of the external-model text tokens, generate a plurality of output text tokens corresponding to the plurality of frames; compute a latency between the plurality of external-model text tokens and the plurality of output text tokens based at least in part on differences between respective output boundaries of the output text tokens and corresponding external-model boundaries of the external-model text tokens, wherein: each output text token has a corresponding output alignment within the audio input; and for each output text token, a latency between the output alignment and the external-model alignment is constrained to be below a predetermined latency threshold; and output the text transcription including the plurality of output text tokens.