Pattern matching for large vocabulary speech recognition with packed distribution and localized trellis access
A method is provided for improving pattern matching in a speech recognition system having a plurality of acoustic models (20). Similarity measures for acoustic feature vectors (54) are determined in groups that are then buffered into cache memory (59). To further reduce computational processing, the acoustic data may be partitioned amongst a plurality of processing nodes (66, 67, 68). In addition, a priori knowledge of the spoken order may be used to establish the access order (124) used to copy records from the main speech parameter table (120, 200) into a sub-table (130, 204). The sub-table is processed such that the entries are in contiguous memory locations (206) and sorted according to the processing order (208). The speech processing algorithm is then directed to operate upon the sub-table (210) which causes the processor to load the sub-table into high speed cache memory (104, 212).
Latest MATSUSHITA ELECTRIC INDUSTRIAL CO., LTD Patents:
- Cathode active material for a nonaqueous electrolyte secondary battery and manufacturing method thereof, and a nonaqueous electrolyte secondary battery that uses cathode active material
- Optimizing media player memory during rendering
- Navigating media content by groups
- Optimizing media player memory during rendering
- Information process apparatus and method, program, and record medium
This application is a continuation-in-part of U.S. application Ser. No. 10/127,184, entitled, “Pattern Matching for Large Vocabulary Speech Recognition Systems, filed Apr. 22, 2002.
BACKGROUND OF THE INVENTIONThe present invention relates generally to large vocabulary continuous speech recognition system, and more particularly, to a method for improving pattern matching in a large vocabulary continuous speech recognition system.
Pattern matching is one of the more computationally intensive aspect of the speech recognition process. Conventional pattern matching involves computing similarity measures for each acoustic feature vector in relation to each of the acoustic models. However, due to the large number of acoustic models, only a subset of acoustic models may be loaded into the available memory at any given time. In order to compute similarity measures for a given acoustic feature vector, conventional pattern matching requires a number of I/O operations to load and unload each of the acoustic models into the available memory space.
Therefore, it is desirable to provide an improved method of pattern matching that reduces the number I/O operations associated with loading and unloading each acoustic model into memory.
SUMMARY OF THE INVENTIONIn accordance with the present invention, a method is provided for improving pattern matching in a speech recognition system having a plurality of acoustic models. The improved method includes: receiving continuous speech input; generating a sequence of acoustic feature vectors that represent temporal and spectral behavior of the speech input; loading a first group of acoustic feature vectors from the sequence of acoustic feature vectors into a memory workspace accessible to a processor; loading an acoustic model from the plurality of acoustic models into the memory workspace; and determining a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in relation to the acoustic model. Prior to retrieving another group of acoustic feature vectors, similarity measures are computed for the first group of acoustic feature vectors in relation to each of the acoustic models employed by the speech recognition system. In this way, the improved method reduces the number I/O operations associated with loading and unloading each acoustic model into memory.
In accordance with another aspect of the invention, a method is provided for processing speech data utilizing high speed cache memory. The cache memory has an associated cache mechanism for transfer of data from system memory into cache memory that may operate automatically or under program control, depending on the features provided by the processor. First, a main table of speech data in system memory is provided along with a list that establishes a processing order of a subset of said speech data. In this regard, the tem “list” is intended to encompass any data structure that can represent sequential information (such as the sequential information found in a speech utterance).
The method involves copying the subset of said speech data into a sub-table that is processed such that entries in said sub-table occupy contiguous memory locations. Then the sub-table is operated upon using a speech processing algorithm, and the cache mechanism associated with said high speed cache memory is employed (automatically or programmatically) to transfer the sub-table into said high speed cache memory. In this way, the speech processing algorithm accesses the subset of speech data at cache memory access rates and thereby provides significant speed improvement.
For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
The exemplary speech recognizer performs the recognition process in three steps as shown in
Next, acoustic pattern matching occurs at step 12. During this step, a similarity measure is computed between each frame of input speech and each reference pattern. The process defines a local measure of closeness between acoustic feature vectors and further involves aligning two speech patterns which may differ in duration and rate of speaking. The pattern classification step uses a plurality of acoustic models 14 generated during the training phase.
A diagram of a simple Hidden Markov Model is shown at 20 of
Each Hidden Markov Model includes a collection of probabilities associated with the states themselves and transition amongst the states. Because probability values associated with each state may be more complex than a single value could represent, some systems will represent probability in terms of a Gaussian distribution. To provide a more robust model, a mixture of Gaussian distributions may be used in a blended manner to represent probability values as shown diagrammatically at 26 and referenced by a mixture index pointer 28. Thus, associated with each state is a mixture index pointer which in turn identifies the Gaussian mixture density data for that state.
Transitions amongst the states are illustrated by arrows. Each self-loop transition has an associated transition probability as depicted at 22; whereas each transition to another state also has an associated transition probability as depicted at 24. Likewise, transition probabilities may be represented by Gaussian distributions data or Gaussian mixture density data.
In the context of large vocabulary speech recognizers, Hidden Markov Models are typically used to model sub-word units, such as phonemes. However, speech recognition systems that employ word-level acoustic models or acoustic models based on another speech sub-component are also within the scope of the present invention. For more information regarding the basic structure of Hidden Markov Modeling, see Junqua, Jean-Claude and Haton, Jean-Paul, Robustness in Automatic Speech Recognition, Fundamentals and Applications, Kluwer Academic Publishers, 1996.
Speech recognition concludes with a decoding step 16. The probability that a particular phoneme was spoken is provided by the acoustic models as part of the pattern matching process. A sequence of words can then be constructed by concatenating the phonemes observed during the pattern matching process. The process of combining probabilities for each possible path and searching through the possible paths to select the one with highest probability is commonly referred to as decoding or searching. In other words, the decoding process selects a sequence of words having the highest probability given the observed input speech. A variety of well known searching algorithms may be used to implement the decoding process.
In one aspect of the present invention, an improved method is provided for performing pattern matching in a large vocabulary continuous speech recognition system as shown in
Referring to
A similarity measure can then be computed at step 36 for each acoustic feature vector in the first group of vectors. For example, a Gaussian computation may be performed for each acoustic feature vector as is well known in the art. Resulting similarity measures may be stored in an output memory space which is also accessible to the processor performing the computations. By performing the similarity computation for a group of acoustic feature vectors, the present invention reduces the number I/O operations required to load and unload each acoustic model.
Prior to retrieving additional acoustic models, the acoustic models currently resident in the memory workspace are removed at step 38. Additional acoustic models are then loaded into the memory space at step 42. If desired the removal step 38 can be performed concurrently with the loading step 42; the loading step can overwrite what is already stored in the memory workspace, thereby removing the models then resident. Similarity measures are computed for each acoustic feature vector in the first vector group in relation to each of the additional acoustic models resident in the memory workspace at step 36. Again, the resulting similarity measures may be stored in an output memory space which is also accessible to the processor performing the computations. This process is repeated via step 40 until similarity measures are computed for the first group of acoustic feature vectors in relation to each of the acoustic models employed by the speech recognition system.
Once similarity measures have been determined for the first group of acoustic feature vectors, the search process is performed at step 44. In particular, the search process updates the search space based on the similarity measures for the first group of acoustic feature vectors. It is to be understood that this aspect of the present invention is not limited to a particular searching algorithm, but may be implemented using a variety of well known searching algorithms.
Contemporaneous with the search process, a subsequent group of acoustic feature vectors may be retrieved into the memory workspace at step 48. A similarity measure is computed for each acoustic feature vector in this subsequent group as described above. In other words, acoustic models are loaded and unloaded into the memory workspace and a Gaussian computation is performed for each acoustic feature vector in relation to the acoustic models resident in the memory workspace. This process is repeated via step 40 until similarity measures are computed for the subsequent group of acoustic feature vectors in relation to each of the acoustic models employed by the speech recognition system. It is envisioned that the first group of acoustic feature vectors is removed from the memory workspace prior to loading the subsequent group of acoustic feature vectors into the memory workspace. One skilled in the art will readily recognize that this is an iterative process that is performed for each of the acoustic feature vectors that represents the input speech.
It is further envisioned that the improved method for performing pattern matching may be distributed across multiple processing nodes as shown in
An acoustic front-end node 52 is receptive of speech input and operable to generate a sequence of acoustic feature vectors as is known in the art. The acoustic front-end node 52 is further able to replicate the sequence of acoustic feature vectors 54 and distribute the replicated sequences 54 amongst the plurality of pattern matching nodes 56. It is envisioned that the replicated sequence of acoustic feature vectors may be partitioned into groups of vectors which are periodically or upon request communicated to the plurality of pattern matching nodes.
Each pattern matching node 56 is comprised of a data processor 58 and a memory space 59 accessible to the data processor 58. To perform pattern matching, each pattern matching node 56 is adapted to receive the replicated sequence of acoustic feature vectors 54 from the acoustic front-end node 52. As described above, each pattern matching node 56 is operable to load one or more acoustic models 60 into a resident memory space, and then determine similarity measures for each acoustic feature vector in relation the loaded acoustic models. In this approach, each pattern matching node 56 is responsible for a predetermined range of acoustic models, such that computation of similarity measures for a given acoustic feature vector or group of vectors can occur in parallel, thereby further improving the overall performance of the speech recognition process.
In another aspect of the present invention, the decoding process may be distributed amongst a plurality of processing nodes. In general, the search space is comprised of observed acoustic data (also referred to as the potential search space). Referring to
To further reduce computational processing, the observed acoustic data may be partitioned amongst a plurality of processing nodes as shown in
Partitioning the observed acoustic data further includes defining link data 64 that is indicative of the relationships between the segmented acoustic data residing at the different processing nodes. Since each processing node only evaluates a subset of the observed acoustic data, link data is maintained at each of the processing nodes. As further describe below, changes in the link data is communicated amongst the plurality of processing nodes.
In
For illustration purposes, a decoding process based on lexical trees is further described below. Lexical trees generally represent the pronunciations of words in the vocabulary and may be constructed by concatenating the phonemes observed during the pattern matching process. Each node in a lexical tree is associated to a state of a certain phoneme of a certain word of a certain word history for language model conditioning. The states of all phonemes of all words have been compiled into lexical trees. These trees are replicated for word history language model conditioning.
Referring to
The pattern matching subsystem 82 is comprised of a plurality of pattern matching nodes 88. To perform pattern matching, each pattern matching node 88 is adapted to receive a replicated sequence of acoustic feature vectors from an acoustic front-end node (not shown). As described above, each pattern matching node 88 determines similarity measures for a predetermined range of acoustic models, such that computation of similarity measures for a given acoustic feature vector occurs in parallel. Resulting similarity measures are then communicated from each of the pattern matching nodes 88 via the communication link 86 to the lexical search subsystem 84.
Resulting similarity measures are preferably communicated in a multicast mode over an unreliable link. A reliable link typically require a connection protocol, such as TCP, which guarantees that the information is received by the intended recipient. Reliable links are typically more expensive in term of bandwidth and latency, and thus should only be used when data needs to be received. In contrast, an unreliable link usually does not require a connection to be opened but does not guarantee that all transmitted data is received by the recipient. In an exemplary embodiment, the communication link 86 is a standard Ethernet link (e.g., 100 Mbits/sec). Although an unreliable link is presently preferred to maximize throughout, a reliable link may also be used to communicate similarity measures between the pattern matching subsystem and the lexical searching subsystem.
Similarly, the lexical search subsystem 84 is comprised of a plurality of searching nodes 90. The search space is partitioned such that each searching node 90 is responsible for evaluating one or more of the lexical trees which define the search space. To do so, each searching node 90 is adapted to receive similarity measures from each of the pattern matching nodes 88 in the pattern matching subsystem 82.
If a searching node does not receive some of the similarity measure data that it needs, the node could either compute it or ask for it to be retransmitted. To recompute similarity measures, the searching node would need to access to all of the acoustic models which could constitute a considerable memory use. On the other hand, retransmitting similarity measures is equivalent to implementing reliable multicast. Although the approach is expensive in terms of bandwidth and especially in terms of latency, it may be feasible in some applications.
For instance, the latency problem due to retransmissions inherent with the reliable multicast mode may not be a problem in the horizontal caching technique described above. To maximize throughput on the communication link, assume that a daisy chain is constructed with reliable links between the pattern matching nodes 88. The daisy chain is used to synchronize the transmission of the similarity measures using a round-robin approach. This approach has the advantage that the pattern matching nodes would not try to write on the shared link at the same time, thereby creating collisions and possible retransmissions.
Using this approach, the first pattern matching node would write the first 10 frames (equivalent to 100 milliseconds of speech) of its output cache on the shared non-reliable link. The first node then signals the next node on the chain that it is now its turn to transmit data. The next node will transmit its data and then signal yet another node. Assuming 8 pattern matching nodes, the total amount of data each node will have to send over the shared medium is 10 frames×10 kminutes/8 nodes×4 bytes=50 Kbytes=0.4 Mbits. To complete this process for 8 nodes, it takes 32 milliseconds over a 100 Mbits per second shared link, not accounting for overhead, latency due to the transmission and synchronization of the daisy chain. Since only one third of the total aggregate bandwidth of the communication link has been used, the remainder of the bandwidth could be used for retransmission associated with the reliable multicast. One skilled in the art will readily recognize that if the latencies are too high, the horizontal caching technique provides the flexibility to increase the batch size to more than 10 frames, therefore reducing the sensitivity to latencies.
Each searching node 90 only processes a subset of the lexical trees in the search space. To do so, each searching node 90 needs to know the state of its associated lexical trees as well as data indicating the links between all of the lexical trees in the search space. Thus, each searching node further includes a data store for maintaining the link data.
Since processing of associated lexical trees by a searching node may result in changes to the link data, each searching node 90 is further operable to communicate changes to the link data to each of the other searching nodes in the lexical search subsystem. Here, the communication problem is more difficult because synchronization up to the frame time (e.g., 10 milliseconds) and reliability must be guaranteed. Although a shared communication link may be feasible, a switching network is preferably used to link searching node in the lexical search subsystem. In particular, each searching node 80 is interconnected by a switching fabric 92 having a dedicated link.
In operation, each searching node 90 will be listening and reading the similarity measures from the pattern matching subsystem 82. In this case, each searching node 90 is multi-threaded, so that reading from the communication link can be done in parallel with processing of lexical trees. At the end of each frame, each search node 90 will send the likely word endings and a few other statistics (e.g., likelihoods histograms used to adapt the beam search) to a search reduction server 94. The search reduction server 94 is operable to combine information about word endings, apply a language model to generate a new (global) search state and sent the search state back (in multicast mode) to each searching node 90. All of this process has to be accomplished in a time window smaller that the frame rate, and in a reliable way, since the search state has to be maintained consistent across all nodes. Therefore, efficient reliable multicast is preferably employed. In addition, the search reduction server is further operable to generate the recognized sentence and to compute statistics, like the confidence measure or the speaker id, as post processing.
Reducing the size of the search space is another known technique for reducing computational processing associated with the decoding processing. Histogram pruning is one known technique for reducing the number of active nodes residing in the search space. One known technique for achieving N best (or approximately N best) pruning is through the computation of a histogram. The histogram represents the probability density function of the scores of the nodes. It is defined as y=f(X), where X is the score and y is the number of nodes a given time t with the score. Since scores are real numbers, X does not represent a specific value, but rather a range.
For illustration purposes, a simplistic example of histogram pruning is provided below. Suppose we have 10 active states at time t, and that we should wish to retain only 5 of them. Assume the active states are as follows:
-
- s0: score 3 associated to node n0
- s1: score 2 associated to node n1
- s2: score 5 associated to node n2
- s3: score 4 associated to node n3
- s4: score 4 associated to node n4
- s5: score 3 associated to node n5
- s6: score 5 associated to node n6
- s7: score 3 associated to node n7
- s8: score 2 associated to node n8
- s9: score 5 associated to node n9
Thus, the histogram maps:
-
- f(2)=2 (states s1, and s8)
- f(3)=3 (states s0, s5, s7)
- f(4)=2 (states s3 and s4)
- f(5)=3 (states s2, s6, s9)
We do not need to know which states are associated with which value of X, and therefore a simple array y=f(X) is sufficient.
Next, to identify the N=5 best, we just look at the histogram to compute the threshold, T, corresponding to the pruning. If T=6 or above, no states satisfy score(s)>=T. If T=5, then add backwards the number of nodes s which satisfy score(s)>=T: f(5)=3. In this case, only three node meet the threshold. Since three nodes is insufficient to meet our pruning criteria (3<N=5), then we continue by setting T=4. In this case, five nodes meet the threshold. The threshold (T=4), can then be applied to the list of nodes as follows:
-
- s0: score 3 associated to node n0===>remove
- s1: score 2 associated to node n1===>remove
- s2: score 5 associated to node n2===>KEEP
- s3: score 4 associated to node n3===>KEEP
- s4: score 4 associated to node n4===>KEEP
- s5: score 3 associated to node n5===>remove
- s6: score 5 associated to node n6===>KEEP
- s7: score 3 associated to node n7===>remove
- s8: score 2 associated to node n8===>remove
- s9: score 5 associated to node n9===>KEEP
Histogram pruning may be implemented in the distributed environment of the present invention as described below. Assume the search space is divided amongst three search nodes, K1, K2, and K3, such that:
-
- s0: score 3: processed by node K1
- s1: score 2: processed by node K2
- s2: score 5: processed by node K3
- s3: score 4: processed by node K1
- s4: score 4: processed by node K1
- s5: score 3: processed by node K1
- s6: score 5: processed by node K2
- s7: score 3: processed by node K2
- s8: score 2: processed by node K3
- s9: score 5: processed by node K3
- s0: score 3: processed by node K1
To identify 5 active states, each search processing node computes its own histogram as follows:
-
- K1: f(3)=2 (s0 and s5), f(4)=2 (s3 and s4)
- K2: f(2)=1 (s1), f(3)=1 (s6), f(5)=1 (s6)
- K3: f(2)=1 (s8), f(5)=2 (s2,s9)
Unfortunately, this example, is not very exemplary of the distribution of scores. The distribution is typically in an identifiable form, such as exponential. In other words, y=f(M−X)=alpha*exp(1/alpha*(M−X)). In this case, the threshold may be computed from estimations for the parameters alpha and M. Specifically, the threshold is T=M−1/alpha*log N, where M is the maximum score and the expectation (average value) is M−1/alpha.
To compute the threshold, an algorithm is implemented at each searching node. The algorithm involves looping through all the nodes and computing the mean value and max value of all scores. Let Mk denote the max score on search processing node Kk, Ek denote the mean value of the scores on node Kk, and Wk be the number of active nodes on Kk, where k=1, 2 . . . n.
The overall threshold may be recovered by using Mk, Ek, and Wk from each of the searching nodes. The overall maximum M is equal to the largest Mk and the overall mean is 1/(sum Wk)*(sum of Wk*Ek). Since Mk, Ek, and Wk are the only entities that need to be transmitted, they are called sufficient statistics for the computation of the threshold T. Furthermore, these statistics are much smaller than the large array y=f(X).
Based on these sufficient statistics, computation of a threshold is done at one of the processing nodes (possibly the root node) and then transmitted back to each of the search nodes. The threshold is applied to the active nodes at each processing node as previously explained.
Packed Distribution and Localized Trellis AccessLarge vocabulary speech applications will typically employ a very large number of speech parameters. For example, an exemplary large vocabulary speech recognition system may require a Gaussian mixture table containing 100,000 Gaussians, or more. There is a class of speech processing problems that initially require access to the entire table, but that later constrain access to a subset of the entire table. For example, in a multi-pass recognizer, the speech processing algorithm uses the first pass to constrain the search space used by subsequent passes.
The need to deal with massive amounts of data makes large vocabulary speech applications highly processor intensive. Unfortunately, conventional processing algorithms do little to combat this problem, but instead place the computational burden on comparatively expensive processors. This traditional “brute force” approach has placed large vocabulary applications off limits for a variety of consumer products that do not have powerful processors. However, as will be more fully explained herein, it is possible to significantly improve processing throughput, and to significantly reduce processor overhead, by taking advantage of the a priori knowledge of the temporal order or spoken order inherent in many speech applications. As will be more fully explained, these improvements are achieved through a packed distribution and localized trellis access method, whereby a subset of the full parameter data space is selected, ordered and packed into a new data structure. The new data structure is designed so that the processor can load it into its faster cache memory and then utilize the cached information in a very efficient manner. Specifically, the information is ordered and packed to allow the processor to access the data in substantially sequential order, with a substantially reduced likelihood that the cache will need to be flushed and reloaded (a time consuming and inefficient process).
To understand how the packed distribution and localized trellis access method is able to produce processing speed improvements (10 fold or more) some knowledge of microprocessor caching techniques will be helpful.
The basic concept behind caching is to place the most frequently used program instructions and/or the most frequently used data in the fastest memory available to the processor. In random access memory devices, data access is mediated by a clock. This clock dictates how quickly the information can be read from or written to the memory device under its control. In a typical microprocessor architecture, the microprocessor itself may operate under control of a high speed clock, while the main memory of the computer system will typically operate using a slower clock. This is because it is generally not economically feasible to construct random access main memory circuits that are able to operate at the same clock speed as the microprocessor.
Illustrated in
Typically, the memory architecture, illustrated in
In a typical speech processing application, such as a large vocabulary application, the speech parameters will be stored in a table occupying a portion of the main memory 100. As the speech processing algorithm performs its task, utilizing these parameters, portions of the parameter table will be loaded into the microprocessor's cache memory—as a natural consequence of being accessed by the speech processing algorithm. In conventional speech processing algorithms, however, no attempt is made to optimize what gets loaded into the cache.
According to the present invention, it is possible to optimize what gets loaded into the cache and thereby substantially improve the speed at which speech processing tasks may be performed. Referring to
The present invention selects a subset of the Gaussian mixture table, corresponding to the Gaussian mixture values actually used in the recognition process, and stores that subset in a packed mixture table. In
Speech data is different from other forms of data, such as financial data, in that there is a sequential order or spoken order to the speech data. This can be illustrated by a directed graph, shown at 122. The graph shows all possible sequences by which one sound unit may follow another sound unit. In this regard, sound units can be individual phones, or they can be larger structures, such as syllables, words, etc. To illustrate the concept of the directed graph, refer to
Rather than utilize the selected Gaussian mixture values directly from mixture table 120, the packed distribution and localized trellis access technique resorts and packs the selected subset of table 120 into the packed mixture table 130. This is performed using the processing step or module 128. The selected subset from table 120 is (a) placed in sequential order that corresponds to the sequential order or spoken order described by graph 122 and (b) packed so that the respective sorted values are adjacent or contiguous in memory.
After performing the resorting and packing operation, the algorithm passes control to the speech processing algorithm that will utilize the data. This is done by passing the address of the packed mixture table 130 to the processing algorithm, so that it will operate upon the data stored in the packed mixture table, rather than upon the data in the Gaussian mixture table 120. In so doing, the microprocessor will load the packed mixture table 130 into its cache 104, where all operations using the cached values will be performed at much higher speed than would be possible if main memory were utilized.
Further expanding on the explanation provided by
In
The operation of the re-sort and pack step or module 128 (
For a further understanding of the presently preferred processing method, refer to
At step 204, data items are selected from the main table based on the list order and these are copied into a sub-table. As illustrated by constraining steps 206 and 208, the sub-table is processed so that entries are stored in contiguous memory locations. In the presently preferred embodiment, the sub-table may be implemented in main system memory. In addition, the sub-table is processed so that entries are sorted according to the processing order established by the list in step 202. It will be appreciated that constraining steps 206 and 208 can be processed in either order, or concurrently. In the presently preferred embodiment the sub-table is constructed by sequentially adding entries to the sub-table in successively contiguous memory locations, with the order of the entries being established by selecting them from the main table in the order established by the list. Of course, alternate embodiments can be envisioned where the sub-table is initially constructed in a non-contiguous fashion and then later compacted, or where the sub-table is initially constructed with one sort order and thereafter re-sorted according to the list.
After copying the entries to the sub-table in step 204, the applicable speech processing algorithm is then used to operate upon the sub-table at step 210. By operating upon the sub-table, the sub-table is transferred into high speed cache memory by utilizing the cache mechanism associated with the cache memory. In this regard, most modern day microprocessors will automatically transfer a given block of information into high speed cache memory, so that that block of data can be processed more rapidly. Of course, the transfer into high speed cache memory can also be effected by an explicit processor command, if desired.
The packed distribution and localized trellis access method illustrated in
-
- local distance computation and trellis expansion for algorithms of the Viterbi and Baum-Welch type. These computations are central to many large vocabulary continuous speech recognition training and recognition applications;
- Viterbi beam search algorithms for real-time recognition;
- constrained search on word/phone graphs or focused language models;
- re-scoring of word lattices for acoustic model adaptation;
- maximum mutual information estimation (MMIE) acoustic model training;
- expectation maximization-based maximum likelihood acoustic model training;
- multi-pass recognition processes.
In applications such as those listed above, the packed distribution and localized trellis access method provides a speed improvement of at least one order of magnitude over conventional methods. The exact speed improvement factor depends, in part, upon the speed benefit that the high speed cache memory produces over system memory. Thus, processors having faster high-speed cache performance will show even greater speed improvement when the technique is utilized. In this regard, it will be appreciated that the technique exploits the speed benefit of the high-speed cache by (a) localizing the memory access based upon the order that the expansion algorithm or other speech processing algorithm explores the trellis and (b) sorting the memory representation of the Gaussian parameters (or other speech parameters) such that the memory is accessed in increasing order.
In general, the advantages of the technique can be enjoyed in applications where the system has some a priori knowledge of which speech parameters will be needed, and that those parameters be of sufficiently small size as to fit in the high-speed cache memory. For a typical very large vocabulary continuous speech recognition application, these conditions are met during training and during passes of recognition that occur after a first pass (e.g., adaptation or re-scoring passes). These conditions are also met for dialog systems where each state is associated with a particular vocabulary, and for text-prompted speaker recognition systems, where each prompted text evokes a particular set of speech parameters. Finally, the algorithm described here can be combined with the other algorithms described earlier in this document to further improve memory access by decreasing the bandwidth used between the main CPU and the system memory.
The foregoing discloses and describes merely exemplary embodiments of the present invention. One skilled in the art will readily recognize from such discussion, and from accompanying drawings and claims, that various changes, modifications, and variations can be made therein without departing from the spirit and scope of the present invention.
Claims
1. A method for improving pattern matching in a speech recognition system having a plurality of acoustic models, comprising:
- (a) receiving continuous speech input;
- (b) generating a sequence of acoustic feature vectors that represent temporal and spectral behavior of the speech input;
- (c) loading a first group of acoustic feature vectors from the sequence of acoustic feature vectors into a memory workspace accessible to a processor;
- (d) loading an acoustic model from the plurality of acoustic models into the memory workspace; and
- (e) determining a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in relation to the acoustic model.
2. The method of claim 1 further comprises loading a next acoustic model from the plurality of acoustic models into the memory workspace, and determining a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in the relation to said next acoustic model until similarity measures for the first group of acoustic feature vectors are determined in relation to each of the plurality of acoustic models.
3. The method of claim 2 further comprises removing the acoustic model from the memory workspace prior to retrieving the next acoustic model from the plurality of acoustic models.
4. The method of claim 2 further comprises storing the similarity measures for the first group of acoustic feature vectors in an output memory space.
5. The method of claim 2 further comprises updating a search space based on the similarity measures for the first group of acoustic feature vectors; and subsequently performing a searching operation on the search space.
6. The method of claim 2 further comprises loading a second group of acoustic feature vectors from the sequence of acoustic feature vectors into the memory workspace; and determining similarity measures for the second group of acoustic feature vectors in relation to each of the plurality of acoustic models.
7. The method of claim 1 wherein the acoustic model is further defined as a Hidden Markov Model having a plurality of states, such that probability values for transitioning amongst the plurality of states is expressed in terms of Gaussian data.
8. The method of claim 7 wherein the step of determining a similarity measure further comprises performing a Gaussian computation.
9. An architectural arrangement for a speech recognition system having a plurality of acoustic models residing in a data store, comprising:
- an acoustic front-end node receptive of continuous speech input, the acoustic front-end node operable to generate a sequence of acoustic feature vectors that represent temporal and spectral behavior of the speech input;
- a first pattern matching node having a first data processor and a first memory space accessible to the first data processor, the first pattern matching node adapted to receive a first group of acoustic feature vectors from the sequence of acoustic feature vectors into the first memory space, the first pattern matching node further operable to load a first acoustic model in the first memory space from the data store and to determine a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in relation to the first acoustic model using the first data processor; and
- a second pattern matching node having a second data processor and a second memory space accessible to the second data processor, the second pattern matching node adapted to receive the first group of acoustic feature vectors into the second memory space, the second pattern matching node further operable to load a second acoustic model in the second memory space from the data store and to determine a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in relation to the second acoustic model using the second data processor.
10. A method for improving pattern matching in a speech recognition system having a plurality of acoustic models, comprising:
- receiving continuous speech input;
- generating a sequence of acoustic feature vectors that represent temporal and spectral behavior of the speech input;
- retrieving a first group of acoustic feature vectors from the sequence of acoustic feature vectors into a first memory workspace accessible to a first processor;
- retrieving a first acoustic model from the plurality of acoustic models into the first memory workspace;
- retrieving a first group of acoustic feature vectors from the sequence of acoustic feature vectors into a second memory workspace accessible to a second processor;
- retrieving a second acoustic model from the plurality of acoustic models into the second memory workspace; and
- determining a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in relation to the first acoustic model by the first processor contemporaneously with determining a similarity measure for each acoustic feature vector of the first group of acoustic feature vectors in relation to the second acoustic model by the second processor.
11. A method for improving the decoding process in a speech recognition system, comprising:
- generating a search space that is comprised of observed acoustic data, the search space having an active search space;
- partitioning the active search space amongst a plurality of processing nodes; and
- performing a searching operation on the active search space allocated to each processing node, such that searching operations occur concurrently on at least two of the plurality of processing nodes.
12. The method of claim 11 further comprises defining the active search space as a plurality of lexical trees and distributing the plurality of lexical trees amongst the plurality of processing nodes.
13. The method of claim 12 further comprises maintaining link data indicative of links between the lexical trees at each of the plurality of processing nodes and communicating changes in the link data amongst the plurality of processing nodes.
14. The method of claim 11 wherein the step of partitioning the active search space further comprises allocating the active search space amongst the plurality of the processing nodes based on available processing power associated with each processing node.
15. A method for improving the decoding process in a speech recognition system, comprising:
- generating a search space that is comprised of observed acoustic data, the search space having an active search space;
- partitioning the active search space amongst a plurality of processing nodes; and
- performing a searching operation on the active search space allocated to each processing node, such that searching operations occur concurrently on at least two of the plurality of processing nodes;
- wherein the step of partitioning the active search space further comprises segmenting the active search space in a manner that minimizes links and allocating segmented active search space amongst the plurality of the processing nodes in proportion to processing power associated with each processing node.
16. The method of claim 11 wherein the step of performing a searching operation on the observed acoustic data further comprises defining the search operation as at least one of a Viterbi search algorithm, a stack decoding algorithm, a multi-pass search algorithm and a forward-backward search algorithm.
17. A distributed architectural arrangement for a speech recognition system, the speech recognition system operable to generate a search space defined by a plurality of lexical trees, comprising:
- a first searching node having a first data processor and a first memory space accessible to the first data processor, the first searching node adapted to receive similarity measures that correlate speech input to a plurality of acoustic models and operable to evaluate a first lexical tree based on the similarity measures;
- a second searching node having a second data processor and a second memory space accessible to the second data processor, the second searching node adapted to receive the similarity measures and operable to evaluate a second lexical tree based on the similarity measures; and
- a communication link interconnecting the first and second searching nodes.
18. The distributed architectural arrangement of claim 17 wherein the plurality of lexical trees are interconnected by one or more links and each of the searching nodes maintains link data indicative of the links amongst the plurality of lexical trees.
19. The distributed architectural arrangement of claim 18 wherein the evaluation of the first lexical tree by the first searching node results in changes to the link data, such that the first searching node is further operable to communicate the changes to the link data across the communication link to the second searching node.
20. A distributed architectural arrangement for a speech recognition system, the speech recognition system operable to generate a search space defined by a plurality of lexical trees, comprising:
- a first searching node having a first data processor and a first memory space accessible to the first data processor, the first searching node adapted to receive similarity measures that correlate speech input to a plurality of acoustic models and operable to evaluate a first lexical tree based on the similarity measures; a second searching node having a second data processor and a second memory space accessible to the second data processor, the second searching node adapted to receive the similarity measures and operable to evaluate a second lexical tree based on the similarity measures; and a communication link interconnecting the first and second searching nodes; wherein the plurality of lexical trees are interconnected by one or more links and each of the searching nodes maintains link data indicative of the links amongst the plurality of lexical trees; and
- wherein the first searching node initiates communication of the changes to the link data prior to completing the evaluation of the first lexical tree.
21. A distributed architectural arrangement for a speech recognition system, the speech recognition system operable to generate a search space defined by a plurality of lexical trees, comprising:
- a first searching node having a first data processor and a first memory space accessible to the first data processor, the first searching node adapted to receive similarity measures that correlate speech input to a plurality of acoustic models and operable to evaluate a first lexical tree based on the similarity measures;
- a second searching node having a second data processor and a second memory space accessible to the second data processor, the second searching node adapted to receive the similarity measures and operable to evaluate a second lexical tree based on the similarity measures;
- a communication link interconnecting the first and second searching nodes; and
- a pattern matching node adapted to receive acoustic feature vector data indicate of the speech input and operable to determine similarity measures for the acoustic feature vector data in relation to the plurality of acoustic models, the pattern matching node further operable to communicate similarity measures over an unreliable second communication link to each of the first searching node and the second searching node.
22. The distributed architectural arrangement of claim 21 wherein at least one of the first searching node and the second searching node is operable to request retransmission of similarity measures from the pattern matching node upon detecting an error in the transmission of the similarity measures from the pattern matching node.
23. The distributed architectural arrangement of claim 22 wherein at least one of the first searching node and the second searching node is operable to recompute similarity measures upon detecting an error in the transmission of the similarity measures from the pattern matching node.
24. A distributed architectural arrangement for a speech recognition system, the speech recognition system operable to generate a search space defined by a plurality of lexical trees, comprising:
- a first searching node having a first data processor and a first memory space accessible to the first data processor, the first searching node adapted to receive similarity measures that correlate speech input to a plurality of acoustic models and operable to evaluate a first lexical tree based on the similarity measures;
- a second searching node having a second data processor and a second memory space accessible to the second data processor, the second searching node adapted to receive the similarity measures and operable to evaluate a second lexical tree based on the similarity measures; and
- a communication link interconnecting the first and second searching nodes;
- wherein the plurality of lexical trees are interconnected by one or more links and each of the searching nodes maintains link data indicative of the links amongst the plurality of lexical trees; and
- wherein at least one of the first searching node and the second searching node is operable to reduce the search space by performing histogram pruning.
25. The distributed architectural arrangement of claim 24 wherein each searching node is operable to compute a histogram associated with its processing and communicate statistics indicative of the histogram to the other searching node.
26. The distributed architectural arrangement of claim 24 wherein the histogram statistics is further defined as a maximum score value, a mean score value and a number of active nodes associated with the searching node.
27. A method for processing speech data utilizing high speed cache memory having an associated cache mechanism for transfer of data from system memory into cache memory, comprising:
- providing a main table of speech data in system memory;
- providing a list that establishes a processing order of a subset of said speech data;
- copying said subset of said speech data into a sub-table that is processed such that entries in said sub-table occupy contiguous memory locations;
- using a speech processing algorithm to operate upon said sub-table; and
- employing the cache mechanism associated with said high speed cache memory to transfer said sub-table into said high speed cache memory, thereby allowing said speech processing algorithm to access said subset of speech data at cache memory access rates.
28. The method of claim 27 wherein said main table stores speech parameters.
29. The method of claim 27 wherein said main table stores Gaussian parameters.
30. The method of claim 27 wherein said list that establishes a processing order is developed from a speech utterance having a temporal sequence.
31. The method of claim 27 wherein said sub-table resides in system memory.
32. The method of claim 27 wherein said copying step is performed such that said entries in said sub-table are sorted in an order defined by said processing order established by said list.
33. The method of claim 27 wherein said speech processing algorithm is a multi-pass process that includes one pass that establishes said list.
34. The method of claim 27 wherein said speech processing algorithm is a multi-pass recognition process.
35. The method of claim 27 wherein said speech processing algorithm is an acoustic model adaptation process.
36. The method of claim 27 wherein said speech processing algorithm is a lattice rescoring process.
37. The method of claim 27 wherein said speech processing algorithm is a speech model training process.
38. The method of claim 27 wherein said speech processing algorithm is a constrained search on a word/phone graph.
39. The method of claim 27 wherein said speech processing algorithm is a constrained search on a focused language model.
40. The method of claim 27 wherein said speech processing algorithm is a Viterbi or Baum-Welch local distance computation.
41. The method of claim 27 wherein said speech processing algorithm is a trellis expansion algorithm.
42. The method of claim 27 wherein said speech processing algorithm is a beam search algorithm.
43. The method of claim 27 wherein said list that establishes a processing order is developed from language constraints from which a temporal order of access can be devised.
44. The method of claim 27 wherein said list that establishes a processing order is developed from search space constraints from which a temporal order of access can be devised.
45. The method of claim 27 wherein said speech processing algorithm is a multi-pass process that includes one pass that outputs language constraints that are used to establish said list.
46. The method of claim 27 wherein said speech processing algorithm is a multi-pass process that includes one pass that outputs search space constraints that are used to establish said list.
Type: Application
Filed: Mar 19, 2003
Publication Date: Jul 21, 2005
Applicant: MATSUSHITA ELECTRIC INDUSTRIAL CO., LTD (Osaka)
Inventors: Patrick Nguyen (Santa Barbara, CA), Luca Rigazio (Santa Barbara, CA)
Application Number: 10/512,354