AUTOMATED TUNING OF SPEECH RECOGNITION PARAMETERS
A method for execution on a server for serving presence information, the method for providing dynamically loaded speech recognition parameters to a speech recognition engine, can be provided. The method can include storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met. The method can further include receiving notice that a speech recognition session has been initiated between a user and the speech recognition engine. The method can further include selecting a first set of speech recognition parameters responsive to executing the at least one rule and providing to the speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
Latest IBM Patents:
- INTERACTIVE DATASET EXPLORATION AND PREPROCESSING
- NETWORK SECURITY ASSESSMENT BASED UPON IDENTIFICATION OF AN ADVERSARY
- NON-LINEAR APPROXIMATION ROBUST TO INPUT RANGE OF HOMOMORPHIC ENCRYPTION ANALYTICS
- Back-side memory element with local memory select transistor
- Injection molded solder head with improved sealing performance
1. Field of the Invention
The present invention relates to automatic speech recognition, and more particularly relates to the tuning of speech recognition parameters for automatic speech recognition engines.
2. Description of the Related Art
Speech recognition (or SR) systems translate audio information into text information. An SR system processes incoming speech and uses speech recognition parameters (i.e., grammars, weights, etc.) to determine the natural language represented by the speech. In an SR system, speech recognition occurs based on a score describing a phonetic similarity to the natural language options in a set of grammars. A grammar is an available set of natural language options in a particular context. A grammar can represent a set of words or phrases. When speech is recognized as one of the words or phrases in a grammar, the SR system returns the natural language interpretation of the speech.
The SR system computes scores for the options of the grammars for speech. The score of an option is based on two kinds of information: acoustic information and grammatical information. A probabilistic framework for the acoustic information defines the “acoustic score” as the likelihood that a particular option was spoken, given the acoustic properties of an utterance. The grammatical information biases some options in relation to others. In a probabilistic framework, the grammatical information is defined as a probability associated with each option. These probabilities are referred to herein as “grammar weights”, or simply “weights”. The score computed by the SR system for an option, given an utterance, is a combination of the acoustic score and the grammar weight. The SR system chooses the grammar option having the highest score as the natural language interpretation of the speech. Increasing the grammar weight of an option (and thus increasing the score of the option) therefore increases the chance of that option being chosen as the natural language interpretation of a given utterance.
An application author, which is a voice application programmer, defines the grammars for a speech engine. Grammar weights are defined by application authors in the course of the application programming process and are therefore alterable by the application author. The grammar weights of grammars may be determined (either assigned or tuned) according to a specific method to maximize the abilities of the SR system to correctly interpret speech. However, because acoustic scores are modeled by the manufacturer of the speech recognition software, the acoustic scores are typically fixed in a particular version of the speech recognition software. This can produce obstacles during maintenance, re-deployment, piloting and other phases of production. For example, if an SR system is originally deployed for recognizing residential addresses and then is later deployed for recognizing business addresses, the speech recognition parameters, which were originally hard-coded into the application, must then be re-worked or modified to recognize business addresses. This can be time-consuming and costly. It is therefore desirable for an SR system to have easy access to speech recognition parameters so as to allow for customization to different environments independent of applications.
Therefore, a need arises for a more efficient method for providing access to speech recognition parameters to speech recognition systems that are deployed in different environments.
BRIEF SUMMARY OF THE INVENTIONEmbodiments of the present invention address deficiencies of the art in respect to automatic speech recognition and provide a novel and non-obvious method, system and computer program product for providing dynamically loaded speech recognition parameters. In one embodiment of the invention, a method for execution on a server for serving presence information, the method for providing dynamically loaded speech recognition parameters to a speech recognition engine, can be provided. The method can include storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met. The method can further include receiving notice that a speech recognition session has been initiated between a user and the speech recognition engine. The method can further include selecting a first set of speech recognition parameters responsive to executing the at least one rule and providing to the speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
In another embodiment of the invention, a method for execution on a server for serving presence information, the method for providing dynamically loaded speech recognition parameters to a speech recognition engine, is provided. The method can include storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met. The method can further include storing periodically updated metadata about a plurality of speech recognition engines and selecting a first speech recognition engine based on most recently stored metadata. The method can further include receiving notice that a speech recognition session has been initiated between a user and the first speech recognition engine and executing the at least one rule. The method can further include selecting a first set of speech recognition parameters responsive to executing the at least one rule and providing to the first speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
In yet another embodiment of the invention, a computer system comprising a server for serving presence information, the server for providing dynamically loaded speech recognition parameters to a speech recognition engine, can be provided. The system can include a repository for storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met. The system further can include a processor configured for receiving notice that a speech recognition session has been initiated between a user and the speech recognition engine and executing the at least one rule. The processor may further be configured for selecting a first set of speech recognition parameters responsive to executing the at least one rule and providing to the speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
Additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The aspects of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention. The embodiments illustrated herein are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown, wherein:
Embodiments of the present invention provide a method, system and computer program product for providing dynamically loaded speech recognition parameters. The method can include storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met. The method can further include initiating a speech recognition session between with a user and the speech recognition engine and executing the at least one rule. The method can further include selecting a first set of speech recognition parameters responsive to executing the at least one rule and loading the first set of speech recognition parameters for performing speech recognition of the user.
Also connected to network 106 are three sets of speech recognition servers 130, 132, 134. Each type of speech recognition, such as recognizing names as opposed to numbers, necessitates a specific set of grammar and weights. Thus, each set of speech servers handle a particular type of speech recognition. The first set of speech recognition servers 130 comprise a set of servers that provide speech recognition for address capture, wherein an address, whether residential or business, is recognized. The second set of speech recognition servers 132 comprises a set of servers that provide speech recognition for cities. The third set of speech recognition servers 134 comprises a set of servers that provide speech recognition for a date. Note that although
The speech recognition servers 130, 132, 134 are configured to be used according to their abilities. Thus, a particular speech recognition server may be used for one turn. A turn is one segment of a speech recognition session. A speech recognition session may comprise various segments wherein each segment is directed towards recognizing a particular type of data. For example, a speech recognition application may be programmed to recognize an address and a city. The aforementioned speech recognition session may be divided into two segments or turns wherein the first turn is serviced by an address speech recognition server (found in group 130) and the second turn is serviced by a city speech recognition server (found in group 132).
Also connected to the network 106 is the presence server 110. The presence server 110 serves presence information, which is a status indicator that conveys ability and willingness of an entity, such as a user or a server, to communicate or operate normally. Presence information, and related metadata, is provided by each server 130, 132, and 134 to presence server 110. The presence information, and related metadata, is stored in appropriate databases 116, 118 and can be made available for distribution to other entities. Users and servers may publish presence information and related metadata to indicate its current communication and performance status. This published information informs others that wish to contact or interact with an entity of his availability and willingness to communicate and process information.
In an embodiment of the present invention, presence server 110 is a modified commercially available presence server such as the IBM WebSphere Presence Server available from International Business Machines Corp. of Armonk, N.Y. Conventionally, a presence server serves presence information, which is a status indicator that conveys ability and willingness of a potential communication partner. A user's client provides presence information via a network connection to a presence server, which stores the presence information in a user's personal availability record and can be made available for distribution to other users to convey his availability for communication. The presence server 110 can be a commercially available presence server modified to serve additional information, besides presence information, as described below. The presence server 110 can further be modified to provide additional functions described below.
In an embodiment of the present invention, each speech recognition server 130, 132, 134 publishes a variety of data to the presence server 110, including load data, supported grammars, availability, health, supported languages and acoustic model characteristics. Speech recognition servers may also publish performance data to the presence server 110 such as recognition accuracy, grammar usage and the like. The above data published by speech recognition servers may be stored in a recognition engine metadata database 116. A user 102, as well as other users, may publish to the presence server 110 such data as the current physical location of the user 102, such as an address, a sphere indicator, such as “at home,” “in an office,” or “driving in a car,” and availability, which indicates whether the user 102 is currently available for a SIP Session. The above data published by users may be stored in a user metadata database 120.
Stored in the parameters database 118 are speech recognition parameters such as grammars, weights, accuracy settings, threshold values and sensitivity values. Also stored in parameters database 118 are rules for adjusting the speech recognition parameters. A rule comprises an if-portion including criteria that must be met and a then-portion specifying speech recognition parameters that must be used when the criteria is met. Factors that may be taken into account when determining whether criteria is met include time of day, recognition accuracy of the speech recognition engine, and grammar usage of the speech recognition engine. For example, if a rules states a recognition accuracy is below 40% and the current recognition accuracy of a recognition engine server is currently 33%, then the criteria is met. Next, the then-portion of the rule dictates that a specified set of speech recognition parameters are selected.
Also connected to network 106 are web interface 112 and administrative terminal 114. These interfaces are used to prompt an administrator for input in response to a situation, such as low recognition accuracy. In this process, the administrator provides commands to the system of
In an optional step after step 204, the IVR 106 gathers metadata about the user 102. The gathered metadata may include the current physical location of the user 102, such as an address, a sphere indicator, such as “at home,” “in an office,” or “driving in a car,” and availability, which indicates whether the user 102 is currently available for a SIP Session. The user metadata may be gathered from a separate entity such as a location server. In a second optional step, the gathered metadata is stored by the presence server 100 in the user metadata database 120.
In step 206, the IVR 106 routes the original invite to the load balancer 108. In step 208, the load balancer 108 queries, via the presence server 110, the recognition engine metadata database 116 for the most recent metadata about the recognition engine servers 130, 132 and 134. In step 210, the load balancer 108 receives the metadata about the recognition engine servers 130, 132 and 134 from the recognition engine metadata database 116.
In step 212, the load balancer 108 selects a recognition engine server within the servers 130, 132 and 134 based on the received metadata. The load balancer 108 may take a variety of factors into account when making the selection of step 212. The load balancer 108 takes into account the grammars and languages supported by each recognition engine server within the servers 130, 132 and 134. For example, if the IVR 106 is currently capturing addresses in English, only those recognition engine servers servicing address capture in English are considered. The load balancer 108 also takes into account load data, availability data and health data for each recognition engine server so as to determine which servers currently have enough bandwidth to service the user 102 at the highest capacity. The load balancer 108 also takes into account acoustic model characteristics so as to determine which server uses the appropriate model to service the speech recognition type of the current turn.
In step 214, the load balancer 108 routes the original invite to the selected recognition engine server, in this case recognition engine server 140. In step 216, recognition engine server 140 receives the original invite from the device 104 and initiates a SIP connection with the device 104. In step 218, the recognition engine server 140 queries the presence server 110 for the appropriate speech recognition parameters. In step 219, the presence server 110 executes the rules in parameter database 118 to determine the appropriate speech recognition parameters for loading into the recognition engine server 140. The process of executing a rule is described in greater detail below.
As described earlier, a rule comprises an if-portion including criteria that must be met and a then-portion specifying speech recognition parameters that must be used when the criteria is met. Step 219 involves reading metadata from the parameters database 118, wherein the metadata includes a least one value for at least one of time of day, recognition accuracy of the speech recognition engine, and grammar usage of the speech recognition engine. Next, it is determined whether the metadata meets criteria of the rule. For example, if the rules states a time of day between 9 am and 5 pm, then if the current time of the day is 1 pm, then the criteria is met. In another example, if the rule states that the recognition accuracy is below 40% and the current recognition accuracy of the recognition engine server 140 is currently 33%, then the criteria is met. Next, assuming the criteria of the if-portion of the rule is met, the then-portion of the rule dictates that a specified set of speech recognition parameters are selected.
In an optional step after step 219, the presence server 110 takes additional user metadata, from database 120, into account when selecting speech recognition parameters as in step 219. For example, if the user metadata in database 120 indicates that the user 102 is driving during the SIP session, then appropriate speech recognition parameters that optimize recognition during driving are selected.
In step 220, the presence server 110 retrieves the selected speech recognition parameters form the parameter database 118. In step 222, the presence server 110 sends the retrieved speech recognition parameters to the recognition engine server 140. In step 224, recognition engine server 140 receives and loads the speech recognition parameters. In step 226, the current turn is executed and in step 228, the control flow of
In step 304, the presence server 110 executes the rules in parameter database 118 to determine the appropriate speech recognition parameters for loading into the recognition engine server 140. The process of executing a rule is described in greater detail above. In this example, a rule is executed wherein a grammar weight is changed due to the low recognition accuracy.
In an optional step after step 304, an administrator, connected via web interface 112 or administrative terminal 114, is prompted for input in response to the low recognition accuracy. In this alternative, the administrator provides commands to the system of
In step 306, a notification is sent via presence server 110 by recognition engine server 140. The notification is sent to all other recognition engine servers 130, 132 and 134. The notification may be a standard text message sent via TCP/IP or SIP NOTIFY events. The notification states that new speech recognition parameters are available and shall be loaded at the next turn. In step 308, the next turn is initiated.
In step 310, the recognition engine server 140 queries the presence server 110 for the appropriate speech recognition parameters. In step 312, the presence server 110 executes the rules in parameter database 118 to determine the appropriate speech recognition parameters for loading into the recognition engine server 140. The process of executing a rule is described in greater detail above.
In step 314, the presence server 110 retrieves the selected speech recognition parameters form the parameter database 118. In step 316, the presence server 110 sends the retrieved speech recognition parameters to the recognition engine server 140. In step 318, recognition engine server 140 receives and loads the speech recognition parameters. In step 320, the current turn is executed and in step 322, the control flow of
Embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, and the like. Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Claims
1. A method for execution on a server for serving presence information, the method for providing dynamically loaded speech recognition parameters to a speech recognition engine, comprising:
- storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met;
- receiving notice that a speech recognition session has been initiated between a user and the speech recognition engine;
- executing the at least one rule;
- selecting a first set of speech recognition parameters responsive to executing the at least one rule; and
- providing to the speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
2. The method of claim 1, wherein the step of storing at least one rule comprises:
- storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met, and wherein criteria includes at least one value for at least one of time of day, recognition accuracy of the speech recognition engine, and grammar usage of the speech recognition engine.
3. The method of claim 2, wherein the step of storing at least one rule further comprises:
- storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met, and wherein speech recognition parameters include any one of grammars, weights, accuracy settings, thresholds, and sensitivity.
4. The method of claim 3, wherein the step of executing the at least one rule comprises:
- reading metadata including at least one value for at least one of time of day, recognition accuracy of the speech recognition engine, and grammar usage of the speech recognition engine; and
- determining that the metadata meets criteria of the at least one rule.
5. The method of claim 4, wherein the step of selecting a first set of speech recognition parameters comprises:
- selecting a first set of speech recognition parameters identical to the speech recognition parameters specified by the then-portion of the at least one rule.
6. A method for execution on a server for serving presence information, the method for providing dynamically loaded speech recognition parameters to a speech recognition engine, comprising:
- storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met;
- storing periodically updated metadata about a plurality of speech recognition engines;
- selecting a first speech recognition engine based on most recently stored metadata;
- receiving notice that a speech recognition session has been initiated between a user and the first speech recognition engine;
- executing the at least one rule;
- selecting a first set of speech recognition parameters responsive to executing the at least one rule; and
- providing to the first speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
7. The method of claim 6, wherein the step of storing at least one rule comprises:
- storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met, and wherein criteria includes at least one value for at least one of time of day, recognition accuracy of a speech recognition engine, and grammar usage of a speech recognition engine.
8. The method of claim 7, wherein the step of storing at least one rule further comprises:
- storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met, and wherein speech recognition parameters include any one of grammars, weights, accuracy settings, thresholds, and sensitivity.
9. The method of claim 8, wherein the step of storing periodically updated metadata comprises:
- storing periodically updated metadata about a plurality of speech recognition engines, including any one of load data, supported grammars, availability, health, supported languages and acoustic model characteristics.
10. The method of claim 9, wherein the step of executing the at least one rule comprises:
- reading metadata including at least one value for at least one of time of day, recognition accuracy of the first speech recognition engine, and grammar usage of the first speech recognition engine; and
- determining that the metadata meets criteria of the at least one rule.
11. The method of claim 10, wherein the step of selecting a first set of speech recognition parameters comprises:
- selecting a first set of speech recognition parameters identical to the speech recognition parameters specified by the then-portion of the at least one rule.
12. A computer system comprising a server for serving presence information, the server for providing dynamically loaded speech recognition parameters to a speech recognition engine, comprising:
- a repository for storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met; and
- a processor configured for: receiving notice that a speech recognition session has been initiated between a user and the speech recognition engine; executing the at least one rule; selecting a first set of speech recognition parameters responsive to executing the at least one rule; and providing to the speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
13. The computer system of claim 12, wherein criteria includes at least one value for at least one of time of day, recognition accuracy of the speech recognition engine, and grammar usage of the speech recognition engine.
14. The computer system of claim 13, wherein speech recognition parameters include any one of grammars, weights, accuracy settings, thresholds, and sensitivity.
15. The computer system of claim 12, further comprising:
- a load balancing server for distributing speech recognition sessions among a plurality of speech recognition engines based on availability of the speech recognition engines.
16. A computer program product comprising a computer usable medium on a server for serving presence information, the computer usable medium embodying computer usable program code for providing dynamically loaded speech recognition parameters to a speech recognition engine, comprising:
- computer usable program code for storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met;
- computer usable program code for receiving notice that a speech recognition session has been initiated between a user and the speech recognition engine;
- computer usable program code for executing the at least one rule;
- computer usable program code for selecting a first set of speech recognition parameters responsive to executing the at least one rule; and
- computer usable program code for providing to the speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
17. The computer program product of claim 16, wherein the computer usable program code for storing at least one rule comprises:
- computer usable program code for storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met, and wherein criteria includes at least one value for at least one of time of day, recognition accuracy of the speech recognition engine, and grammar usage of the speech recognition engine.
18. The computer program product of claim 17, wherein the computer usable program code for storing at least one rule further comprises:
- computer usable program code for storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met, and wherein speech recognition parameters include any one of grammars, weights, accuracy settings, thresholds, and sensitivity.
19. A computer program product comprising a computer usable medium on a server for serving presence information, the computer usable medium embodying computer usable program code for providing dynamically loaded speech recognition parameters to a speech recognition engine, comprising:
- computer usable program code for storing at least one rule for selecting speech recognition parameters, wherein a rule comprises an if-portion including criteria and a then-portion specifying speech recognition parameters that must be used when the criteria is met;
- computer usable program code for storing periodically updated metadata about a plurality of speech recognition engines;
- computer usable program code for selecting a first speech recognition engine based on most recently stored metadata;
- computer usable program code for receiving notice that a speech recognition session has been initiated between a user and the first speech recognition engine;
- computer usable program code for executing the at least one rule;
- computer usable program code for selecting a first set of speech recognition parameters responsive to executing the at least one rule; and
- computer usable program code for providing to the first speech recognition engine the first set of speech recognition parameters for performing speech recognition of the user.
Type: Application
Filed: Oct 18, 2007
Publication Date: Apr 23, 2009
Patent Grant number: 9129599
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
Inventors: Girish Dhanakshirur (Delray Beach, FL), Baiju D. Mandalia (Boca Raton, FL), Wendi L. Nusbickel (Boca Raton, FL)
Application Number: 11/874,230
International Classification: G10L 21/00 (20060101);