User dedicated automatic speech recognition
A multi-mode voice controlled user interface is described. The user interface is adapted to conduct a speech dialog with one or more possible speakers and includes a broad listening mode which accepts speech inputs from the possible speakers without spatial filtering, and a selective listening mode which limits speech inputs to a specific speaker using spatial filtering. The user interface switches listening modes in response to one or more switching cues.
Latest NUANCE COMMUNICATIONS, INC. Patents:
- System and method for dynamic facial features for speaker recognition
- INTERACTIVE VOICE RESPONSE SYSTEMS HAVING IMAGE ANALYSIS
- GESTURAL PROMPTING BASED ON CONVERSATIONAL ARTIFICIAL INTELLIGENCE
- SPEECH DIALOG SYSTEM AND RECIPIROCITY ENFORCED NEURAL RELATIVE TRANSFER FUNCTION ESTIMATOR
- Automated clinical documentation system and method
The present application is a continuation of U.S. patent application Ser. No. 14/382,839, entitled: “U
The present invention relates to user interfaces for computer systems, and more specifically to a user dedicated, multi-mode, voice controlled interface using automatic speech recognition.
BACKGROUND ARTIn voice controlled devices, automatic speech recognition (ASR) typically is triggered using a push-to-talk (PTT) button. Pushing the PTT button makes the system respond to any spoken word inputs regardless of who uttered the speech. In distant talking applications such as voice controlled televisions or computer gaming consoles, the PTT button may be replaced by an activation word command. In addition, there may be more than one user that may potentially want to do voice control.
ASR systems typically are equipped with a signal preprocessor to cope with interference and noise. Often multiple microphones are used, particularly for distant talking interfaces where the speech enhancement algorithm is spatially steered towards the assumed direction of the speaker (beamforming). Consequently, interferences from other directions will be suppressed. This improves the ASR performance for the desired speaker, but decreases the ASR performance for others. Thus the ASR performance depends on the spatial position of the speaker relative to the microphone array and on the steering direction of the beamforming algorithm.
SUMMARYEmbodiments of the present invention are directed to a multi-mode voice controlled user interface for an automatic speech recognition (ASR) system employing at least one hardware implemented computer processor, and corresponding methods of using such an interface. The user interface is adapted to conduct a speech dialog with one or more possible speakers and includes a broad listening mode which accepts speech inputs from the possible speakers without spatial filtering, and a selective listening mode which limits speech inputs to a specific speaker using spatial filtering. The user interface switches listening modes in response to one or more switching cues.
The broad listening mode may use an associated broad mode recognition vocabulary and the selective listening mode uses a different associated selective mode recognition vocabulary. The switching cues may include one or more mode switching words from the speech inputs, one or more dialog states in the speech dialog, and/or one or more visual cues from the possible speakers. The selective listening mode may use acoustic speaker localization and/or image processing for the spatial filtering.
The user interface may operate in selective listening mode simultaneously in parallel for each of a plurality of selected speakers. In addition or alternatively, the interface may be adapted to operate in both listening modes in parallel, whereby the interface accepts speech inputs from any user in the room in the broad listening mode, and at the same time accepts speech inputs from only one selected speaker in the selective listening mode.
Embodiments of the present invention also include a device for automatic speech recognition (ASR) that includes a voice controlled user interface employing at least one hardware implemented computer processor. The user interface is adapted to conduct a speech dialog with one or more possible speakers. A user selection module is in communication with the user interface for limiting the user interface using spatial filtering based on image processing of the possible speakers so as to respond to speech inputs from only one specific speaker.
The spatial filtering may be further based on selective beamforming of multiple microphones. The user interface may be further adapted to provide visual feedback to indicate a direction of the specific speaker and/or the identity of the specific speaker. The image processing may include performing gesture recognition of visual images of the possible speakers and/or facial recognition of visual images of the faces of the possible speakers.
Embodiments of the present invention are directed towards user dedicated ASR which limits the voice control functionality to one selected user rather than to any user who happens to be in the vicinity. This may be based, for example, on a user speaking a special activation word that invokes the user limiting functionality. The system may then remain dedicated to the designated user until a specific dialog ends or some other mode switching event occurs. While operating in user dedicated mode, the system does not respond to any spoken inputs from other users (interfering speakers).
More particularly, embodiments of the present invention include a user-dedicated, multi-mode, voice-controlled interface using automatic speech recognition with two different kinds of listening modes: (1) a broad listening mode that responds to speech inputs from any user in any direction, and (2) a selective listening mode that limits speech inputs to a specific speaker in a specific location. The interface system can switch modes based on different switching cues: dialog-state, certain activation words, or visual gestures. The different listening modes may also use different recognition vocabularies, for example, a limited vocabulary in broad listening mode and a larger recognition vocabulary in selective listening mode. To limit the speech inputs to a specific speaker, the system may use acoustic speaker localization and/or video processing means to determine speaker position.
Embodiments of the present invention also include an arrangement for automatic speech recognition (ASR) which is dedicated to a specific user which does not respond to any other user. Potential users are detected by means of image processing using images from one or more cameras. Image processing may rely on detection of one or more user cues to determine and select the dedicated user, for example, gesture recognition, facial recognition, etc. Based on the results of such user selection, the steering direction of the acoustic spatial filter can be controlled, continuing to rely on ongoing visual information. User feedback (via a GUI) can be given to identify the direction and/or identity of the selected dedicated user, for example, to indicate the spatial steering direction of the system.
The spatial filtering of a specific speaker performed in selective listening mode may be based a combination of content information together with acoustic information, as shown in
As shown in
Depending on the listening mode, different acoustic models may be used in the ASR engine or even different ASR engines may be used. Either way, the ASR grammar needs to be switched when switching listening modes. For some number of multiple users M, there may either be N=M beams, N<M beams or N=1 beam used by the interface.
It may be useful for the interface to communicate to the specific speaker when the device is in selective listening mode and listening only to him. There are several different ways in which this can be done. For example, a visual display may show a schematic image of the room scene with user highlighting to identify the location of the selected specific speaker. Or more simply, a light bar display can be intensity coded to indicate that spatial direction of the selected specific speaker. Or an avatar may be used to deliver listening mode feedback as part of a dialog with the user(s).
For example, one useful application of the foregoing would be in the specific context of controlling a television or gaming console based on user dedicated ASR with broad and selective listening modes where potential users and their spatial positions are detected by means of one or more cameras. Initially, the interface system is in broad listening mode and potential user information is provided to a spatial voice activity detection process that checks speaker positions for voice activity. When the broad listening mode detects the mode switching cue, e.g. the activation word, the spatial voice activity detection process provides information about who provided that switching cue. The interface system then switches to selective listening mode by spatial filtering (beamforming and/or blind source separation) and dedicates/limits the ASR to that user. User feedback is also provided over a GUI as to listening direction, and from then on the spatial position of the dedicated user is followed by the one or more cameras. A mode transition back to broad listening mode may depend on dialog state or another switching cue.
Embodiments of the invention may be implemented in whole or in part in any conventional computer programming language such as VHDL, SystemC, Verilog, ASM, etc. Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
Embodiments can be implemented in whole or in part as a computer program product for use with a computer system. Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or analog communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).
Although various exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention.
Claims
1. A device for automatic speech recognition (ASR) comprising: wherein the user interface is adapted to:
- a multi-mode voice-controlled user interface employing at least one hardware implemented computer processor, wherein the user interface is adapted to conduct a speech dialog with one or more possible speakers and includes: a broad listening mode which accepts speech inputs from the possible speakers without spatial filtering and has an associated limited broad mode recognition vocabulary; and a selective listening mode which limits speech inputs to a specific speaker using spatial filtering and has an associated selective mode recognition vocabulary that is larger than the limited broad mode recognition vocabulary,
- switch from the broad listening mode to the selective listening mode in response to one or more switching cues,
- in the selective listening mode, engage the specific speaker in a dialog using the selective mode recognition vocabulary, and
- the user interface is adapted to remain in the selective listening mode so long as a location of the specific speaker is known.
2. A device according to claim 1, wherein the switching cues include one or more mode switching words from the speech inputs.
3. A device according to claim 1, wherein the switching cues include one or more dialog states in the speech dialog.
4. A device according to claim 1, wherein the switching cues include one or more visual cues from the possible speakers.
5. A device according to claim 1, wherein the selective listening mode uses acoustic speaker localization for the spatial filtering.
6. A device according to claim 1, wherein the selective listening mode uses image processing for the spatial filtering.
7. A device according to claim 1, wherein the user interface operates in the selective listening mode simultaneously in parallel for each of a plurality of selected speakers, so that each of the plurality of selected speakers has its own selective listening mode and dialog with the user interface.
8. A device according to claim 1, wherein the user interface is adapted to operate in both listening modes in parallel, whereby the user interface accepts speech inputs in the broad listening mode, and at the same time accepts speech inputs from at least one selected speaker in at least one selective listening mode.
9. The device according to claim 1, wherein the user interface is adapted to switch from the selective listening mode to the broad listening mode in response to either an end of the dialog or an activation word.
10. A computer program product encoded in a non-transitory computer-readable medium for operating an automatic speech recognition (ASR) system, the product comprising:
- program code executable to conduct a speech dialog with one or more possible speakers via a multi-mode voice-controlled user interface adapted to: accept speech inputs from the possible speakers in a broad listening mode without spatial filtering, the broad listening mode having an associated limited broad mode recognition vocabulary; and limit speech inputs to a specific speaker in a selective listening mode using spatial filtering, the selective listening mode having an associated selective mode recognition vocabulary that is larger than the limited broad mode recognition vocabulary,
- wherein the program code is executable to cause the user interface to: switch from the broad listening mode to the selective listening mode in response to one or more switching cues, in the selective listening mode, engage the specific speaker in a dialog using the selective mode recognition vocabulary, and the program code is executable to cause the user interface to remain in the selective listening mode so long as a location of the specific speaker is known.
11. The computer program product of claim 10, wherein the program code is executable to switch from the selective listening mode to the broad listening mode in response to either an end of the dialog or an activation word.
12. A method for automatic speech recognition (ASR) comprising:
- employing a multi-mode voice-controlled user interface having a computer processor to conduct a speech dialog with one or more possible speakers by: employing a broad listening mode which accepts speech inputs from the possible speakers without spatial filtering and has an associated limited broad mode recognition vocabulary; and employing a selective listening mode which limits speech inputs to a specific speaker using spatial filtering and has an associated selective mode recognition vocabulary that is larger than the limited broad mode recognition vocabulary,
- the user interface: switching from the broad listening mode to the selective listening mode in response to one or more switching cues, in the selective listening mode, engaging the specific speaker in a dialog using the selective mode recognition vocabulary, and remaining in the selective listening mode so long as a location of the specific speaker is known.
13. The method according to claim 12, wherein the switching cues include one or more mode switching words from the speech inputs.
14. The method according to claim 12, wherein the switching cues include one or more dialog states in the speech dialog.
15. The method according to claim 12, wherein the switching cues include one or more visual cues from the possible speakers.
16. The method according to claim 12, wherein the selective listening mode includes using acoustic speaker localization for the spatial filtering.
17. The method according to claim 12, wherein the selective listening mode includes using image processing for the spatial filtering.
18. The method according to claim 12, wherein the user interface operates in selective listening mode simultaneously in parallel for each of a plurality of selected speakers, so that each of the plurality of selected speakers has its own selective listening mode and dialog with the user interface.
19. The method according to claim 12, wherein the user interface operates in both listening modes in parallel, such that the user interface accepts speech inputs in the broad listening mode, and at the same time accepts speech inputs from at least one selected speaker in at least one selective listening mode.
20. The method according to claim 12, including the user interface switching from the selective listening mode to the broad listening mode in response to either an end of the dialog or an activation word.
6125341 | September 26, 2000 | Raud |
6556970 | April 29, 2003 | Sasaki |
7355508 | April 8, 2008 | Mian |
7813822 | October 12, 2010 | Hoffberg |
8666047 | March 4, 2014 | Rambo |
8700392 | April 15, 2014 | Hart |
8818800 | August 26, 2014 | Fallat |
20040267518 | December 30, 2004 | Kashima |
20060200253 | September 7, 2006 | Hoffberg |
20070038436 | February 15, 2007 | Cristo |
20080162120 | July 3, 2008 | Mactavish |
20080253589 | October 16, 2008 | Trahms |
20090055170 | February 26, 2009 | Nagahama |
20090055180 | February 26, 2009 | Coon |
20090066798 | March 12, 2009 | Oku |
20090164212 | June 25, 2009 | Chan |
20090198495 | August 6, 2009 | Hata |
20090204410 | August 13, 2009 | Mozer |
20100215184 | August 26, 2010 | Buck |
20100217590 | August 26, 2010 | Nemer |
20100304731 | December 2, 2010 | Bratton |
20110083075 | April 7, 2011 | Robinson et al. |
20110161076 | June 30, 2011 | Davis |
20110244919 | October 6, 2011 | Aller |
20130060571 | March 7, 2013 | Soemo |
20150046157 | February 12, 2015 | Wolff et al. |
1342967 | April 2002 | CN |
102030008 | April 2011 | CN |
102237086 | November 2011 | CN |
1 400 814 | March 2004 | EP |
1400814 | March 2004 | EP |
1695873 | August 2006 | EP |
2 028 062 | February 2009 | EP |
2003-114699 | April 2003 | JP |
2004-515982 | May 2004 | JP |
2004-184803 | July 2004 | JP |
2004-109361 | August 2004 | JP |
2006-504130 | February 2006 | JP |
2009-020352 | January 2009 | JP |
2011-61461 | March 2011 | JP |
WO 03/107327 | December 2003 | WO |
- International Search Report PCT/US2012/029359 filed on Mar. 16, 2012, 3 pages.
- Notification Concerning Transmittal of International Preliminary Report on Patentability (Chapter 1 of the Patent Cooperation Treaty), PCT/US2012/029359, dated Sep. 25, 2014, 9 pages.
- Korean Notice of Preliminary Rejection (with English translation) dated Sep. 8, 2015; for Korean Pat. App. No. 10-2014-7025374; 9 pages.
- Japanese Patent Application No. 2015-500412 Official Action dated Sep. 11, 2015, including English translation, 13 pages.
- Application No. 2015-500412 Response filed on Dec. 8, 2015 with translation of amended claims 13 pages.
- Application No. 10-2014-7025374 Response filed on Dec. 11, 2015 with translation of amended claims 16 pages.
- European Patent Application No. 12 710 851.2-1910 Office Action dated Feb. 26, 2016, 6 pages.
- Korean Patent Application No. 10-2014-7025374 Notice of Final Rejection dated Apr. 27, 2016, including English translation, 5 pages.
- Japanese Final Office Action (with English Translation) dated May 26, 2016 corresponding to Japanese Application No. 2015-500412; 7 Pages.
- Korean Notice of Allowance (with English Reporting Letter) dated Jun. 28, 2016 corresponding to Korean Application No. 10-2014-7025374; 4 Pages.
- Response (with Reporting Letter and Amended Claims in English) to Korean Final Office Action dated Apr. 27, 2016 corresponding to Korean Application No. 10-2014-7025374; Response filed on May 27, 2016; 20 Pages.
- Response to Office Action dated Feb. 26, 2016 corresponding to European Application No. 12710851.2; Response filed on Jun. 22, 2016; 10 Pages.
- Korean Application No. 10-2014-7025374 Allowance Report dated Jun. 28, 2016, including English translation of allowed claims, 6 pages.
- European Application No. 12710851.2 Intention to Grant dated Sep. 14, 2016, 7 pages.
- Chinese Office Action including Search Report (with English translation) dated Sep. 27, 2016; for Chinese Pat, App. No. 201280071506.0; 14 pages.
- Response (with English Amended Claims) to Chinese Office Action dated Sep. 27, 2016 for Chinese Application No. 201280071506.0; Response filed on Jan. 13, 2017; 21 Pages.
- Chinese Office Action (with English translation) dated May 22, 2017; for Chinese Pat. App. No. 201280071506.0; 11 pages.
- Chinese Response with English translation filed Aug. 4, 2017 to the Office Action dated May 22, 2017; for Chinese Pat. App.: 201280071506.0; 19 pages.
- U.S. Non-Final Office Action dated Nov. 3, 2015 corresponding to U.S. Appl. No. 14/382,839; 11 Pages.
- Response to U.S. Non-Final Office Action dated Nov. 3, 2015 corresponding to U.S. Appl. No. 14/382,839; Response filed Nov. 25, 2015; 10 Pages.
- U.S. Final Office Action dated Apr. 14, 2016 corresponding to U.S. Appl. No. 14/382,839; 18 Pages.
- Appeal Brief filed Sep. 20, 2016 corresponding to U.S. Appl. No. 14/382,839; 15 Pages.
- Examiner's Answer to Appeal Brief dated Dec. 2, 2016 corresponding to U.S. Appl. No. 14/382,839; 4 Pages.
- Reply Brief filed Jan. 30, 2017 corresponding to U.S. Appl. No. 14/382,839; 8 Pages.
- Examiner's Decision to Appeal dated Sep. 5, 2017 corresponding to U.S. Appl. No. 14/382,839; 17 Pages.
- Chinese 3rd Office Action (with English Translation) dated Nov. 27, 2017 corresponding to Chinese Appl. No. 201280071506.0; 12 Pages.
- Response (with English Translation) to Chinese Third Office Action dated Nov. 27, 2017 for Chinese Application No. 201280071506.0; Response filed Apr. 12, 2018; 21 Pages.
- 4th Chinese Office Action (with English Translation) dated Aug. 7, 2018 for Chinese Application No. 201280071506.0; 12 Pages.
- Response (with English Translation and Amended Claims) to Chinese Office Action dated Aug. 7, 2018 for Chinese Application No. 201280071506.0; Response filed Nov. 22, 2018; 18 Pages.
Type: Grant
Filed: Jan 22, 2018
Date of Patent: Sep 29, 2020
Patent Publication Number: 20180158461
Assignee: NUANCE COMMUNICATIONS, INC. (Burlington, MA)
Inventors: Tobias Wolff (Neu Ulm), Markus Buck (Biberach), Tim Haulick (Blaubeuren), Suhadi (Stuttgart)
Primary Examiner: Jonathan C Kim
Application Number: 15/876,545
International Classification: G10L 15/22 (20060101); G10L 15/28 (20130101); G06F 3/16 (20060101); G10L 25/51 (20130101); G10L 15/183 (20130101); G10L 21/0216 (20130101);