MULTI-MODAL INTERACTION WITH INTELLIGENT ASSISTANTS IN VOICE COMMAND DEVICES
A method comprising detecting an activation of an intelligent assistant on an electronic device, waking up the intelligent assistant from a sleep mode in response to the activation, and determining an amount of vocabulary the intelligent assistant acts upon during a listening mode based on a type of the activation.
Latest Samsung Electronics Patents:
This application is a continuation application of U.S. application Ser. No. 17/950,668, filed Sep. 22, 2022, which is a continuation of U.S. application Ser. No. 16/367,063, filed Mar. 27, 2019, now U.S. Pat. No. 11,482,215, issued Oct. 25, 2022, the disclosures of which are incorporated herein in their entireties.
TECHNICAL FIELDOne or more embodiments generally relate to voice command devices, in particular, a method and system for emergent multi-modal interaction with intelligent assistants in voice command devices.
BACKGROUNDA voice user interface provides a way of interacting with an intelligent personal assistant (IPA) or a virtual assistant (VA) operating on a voice command device. Colloquially, an IPA or a VA is called an “intelligent assistant”. A voice command device is a device controllable with a voice user interface (i.e., a voice controlled device). A voice user interface makes spoken human interaction with devices possible, using speech recognition to understand voice commands (i.e., spoken commands) and questions, and text-to-speech to output replies. With advances in automatic speech recognition (ASR) and natural language understanding (NLU), voice user interfaces are becoming increasingly popular in devices such as, but not limited to, automobiles, mobile devices (e.g., smartphones, tablets, watches, etc.), appliances (e.g., washers, dryers, etc.), entertainment devices (e.g., televisions, smart speakers, etc.), etc.
SUMMARYOne embodiment provides a method comprising detecting an activation of an intelligent assistant on an electronic device, waking up the intelligent assistant from a sleep mode in response to the activation, and determining an amount of vocabulary the intelligent assistant acts upon during a listening mode based on a type of the activation.
Another embodiment provides an electronic device comprising at least one sensor unit, at least one input/output (I/O) unit, at least one processor, and a non-transitory processor-readable memory device storing instructions that when executed by the at least one processor causes the at least one processor to perform operations. The operations include detecting a physical interaction with the electronic device via the at least one sensor unit, and activating an intelligent assistant on the electronic device from a sleep mode in response to the physical interaction.
One embodiment provides a non-transitory processor-readable medium that includes a program that when executed by a processor performing a method comprising detecting an activation of an intelligent assistant on an electronic device, waking up the intelligent assistant from a sleep mode in response to the activation, and determining an amount of vocabulary the intelligent assistant acts upon during a listening mode based on a type of the activation.
These and other aspects and advantages of one or more embodiments will become apparent from the following detailed description, which, when taken in conjunction with the drawings, illustrate by way of example the principles of the one or more embodiments.
For a fuller understanding of the nature and advantages of the embodiments, as well as a preferred mode of use, reference should be made to the following detailed description read in conjunction with the accompanying drawings, in which:
The following description is made for the purpose of illustrating the general principles of one or more embodiments and is not meant to limit the inventive concepts claimed herein. Further, particular features described herein can be used in combination with other described features in each of the various possible combinations and permutations. Unless otherwise specifically defined herein, all terms are to be given their broadest possible interpretation including meanings implied from the specification as well as meanings understood by those skilled in the art and/or as defined in dictionaries, treatises, etc.
One or more embodiments generally relate to voice command devices, in particular, a method and system for emergent multi-modal interaction with intelligent assistants in voice command devices. One embodiment provides a method comprising detecting an activation of an intelligent assistant on an electronic device, waking up the intelligent assistant from a sleep mode in response to the activation, and determining an amount of vocabulary the intelligent assistant acts upon during a listening mode based on a type of the activation.
Another embodiment provides an electronic device comprising at least one sensor unit, at least one input/output (I/O) unit, at least one processor, and a non-transitory processor-readable memory device storing instructions that when executed by the at least one processor causes the at least one processor to perform operations. The operations include detecting a physical interaction with the electronic device via the at least one sensor unit, and activating an intelligent assistant on the electronic device from a sleep mode in response to the physical interaction.
One embodiment provides a non-transitory processor-readable medium that includes a program that when executed by a processor performing a method comprising detecting an activation of an intelligent assistant on an electronic device, waking up the intelligent assistant from a sleep mode in response to the activation, and determining an amount of vocabulary the intelligent assistant acts upon during a listening mode based on a type of the activation.
As speech is generally present in an environment where one or more voice command devices are present and only a small portion of the speech is meant for a particular voice command device, one of the challenges a voice command device encounters is determining which speech is intended for the device.
Conventionally, a user may indicate that an utterance (i.e., a voice command) from the user is intended for a particular voice command device by interacting with a speech input button. For example, the speech input button may be a dedicated physical button or another type of I/O component coupled to or integrated with the device. As another example, the speech input button may be a software component of a graphical user interface (GUI) displayed on a display coupled to or integrated with the device.
Another conventional solution is a user producing a special utterance (e.g., a word or a phrase) that instructs a particular voice command device to attend to a voice command following the special utterance. Colloquially, the special utterance is called a “wake word.” For example, with a GALAXY® Note9TM smart phone from Samsung®, a user can press a dedicated physical button on a side of the phone and issue a voice command. Alternatively, the user can say a special utterance first, such as “Hi Bixby®”, and follow up the special utterance with a voice command.
One or more embodiments provide a voice user interface system for activating a voice agent (i.e., intelligent assistant) operating on an electronic device based on one or more physical interactions with the device that a user usually performs when utilizing the device. In one embodiment, the voice user interface system is configured to listen, when appropriate, for voice interactions that make sense in tandem with how the device is being physically manipulated. Unlike conventional solutions that require explicit activation from a user (e.g., via interacting with a speech input button or saying a special utterance), this voice user interface system provides automatic wake up of the voice agent where the wake up is implicit (i.e., does not require explicit activation from a user).
Examples of an electronic device 50 include, but are not limited to, a desktop computer, a mobile electronic device (e.g., a tablet, a smart phone, a laptop, etc.), a wearable device (e.g., a smart watch, etc.), an Internet of Things (IoT) device, a smart appliance such as a smart television, a smart oven, a smart washer, a smart light, etc.
In one embodiment, the electronic device 50 comprises one or more sensor units 70 integrated in or coupled to the electronic device 50, such as a camera, a microphone, a GPS, a motion sensor, etc. A sensor unit 70 may be utilized to capture content and/or sensor-based contextual information. For example, an application on the electronic device 50 may utilize at least one sensor unit 70 to capture content and/or sensor-based contextual information, such as a microphone for audio data (e.g., voice commands, ambient noise, etc.), a camera for image data (e.g., still and/or video images of an environment surrounding the electronic device 50, etc.), a GPS for location data (e.g., location coordinates), a motion sensor for proximity/motion data (e.g., data indicative of a user 30 within proximity of the electronic device 50), etc.
In one embodiment, the electronic device 50 comprises one or more I/O units 65 integrated in or coupled to the electronic device 50. In one embodiment, the one or more I/O units 65 include, but are not limited to, a physical user interface (PUI) and/or a GUI, such as a keyboard, a keypad, a touch interface, a touch screen, a knob, a button, a display screen, etc. In one embodiment, a user 30 can utilize at least one I/O unit 65 to configure one or more user preferences, configure one or more parameters (e.g., user permissions), provide input (e.g., a search query), etc.
In one embodiment, the electronic device 50 is a voice command device. The one or more applications on the electronic device 50 include, but are not limited to, a voice user interface system 200 configured to facilitate hands-free user interaction with the electronic device 50. In one embodiment, the system 200 provides a voice agent 205, wherein the voice agent 205 is an intelligent assistant configured to allow a user 30 to control the electronic device 50 via voice.
In one embodiment, the voice agent 205 has at least two different operating modes: a sleep mode in which the voice agent 205 is asleep (i.e., not active or in a powered down state), and a listening mode in which the voice agent 205 is awake (i.e., active or in a powered up state) and listening for a voice command from a user 30 (e.g., a voice command captured via at least one sensor unit 70).
As described in detail later herein, in one embodiment, the system 200 comprises a listening agent 250 (
In one embodiment, the voice agent 205 can be activated via different types of activation methods. For example, in one embodiment, the voice agent 205 is activated explicitly via a wake word activation. A wake word activation is an example activation condition requiring an explicit utterance from a user 30 (“user utterance”) of a particular wake word for the electronic device 50. In one embodiment, the voice agent 205 is configured to wake up from the sleep mode in response the system 200 detecting (e.g., via the listening agent 250 and at least one sensor unit 70) a user utterance of the wake word. For example, the voice agent 205 is woken up from the sleep mode in response to a microphone capturing a user utterance of the wake word.
As another example, in one embodiment, the voice agent 205 is activated automatically via a physical event activation without requiring a user utterance. A physical event activation is an example activation condition requiring a physical interaction that is typical of a user 30 utilizing the electronic device 50. In one embodiment, the voice agent 205 is configured to automatically wake up from the sleep mode in response the system 200 detecting (e.g., via the listening agent 250 and at least one sensor unit 70 and/or at least one I/O unit 65) a physical interaction with the electronic device 50. Examples of different physical interactions that can automatically wake up the voice agent 205 from the sleep mode include, but are not limited to, a mechanical interaction (i.e., mechanical actuation) with a movable/rotatable/adjustable member of the electronic device 50 such as a door closure activation (e.g., a user 30 shutting a door of the electronic device 50), a proximity interaction (e.g., a user 30 is within physical proximity of the electronic device 50, a user 30 has entered a room including the electronic device 50), a user interface (UI) activation such as an actuation of a PUI (e.g., a user 30 interacting with a knob, a button, and/or another hardware I/O unit 65) or a GUI (e.g., a user 30 interacting with a touch screen and/or another software I/O unit 65).
For example, if the electronic device 50 comprises a smart washer, the voice agent 205 is automatically woken up from the sleep mode in response to at least one sensor unit 70 detecting a door of a washing chamber of the washer closing, thereby saving a user 30 from having to explicitly utter a wake word for the electronic device 50 to wake up the voice agent 205.
In one embodiment, the system 200 is configured to utilize different interaction feedback methods for providing feedback to a user 30 (e.g., visual and/or speech feedback indicating a current mode of the voice agent 205 and/or a response, etc.) based on a type of activation method used to activate the voice agent 205.
In one embodiment, the system 200 is configured to control the number and the type of actions that the electronic device 50 can be invoked/triggered to perform via voice commands based on contextual information such as, but not limited to, a context of a user 30, a context of the electronic device 50 such as current state (i.e., current configuration), an environment of the user 30 and/or the electronic device 50 such as time, day, temperature, weather, ambient lighting, location, etc.
In one embodiment, the system 200 is configured to determine an identity of a user 30 who produced a user utterance detected via the listening agent 250 based on voice identification (ID), and control actions the user 30 can request the electronic device 50 to perform via voice based on the identity of the user 30. In one embodiment, voice ID involves determining voice patterns/qualities of the user 30 (e.g., sound, rhythm, dialect, pitch, etc.) based on the user utterance, and comparing the voice patterns/qualities against voice patterns/qualities of recognized/identified voice recordings (e.g., maintained on the one or more storage units 60). For example, if the electronic device 50 comprises a smart washer, the system 200 can permit only particular, authorized users 30, such as adults in a household, to start the washer via voice without having to actuate a physical UI or a GUI after the system 200 detects a door closure of the washer. Restricting the start of the washer to only a voice command from an adult after detecting a door closure of the washer ensures safe operation of the washer. For example, such restrictions can prevent an unauthorized user 30, such as a child, from being able to start the washer (e.g., the child may get inside the washer and attempt to turn on the washer via voice after closing the door of the washer).
In one embodiment, the system 200 is configured to recognize, based on a type of activation method used to activate the voice agent 205 and voice ID, that one or more user utterances detected by the listening agent 250 are not intended for the electronic device 50, thereby reducing a likelihood of user speech activating an unintentional or undesired action on the electronic device 50.
In one embodiment, the system 200 is configured to adjust an amount of vocabulary required for the system 200 to interpret a user utterance to only what is contextually relevant based on a type of activation method used to activate the voice agent 205 and/or a current state of the electronic device 50, thereby ensuring that only voice commands that are appropriate for the current state of the electronic device 50 are acted upon.
In one embodiment, the one or more applications on the electronic device 50 may further include one or more software mobile applications 90 loaded onto or downloaded to the electronic device 50, such as a camera application, a social media application, etc. A software mobile application 90 on the electronic device 50 may exchange data with the system 200.
In one embodiment, the electronic device 50 comprises a communications unit 80 configured to exchange data with one or more remote devices 20 and/or one or more other electronic devices 50 over a communications network/connection 40 (e.g., a wireless connection such as a Wi-Fi connection or a cellular data connection, a wired connection, or a combination of the two). The communications unit 80 may comprise any suitable communications circuitry operative to connect to a communications network (e.g., communications network 40) and to exchange communications operations and media from the electronic device 50 to other devices connected to the communications network 40. The communications unit 80 may be operative to interface with the communications network 40 using any suitable communications protocol such as, for example, Wi-Fi (e.g., an IEEE 802.11 protocol), Bluetooth®, high frequency systems (e.g., 900 MHz, 2.4 GHz, and 5.6 GHz communication systems), infrared, GSM, GSM plus EDGE, CDMA, quadband, and other cellular protocols, VOIP, TCP-IP, or any other suitable protocol.
For example, a remote device 20 may comprise a remote server (e.g., a computer, device, or program that manages network resources, etc.) providing an online platform for hosting one or more online services (e.g., an online social media service, an online data source, etc.) and/or distributing one or more software mobile applications 90. As another example, the system 200 may be loaded onto or downloaded to the electronic device 50 from a remote device 20 that maintains and distributes updates for the system 200. As yet another example, a remote device 20 may comprise a cloud computing environment providing shared pools of configurable computing system resources and higher-level services.
In one embodiment, the cluster 100 of different electronic device 50 can be located within proximity of one another (e.g., within one or more rooms in a household). For example, the cluster 100 comprises at least one of the following electronic devices 50 located in a kitchen: a smart microwave, a smart range, a smart oven, a smart dishwasher, a family hub, etc. As another example, the cluster 100 comprises at least one of the following electronic devices 50 located in a laundry room: a smart washer, a smart dryer, etc.
In one embodiment, each electronic device 50 of the cluster 100 is configured to exchange data with one or more other electronic devices 50 of the cluster 100 over a communications network/connection (e.g., a wireless connection such as a WiFi connection or a cellular data connection, a wired connection, or a combination of the two).
In one embodiment, a user 30 can simultaneously activate all voice agents 205 of all electronic devices 50 of the cluster 100 via wake word activation.
In one embodiment, a user 30 can automatically activate only one voice agent 205 of one electronic device 50 of the cluster 100 via a physical event activation that involves only the electronic device 50. As popular voice user interfaces typically share the same wake word across multiple electronic devices 50 (e.g., the wake word “Alexa®” is used to activate different Amazon® voice command devices, the wake word “Siri®” to used to activate different Apple® voice command devices, the wake phrase “Hey Google®” is used to activate different Android® voice command devices, etc.), physical event activation removes the problem of knowing which particular electronic device 50 is being activated by a wake word, thereby reducing a likelihood of multiple electronic devices 50 being simultaneously activated. Physical event activation relieves a user 30 from having to say a wake word, and also saves the user 30 from having to specify which particular electronic device 50 should attend to a voice command.
In one embodiment, an electronic device 50 of the cluster 100 can operate as a master device that controls one or more other electronic devices 50 of the cluster 100 (e.g., trigger/invoke the one or more other electronic devices 50 to perform one or more actions in response to one or more voice commands).
In one embodiment, an action performed at one electronic device 50 of the cluster 100 can trigger a voice agent 205 of another electronic device 50 of the cluster 100 to wake up from the sleep mode. For example, an action performed at a first electronic device 50 of the cluster 100 can trigger a voice agent 205 of a second electronic device 50 of the cluster 100 that operates as a master device to wake up from the sleep mode and listen for voice commands that are intended for the first electronic device 50.
In one embodiment, a portable or moveable electronic device 50 can be added to or removed from the cluster 100, such as a smart AI speaker or a smart phone.
In one embodiment, the system 200 comprises a UI & sensors logic unit 210 configured to: (1) receive UI & sensor data captured by at least one I/O unit 65 and/or at least one sensor unit 70, wherein the UI & sensor data is indicative of one or more physical interactions with the electronic device 50, (2) generate one or more activation signals in response to the one or more physical interactions detected, wherein the one or more activation signals trigger the voice agent 205 to automatically wake up from the sleep mode, and (3) generate control data indicative of one or more adjustments/updates to a current state of the electronic device 50.
In one embodiment, the system 200 comprises a wake word detection unit 240 configured to: (1) receive audio data (i.e., sound in data) captured by at least one sensor unit 70 (e.g., a microphone), wherein the audio data comprises ambient noise and/or one or more user utterances, (2) determine whether the audio data comprises an explicit user utterance of a wake word for the electronic device 50, and (3) generate one or more activation signals in response to determining the audio data comprises an explicit user utterance of the wake word, wherein the one or more activation signals trigger the voice agent 205 to wake up from the sleep mode.
In one embodiment, the system 200 comprises a device control and state logic unit 220 configured to provide an electromechanical application programming interface (API) for communicating with one or more mechanical parts (“device mechanics”) 230 of the electronic device 50. The device control and state logic unit 220 allows for the one or more device mechanics 230 to be controlled via the electromechanical API based on control data (e.g., from the UI & sensors logic unit 210 and/or from a device action initiator 274).
In one embodiment, the device control and state logic unit 220 is configured to generate state data indicative of at least one of the following: a current state of the electronic device 50, or one or more recent actions performed by the electronic device 50.
In one embodiment, the system 200 comprises a listening agent 250. When the voice agent 205 is in the sleeping mode, the listening agent 250 is configured to: (1) receive one or more activation signals from the UI & sensors logic unit 210 and/or the wake word detection unit 240, and (2) awaken the voice agent 205 from the sleep mode and place the voice agent 205 in the listening mode in response to the one or more activation signals. As stated above, the voice agent 205 can either be woken up explicitly via a wake word activation or automatically via a physical event activation.
When the voice agent 205 is in the listening mode, the listening agent 250 is configured to listen for one or more voice commands from a user 30. Specifically, the listening agent 250 is configured to: (1) receive audio data captured by at least one sensor unit 70 (e.g., a microphone), wherein the audio data comprises ambient noise and/or one or more user utterances, and (2) analyze the audio data.
The listening agent 250 comprises one or more components for analyzing audio data. In one embodiment, the listening agent 250 comprises an ambient noise analysis unit 252 configured to analyze audio data to determine whether the audio data comprises ambient noise. In one embodiment, the listening agent 250 comprises an utterance capture unit 253 configured to: (1) analyze audio data to determine whether the audio data comprises a user utterance, and (2) in response to determining the audio data comprises a user utterance, capture/extract the user utterance in the audio data as speech data (i.e., data including spoken language). As described in detail later herein, the speech data can be forwarded to another component of the system 200 for automatic speech recognition (ASR).
In one embodiment, the listening agent 250 comprises a voice ID analysis unit 255 configured to determine a user ID indicative of an identity of a user 30 who produced a user utterance included in audio data based on voice ID. As described in detail later herein, the user ID can be forwarded to another component of the system 200 for determining one or more user permissions for the user 30.
In one embodiment, the listening agent 250 comprises a speech start stop unit 254 configured to: (1) wait and listen for a user utterance for a pre-determined amount of time (i.e., a timing window), and (2) generate/issue a timeout when the pre-determined amount of time has elapsed and a complete user utterance is unavailable. In one embodiment, the listening agent 250 is configured to generate different types of timeouts, wherein each type of timeout has its own corresponding timing window. For example, in one embodiment, the listening agent 250 generates a no-speak timeout indicating that no user utterance has been detected after a corresponding time window for the no-speak timeout has elapsed (i.e., the listening agent 250 has detected an absence of voice commands during the time window). As another example, in one embodiment, the listening agent 250 generates a command-too-long timeout indicating that a user utterance detected is too long (i.e., the user utterance is not complete) after a corresponding time window for the command-too-long timeout has elapsed.
In one embodiment, the system 200 maintains one or more activation rules that adapt one or more algorithms for controlling one or more behaviors of the electronic device 50 based on a type of activation method used to activate the voice agent 205 (e.g., wake word activation or physical event activation). For example, in one embodiment, one or more parameters controlling one or more behaviors of the listening agent 250 are selectively adjusted/tuned based on a type of activation method used to activate the voice agent 205.
In one embodiment, the listening agent 250 comprises an activation tuning table 251 configured to maintain one or more activation rules for adapting one or more parameters controlling one or more behaviors of the listening agent 250 based on a type of activation method used to activate the voice agent 205. For example, in one embodiment, for a parameter controlling a timing window for a no-speak timeout, the timing window is shorter (e.g., 1.5 seconds) if the voice agent 205 is activated via physical event activation (e.g., door closure), whereas the timing window is longer (e.g., 10 seconds) if the voice agent 205 is activated via wake word activation.
Table 1 below provides an example set of activation rules maintained in an activation tuning table 251, in one or more embodiments.
In one embodiment, the system 200 comprises a language interpreter 260 configured to interpret spoken language included in a user utterance. In one embodiment, the language interpreter 260 comprises an ASR unit 262 configured to receive speech data from the utterance capture unit 253, and recognize and translate spoken language included in the speech data into text based on ASR.
In one embodiment, the language interpreter 260 comprises a natural language understanding (NLU) unit 263 configured to receive text from the ASR unit 262, and determine an intent of a voice command from a user 30 based on the text and NLU. For example, in one embodiment, the intent is indicative of an action that the user 30 wants the electronic device 50 to perform either now or sometime in the future. In one embodiment, the language interpreter 260 can generate and forward intent data indicative of an intent of a voice command from a user 30 to another component of the system 200.
In one embodiment, the language interpreter 260 maintains different vocabularies available for use in interpreting spoken language included in a user utterance. In one embodiment, the language interpreter 260 is configured to select a particular vocabulary to utilize from the different vocabularies available based on a type of activation method used to activate the voice agent 205 and/or a current state of the electronic device 50. For example, in one embodiment, the system 200 is configured to detect user speech that is not intended for the electronic device 50 when the voice agent is activated via physical event activation. In one embodiment, if the voice agent is activated via physical event activation, to reduce a likelihood of the electronic device 50 performing an undesired or unintentional action based on user speech, the language interpreter 260 is configured to select a vocabulary to utilize that is reduced/smaller compared to one or more other vocabularies available. In one embodiment, the language interpreter 260 is configured to select a vocabulary to utilize that is suitable for use in interpreting only voice commands that are available based on a current state of the electronic device 50, thereby ensuring that the electronic device 50 will perform an action that a user 30 intends or desires for the electronic device 50 to perform.
As another example, if the electronic device 50 comprises a smart light in a room and the voice agent 205 is activated in response to sensor-based contextual information indicating that a user 30 has just entered the room (e.g., detected via at least one sensor unit 70, such as a motion sensor), the language interpreter 260 is configured to select a vocabulary to utilize during a listening window (e.g., a brief period of time after the user 30 has entered the room) that is reduced/smaller compared to one or more other vocabularies available, wherein the selected vocabulary includes only a restricted/reduced set of voice commands the user 30 is permitted to make. For example, the smart light will only turn on if the user 30 issues the voice command “Lights on”. This restriction prevents the smart light from automatically turning on when the user 30 enters the room after bed time and reduces a likelihood of disturbing any sleeping occupants in the room. By comparison, if the voice agent 205 is activated in response to a wake word activation instead, the language interpreter 260 is configured to select a vocabulary to utilize during a listening window that is larger compared to a vocabulary utilized in response to a physical event activation.
In one embodiment, the language interpreter 260 comprises a vocabulary chooser table 261 maintaining different vocabularies available for the language interpreter 260 to select and use based on a type of activation method used to activate the voice agent 205 and/or a current state of the electronic device 50. For example, in one embodiment, if the electronic device 50 comprises a smart washer that is currently stopped and the voice agent 205 is activated via either closing a door of a washing chamber of the washer or actuating a physical UI/GUI, the language interpreter 260 selects from the vocabulary chooser table 261 a vocabulary that includes the word “Start”, thereby enabling a user 30 to start the washer by simply uttering the word “Start” (i.e., the user 30 need not explicitly utter a wake word for the electronic device 50).
Table 2 below provides an example set of different vocabularies maintained in a vocabulary chooser table 261, in one or more embodiments.
In one embodiment, the system 200 comprises an action planner 270 configured to receive at least one of the following: (1) intent data from the language interpreter 260, and (2) a user ID from the voice ID analysis unit 255.
In one embodiment, the action planner 270 comprises a device action initiator 274 configured to plan/schedule one or more actions for the electronic device 50 to perform either now or in the future based on intent data and/or user ID. The device action initiator 274 provides an action API for communicating with the device control and state logic unit 220. The device action initiator 274 allows for making one or more adjustments/updates to a current state of the electronic device 50 via the action API to carry out the one or more actions.
In one embodiment, the system 200 is configured to utilize different interaction feedback methods for providing feedback to a user 30 based on a type of activation method used to activate the voice agent 205. In one embodiment, the system 200 comprises a speech interface visual feedback component 280 comprising one or more I/O units 65 (e.g., one or more indicator lights, a display screen, etc.) for displaying visual feedback indicative of a current mode of the voice agent 205, such as whether the voice agent 205 is listening (i.e., waiting and listening for a user utterance), whether the voice agent 205 is thinking (i.e., interpreting a user utterance), whether the voice agent 205 is responding (i.e., providing feedback), whether the voice agent 205 is in the sleep mode, etc. For example, an indicator light can selectively flash different colors and/or adjust intensity/brightness based on a current mode of the voice agent 205 (e.g., flash green if the voice agent 205 is listening, flash yellow if the voice agent 205 is thinking, flash red if the voice agent 205 is responding, dim if the voice agent 205 is in the sleep mode), a display screen can selectively display different phrases based on the current mode of the voice agent 205 (e.g., display “Listening . . . ” if the voice agent 205 is listening, display “Thinking . . . ” if the voice agent 205 is thinking, display “Responding . . . ” if the voice agent 205 is responding).
In one embodiment, the system 200 comprises a speech synthesizer unit 290 configured to generate a speech response (e.g., via text-to-speech) indicative of a current mode of the voice agent 205, an error (e.g., a timeout), and/or an acknowledgement/confirmation. In one embodiment, the system 200 comprises a voice output component 285 comprising one or more I/O units 65 (e.g., a speaker, etc.) for outputting a speech response received from the speech synthesizer unit 290. For example, the system 200 can selectively generate and output different speech responses based on the current mode of the voice agent 205 (e.g., output “Listening . . . ” if the voice agent is listening, output “Thinking . . . ” if the voice agent is thinking, output “Responding . . . ” if the voice agent is responding).
In one embodiment, the action planner 270 comprises a visual feedback unit 272 configured to trigger the speech interface visual feedback component 280 to display a particular visual feedback based on intent data and/or user ID. In one embodiment, the action planner 270 comprises a speech feedback unit 273 configured to trigger the speech synthesizer component 290 to generate a particular speech response based on intent data and/or user ID.
In one embodiment, the system 200 is configured to selectively generate and output a speech response based on a type of activation method used to activate the voice agent 205. For example, in one embodiment, if the voice agent 205 is activated via wake word activation and the system 200 encounters an error (e.g., a voice command includes a word not included in a vocabulary utilized by the language interpreter 260, a timeout, a user 30 does not have user permissions to control the electronic device 50 via voice, etc.), the system 200 is configured to generate and output a speech response indicating that a voice command received is unknown. As another example, in one embodiment, if the voice agent 205 is activated via physical event activation and the system 200 encounters an error, the system 200 is configured not to generate and output a speech response indicative of that no voice command was received.
In one embodiment, the action planner 270 comprises an error feedback table 275 maintaining one or more error feedback rules for use in controlling when a speech response is generated and outputted by the system 200 based on a type of activation method used to activate the voice agent 205.
Table 3 below provides an example set of error feedback rules maintained in an error feedback table 275, in one or more embodiments.
In one embodiment, the action planner 270 comprises one or more user permissions table 271 corresponding to one or more users 30, wherein each user permission table 271 maintains one or more user permissions for a corresponding user 30. The action planner 270 utilizes a user permissions table 271 corresponding to a user 30 to determine whether the user 30 is authorized to control the electronic device 30 via voice based on a user ID for the user 30 and/or a type of activation method used to activate the voice agent 205.
Table 4 below provides an example set of user permissions maintained in a user permissions table 271 corresponding to a user 30, in one or more embodiments.
As shown in
As shown in
As shown in
In one embodiment, a voice agent 205 operating in an electronic device 50 is configured to determine at least one other electronic device 50 that the electronic device 50 can be used in combination with. For example, based on recent actions performed by the smart washer W and the smart dryer D (e.g., users 30 wash clothes via the smart washer W first then dry the washed clothes via the smart dryer D), the voice agent 205 of the smart washer W determines the smart washer W and the smart dryer D can be used in combination. In one embodiment, the voice agent 205 of the smart washer W triggers the voice agent 205 of the smart dryer D to wake up from the sleep mode before the current wash cycle of the smart washer W ends. In one embodiment, the voice agent 205 of the smart washer W or the voice agent 205 of the smart dryer D provides feedback (e.g., speech and/or visual feedback) to the user 30 that indicates when the current wash cycle of the smart washer W will terminate. This feedback informs the user 30 when washed clothes can be removed from the smart washer W.
In one embodiment, the user 30 automatically activates the voice agent 205 of the smart dryer D via closing a door of a drying chamber of the smart dryer D after the user 30 has removed washed clothes from the smart washer W and placed the washed clothes in the drying chamber.
As shown in
As shown in
As shown in
As shown in
Other examples of physical event activations include, but are not limited to, the user 30 automatically activating a voice agent 205 of the smart microwave M via closing a door of the smart microwave M (e.g., after placing a food item into the smart microwave M for microwaving), the user 30 automatically activating a voice agent 205 of the smart refrigerator FH via closing a door of the smart refrigerator FH (e.g., after placing a food item into the smart refrigerator FH for refrigerating), the user 30 automatically activating a voice agent 205 of the smart DW via closing a door of a dishwashing chamber of the smart dishwasher DW (e.g., after placing plates in the dishwashing chamber for washing), etc.
In one embodiment, an activity performed by one electronic device 50 of a cluster of electronic devices 50 can activate a voice agent 205 of one or more other electronic devices 50 of the cluster. As shown in
For example, assume the user 30 retrieves frozen chicken from the smart refrigerator FH and closes a door of the smart microwave M after placing the frozen chicken in the smart microwave M for defrosting. The voice agent 205 of the smart microwave M is activated in response to the door closure of the smart microwave M. When the user 30 issues a voice command “Start”, the voice agent 205 determines contextual information relevant to the voice command, such as microwave settings necessary to perform the voice command. For example, based on the recipe and/or one or more online data sources, the voice agent 205 determines microwave settings such as temperature and amount of time necessary to defrost the frozen chicken, and triggers/invokes the smart microwave M to start in accordance with the microwave settings and display visual feedback via a control surface of the smart microwave M, wherein the visual feedback is indicative of the microwave settings. The user 30 can review the microwave settings displayed, and make adjustments, if necessary, via voice interaction (e.g., issuing a voice command) or touch interaction (e.g., interacting with the control surface).
If the recipe requires the chicken to be seared, a voice agent 205 of the smart range R can be activated before the smart microwave M finishes defrosting the chicken. For example, assume a voice agent 205 of the smart microwave M or a master device of the cluster triggers the voice agent 205 of the smart range R to wake up a minute or a couple of seconds before the smart microwave M finishes the defrosting. When the user 30 places a pan including the defrosted chicken on the smart range R (after retrieving the chicken from the smart microwave M) and issues a voice command “Start”, the voice agent 205 of the smart range R determines contextual information relevant to the voice command, such as range settings necessary to perform the voice command. For example, based on the recipe and/or one or more online data sources, the voice agent 205 determines range settings such as temperature and amount of time necessary to sear the chicken, and triggers/invokes the smart range R to start in accordance with the range settings and display visual feedback via a control surface of the smart range R, wherein the visual feedback is indicative of the range settings. The user 30 can review the range settings displayed, and make adjustments, if necessary, via voice interaction (e.g., issuing a voice command) or touch interaction (e.g., interacting with the control surface).
If the recipe requires the chicken to be baked after it is seared, a voice agent 205 of the smart oven O can be activated before the smart range R finishes searing the chicken. For example, assume a voice agent 205 of the smart range R or a master device of the cluster triggers the voice agent 205 of the smart oven O to wake up a few minutes before the smart range R finishes the searing and to start pre-heating the smart oven O. The voice agent 205 determines contextual information relevant to pre-heating the smart oven O, such as oven settings necessary to pre-heat the smart oven O. For example, the voice agent 205 determines oven settings such as temperature and amount of time necessary to bake the chicken, and triggers/invokes the smart oven O to start a pre-heat in accordance with the temperature. When the user 30 places the pan including the seared chicken into the smart oven O and issues a voice command “Start”, the voice agent 205 of the smart oven O triggers/invokes the smart oven O to start a bake in accordance with the oven settings and display visual feedback via a control surface of the smart oven O, wherein the visual feedback is indicative of the oven settings. The user 30 can review the oven settings displayed, and make adjustments, if necessary, via voice interaction (e.g., issuing a voice command) or touch interaction (e.g., interacting with the control surface).
Based contextual information relevant to the user 30 (e.g., sensor-based contextual information captured via at least one sensor unit 70, calendar information for the user 30, a voice command scheduled for the future, etc.), a voice agent 205 operating in the cluster (e.g., a master device) can determine, while the chicken is baking in the smart oven O, whether the user 30 will leave or has left the household (e.g., to pick-up children from school), and trigger/invoke the smart oven O to stop the bake while the user 30 is away. The voice agent 205 can trigger/invoke the smart oven O to resume the bake upon detecting that the user 30 has returned to the household.
In one embodiment, the smart home device H operates as a master device that controls multiple electronic devices 50 in different rooms. In one embodiment, when the voice agent 205 of the smart home device H wakes up in response to the wake word, the voice agent 205 is configured to determine contextual information relevant to the statement based on a current state of each of the multiple electronic devices 50 (e.g., which of the multiple electronic devices 50 is a contributing factor to the statement). For example, as shown in
As shown in
As shown in
In one embodiment, the smart home mini device HM operates as a master device that controls multiple electronic devices 50 in different rooms. In one embodiment, when the voice agent 205 of the smart home mini device HM wakes up in response to the wake word, the voice agent 205 is configured to determine which of the multiple electronic devices 50 the request is intended for (e.g., which of the electronic devices 50 is best suited to perform the request), and trigger/invoke the electronic device 50 that the request is intended for to perform the request. For example, as shown in
In one embodiment, when the user 30 moves from the living room to a kitchen, the user 30 can activate a voice agent 205 operating in multiple electronic devices 50 in the kitchen, such as a smart microwave M, a smart range R, a smart oven O, and/or a smart home device H, via wake word activation. For example, as shown in
In one embodiment, the smart home device H operates as a master device that controls multiple electronic devices 50 in different rooms. In one embodiment, when the voice agent 205 of the smart home device H wakes up in response to the wake word, the voice agent 205 is configured to determine contextual information relevant to the inquiry based on a current state of each of the multiple electronic devices 50. For example, as shown in
In one embodiment, the smart home mini device HM operates as a master device that controls multiple electronic devices 50 in different rooms. In one embodiment, when the voice agent 205 of the smart home mini device HM wakes up in response to the wake word, the voice agent 205 is configured to determine which of the multiple electronic devices 50 the first request is intended for, and trigger/invoke the electronic device 50 that the first request is intended for to perform the first request. For example, as shown in
As shown in
In one embodiment, the smart refrigerator FH operates as a master device that controls multiple electronic devices 50 in different rooms. In one embodiment, when the voice agent 205 of the smart refrigerator FH wakes up in response to the wake word, the voice agent 205 is configured to determine which of the multiple electronic devices 50 the second request is intended for based on a current state of each of the multiple electronic devices 50. For example, as shown in
In one embodiment, if a voice agent 205 operating in multiple electronic devices 50 in multiple locations (e.g., different rooms) receives conflicting voice commands from multiple users 30, the voice agent 205 is configured to inform at least one of the multiple users 30 of the conflict. For example, as shown in
In one embodiment, process blocks 901-903 may be performed by one or more components of the voice user interface system 200.
In one embodiment, process blocks 911-912 may be performed by one or more components of the voice user interface system 200.
Information transferred via communications interface 607 may be in the form of signals such as electronic, electromagnetic, optical, or other signals capable of being received by communications interface 607, via a communication link that carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an radio frequency (RF) link, and/or other communication channels. Computer program instructions representing the block diagram and/or flowcharts herein may be loaded onto a computer, programmable data processing apparatus, or processing devices to cause a series of operations performed thereon to generate a computer implemented process. In one embodiment, processing instructions for process 900 (
Embodiments have been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. Each block of such illustrations/diagrams, or combinations thereof, can be implemented by computer program instructions. The computer program instructions when provided to a processor produce a machine, such that the instructions, which execute via the processor create means for implementing the functions/operations specified in the flowchart and/or block diagram. Each block in the flowchart /block diagrams may represent a hardware and/or software module or logic. In alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures, concurrently, etc.
The terms “computer program medium,” “computer usable medium,” “computer readable medium”, and “computer program product,” are used to generally refer to media such as main memory, secondary memory, removable storage drive, a hard disk installed in hard disk drive, and signals. These computer program products are means for providing software to the computer system. The computer readable medium allows the computer system to read data, instructions, messages or message packets, and other computer readable information from the computer readable medium. The computer readable medium, for example, may include non-volatile memory, such as a floppy disk, ROM, flash memory, disk drive memory, a CD-ROM, and other permanent storage. It is useful, for example, for transporting information, such as data and computer instructions, between computer systems. Computer program instructions may be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
As will be appreciated by one skilled in the art, aspects of the embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of the embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the embodiments may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations for aspects of one or more embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of one or more embodiments are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
References in the claims to an element in the singular is not intended to mean “one and only” unless explicitly so stated, but rather “one or more.” All structural and functional equivalents to the elements of the above-described exemplary embodiment that are currently known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the present claims. No claim element herein is to be construed under the provisions of 35 U.S.C. section 112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or “step for.”
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention.
Though the embodiments have been described with reference to certain versions thereof; however, other versions are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the preferred versions contained herein.
Claims
1. A method of operating an electronic device comprising a display, a microphone, a speaker, and at least one sensor, the method comprising:
- based on a first occurrence of a first event, providing a first feedback via at least one of the display or the speaker, the first event comprising identifying a wake word from a first speech signal obtained via the microphone;
- generating a first timeout based on no speech signal being obtained via the microphone within a first time duration after providing the first feedback;
- based on a first occurrence of a second event, providing a second feedback via at least one of the display or the speaker, the second event comprising detecting, via the at least one sensor, an external object within a proximity of the electronic device, and the second feedback being different from the first feedback; and
- generating a second timeout based on no speech signal being obtained via the microphone within a second time duration after providing the second feedback,
- wherein the method further comprising:
- generating a third timeout based on no voice command being identified from a second speech signal obtained via the microphone within a third time duration after a second occurrence of the first event; and
- generating a fourth timeout based on no voice command being identified from a third speech signal obtained via the microphone within a fourth time duration after a second occurrence of the second event.
2. The method of claim 1, wherein the first feedback comprises a feedback based on at least one of a sound, provided via the speaker, and or a feedback based on a visual indication, provided via the display, and
- wherein the second feedback comprises a feedback based on a visual indication, provided via the display.
3. The method of claim 1, wherein the first time duration is different from the third time duration, and the second time duration is different from the fourth time duration.
4. The method of claim 1, further comprising: providing, based on the generating the first timeout, a visual indication indicating a transition to a sleep mode.
5. The method of claim 1, further comprising: providing, based on the generating the second timeout, a visual indication indicating a transition to a sleep mode.
6. The method of claim 1, further comprising: based on the second speech signal being obtained via the microphone and prior to the generating the third timeout, displaying, on the display, a text phrase indicating a transition to a listening mode.
7. The method of claim 1, further comprising: based on the third speech signal being obtained via the microphone and prior to the generating the fourth timeout, displaying, on the display, a text phrase indicating a transition to a listening mode.
8. The method of claim 1, wherein the generating the third timeout comprises providing a speech response, via the speaker, indicating no voice command being identified from the second speech signal.
9. The method of claim 1, wherein the generating the fourth timeout comprises providing a visual feedback, via the display, indicating no voice command being identified from the third speech signal.
10. The method of claim 3, wherein the first time duration is longer than the second time duration, and the third time duration is longer than the fourth time duration.
11. An electronic device comprising:
- a display;
- a microphone;
- a speaker;
- at least one sensor; and
- at least one processor coupled to the display, the microphone, the speaker, and the at least one sensor, and configured to: based on a first occurrence of a first event, provide a first feedback via at least one of the display or the speaker, the first event comprising identifying a wake word from a first speech signal obtained via the microphone; generate a first timeout based on no speech signal being obtained via the microphone within a first time duration after providing the first feedback; based on a first occurrence of a second event, provide a second feedback via at least one of the display or the speaker, the second event comprising detecting, via the at least one sensor, an external object within a proximity of the electronic device, and the second feedback being different from the first feedback; and generate a second timeout based on no speech signal being obtained via the microphone within a second time duration after providing the second feedback,
- wherein the at least one processor is further configured to: generate a third timeout based on no voice command being identified from a second speech signal obtained via the microphone within a third time duration after a second occurrence of the first event; and generate a fourth timeout based on no voice command being identified from a third speech signal obtained via the microphone within a fourth time duration after a second occurrence of the second event.
12. The electronic device of claim 11, wherein the first feedback comprises a feedback based on a sound, provided via the speaker, or a feedback based on a visual indication, provided via the display.
13. The electronic device of claim 11, wherein the first time duration is different from the third time duration, and the second time duration is different from the fourth time duration.
14. The electronic device of claim 11, wherein the at least one processor is further configured to: provide, based on the generating the first timeout, a visual indication indicating a transition to a sleep mode.
15. The electronic device of claim 11, wherein the at least one processor is further configured to: provide, based on the generating the second timeout, a visual indication indicating a transition to a sleep mode.
16. The electronic device of claim 11, wherein the at least one processor is further configured to: based on the second speech signal being obtained via the microphone and prior to the generating the third timeout, control the display to display a text phrase indicating a transition to a listening mode.
17. The electronic device of claim 11, wherein the at least one processor is further configured to: based on the third speech signal being obtained via the microphone and prior to the generating the fourth timeout, control the display to display a text phrase indicating a transition to a listening mode.
18. The electronic device of claim 11, wherein the at least one processor is further configured to, in generating the third timeout, provide a speech response, via the speaker, indicating no voice command being identified from the second speech signal.
19. The electronic device of claim 11, wherein the at least one processor is further configured to, in generating the fourth timeout, provide a visual feedback, via the display, indicating no voice command being identified from the third speech signal.
20. The electronic device of claim 13, wherein the first time duration is longer than the second time duration, and the third time duration is longer than the fourth time duration.
Type: Application
Filed: Jun 9, 2023
Publication Date: Oct 19, 2023
Applicant: SAMSUNG ELECTRONICS CO., LTD. (Suwon)
Inventors: Jeffrey C. OLSON (Woodside, CA), Henry N. HOLTZMAN (San Francisco, CA), Jean-David HSU (San Francisco, CA), Jeffrey A. MORGAN (Napa, CA)
Application Number: 18/332,479