Multi-User Warm Words
A method includes detecting a presence of multiple users within an environment of an assistant-enabled device (AED) and obtaining, for each user, a respective active set of warm words that each specify a respective action for a digital assistant to perform. Based on each respective active set of warm words, the method also includes executing a warm word arbitration routine to enable a final set of warm words for detection by the AED. Here, the final set of warm words includes warm words selected from the respective active set of warm words. While the final set of warm words are enabled, the method also includes receiving audio data corresponding to an utterance captured by the AED, detecting a warm word from the final set of warm words in the audio data, and instructing the digital assistant to perform the respective action specified by the detected warm word.
Latest Google Patents:
This disclosure relates to multi-user warm words.
BACKGROUNDA user's manner of interacting with an assistant-enabled device is designed primarily, if not exclusively, by means of voice input. For example, a user may ask a device to perform an action including media playback (e.g., music or podcasts), where the device responds by initiating playback of audio that matches the user's criteria. In instances where a device (e.g., a smart speaker) is commonly shared by multiple users in an environment, the device may need to field multiple actions requested by the users that may compete with one another.
SUMMARYOne aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations that include detecting a presence of multiple users within an environment of an assistant-enabled device (AED) executing a digital assistant, and for each user of the multiple users, obtaining a respective active set of warm words that each specify a respective action for the digital assistant to perform. Based on the respective active set of warm words for each user of the multiple users, the operations also include executing a warm word arbitration routine to enable a final set of warm words for detection by the AED. The final set of warm words enabled for detection by the AED include warm words selected from the respective active set of warm words for at least one user of the multiple users detected within the environment of the AED. While the final set of warm words are enabled for detection by the AED, the operations also include receiving audio data corresponding to an utterance captured by the AED, detecting, in the audio data, a warm word from the final set of warm words, and instructing the digital assistant to perform the respective action specified by the detected warm word.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, detecting the presence of the multiple users within the environment of the AED includes detecting at least one of the multiple users is present within the environment based on proximity information of the AED relative to a user device associated with the at least one of the multiple users. In additional implementations, detecting the presence of the multiple users within the environment of the AED includes receiving image data corresponding to a scene of the environment and detecting the presence of at least one of the multiple users within the environment based on the image data. In some examples, detecting the presence of the multiple users within the environment of the AED includes detecting the presence of a corresponding user within the environment of the AED based on receiving voice data characterizing a voice query issued by the corresponding user and directed toward the digital assistant, performing speaker identification on the voice data to identify the corresponding user that issued the voice query, and determining that the identified corresponding user that issued the voice query is present within the environment of the AED. In these examples, the voice query issued by the corresponding user comprises a command that causes the digital assistant to perform a long-standing operation specified by the command and obtaining the respective active set of warm words includes adding, in response to the digital assistant performing the long-standing operation specified by the command, to the respective active set of warm words for the corresponding user that issued the voice query, one or more warm words specifying respective actions for controlling the long-standing operation.
In some implementations, the operations also include detecting the presence of a new user within the environment of the AED, wherein the warm word arbitration routine executes in response to detecting the presence of the new user within the environment. The operations may optionally include determining that one of the multiple users is no longer present within the environment of the AED, wherein the warm word arbitration routine executes in response to determining that the one of the multiple users is no longer present within the environment. In some examples, the operations also include determining addition of a new warm word to, or removal of one of the warm words from, the respective active set of warm words for one of the multiple users present in the environment of the AED, wherein the warm word arbitration routine executes in response to determining the addition of the new warm word to, or the removal of the one of the warm words from, the respective active set of warm words for the one of the multiple users present in the environment of the AED. In some implementations, the operations also include determining a change in ambient context of the AED, wherein the warm word arbitration routine executes in response to determining the change in ambient context.
In some examples, executing the warm word arbitration routine includes obtaining enabled warm word constraints and determining a number of warm words to enable in the final set of warm words for detection by the AED based on the enabled warm word constraints. Here, the enabled warm word constraints include at least one of memory and computing resource availability on the AED for detection of warm words, computational requirements for enabling each warm word in the respective active set of warm words for each user of the multiple users present in the environment of the AED, an acceptable false accept rate tolerance, or an acceptable false reject rate tolerance. In some implementations, executing the warm word arbitration routine includes ranking the warm words in the respective active set of warm words from highest priority to lowest priority based on warm word prioritization signals for each corresponding user of the multiple users, and enabling the final set of warm words for detection by the AED based on the rankings of the warm words in the respective active set of warm words for each corresponding user of the multiple users. In some examples, executing the warm word arbitration routine includes identifying any shared warm words corresponding to warm words present in the respective active set of warm words for each of at least two of the multiple users and determining the final set of warm words is based on assigning, for inclusion in the final set of warm words, a higher priority to warm words identified as shared warm words.
In some implementations, executing the warm word arbitration routine includes determining a warm word affinity score for each user and enabling the final set of warm words for detection by the AED based on the warm word affinity score determined for each corresponding user. Here, the warm word affinity score determined for each corresponding user may be based on at least one of: frequency of warm word usage by the corresponding user; frequency of interactions between the corresponding user and the digital assistant; duration of presence of the corresponding user in the environment of the AED; a proximity of the user relative to the AED; or a current user context.
In some examples, the operations further include determining that the warm word detected in the audio data corresponding to the utterance captured by the user device comprises a speaker-specific warm word selected from the respective active set of warm words for each of one or more users among the multiple users such that the digital assistant only performs the respective action specified by the speaker-specific warm word when the speaker-specific warm word is spoken by any of the one or more users corresponding to the respective active set of warm words, and based on determining that the warm word detected in the audio data corresponding to the utterance captured by the user device includes a speaker-specific warm word, performing speaker verification on the audio data to determine that the utterance was spoken by one of the one or more users corresponding to the respective active set of warm words that the detected warm word was selected. Here, instructing the digital assistant to perform the respective action specified by the detected warm word is based on the speaker verification performed on the audio data.
In some implementations, final set of warm words are enabled for detection by activating, for each warm word in the final set of warm words, a respective warm word model to run on the assistant-enabled device, and detecting, in the audio data, the warm word from the final set of warm words includes detecting, using the activated respective warm word model, the warm word in the audio data without performing speech recognition on the audio data. In these implementations, detecting the warm word in the audio data may include extracting audio features of the audio data, generating, using the activated respective warm word model, a warm word confidence score by processing the extracted audio features, and determining that the audio data corresponding to the utterance includes the warm word when the warm word confidence score satisfies a warm word confidence threshold. The final set of warm words may be enabled for detection by executing a speech recognizer on the AED, the speech recognizer biased to recognize the warm words in the final set of warm words, and detecting, in the audio data, the warm word from the final set of warm words may include recognizing, using the speech recognizer executing on the AED, the repeat warm word in the audio data.
Another aspect of the disclosure provides a system including data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that when executed by the data processing hardware cause the data processing hardware to perform operations that include detecting a presence of multiple users within an environment of an assistant-enabled device (AED) executing a digital assistant, and for each user of the multiple users, obtaining a respective active set of warm words that each specify a respective action for the digital assistant to perform. Based on the respective active set of warm words for each user of the multiple users, the operations also include executing a warm word arbitration routine to enable a final set of warm words for detection by the AED. The final set of warm words enabled for detection by the AED include warm words selected from the respective active set of warm words for at least one user of the multiple users detected within the environment of the AED. While the final set of warm words are enabled for detection by the AED, the operations also include receiving audio data corresponding to an utterance captured by the AED, detecting, in the audio data, a warm word from the final set of warm words, and instructing the digital assistant to perform the respective action specified by the detected warm word.
This aspect may include one or more of the following optional features. In some implementations, detecting the presence of the multiple users within the environment of the AED includes detecting at least one of the multiple users is present within the environment based on proximity information of the AED relative to a user device associated with the at least one of the multiple users. In additional implementations, detecting the presence of the multiple users within the environment of the AED includes receiving image data corresponding to a scene of the environment and detecting the presence of at least one of the multiple users within the environment based on the image data. In some examples, detecting the presence of the multiple users within the environment of the AED includes detecting the presence of a corresponding user within the environment of the AED based on receiving voice data characterizing a voice query issued by the corresponding user and directed toward the digital assistant, performing speaker identification on the voice data to identify the corresponding user that issued the voice query, and determining that the identified corresponding user that issued the voice query is present within the environment of the AED. In these examples, the voice query issued by the corresponding user comprises a command that causes the digital assistant to perform a long-standing operation specified by the command and obtaining the respective active set of warm words includes adding, in response to the digital assistant performing the long-standing operation specified by the command, to the respective active set of warm words for the corresponding user that issued the voice query, one or more warm words specifying respective actions for controlling the long-standing operation.
In some implementations, the operations also include detecting the presence of a new user within the environment of the AED, wherein the warm word arbitration routine executes in response to detecting the presence of the new user within the environment. The operations may optionally include determining that one of the multiple users is no longer present within the environment of the AED, wherein the warm word arbitration routine executes in response to determining that the one of the multiple users is no longer present within the environment. In some examples, the operations also include determining addition of a new warm word to, or removal of one of the warm words from, the respective active set of warm words for one of the multiple users present in the environment of the AED, wherein the warm word arbitration routine executes in response to determining the addition of the new warm word to, or the removal of the one of the warm words from, the respective active set of warm words for the one of the multiple users present in the environment of the AED. In some implementations, the operations also include determining a change in ambient context of the AED, wherein the warm word arbitration routine executes in response to determining the change in ambient context.
In some examples, executing the warm word arbitration routine includes obtaining enabled warm word constraints and determining a number of warm words to enable in the final set of warm words for detection by the AED based on the enabled warm word constraints. Here, the enabled warm word constraints include at least one of memory and computing resource availability on the AED for detection of warm words, computational requirements for enabling each warm word in the respective active set of warm words for each user of the multiple users present in the environment of the AED, an acceptable false accept rate tolerance, or an acceptable false reject rate tolerance. In some implementations, executing the warm word arbitration routine includes ranking the warm words in the respective active set of warm words from highest priority to lowest priority based on warm word prioritization signals for each corresponding user of the multiple users, and enabling the final set of warm words for detection by the AED based on the rankings of the warm words in the respective active set of warm words for each corresponding user of the multiple users. In some examples, executing the warm word arbitration routine includes identifying any shared warm words corresponding to warm words present in the respective active set of warm words for each of at least two of the multiple users and determining the final set of warm words is based on assigning, for inclusion in the final set of warm words, a higher priority to warm words identified as shared warm words.
In some implementations, executing the warm word arbitration routine includes determining a warm word affinity score for each user and enabling the final set of warm words for detection by the AED based on the warm word affinity score determined for each corresponding user. Here, the warm word affinity score determined for each corresponding user may be based on at least one of: frequency of warm word usage by the corresponding user; frequency of interactions between the corresponding user and the digital assistant; duration of presence of the corresponding user in the environment of the AED; a proximity of the user relative to the AED; or a current user context.
In some examples, the operations further include determining that the warm word detected in the audio data corresponding to the utterance captured by the user device comprises a speaker-specific warm word selected from the respective active set of warm words for each of one or more users among the multiple users such that the digital assistant only performs the respective action specified by the speaker-specific warm word when the speaker-specific warm word is spoken by any of the one or more users corresponding to the respective active set of warm words, and based on determining that the warm word detected in the audio data corresponding to the utterance captured by the user device includes a speaker-specific warm word, performing speaker verification on the audio data to determine that the utterance was spoken by one of the one or more users corresponding to the respective active set of warm words that the detected warm word was selected. Here, instructing the digital assistant to perform the respective action specified by the detected warm word is based on the speaker verification performed on the audio data.
In some implementations, final set of warm words are enabled for detection by activating, for each warm word in the final set of warm words, a respective warm word model to run on the assistant-enabled device, and detecting, in the audio data, the warm word from the final set of warm words includes detecting, using the activated respective warm word model, the warm word in the audio data without performing speech recognition on the audio data. In these implementations, detecting the warm word in the audio data may include extracting audio features of the audio data, generating, using the activated respective warm word model, a warm word confidence score by processing the extracted audio features, and determining that the audio data corresponding to the utterance includes the warm word when the warm word confidence score satisfies a warm word confidence threshold. The final set of warm words may be enabled for detection by executing a speech recognizer on the AED, the speech recognizer biased to recognize the warm words in the final set of warm words, and detecting, in the audio data, the warm word from the final set of warm words may include recognizing, using the speech recognizer executing on the AED, the repeat warm word in the audio data.
The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTIONA user's manner of interacting with an assistant-enabled device is designed primarily, if not exclusively, by means of voice input. For example, a user may ask a device to perform an action including media playback (e.g., music or podcasts), where the device responds by initiating playback of audio that matches the user's criteria. Consequently, the assistant-enabled device must have some way of discerning when any given utterance in a surrounding environment is directed toward the device as opposed to being directed to an individual in the environment or originating from a non-human source (e.g., a television or music player). One way to accomplish this is to use a hotword, which by agreement among the users in the environment, is reserved as a predetermined word(s) that is spoken to invoke the attention of the device. In an example environment, the hotword used to invoke the assistant's attention are the words “OK computer.” Consequently, each time the words “OK computer” are spoken, it is picked up by a microphone, conveyed to a hotword detector, which performs speech modeling techniques to determine whether the hotword was spoken and, if so, awaits an ensuing command or query. Accordingly, utterances directed at an assistant-enabled device take the general form [HOTWORD] [QUERY], where “HOTWORD” in this example is “OK computer” and “QUERY” can be any question, command, declaration, or other request that can be speech recognized, parsed and acted on by the system, either alone or in conjunction with the server via the network.
In cases where the user provides a sequence of several hotword based commands to an assistant-enabled device, such as a mobile phone or smart speaker, the user's interaction with the phone or speaker may become awkward. The user may speak, “Ok computer, play my homework playlist.” The phone or speaker may begin to play the first song on the playlist. The user may wish to advance to the next song and speak, “Ok computer, next.” To advance to yet another song, the user may speak, “Ok computer, next,” again. To alleviate the need to keep repeating the hotword before speaking a command, the assistant-enabled device may be configured to recognize/detect a narrow set of hotphrases or warm words to directly trigger respective actions. In the example, the warm word “next” serves the dual purpose of a hotword and a command so that the user can simply utter “next” to invoke the assistant-enabled device to trigger performance of the respective action instead of uttering “Ok computer, next.” Other non-limiting warm words and hotphrases may include “what's the weather?”, “set a timer”, “volume up”, and “volume down”.
A set of warm words can be active for controlling a long-standing operation. As used herein, a long-standing operation refers to an application or event that a digital assistant performs for an extended duration and one that can be controlled by the user while the application or event is in progress. For instance, when a digital assistant sets a timer for 30-minutes, the timer is a long-standing operation from the time of setting the timer until the timer ends or a resulting alert is acknowledged after the timer ends. In this instance, a warm word such as “stop timer” could be active to allow the user to stop the timer by simply speaking “stop timer” without first speaking the hotword. Likewise, a command instructing the digital assistant to play music from a streaming music service is a long-standing operation while the digital assistant is streaming music from the streaming music service through a playback device. In this instance, an active set of warm words can be “pause”, “pause music”, “volume up”, “volume down”, “next”, “previous”, etc., for controlling playback of the music the digital assistant is streaming through the playback device. The long-standing operation may include multi-step dialogue queries such as “book a restaurant” in which different sets of warm words will be active depending on a given stage of the multi-step dialogue. For instance, the digital assistant may prompt a user to select from a list of restaurants, whereby a set of warm words may become active that each include a respective identifier (e.g., restaurant name or number in list) for selecting a restaurant from the list and complete the action of booking a reservation for that restaurant.
One challenge with warm words is limiting the number of words/phrases that are simultaneously active so that quality and efficiency is not degraded. For instance, a number of false positives indicating when the assistant-enabled device incorrectly detected/recognized one of the active words greatly increases the larger the number of warm words that are simultaneously active. Moreover a user that seeds a command to initiate a long-running operation cannot prevent others from speaking active warm words for controlling the long-running operation.
Since warm words and/or hotphrases may be active for extended periods of time (e.g., always-on), there may often be a limit in terms of how many different words/phrases may be enabled for detection by an assistant-enabled device (AED) at any given time. For instance, a computational budget based on computing resource constraints due to processing capabilities and memory availability on the AED may impact how many different warm words can be enabled at any given time. Considerations of computational budget is especially important for battery-powered devices since the models for recognizing/detecting warm words and/or hotphrases typically run on digital signal processors (DSPs). Another challenge with warm words is limiting the number of words/phrases that are simultaneously enabled so that quality and efficiency is not degraded. For instance, a number of false positives (i.e., also referred to as ‘false alarm rate’) indicating when the AED incorrectly detected/recognized one of the warm words greatly increases the larger the number of warm words that are simultaneously enabled for detection by the AED.
As the total number of different warm words that can be enabled for detection is typically limited due to the one or more of the factors mentioned above, difficulties on shared AEDs that aim to serve multiple users (e.g., household family members sharing an assistant-enabled device) arise since different users may have divergent preferences that each may depend on different sets of warm words being active. For instance, one user may wish to issue commands for controlling music playback on the AED, while another user sharing the same AED may wish to issue messaging commands for facilitating communication of messages between that user and some remote recipient. Moreover, some of the users may have different tolerance preferences with respect to consequences of false-accepting (and/or false-rejecting) detection of warm words. For instance, some users may tolerate an over-trigger of “stop” while other users may not. Ideally, the AED aims to enable as many of the warm words in each active set of warm words based on the computational budget of the AED at any given time.
Implementations herein are directed toward detecting a presence of multiple users within an environment of the AED and obtaining the respective active set of warm words that each specify a respective action for the digital assistant to perform. Based on the respective active set of warm words obtained for each user, implementations are further directed toward executing a warm word arbitration routine to enable a final set of warm words for detection by the AED whereby the final set of warm words include warm words selected from the respective active set of warm words for at least one of user among the multiple users detected within the environment of the AED
As used herein, an “active” warm word includes a warm word that a user prefers to have the AED enable for detection/recognition without requiring the user to speak a predefined hotword to “wake-up” the AED, whereas a warm word in the final set of warm words enabled for detection is affirmatively enabled for the AED to detect/recognize. Thus, as long as the computational budget of the AED permits, the warm word arbitration routine will aim to include all of the active warm words in the final set of warm words enabled for detection by the AED. Otherwise, when a total number of warm words permissible in the final set of warm words is less than a total number of active warm words from the respective active sets of warm words, the warm word arbitration routine is tasked with selecting those active warm words for inclusion in the final set of warm words that are ranked with higher priorities than those not selected for inclusion in the final set of warm words. As a result, there may be conditions and scenarios where a warm word found in an active set of warm words for a given user may ultimately not be included in the final set of warm words, and thus, not enabled for detection by the AED such that unselected warm word would not be recognized/detected when spoken in an utterance unless the utterance also included a predefined hotword (e.g., “Hey Computer). Accordingly, the warm word arbitration routine aims to dynamically select warm words for inclusion in the final set of warm words on an ongoing basis in order to maximize the total number of warm words permissible in the final set of warm words enabled for detection by the AED.
In some implementations, for each warm word 112 in the final set of warm words 112F, the AED 104 additionally receives a respective warm word model 330 configured to detect the corresponding warm word 112 in streaming audio without performing speech recognition. For example, the AED 104 (and/or the server 130) additionally includes one or more warm word models 330. Here, the warm word models 330 may be stored on the memory hardware 12 of the AED 104 or the remote memory hardware 134 on the server 130. If stored on the server 130, the AED 104 may request the server 130 to retrieve a warm word model 330 for a corresponding warm word 112 and provide the retrieved warm word model 330 so that the AED 104 (via the warm word arbitration routine 401) can enable the warm word model 330. An enabled warm word model 330 running on the AED 104 may detect an utterance of the corresponding warm word 112 in streaming audio captured by the AED 104 without performing speech recognition on the captured audio. Further, a single warm word model 330 may be capable of detecting all of the warm words 112 from the final set of warm words 112F in streaming audio.
In some configurations, the AED 104 receives code associated with an application loaded on the AED 104 (e.g., a music application running in the foreground or background of the AED 104) to identify any warm words 112 and associated warm word models 330 that developers of the application want users 102 to be able to speak to interact with the application and the respective actions for each warm word 112. In other examples, the AED 104 receives, for at least one warm word 112 in the respective active set of warm words 112A for at least one user 102, via a warm word application programming interface (API) executing on the AED 104, a respective warm word model 330 configured to detect the corresponding warm word 112 in streaming audio without performing speech recognition. The warm words 112 in the registry may also relate to follow-up queries that the user 102 (or typical users) tend to issue following the given query, e.g., “Ok computer, play my music playlist”.
In additional implementations, enabling the final set of warm words 112F causes the AED 104 to execute the speech recognizer 116 in a low-power and low-fidelity state. Here, the speech recognizer 116 is constrained or biased to only recognize the warm words in the final set of warm words 112F when spoken in utterances captured by the AED 104. Since the speech recognizer 116 is only recognizing a limited number of terms/phrases, the number of parameters of the speech recognizer 116 may be drastically reduced, thereby reducing the memory requirements and number of computations needed for recognizing the active warm words in speech. Accordingly, the low-power and low-fidelity characteristics of the speech recognizer 116 may be suitable for execution on a digital signal processor (DSP). In these implementations, the speech recognizer 116 executing on the AED 104 may recognize an utterance 106 of a warm word 112 in streaming audio captured by the AED 104 in lieu of using a warm word model 330. In some examples, detection of a warm word 112 by a corresponding warm word model 330 is confirmed by the speech recognizer 116 performing speech recognition on the audio data.
In the example shown, the AED 104 includes a smart speaker. However, the AED 104 can include other computing devices, such as, without limitation, a smart phone, tablet, smart display, headset, desktop/laptop, smart watch, smart appliance, headphones, other wearables, or vehicle infotainment device. The AED 104 includes data processing hardware 10 and memory hardware 12 storing instructions that when executed on the data processing hardware 10 cause the data processing hardware 10 to perform operations. The AED 104 includes an array of one or more microphones 16 configured to capture acoustic sounds such as speech directed toward the AED 104. The AED 104 may also include, or be in communication with, an audio output device (e.g., speaker) 18 that may output audio such as music 122 and/or synthesized speech from the digital assistant 105. In some configurations, the AED 104 also includes, or is in communication with, a display 13 configured to display content from various sources. Moreover, the AED 104 may include, or be in communication with, one or more cameras 19 configured to capture images within the environment and output image data 412 (
In some configurations, the AED 104 is in communication with multiple user devices 50 associated with the multiple users 102. In the examples shown, the second and third users 102b, 102c each include a respective user device 50 that includes a smart phone that the respective user 102 may interact with. However, the user device 50 can include other computing devices, such as, without limitation, a smart watch, smart display, smart glasses, a smart phone, smart glasses/headset, tablet, smart appliance, headphones, a computing device, a smart speaker, or another assistant-enabled device. Each user device 50 may include at least one microphone 52 residing on the user device 50 and in communication with the AED 104. In these configurations, the user device 50 may also be in communication with the one or more microphones 16 residing on the AED 104. Additionally, the multiple users 102 may control and/or configure the AED 104, as well as interact with the digital assistant 105, using an interface 200, such as a graphical user interface (GUI) 200 rendered for display on a respective screen of each user device 50.
With continued reference to
In additional implementations, the user detector 410 automatically detects one or more of the multiple users 102 in the environment by receiving image data 412 corresponding to a scene of the environment and obtained by the camera 19. Here, the user detector 410 detects the multiple users 102 based on the received image data 312. The user detector 410 may detect the users 102 based on the image data 412 anonymously without uniquely identifying the users 102. In some implementations, the user detector 410 performs facial recognition on the received image data 312 and attempts to uniquely identify each user 102 based on the facial recognition performed. In these implementations, each user 102 expressly grants privileges to the digital assistant 105 to perform facial recognition and each user 102 has the option to revoke the granted privilege at any time.
Similarly, the user detector 410 may detect the multiple users 102 in the environment by performing speaker identification (
In some implementations, the user detector 410 resolves an identity of each of the multiple users. In some scenarios, a user 102 is identified as an enrolled user 200 of the AED 104 that is authorized to access or control various functions of the AED 104 and digital assistant 105. The AED 104 may have multiple different enrolled users 200 each having registered user accounts indicating particular permissions or rights regarding functionality of the AED 104. For instance, the AED 104 may operate in a multi-user environment such as a household with multiple family members, whereby each family member corresponds to an enrolled user 200 having permissions for accessing a different respective set of resources. To illustrate, a mother named Barb speaking the command “play my music playlist” would result in the digital assistant 105 streaming music from a rock music playlist associated with the mother, as opposed to a different music playlists created by, and associated with, another enrolled user 200 of the household such as a teenage daughter whose playlist includes pop music.
In some examples, the enrolled speaker vector 154 for an enrolled user 200 includes a text-dependent enrolled speaker vector. For instance, the text-dependent enrolled speaker vector may be extracted from one or more audio samples of the respective enrolled user 200 speaking a predetermined term such as the hotword 110 (e.g., “Ok computer”) used for invoking the AED 104 to wake-up from a sleep state. In other examples, the enrolled speaker vector 154 for an enrolled user 200 is text-independent obtained from one or more audio samples of the respective enrolled user 200 speaking phrases with different terms/words and of different lengths. In these examples, the text-independent enrolled speaker vector may be obtained over time from audio samples obtained from speech interactions the user 102 has with the AED 104 or other device linked to the same account.
Additionally, the AED 104 (and/or server 120) may optionally store one or more other text-dependent speaker vectors 158 each extracted from one or more audio samples of the respective enrolled user 200 speaking a specific term or phrase. For example, the enrolled user 200a may include a respective text-dependent speaker vector 158 for each of one or more warm words 112 that, when enabled for detection by the AED 104, may be spoken to cause the AED 104 to perform a respective action for controlling a long-standing operation or perform some other command. Accordingly, a text-dependent speaker vector 158 for a respective enrolled user 200 represents speech characteristics of the respective enrolled user 200 speaking the specific warm word 112. A text-dependent speaker vector 154 stored for a respective enrolled user 200 that is associated with a specific warm word 112 may be used to verify the respective enrolled user 200 speaking the specific warm word 112 to command the AED 104 to perform an action for controlling a long-standing operation.
The warm word preferences 212 stored for each respective enrolled user may further include acceptable false accept rate tolerances and/or acceptable false reject rate tolerances for his/her active set of warm words. These tolerances can dictate how sensitive resulting warm word models are for detecting the presence of the warm words in speech. These tolerances may be included within the enabled warm word constraints 332 received by the warm word arbitration routine 401 when enabling the final set of warm words 112F for detection by the AED 104.
In the hotword detector 108 is configured to identify hotwords that are in the initial portion of the utterance 106. In this example, the hotword detector 108 may determine that the utterance 106 “Ok computer, play my music playlist” includes the hotword 110 “ok computer” if the hotword detector 108 detects acoustic features in the audio data 402 that are characteristic of the hotword 110. The acoustic features may be mel-frequency cepstral coefficients (MFCCs) that are representations of short-term power spectrums of the utterance 106 or may be mel-scale filterbank energies for the utterance 106. For example, the hotword detector 108 may detect that the utterance 106 “Ok computer, play music” includes the hotword 110 “ok computer” based on generating MFCCs from the audio data 402 and classifying that the MFCCs include MFCCs that are similar to MFCCs that are characteristic of the hotword “ok computer” as stored in a hotword model of the hotword detector 108. As another example, the hotword detector 108 may detect that the utterance 106 “Ok computer, play music” includes the hotword 110 “ok computer” based on generating mel-scale filterbank energies from the audio data 402 and classifying that the mel-scale filterbank energies include mel-scale filterbank energies that are similar to mel-scale filterbank energies that are characteristic of the hotword “ok computer” as stored in the hotword model of the hotword detector 108.
When the hotword detector 108 determines that the audio data 402 that corresponds to the utterance 106 includes the hotword 110, the AED 104 may trigger a wake-up process to initiate speech recognition on the audio data 402 that corresponds to the utterance 106. For example, a speech recognizer 116 running on the AED 104 may perform speech recognition or semantic interpretation on the audio data 402 that corresponds to the utterance 106. The speech recognizer 116 may perform speech recognition on the portion of the audio data 402 that follows the hotword 110. In this example, the speech recognizer 116 may identify the words “play my music playlist” as a command 118 following the hotword 110.
In some implementations, the speech recognizer 116 is located on a server 120 in addition to, or in lieu, of the AED 104. Upon the hotword detector 108 triggering the AED 104 to wake-up responsive to detecting the hotword 110 in the utterance 106, the AED 104 may transmit the audio data 402 corresponding to the utterance 106 to the server 120 via a network 132. The AED 104 may transmit the portion of the audio data 402 that includes the hotword 110 for the server 120 to confirm the presence of the hotword 110. Alternatively, the AED 104 may transmit only the portion of the audio data 402 that corresponds to the portion of the utterance 106 after the hotword 110 to the server 120. The server 120 executes the speech recognizer 116 to perform speech recognition and returns a transcription of the audio data 402 to the AED 104. In turn, the AED 104 identifies the words in the utterance 106, and the AED 104 performs semantic interpretation and identifies any speech commands. The AED 104 (and/or the server 120) may identify the command for the digital assistant 105 to perform the long-standing operation of “play music”. In the example shown, the digital assistant 105 begins to perform the long-standing operation of playing music 122 as playback audio from the speaker 18 of the AED 104. The digital assistant 105 may stream the music 122 from a streaming service (not shown) or the digital assistant 105 may instruct the AED 104 to play music stored on the AED 104.
The AED 104 (and/or the server 120) may include an operation identifier 124 configured to identify one or more long-standing operations the digital assistant 105 is currently performing. For each long-standing operation the digital assistant 105 is currently performing, the warm word selector 400 via the warm word arbitration routine 401 may select, for inclusion in the final set of warm words 112F, a corresponding set of one or more warm words 112 each associated with a respective action for controlling the long-standing operation. In some examples, the warm word selector 400 accesses the warm word preferences 212 from the enrolled user data/information for enrolled users 200 of the AED 104 (e.g., stored on the memory hardware 12) or another registry or table that associates the identified long-standing operation with a corresponding set of one or more warm words 112 that are highly correlated with the long-standing operation. For example, if the long-standing operation corresponds to a set timer function, the associated set of one or more warm words 112 available the warm word selector 126 to activate includes the warm word 112 “stop timer” for instructing the digital assistant 105 to stop the timer. Similarly, for the long-standing operation of “Call [contact name]” the associated set of warm words 112 includes a “hang up” and/or “end call” warm word(s) 112 for ending the call in progress. In the example shown, for the long-standing operation of playing music 122, the associated set of one or more warm words 112 available for the warm word selector 126 to activate includes the warm words 112 “next”, “pause”, “previous”, “volume up”, and “volume down” each associated with the respective action for controlling playback of the music 122 from the speaker 18 of the AED 104. Accordingly, the warm word selector 400 may determine to include these warm words 112 in the final set of warm words 112F enabled for detection by the AED 104 while the digital assistant 105 is performing the long-standing operation and may disable these warm words 112 once the long-standing operation ends. Similarly, the warm word arbitration routine 401 may enable/disable different warm words 112 depending on a state of the long-standing operation in progress. For example, if the user speaks “pause” to pause the playback of music 122, the warm word arbitration routine 401 may add a warm word 112 for “play” into the final set of warm words 112F to resume the playback of the music 122. In some configurations, instead of accessing a registry or the warm word preferences 212 from the enrolled user data/information for enrolled users 200, the warm word selector 400 examines code associated with an application of the long-standing operation (e.g., a music application running in the foreground or background of the AED 104) to identify any warm words 112 that developers of the application want users 102 to be able to speak to interact with the application and the respective actions for each warm word 112.
In some implementations, after adding the warm words 112 correlated with the long-standing operation for inclusion in the final set of warm words 112F enabled for detection, the digital assistant 105 associates these warm words 112 with only the user 102 that spoke the utterance 106 with the command 118 for the digital assistant 105 to perform the long-standing operation. That is, the digital assistant 105 configures a portion of the warm words 112 in the final set of warm words 112F to be speaker-specific such that they are dependent on a speaking voice of the particular user 102 that provided the initial command 118 to initiate the long-standing operation. As will become apparent, by depending warm words 112 on the speaking voice of the particular user 102a, the AED 104 (e.g., via the digital assistant 105) will only perform the respective action specified by the one of the warm words 112 when the warm word is spoken by the particular user 102, and thereby suppress performance (or at least require approval from the particular user 102) of the respective action when the warm word 112 is spoken by a different speaker.
Referring to
Once the first speaker-discriminative vector 511 is output from the model 510, the speaker identification process 500 determines whether the extracted speaker-discriminative vector 511 matches any of the enrolled speaker vectors 154 stored on the AED 104 (e.g., in the memory hardware 12) for enrolled users 200a-n (
In some implementations, the speaker identification process 500 uses a comparator 520 that compares the first speaker-discriminative vector 511 to the respective enrolled speaker vector 154 associated with each enrolled user 200a-n of the AED 104. Here, the comparator 520 may generate a score for each comparison indicating a likelihood that the utterance 106 corresponds to an identity of the respective enrolled user 200, and the identity is accepted when the score satisfies a threshold. When the score does not satisfy the threshold, the comparator may reject the identity. In some implementations, the comparator 520 computes a respective cosine distance between the first speaker-discriminative vector 511 and each enrolled speaker vector 154 and determines the first speaker-discriminative vector 511 matches one of the enrolled speaker vectors 154 when the respective cosine distance satisfies a cosine distance threshold.
In some examples, the first speaker-discriminative vector 511 is a text-dependent speaker-discriminative vector extracted from a portion of the audio data that includes the hotword 110 and each enrolled speaker vector 154 is also text-dependent on the same hotword 110. The use of text-dependent speaker vectors can improve accuracy in determining whether the first speaker-discriminative vector 511 matches any of the enrolled speaker vectors 154. In other examples, the first speaker-discriminative vector 511 is a text-independent speaker-discriminative vector extracted from the entire audio data that includes both the hotword 110 and the command 118 or from the portion of the audio data that includes the command 118.
When the speaker identification process 500 determines that the first speaker-discriminative vector 511 matches one of the enrolled speaker vectors 154, the process 500 identifies the user 102 that spoke the utterance 106 as the respective enrolled user 200 associated with the one of the enrolled speaker vectors 154 that matches the extracted speaker-discriminative vector 511. In the example shown, the comparator 520 determines the match based on the respective cosine distance between the first speaker-discriminative vector 511 and the enrolled speaker vector 154 associated with the first enrolled user 200a satisfying a cosine distance threshold. In some scenarios, the comparator 520 identifies the user 102 as the respective first enrolled user 200a associated with the enrolled speaker vector 154 having the shortest respective cosine distance from the first speaker-discriminative vector 511, provided this shortest respective cosine distance also satisfies the cosine distance threshold.
Conversely, when the speaker identification process 500 determines that the first speaker-discriminative vector 511 does not match any of the enrolled speaker vectors 154, the process 500 may identify the user 102 that spoke the utterance 106 as a guest user of the AED 104. Accordingly, the user detector 410 may associate the activated set of one or more warm words 112 with the guest user and use the first speaker-discriminative vector 511 as a reference speaker vector representing the speech characteristics of the voice of the guest user. In some instances, the guest user could enroll with the AED 104 and the AED 104 could store the first speaker-discriminative vector 511 as a respective enrolled speaker vector 154 for the newly enrolled user.
In the example shown in
Referring to
Additionally, the GUI 300a may render for display an identifier of the long-standing operation (e.g., “Playing Track 1”), an identifier of the AED 104 (e.g., smart speaker) that is currently performing the long-standing operation, and/or an identity of the active user 102 (e.g., Barb) that initiated the long-standing operation. In some implementations, the identity of the active user 102 includes an image 304 of the active user 102. Accordingly, by identifying the active user 102 and the active warm words 112, the GUI 300a reveals the active user 102 as a “controller” of the long-standing operation that may speak any of the active warm words 112 displayed in GUI 300a to perform a respective action for controlling the long-standing operation. As mentioned above, the warm words in the active set of warm words 112 related to controlling the long-standing operation may optionally be dependent on the speaking voice of Barb 102, since Barb 102 seeded the initial command 118 “play music,” to initiate the long-standing operation. By depending the active set of warm words 112 on the speaking voice of Barb 102, the AED 104 (e.g., via the digital assistant 105) will only perform a respective action associated with one of the warm words 112 when the warm word 112 is spoken by Barb 102, and will suppress performance (or at least require approval from Barb 102) of the respective action when the warm word 112 is spoken by a different speaker.
The user device 50 may also render graphical elements 302 for display in the GUI 300a for performing the respective actions associated with the respective active warm words 112 to playback of the music 122 from the speaker 18 of the AED 104. In the example shown, the graphical elements 302 are associated with playback controls for the long-standing operation of playing music 122, that when selected, cause the device 50 to perform a respective action. For instance, the graphical elements 302 may include playback controls for performing the action associated with the warm word 112 “next,” performing the action associated with the warm word 112 “pause,” performing the action associated with the warm word 112 “previous,” performing the action associated with warm word 112 “volume up,” and performing the action associated with the warm word 112 “volume down.” The GUI 300a may receive user input indications via any one of touch, speech, gesture, gaze, and/or an input device (e.g., mouse or stylist) to control the playback of the music 122 from the speaker 18 of the AED 104. For example the user 102 may provide a user input indication indicating selection of the “next” control (e.g., by touching the graphical button in the GUI 300a that universally represents “next”) to cause the AED 104 to perform the action of advancing to the next song in the playlist associated with the music 122.
Referring to
With continued reference to the GUI 300b of
The GUI 300b of
With continued reference to
Referring back to
Referring to
In some examples, the warm word arbitration routine 401 executes in response to the user detector 410 detecting the presence of a new user within the environment of the AED 104. Notably, since the AED obtains a respective active set of warm words 112A for any new user detected, the warm word arbitration routine 401 executes to determine whether the final set of warm words 112F should be updated to include any of the active warm words for the new user. By the same notion, updating the final set of warm words 112F may include removing some warm words from the final set to make room (i.e., in terms of processing/memory capability and/or false acceptance thresholds) for any of the new active warm words for the new user.
Additionally or alternatively, the warm word arbitration routine 401 may execute in response to the AED 104 determining that one of the previously detected users is no longer present within the environment of the AED 104. For instance,
In some implementations, the AED 104 determines a change in ambient context and the warm word arbitration routine 401 executes in response to determining the change in ambient context. In some examples, the warm word selector 400 receives the ambient context signal 440 from a source indicating the change in ambient context. For example, the smart doorbell may provide the ambient context signal 440 indicating the occurrence of one or more visitor presence conditions and the oven could provide the ambient context signal 440 when the oven has reached the pre-heated temperature. An incoming call to one or more enrolled users 200 of the AED 104 could also serve as an ambient context signal 440 that causes the warm word arbitration routine 401 to potentially update the final set of warm words 112F to include new warm words related to the incoming phone call event such as, without limitation, warm words such as ‘answer’ or ‘ignore’ that when spoken would cause the AED 104 to answer or ignore the incoming call.
Execution of the warm word arbitration routine 401 may include obtaining enabled warm word constraints 430 and determining a number of warm words to enable in the final set of warm words 112F for detection by the AED 104 based on the enabled warm word constraints 430. The warm word constraints 430 may permit the warm word arbitration routine 401 to determine a warm word capacity for the AED 104 at any given time. The warm word constraints 430 may include at least one of memory and computing resource availability of the AED 104 for detection of warm words, computational requirements for enabling each warm word in the respective active set of warm words 112A for each user of the multiple users present in the environment of the AED, an acceptable false accept rate tolerance, or an acceptable false reject rate tolerance. The acceptable false accept and reject rate tolerances could be based on global settings, device settings, or be user-defined on a per-user basis. In some examples, the acceptable false accept and reject rate tolerances can be dynamic where a fading of warm word detection sensitivity occurs when a user is less active or further from the device.
The memory and computing resource availability may take into account whether the AED 104 is battery powered, and if so, the current capacity of the battery. The memory and computing resource availability may further take into account current warm words in the final set of warm words, account applications currently running on the AED 104 in addition to processing capabilities and storage capacity.
The computation requirements may determine memory and processing resources required to execute a respective warm word model associated with each warm word in the active sets of warm words 112A. For instance, the computational requirements could include warm word model size/parameters for each warm word as well as whether any of the warm words are speaker specific requiring performance of speaker id process 500. Additionally or alternatively, the computational requirements may determine memory and processing resources required to execute the speech recognizer 116 in a low-power mode sufficient for only recognizing utterances of the warm words. Notably, a low acceptable false accept rate tolerance may require more processing of a respective warm word model for detecting a warm word than a higher acceptable false accept rate tolerance.
In some implementations, execution of the warm word arbitration routine 401 includes ranking the warm words in the respective active set of warm words from highest priority to lowest priority for each corresponding user 102 of the multiple users 102 based on warm word prioritization signals 413 and enabling the final set of warm words 112F for detection by the AED 104 based on the rankings of the warm words in the respective active sets of warm words 112A for each corresponding user 102. Here, the warm word prioritization signals 413 may include at least one of usage frequency by the corresponding user of each warm word in the respective active set of warm words 112A, a current state of the AED 104, ambient context of the AED 104, or co-presence information indicating prior warm word usage and/or operations performed by the AED when the corresponding user was previously present in the environment with one or more combinations of other ones of the multiple users. The current state of the AED may include any long-standing operations the digital assistant is currently performing on the AED 104. For instance, if the digital assistant is playing back music on a highest volume setting the prioritization signal 413 may cause the warm word arbitration routine 401 to rank the warm words ‘play’ and ‘volume up’ in the active set of warm words 112A with a low priority since it is unlikely the user will speak these warm words. The ambient context may include activity recognition performed by the AED based on one or more signals. For instance,
The co-presence information may indicate how warm words are utilized by the corresponding user in the presence of other users. For instance, the first user 102a and the second user 102b may each often speak the warm word “play music” in the presence of one another, whereas the first user 102a seldom speaks the warm word ‘play music’ in the presence of the third user 102c.
The arbitration routine 401 may select higher priority warm words from the ranked active sets of warm words for all of the users 102 for inclusion in the final set of warm words 112F to enable for detection by the AED 104. As mentioned previously, the number of warm words included in the final set of warm words 112F is limited by the enabled warm word constraints 430. In some examples, to optimize execution of the arbitration routine 401, the arbitration routine 401 considers only warm words from the top-N warm words in the respective active set of warm words 112A ranked for each user. The value of N may be fixed or variable for different users. In examples implementing values of N that are variable among the different sets of users, the routine 401 may determine a warm word affinity score for each corresponding user 102 among the multiple users and enable the final set of warm words based on the warm word affinity score. The warm word affinity score for each corresponding user 102 may be based on a frequency of warm word usage by the corresponding user 102 and/or frequency of interactions between the corresponding user 102 and the digital assistant 105. Here, frequency of warm word usage may indicate how often the corresponding user 102 uses warm words when interacting with the digital assistant 105, while the frequency of interactions may indicate how often the corresponding user 102 interacts with the digital assistant 105 generally. The frequency of warm word usage and/or interactions may be further constrained by a current duration that the presence of the corresponding user 102 is detected within the environment of the AED 104, specific days of the week and/or times of day. For instance, a user may frequently ask “What's the weather” every morning around the same time. The warm word affinity score may additionally or alternatively be based on at least one of a duration of presence of the corresponding user 102 in the environment of the AED 104, a proximity of the corresponding user 102 relative to the AED 104, or a current user context. For instance, proximity information 54 for a corresponding user may cause the arbitration routine 401 to assign a higher priority to warm words in the respective active sets of warm words for users closer to the AED 104 than other users since closer users may be more likely to interact with AEDs than users further away.
In the examples of
To illustrate how a combination of current user context and proximity affect the warm word affinity score for the third user 102b, the current user context and proximity for the third user 102c in
In addition to duration of presence, the arbitration routine 401 may further predict an expected likelihood of user presence in the future and determine the affinity score based on the expected likelihood of user presence. For instance, when the operation identifier 124 determines that the transcription output from the speech recognizer 116 for the second utterance 106b includes the command 118 ‘ . . . stop music and start commute routine when I get in the car’, the arbitration routine 401 may predict that the presence of the user 102a in the environment will likely end. As a result, the arbitration routine may start gradually reducing the affinity score for the first user 102a until the first user 102a is no longer detected as shown in
With continued reference to
Referring to
At operation 602, the method 600 includes detecting a presence of multiple users 102 within the environment of the AED 104. The AED 104 executes a digital assistant 105. The AED 104 may execute multiple digital assistants simultaneously in some configurations. At operation 604, for each user 102 of the multiple users 102, the method also includes a respective active set of warm words 112A that each specify a respective action for the digital assistant 105 to perform.
At operation 606, based on the respective active set of warm words for each user of the multiple users, the method 600 includes executing a warm word arbitration routine 401 to enable a final set of warm words 112F for detection by the AED 104. Here, the final set of warm words 112F enabled for detection by the AED 104 include warm words 112 selected from the respective active set of warm words 112A for at least one user 102 of the multiple users 102 detected within the environment of the AED 104.
At operation 608, while the final set of warm words are enabled for detection by the AED 104, the method includes receiving audio data 502 corresponding to an utterance 106 captured by the AED 104, detecting a warm word 112 from the final set of warm words 112F in the audio data 502, and instructing the digital assistant to perform the respective action specified by the detected warm word.
In some examples, the final set of warm words 112F are enabled for detection by activating, for each warm word in the final set of warm words 112F, a respective warm word model 330 to run on the AED. Here, the method uses the activated respective warm word model 330 to detect the warm word in the audio data 502 without performing speech recognition on the audio data 502. More specifically, detecting the warm word in the audio data may include extracting audio features in the audio data 502, generating a warm word confidence score by using the activated respective warm word model 330 to process the extracted audio features, and determining that the audio data corresponding to the utterance includes the warm word when the warm word confidence score satisfies a warm word confidence threshold. In some examples, the warm word confidence threshold may be ascertained for a corresponding enrolled user 200 by accessing the acceptable false accept rate tolerances and/or acceptable false reject rate tolerances stored in the warm word preferences 212 for the enrolled user 200. In additional examples, when the final set of warm words 112F includes two different warm words that are phonetically similar, the warm word confidence thresholds for detecting each of these warm words may be increased to reduce the propensity for false acceptances where a warm word model 330 wrongly detects a warm word in audio instead of a phonetically similar warm word that was actually spoken.
The computing device 700 includes a processor 710, memory 720, a storage device 730, a high-speed interface/controller 740 connecting to the memory 720 and high-speed expansion ports 750, and a low speed interface/controller 760 connecting to a low speed bus 770 and a storage device 730. Each of the components 710, 720, 730, 740, 750, and 760, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 710 (e.g., the data processing hardware 10, 132 of
The memory 720 stores information non-transitorily within the computing device 700. The memory 720 (e.g., the memory hardware 12, 134 of
The storage device 730 is capable of providing mass storage for the computing device 700. In some implementations, the storage device 730 is a computer-readable medium. In various different implementations, the storage device 730 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 720, the storage device 730, or memory on processor 710.
The high speed controller 740 manages bandwidth-intensive operations for the computing device 700, while the low speed controller 760 manages lower bandwidth-intensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller 740 is coupled to the memory 720, the display 780 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 750, which may accept various expansion cards (not shown). In some implementations, the low-speed controller 760 is coupled to the storage device 730 and a low-speed expansion port 790. The low-speed expansion port 790, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
The computing device 700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 700a or multiple times in a group of such servers 700a, as a laptop computer 700b, or as part of a rack server system 700c.
Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an “application,” an “app,” or a “program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.
The non-transitory memory may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by a computing device. The non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
The processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.
Claims
1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising:
- detecting a presence of multiple users within an environment of an assistant-enabled device (AED), the AED executing a digital assistant;
- for each user of the multiple users, obtaining a respective active set of warm words that each specify a respective action for the digital assistant to perform;
- based on the respective active set of warm words for each user of the multiple users, executing a warm word arbitration routine to enable a final set of warm words for detection by the AED, the final set of warm words enabled for detection by the AED comprising warm words selected from the respective active set of warm words for at least one user of the multiple users detected within the environment of the AED; and
- while the final set of warm words are enabled for detection by the AED: receiving audio data corresponding to an utterance captured by the AED; detecting, in the audio data, a warm word from the final set of warm words; and instructing the digital assistant to perform the respective action specified by the detected warm word.
2. The computer-implemented method of claim 1, wherein detecting the presence of the multiple users within the environment of the AED comprises detecting at least one of the multiple users is present within the environment based on proximity information of the AED relative to a user device associated with the at least one of the multiple users.
3. The computer-implemented method of claim 1, wherein detecting the presence of the multiple users within the environment of the AED comprises:
- receiving image data corresponding to a scene of the environment; and
- detecting the presence of at least one of the multiple users within the environment based on the image data.
4. The computer-implemented method of claim 1, wherein detecting the presence of the multiple users within the environment of the AED comprises detecting the presence of a corresponding user within the environment of the AED based on:
- receiving voice data characterizing a voice query issued by the corresponding user and directed toward the digital assistant;
- performing speaker identification on the voice data to identify the corresponding user that issued the voice query; and
- determining that the identified corresponding user that issued the voice query is present within the environment of the AED.
5. The computer-implemented method of claim 4, wherein:
- the voice query issued by the corresponding user comprises a command that causes the digital assistant to perform a long-standing operation specified by the command; and
- obtaining the respective active set of warm words comprises adding, in response to the digital assistant performing the long-standing operation specified by the command, to the respective active set of warm words for the corresponding user that issued the voice query, one or more warm words specifying respective actions for controlling the long-standing operation.
6. The computer-implemented method of claim 1, wherein the operations further comprise:
- detecting the presence of a new user within the environment of the AED, wherein the warm word arbitration routine executes in response to detecting the presence of the new user within the environment.
7. The computer-implemented method of claim 1, wherein the operations further comprise:
- determining that one of the multiple users is no longer present within the environment of the AED,
- wherein the warm word arbitration routine executes in response to determining that the one of the multiple users is no longer present within the environment.
8. The computer-implemented method of claim 1, wherein the operations further comprise:
- determining addition of a new warm word to, or removal of one of the warm words from, the respective active set of warm words for one of the multiple users present in the environment of the AED,
- wherein the warm word arbitration routine executes in response to determining the addition of the new warm word to, or the removal of the one of the warm words from, the respective active set of warm words for the one of the multiple users present in the environment of the AED.
9. The computer-implemented method of claim 1, wherein the operations further comprise:
- determining a change in ambient context of the AED,
- wherein the warm word arbitration routine executes in response to determining the change in ambient context.
10. The computer-implemented method of claim 1, wherein executing the warm word arbitration routine comprises:
- obtaining enabled warm word constraints, the enabled warm word constraints comprising at least one of: memory and computing resource availability on the AED for detection of warm words; computational requirements for enabling each warm word in the respective active set of warm words for each user of the multiple users present in the environment of the AED; an acceptable false accept rate tolerance; or an acceptable false reject rate tolerance; and
- determining a number of warm words to enable in the final set of warm words for detection by the AED based on the enabled warm word constraints.
11. The computer-implemented method of claim 1, wherein executing the warm word arbitration routine comprises:
- for each corresponding user of the multiple users, ranking the warm words in the respective active set of warm words from highest priority to lowest priority based on warm word prioritization signals, the warm word prioritization signals comprising at least one of: usage frequency by the corresponding user of each warm word in the respective active set of warm words; a current state of the AED; ambient context of the AED; or co-presence information indicating prior warm word usage and/or operations performed by the AED when the corresponding user was previously present in the environment with one or more combinations of other ones of the multiple users; and
- enabling the final set of warm words for detection by the AED based on the rankings of the warm words in the respective active set of warm words for each corresponding user of the multiple users.
12. The computer-implemented method of claim 1, wherein executing the warm word arbitration routine comprises:
- for each corresponding user of the multiple users, determining a warm word affinity score; and
- enabling the final set of warm words for detection by the AED based on the warm word affinity score determined for each corresponding user.
13. The computer-implemented method of claim 12, wherein the warm word affinity score determined for each corresponding user is based on at least one of:
- frequency of warm word usage by the corresponding user;
- frequency of interactions between the corresponding user and the digital assistant;
- duration of presence of the corresponding user in the environment of the AED;
- a proximity of the user relative to the AED; or
- a current user context.
14. The computer-implemented method of claim 1, wherein executing the warm word arbitration routine comprises:
- identifying any shared warm words corresponding to warm words present in the respective active set of warm words for each of at least two of the multiple users; and
- determining the final set of warm words is based on assigning, for inclusion in the final set of warm words, a higher priority to warm words identified as shared warm words.
15. The computer-implemented method of claim 1, wherein the operations further comprise:
- determining that the warm word detected in the audio data corresponding to the utterance captured by the user device comprises a speaker-specific warm word selected from the respective active set of warm words for each of one or more users among the multiple users such that the digital assistant only performs the respective action specified by the speaker-specific warm word when the speaker-specific warm word is spoken by any of the one or more users corresponding to the respective active set of warm words; and
- based on determining that the warm word detected in the audio data corresponding to the utterance captured by the user device includes a speaker-specific warm word, performing speaker verification on the audio data to determine that the utterance was spoken by one of the one or more users corresponding to the respective active set of warm words that the detected warm word was selected,
- wherein instructing the digital assistant to perform the respective action specified by the detected warm word is based on the speaker verification performed on the audio data.
16. The computer-implemented method of claim 1, wherein:
- the final set of warm words are enabled for detection by activating, for each warm word in the final set of warm words, a respective warm word model to run on the assistant-enabled device; and
- detecting, in the audio data, the warm word from the final set of warm words comprises detecting, using the activated respective warm word model, the warm word in the audio data without performing speech recognition on the audio data.
17. The computer-implemented method of claim 16, wherein detecting the warm word in the audio data comprises:
- extracting audio features of the audio data;
- generating, using the activated respective warm word model, a warm word confidence score by processing the extracted audio features; and
- determining that the audio data corresponding to the utterance includes the warm word when the warm word confidence score satisfies a warm word confidence threshold.
18. The computer-implemented method of claim 1, wherein:
- the final set of warm words are enabled for detection by executing a speech recognizer on the AED, the speech recognizer biased to recognize the warm words in the final set of warm words; and
- detecting, in the audio data, the warm word from the final set of warm words comprises, recognizing, using the speech recognizer executing on the AED, the repeat warm word in the audio data.
19. A system comprising:
- data processing hardware; and
- memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: detecting a presence of multiple users within an environment of an assistant-enabled device (AED), the AED executing a digital assistant; for each user of the multiple users, obtaining a respective active set of warm words that each specify a respective action for the digital assistant to perform; based on the respective active set of warm words for each user of the multiple users, executing a warm word arbitration routine to enable a final set of warm words for detection by the AED, the final set of warm words enabled for detection by the AED comprising warm words selected from the respective active set of warm words for at least one user of the multiple users detected within the environment of the AED; and while the final set of warm words are enabled for detection by the AED: receiving audio data corresponding to an utterance captured by the AED; detecting, in the audio data, a warm word from the final set of warm words; and instructing the digital assistant to perform the respective action specified by the detected warm word.
20. The system of claim 19, wherein detecting the presence of the multiple users within the environment of the AED comprises detecting at least one of the multiple users is present within the environment based on proximity information of the AED relative to a user device associated with the at least one of the multiple users.
21. The system of claim 19, wherein detecting the presence of the multiple users within the environment of the AED comprises:
- receiving image data corresponding to a scene of the environment; and
- detecting the presence of at least one of the multiple users within the environment based on the image data.
22. The system of claim 19, wherein detecting the presence of the multiple users within the environment of the AED comprises detecting the presence of a corresponding user within the environment of the AED based on:
- receiving voice data characterizing a voice query issued by the corresponding user and directed toward the digital assistant;
- performing speaker identification on the voice data to identify the corresponding user that issued the voice query; and
- determining that the identified corresponding user that issued the voice query is present within the environment of the AED.
23. The system of claim 22, wherein:
- the voice query issued by the corresponding user comprises a command that causes the digital assistant to perform a long-standing operation specified by the command; and
- obtaining the respective active set of warm words comprises adding, in response to the digital assistant performing the long-standing operation specified by the command, to the respective active set of warm words for the corresponding user that issued the voice query, one or more warm words specifying respective actions for controlling the long-standing operation
24. The system of claim 19, wherein the operations further comprise:
- detecting the presence of a new user within the environment of the AED,
- wherein the warm word arbitration routine executes in response to detecting the presence of the new user within the environment.
25. The system of claim 19, wherein the operations further comprise:
- determining that one of the multiple users is no longer present within the environment of the AED,
- wherein the warm word arbitration routine executes in response to determining that the one of the multiple users is no longer present within the environment.
26. The system of claim 19, wherein the operations further comprise:
- determining addition of a new warm word to, or removal of one of the warm words from, the respective active set of warm words for one of the multiple users present in the environment of the AED,
- wherein the warm word arbitration routine executes in response to determining the addition of the new warm word to, or the removal of the one of the warm words from, the respective active set of warm words for the one of the multiple users present in the environment of the AED.
27. The system of claim 19, wherein the operations further comprise:
- determining a change in ambient context of the AED,
- wherein the warm word arbitration routine executes in response to determining the change in ambient context.
28. The system of claim 19, wherein executing the warm word arbitration routine comprises:
- obtaining enabled warm word constraints, the enabled warm word constraints comprising at least one of: memory and computing resource availability on the AED for detection of warm words; computational requirements for enabling each warm word in the respective active set of warm words for each user of the multiple users present in the environment of the AED; an acceptable false accept rate tolerance; or an acceptable false reject rate tolerance; and
- determining a number of warm words to enable in the final set of warm words for detection by the AED based on the enabled warm word constraints.
29. The system of claim 19, wherein executing the warm word arbitration routine comprises:
- for each corresponding user of the multiple users, ranking the warm words in the respective active set of warm words from highest priority to lowest priority based on warm word prioritization signals, the warm word prioritization signals comprising at least one of: usage frequency by the corresponding user of each warm word in the respective active set of warm words; a current state of the AED; ambient context of the AED; or co-presence information indicating prior warm word usage and/or operations performed by the AED when the corresponding user was previously present in the environment with one or more combinations of other ones of the multiple users; and
- enabling the final set of warm words for detection by the AED based on the rankings of the warm words in the respective active set of warm words for each corresponding user of the multiple users.
30. The system of claim 19, wherein executing the warm word arbitration routine comprises:
- for each corresponding user of the multiple users, determining a warm word affinity score; and
- enabling the final set of warm words for detection by the AED based on the warm word affinity score determined for each corresponding user.
31. The system of claim 30, wherein the warm word affinity score determined for each corresponding user is based on at least one of:
- frequency of warm word usage by the corresponding user;
- frequency of interactions between the corresponding user and the digital assistant;
- duration of presence of the corresponding user in the environment of the AED;
- a proximity of the user relative to the AED; or
- a current user context.
32. The system of claim 19, wherein executing the warm word arbitration routine comprises:
- identifying any shared warm words corresponding to warm words present in the respective active set of warm words for each of at least two of the multiple users; and
- determining the final set of warm words is based on assigning, for inclusion in the final set of warm words, a higher priority to warm words identified as shared warm words.
33. The system of claim 19, wherein the operations further comprise:
- determining that the warm word detected in the audio data corresponding to the utterance captured by the user device comprises a speaker-specific warm word selected from the respective active set of warm words for each of one or more users among the multiple users such that the digital assistant only performs the respective action specified by the speaker-specific warm word when the speaker-specific warm word is spoken by any of the one or more users corresponding to the respective active set of warm words; and
- based on determining that the warm word detected in the audio data corresponding to the utterance captured by the user device includes a speaker-specific warm word, performing speaker verification on the audio data to determine that the utterance was spoken by one of the one or more users corresponding to the respective active set of warm words that the detected warm word was selected,
- wherein instructing the digital assistant to perform the respective action specified by the detected warm word is based on the speaker verification performed on the audio data.
34. The system of claim 19, wherein:
- the final set of warm words are enabled for detection by activating, for each warm word in the final set of warm words, a respective warm word model to run on the assistant-enabled device; and
- detecting, in the audio data, the warm word from the final set of warm words comprises detecting, using the activated respective warm word model, the warm word in the audio data without performing speech recognition on the audio data.
35. The system of claim 34, wherein detecting the warm word in the audio data comprises:
- extracting audio features of the audio data;
- generating, using the activated respective warm word model, a warm word confidence score by processing the extracted audio features; and
- determining that the audio data corresponding to the utterance includes the warm word when the warm word confidence score satisfies a warm word confidence threshold.
36. The system of claim 19, wherein:
- the final set of warm words are enabled for detection by executing a speech recognizer on the AED, the speech recognizer biased to recognize the warm words in the final set of warm words; and
- detecting, in the audio data, the warm word from the final set of warm words comprises, recognizing, using the speech recognizer executing on the AED, the repeat warm word in the audio data.
Type: Application
Filed: Nov 17, 2022
Publication Date: May 23, 2024
Applicant: Google LLC (Mountain View, CA)
Inventors: Matthew Sharifi (Kilchberg), Victor Carbune (Zurich)
Application Number: 18/056,697