MULTI-MODAL SMART AUDIO DEVICE SYSTEM ATTENTIVENESS EXPRESSION
A method may involve receiving output signals from each microphone of a plurality of microphones in the environment, each of the plurality of microphones residing in a microphone location of the environment, the output signals corresponding to an utterance of a person. The method may involve determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location and generating a plurality of spatially-varying attentiveness signals within the zone. Each attentiveness signal may be generated by a device located within the zone. Each attentiveness signal may indicate that a corresponding device is in an operating mode in which the corresponding device is awaiting a command and may indicate a relevance metric of the corresponding device.
Latest Dolby Labs Patents:
- Personalized HRTFs via optical capture
- Coding and decoding of interleaved image data
- Source color volume information messaging
- Backward-compatible integration of harmonic transposer for high frequency reconstruction of audio signals
- Systems, methods and apparatus for conversion from channel-based audio to object-based audio
This application claims priority to U. S. Provisional Patent Application No. 62/880,110 filed 30 Jul. 2019; U.S. Provisional Patent Application No. 62/880,112 filed 30 Jul. 2019; U.S. Provisional Patent Application No. 62/964,018 filed 21 Jan. 2020; and U.S. Provisional Patent Application No. 63/003,788 filed 1 Apr. 2020, which are incorporated herein by reference.
TECHNICAL FIELDThis disclosure pertains to systems and methods for automatically controlling a plurality of smart audio devices in an environment.
BACKGROUNDAudio devices, including but not limited to smart audio devices, have been widely deployed and are becoming common features of many homes. Although existing systems and methods for controlling audio devices provide benefits, improved systems and methods would be desirable.
NOTATION AND NOMENCLATUREHerein, we use the expression “smart audio device” to denote a smart device which is either a single purpose audio device or a virtual assistant (e.g., a connected virtual assistant). A single purpose audio device is a device (e.g., a smart speaker, a television (TV) or a mobile phone) including or coupled to at least one microphone (and which may in some examples also include or be coupled to at least one speaker) and which is designed largely or primarily to achieve a single purpose. Although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run locally, including the application of watching television. Similarly, the audio input and output in a mobile phone may do many things, but these are serviced by the applications running on the phone. In this sense, a single purpose audio device having speaker(s) and microphone(s) is often configured to run a local application and/or service to use the speaker(s) and microphone(s) directly. Some single purpose audio devices may be configured to group together to achieve playing of audio over a zone or user-configured area.
Herein, a “virtual assistant” (e.g., a connected virtual assistant) is a device (e.g., a smart speaker, a smart display or a voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker) and which may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud enabled or otherwise not implemented in or on the virtual assistant itself. Virtual assistants may sometimes work together, e.g., in a very discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, i.e., the one which is most confident that it has heard a wakeword, responds to the word. Connected devices may form a sort of constellation, which may be managed by one main application which may be (or include or implement) a virtual assistant.
Herein, “wakeword” is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, to “awake” denotes that the device enters a state in which it awaits (i.e., is listening for) a sound command. In some instances, what may be referred to herein as a “wakeword” may include more than one word, e.g., a phrase.
Herein, the expression “wakeword detector” denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model. Typically, a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold. For example, the threshold may be a predetermined threshold which is tuned to give a good compromise between rates of false acceptance and false rejection. Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
Throughout this disclosure, including in the claims, “speaker” and “loudspeaker” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), all driven by a single, common speaker feed. The speaker feed may, in some instances, undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
SUMMARYAt least some aspects of the present disclosure may be implemented via methods, such as methods of controlling a system of devices in an environment. In some instances, the methods may be implemented, at least in part, by a control system such as those disclosed herein. Some such methods may involve receiving output signals from each microphone of a plurality of microphones in the environment. Each of the plurality of microphones may reside in a microphone location of the environment. The output signals may, in some examples, correspond to an utterance of a person. According to some examples, at least one of the microphones may be included in or configured for communication with a smart audio device. In some instances, a first microphone of the plurality of microphones may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock.
Some such methods may involve determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location. Some such methods may involve generating a plurality of spatially-varying attentiveness signals within the zone. In some instances, each attentiveness signal of the plurality of attentiveness signals may be generated by a device located within the zone. Each attentiveness signal may, for example, indicate that a corresponding device is in an operating mode in which the corresponding device is awaiting a command. In some examples, each attentiveness signal may indicate a relevance metric of the corresponding device.
In some implementations, an attentiveness signal generated by a first device may indicate a relevance metric of a second device. The second device may, in some examples, be a device corresponding to the first device. In some instances, the utterance may be, or may include, a wakeword. According to some such examples, the attentiveness signals vary, at least in part, according to estimations of wakeword confidence.
According to some examples, at least one of the attentiveness signals may be a modulation of at least one previous signal generated by a device within the zone prior to a time of the utterance. In some instances, the at least one previous signal may be, or may include, a light signal. According to some such examples, the modulation may be a color modulation, a color saturation modulation and/or a light intensity modulation.
In some instances, the at least one previous signal may be, or may include, a sound signal. According to some such examples, the modulation may be a level modulation. Alternatively, or additionally the modulation may be a change in one or more of a fan speed, a flame size, a motor speed or an air flow rate.
According to some examples, the modulation may be what is referred to herein as a “swell.” A swell may be, or may include, a predetermined sequence of signal modulations. In some examples, the swell may include a first time interval corresponding to a signal level increase from a baseline level. According to some such examples, the swell may include a second time interval corresponding to a signal level decrease to the baseline level. In some instances, the swell may include a hold time interval after the first time interval and before the second time interval. The hold time interval may, in some instances, correspond to a constant signal level. In some examples, the swell may include a first time interval corresponding to a signal level decrease from a baseline level.
According to some examples, the relevance metric may be based, at least in part, on an estimated distance from a location. In some instances, the location may be an estimated location of the person. In some examples, the estimated distance may be an estimated distance from the location to an acoustic centroid of a plurality of microphones within the zone. According to some implementations, the relevance metric may be based, at least in part, on an estimated visibility of the corresponding device.
Some such methods may involve an automated process of determining whether a device is in a device group. According to some such examples, the automated process may be based, at least in part, on sensor data corresponding to light and/or sound emitted by the device. In some instances, the automated process may be based, at least in part, on communications between a source and a receiver. The source may, for example, be a light source and/or a sound source. According to some examples, the automated process may be based, at least in part, on communications between a source and an orchestrating hub device and/or a receiver and the orchestrating hub device. In some instances, the automated process may be based, at least in part, on a light source and/or a sound source being switched on and off for a duration of time.
Some such methods may involve automatically updating the automated process according to explicit feedback from the person. Alternatively, or additionally, some methods may involve automatically updating the automated process according to implicit feedback. The implicit feedback may, for example, be based on a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated the response of a voice assistant abnormally, a command recognizer returning a low-confidence result and/or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
Some methods may involve selecting at least one speaker of a device located within the zone and controlling the at least one speaker to provide sound to the person. Alternatively, or additionally, some methods may involve selecting at least one microphone of a device located within the zone. Some such methods may involve providing signals output by the at least one microphone to a smart audio device.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon.
For example, the software may include instructions for controlling one or more devices to perform a method that involves controlling a system of devices in an environment. Some such methods may involve receiving output signals from each microphone of a plurality of microphones in the environment. Each of the plurality of microphones may reside in a microphone location of the environment. The output signals may, in some examples, correspond to an utterance of a person. According to some examples, at least one of the microphones may be included in or configured for communication with a smart audio device. In some instances, a first microphone of the plurality of microphones may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock.
Some such methods may involve determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location. Some such methods may involve generating a plurality of spatially-varying attentiveness signals within the zone. In some instances, each attentiveness signal of the plurality of attentiveness signals may be generated by a device located within the zone. Each attentiveness signal may, for example, indicate that a corresponding device is in an operating mode in which the corresponding device is awaiting a command. In some examples, each attentiveness signal may indicate a relevance metric of the corresponding device.
In some implementations, an attentiveness signal generated by a first device may indicate a relevance metric of a second device. The second device may, in some examples, be a device corresponding to the first device. In some instances, the utterance may be, or may include, a wakeword. According to some such examples, the attentiveness signals vary, at least in part, according to estimations of wakeword confidence.
According to some examples, at least one of the attentiveness signals may be a modulation of at least one previous signal generated by a device within the zone prior to a time of the utterance. In some instances, the at least one previous signal may be, or may include, a light signal. According to some such examples, the modulation may be a color modulation, a color saturation modulation and/or a light intensity modulation.
In some instances, the at least one previous signal may be, or may include, a sound signal. According to some such examples, the modulation may be a level modulation. Alternatively, or additionally the modulation may be a change in one or more of a fan speed, a flame size, a motor speed or an air flow rate.
According to some examples, the modulation may be what is referred to herein as a “swell.” A swell may be, or may include, a predetermined sequence of signal modulations. In some examples, the swell may include a first time interval corresponding to a signal level increase from a baseline level. According to some such examples, the swell may include a second time interval corresponding to a signal level decrease to the baseline level. In some instances, the swell may include a hold time interval after the first time interval and before the second time interval. The hold time interval may, in some instances, correspond to a constant signal level. In some examples, the swell may include a first time interval corresponding to a signal level decrease from a baseline level.
According to some examples, the relevance metric may be based, at least in part, on an estimated distance from a location. In some instances, the location may be an estimated location of the person. In some examples, the estimated distance may be an estimated distance from the location to an acoustic centroid of a plurality of microphones within the zone. According to some implementations, the relevance metric may be based, at least in part, on an estimated visibility of the corresponding device.
Some such methods may involve an automated process of determining whether a device is in a device group. According to some such examples, the automated process may be based, at least in part, on sensor data corresponding to light and/or sound emitted by the device. In some instances, the automated process may be based, at least in part, on communications between a source and a receiver. The source may, for example, be a light source and/or a sound source. According to some examples, the automated process may be based, at least in part, on communications between a source and an orchestrating hub device and/or a receiver and the orchestrating hub device. In some instances, the automated process may be based, at least in part, on a light source and/or a sound source being switched on and off for a duration of time.
Some such methods may involve automatically updating the automated process according to explicit feedback from the person. Alternatively, or additionally, some methods may involve automatically updating the automated process according to implicit feedback. The implicit feedback may, for example, be based on a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated the response of a voice assistant abnormally, a command recognizer returning a low-confidence result and/or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
Some methods may involve selecting at least one speaker of a device located within the zone and controlling the at least one speaker to provide sound to the person. Alternatively, or additionally, some methods may involve selecting at least one microphone of a device located within the zone. Some such methods may involve providing signals output by the at least one microphone to a smart audio device.
At least some aspects of the present disclosure may be implemented via apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus may include an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
DETAILED DESCRIPTIONSome embodiments involve a system of orchestrated smart audio devices, in which each of the devices may be configured to indicate (to a user) when it has heard a “wakeword” and is listening for a sound command (i.e., a command indicated by sound) from the user.
A class of embodiments involves the use of voice based interfaces in various environments (e.g., relatively large living environments) where there is no single point of attention for the user interaction or user interface. As technology progresses towards extensive Internet of Things (JOT) automation and connected devices, there are many things around and on us that represent the ability to take sensory input and to deliver information through the change or transduction of signals into the environment. In the case of automation for our living or work spaces, intelligence (e.g., provided at least in part by automated assistant(s)) may be embodied in a very pervasive or ubiquitous sense in the environment in which we are living or working. There may be a sense that an assistant is a bit omnipresent and also non-intrusive, which may in itself create a certain paradoxical aspect of the user interface.
Home automation and assistants in our personal and living spaces may no longer reside in, control or embody a single device. There may be a design goal that collectively many devices try to present a pervasive service or presence. However, to be natural, we need to engage and trigger a normal sense of interaction and acknowledgement through interaction with such personal assistants.
It is natural that we engage such interfaces primarily with voice. In accordance with some embodiments it is envisaged that there is use of voice for both initiating an interaction (e.g., with at least one smart audio device), and also for engaging with at least one smart audio device (e.g., an assistant). In some applications, speech may be the most frictionless and high bandwidth approach to specifying more detail in a request, and/or providing ongoing interaction and acknowledgement.
However, the process of human communication, while anchored on language, is actually built on the first stages of signaling for and acknowledging attention. We typically do not issue commands or voice information without first having some sense that the recipient is available, ready and interested. The ways we can command attention are numerous, though at present in current system design and user interface, the way a system shows a response of attentiveness is more mirrored in the computing single interface text space than it is in interaction efficiency and naturalness. With most systems involving simple visual indicators (lights) primarily at the point of the device being the nearest microphone or user console, this is not well suited to foreseeable future living environments with more pervasive system integration and ambient computing.
Signaling and attention expression are key parts of the transaction where a user indicates a desire to interact with at least one smart audio device (e.g., virtual assistant), and each device shows awareness of the user and initial and ongoing attention towards comprehension and support. In conventional designs there are several jarring aspects of interactions where an assistant is seen as more of a discrete device interface. These aspects include:
-
- where there are multiple points or devices potentially ready to accept input and give attention, it is not simply the closest device to the user that is most appropriate to express attention;
- given the broad range of ergonomics of living and flexible working spaces, the visual attention of a user may not be aligned with any lighting response to indicate acknowledgement;
- while voice may come from a discrete place, it is really often the house or habitat we are addressing and seeking support from, and giving a more pervasive sense of attention is superior to a single device that must change discretely or abruptly between interactions;
- in conditions of high noise and echo, it may be possible to make mistakes in locating the user to express attentiveness to a particular zone, location or device;
- in many cases a user may be moving to or from a particular area and therefore decisions on the boundary would be jarring if made to be a forced choice to a location or device;
- generally, forms of attention expression have very discrete time boundaries, in terms of something clearly happening or not.
Accordingly, we envision that interaction between a user and one or more smart audio devices will typically start with a call (originated by the user) to attention (e.g., a wakeword uttered by the user), and continue with at least one indication (or signal or expression) of “attentiveness” from the smart audio device(s), or from devices associated with the smart audio devices. We also envision that in some embodiments, at least one smart audio device (e.g., a suggestive assistant) may be constantly listening for sound signals (e.g., of a type indicating activity by a user), or may be continuously sensitive to other activity (not necessarily sound signals), and that the smart audio device will enter a state or operating mode in which it awaits a command (e.g., a voice command) from a user upon detecting sound (or activity) of a predetermined type. Upon entering this latter state or operating mode, each such device expresses attentiveness (e.g., in any of the ways described herein).
It is known to configure a smart audio device in a discrete physical zone to detect a user (who has uttered a wakeword that has been detected by the device), and to respond to the wakeword by transmitting a visual signal and/or an auditory signal which can be seen or heard by a user in the zone. Some disclosed embodiments implement a departure from this known approach by configuring one or more smart audio devices (of a system) to consider a user's position as uncertain (within some volume, or area, of uncertainty), and by using all available smart audio devices within the volume (or area) of uncertainty to provide a spatially-varying expression of “attentiveness” of the system through one or more (e.g., all) states or operating modes of the devices. In some embodiments, the goal is not to pick the single closest device to the user and override its current setting, but to modulate behavior of all the devices according to a relevance metric, which may in some examples be based at least in part on a device's estimated proximity to the user. This gives the sense of a system which is focusing its attention on a localized area, eliminating the jarring experience of a distant device indicating that the system is listening when the user is attempting to get the attention of a closer one of the devices.
Some embodiments provide (or are configured to provide) a coordinated utilization of all the smart audio devices in an environment or in a zone of the environment, by defining and implementing the ability of each device to generate an attentiveness signal (e.g., in response to a wakeword). In some implementations, some or all of the devices may be configured to “mix in” the attentiveness signal into a current configuration (and/or to generate the attentiveness signal to be at least partially determined by the current configurations of all the devices). In some implementations, each device may be configured to determine a probabilistic estimate of a distance from a location, such as the device's distance from the user's position. Some such implementations may provide a cohesive, orchestrated expression of the system's behavior in a way that is perceptually relevant to the user.
For a smart audio device which includes (or is coupled to) at least one speaker, the attentiveness signal may be sound emitted from at least one such speaker. Alternatively, or additionally, the attentiveness signal may be of some other type (e.g., light). In some example, the attentiveness signal may be or include two or more components (e.g., emitted sound and light).
Herein, we sometimes use the phrase “attentiveness indication” or “attentiveness expression” interchangeably with the phrase “attentiveness signal.”
In a class of embodiments, a plurality of smart audio devices may be coordinated (orchestrated), and each of the devices may be configured to generate an attentiveness signal in response to a wakeword. In some implementations, a first device may provide an attentiveness signal corresponding to a second device. In some examples, the attentiveness signals corresponding to all the devices are coordinated. Aspects of some embodiments pertain to implementing smart audio devices, and/or to coordinating smart audio devices.
In accordance with some embodiments, in a system, multiple smart audio devices may respond (e.g., by emitting light signals) in coordinated fashion (e.g., to indicate a degree of attentiveness or availability) to determination by the system of a common operating point (or operating state). For example, the operating point may be a state of attentiveness, entered in response to a wakeword from a user, with all the devices having an estimate (e.g., with at least one degree of uncertainty) of the user's position, and in which the devices emit light of different colors depending on their estimated distances from the user.
Following on from the study of users and experiments with interactions, the inventors have recognized some particular rules or guidelines which may apply to wide area life assistants expressing attention and which underpin some disclosed embodiments. These include the following:
-
- attention may show a continuous and responsive escalation or a person signaling. This gives a better indication and closed loop on training the required signaling effort, and creates a more natural interaction. It may be useful to note the range of intensity of signaling (e.g., from a whispered gentle request to a shouted expletive) and to determine associated impedance matched responses (e.g., from a response corresponding to a gentle raised glance through to a response corresponding to standing to attention);
- signaling attention may similarly continuously propagate uncertainty and ambiguity about the location and focal point of the user. The wrong item or object responding creates a very disconnected and disembodied sense of interaction and attention. Therefore, forced choices should be avoided;
- more (rather than less) pervasive signaling and transducers are often preferred to complement any single point of voice response, with continuous control often an important component; and it may be advantageous for the expression of attention to be able to naturally swell and return to a baseline setting or environment, giving the sense of companionship and presence rather than a purely transactional and information based interface.
It is well known that some things quickly anthropomorphize, and subtle aspects of timing and continuity have a large impact. Some disclosed embodiments implement continuous control of output devices in an environment to register some sensory effect on a user, and control the devices in a way to naturally swell and return to express attention and release, while avoiding jarring hard decisions around location and binary decisions of interaction threshold.
In a living space (e.g., that of
-
- 1. The kitchen sink and food preparation area (in the upper left region of the living space);
- 2. The refrigerator door (to the right of the sink and food preparation area);
- 3. The dining area (in the lower left region of the living space);
- 4. The open area of the living space (to the right of the sink and food preparation area and dining area);
- 5. The TV couch (at the right of the open area);
- 6. The TV itself;
- 7. Tables; and
- 8. The door area or entry way (in the upper right region of the living space).
In accordance with some embodiments, a system that estimates where a sound (e.g., a wakeword or other signal for attention) arises or originates may have some determined confidence in (or multiple hypotheses for) the estimate. For example, if a user happens to be near a boundary between zones of the system's environment, an uncertain estimate of location of the user may include a determined confidence that the user is in each of the zones. In some conventional implementations of voice interface it is required that the voice assistant's voice is only issued from one location at a time, this forcing a single choice for the single location (e.g., one of the eight speaker locations, 1.1 and 1.3, in
Next, with reference to
More specifically, elements of the
-
- 102: direct local voice (produced by the user 101);
- 103: voice assistant device (coupled to one or more loudspeakers). Device 103 is positioned nearer to the user 101 than is device 105 or device 107, and thus device 103 is sometimes referred to as a “near” device, device 105 may be referred to as a “mid-distance” device and device 107 may be referred to as a “distant” device;
- 104: plurality of microphones in (or coupled to) the near device 103;
- 105: mid-distance voice assistant device (coupled to one or more loudspeakers);
- 106: plurality of microphones in (or coupled to) the mid-distance device 105;
- 107: distant voice assistant device (coupled to one or more loudspeakers);
- 108: plurality of microphones in (or coupled to) the distant device 107;
- 109: Household appliance (e.g. a lamp); and
- 110: Plurality of microphones in (or coupled to) household appliance 109. In some examples, each of the microphones 110 may be configured for communication with a device configured for implementing one or more of the disclosed methods, which may in some instances be at least one of devices 103, 105 or 107.
The system of
As talker 101 utters sound 102 indicative of a wakeword in the acoustic space, the sound is received by nearby device 103, mid-distance device 105, and far device 107. In this example, each of devices 103, 105, and 107 is (or includes) a wakeword detector, and each of devices 103, 105, and 107 is configured to determine when wakeword likelihood (probability that a wakeword has been detected by the device) exceeds a predefined threshold. As time progresses, the wakeword likelihood determined by each device can be graphed as a function of time.
As is apparent from inspection of
According to some examples, a local maximum may be determined subsequent to determining that a wakeword confidence value exceeds a wakeword detection start threshold, which may be a predetermined threshold. For example, referring to
In some such implementations, a local maximum may be determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame as compared to a wakeword confidence value of a previous audio frame, which in some instances may be the most recent audio frame or one of the most recent audio frames. For example, a local maximum may be determined by detecting, after a previous wakeword confidence value has exceeded the wakeword detection start threshold, a decrease in a wakeword confidence value of audio frame n as compared to a wakeword confidence value of audio frame n-k, wherein k is an integer.
According to some such implementations, some methods may involve initiating a local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device exceeds, with a rising edge, the wakeword detection start threshold. Some such methods may involve terminating the local maximum determination time interval after a wakeword confidence value of the first device, the second device or another device falls below a wakeword detection end threshold.
For example, referring again to
According to some examples, the local maximum determination time interval may terminate after a wakeword confidence value of all devices in a group falls below the wakeword detection end threshold 215b. For example, referring to
In this example, the apparatus 300 includes an interface system 305 and a control system 310. The interface system 305 may, in some implementations, be configured for receiving input from each of a plurality of microphones in an environment. The interface system 305 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 305 may include one or more wireless interfaces. The interface system 305 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 305 may include one or more interfaces between the control system 310 and a memory system, such as the optional memory system 315 shown in
The control system 310 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, the control system 310 may reside in more than one device. For example, a portion of the control system 310 may reside in a device within one of the environments depicted in
In some implementations, the control system 310 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 310 may be configured for implementing methods of generating a plurality of spatially-varying attentiveness signals, e.g., such as those disclosed herein. In some such examples, the control system 310 may be configured for a determining a relevance metric for at least one device.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 315 shown in
In some examples, the apparatus 300 may include the optional microphone system 320 shown in
In some implementations, the apparatus 300 may include the optional light system 325 shown in
According to some such examples the apparatus 300 may be, or may include, a smart audio device. In some such implementations the apparatus 300 may be, or may include, a wakeword detector. For example, the apparatus 300 may be, or may include, a virtual assistant.
In this example, block 405 involves receiving output signals from each microphone of a plurality of microphones in the environment. In this example, each of the plurality of microphones resides in a microphone location of the environment and the output signals correspond to an utterance of a person. The utterance may, in some examples, be (or include) a wakeword. At least one of the microphones may be included in, or configured for communication with, a smart audio device.
In some implementations, a single device may receive output signals from each microphone of a plurality of microphones in the environment in block 405. According to some such examples, the single device may be located in the environment. However, in other examples the single device may be located outside the environment. For example, at least a portion of method 400 may, in some instances, be performed by a remote device such as a server.
In other implementations, multiple devices may receive the output signals in block 405. In some such implementations, a control system of each of a plurality of smart devices may receive output signals from multiple microphones of each smart device in block 405.
The microphones of different devices in the environment may or may may or may not be synchronous microphones, based on the particular implementation. As used herein, microphones may be referred to as “synchronous” if the sounds detected by the microphones are digitally sampled using the same sample clock, or synchronized sample clocks. For example, a first microphone (or a first group of microphones, such as all microphones of a first smart device) within the environment may sample audio data according to a first sample clock and a second microphone (or a second group of microphones, such as all microphones of a second smart device) may sample audio data according to the first sample clock.
According to some alternative implementations, at least some microphones, or microphone systems, of an environment may be “asynchronous.” As used herein, microphones may be referred to as “asynchronous” if the sounds detected by the microphones are digitally sampled using distinct sample clocks. For example, a first microphone (or a first group of microphones, such as all microphones of a first smart device) within the environment may sample audio data according to a first sample clock and a second microphone (or a second group of microphones, such as all microphones of a second smart device) may sample audio data according to a second sample clock. In some instances, the microphones in an environment may be randomly located, or at least may be distributed within the environment in an irregular and/or asymmetric manner.
In the example shown in
In such a living space there are a set of natural activity zones where a person would be performing a task or activity, or crossing a threshold. These action areas (zones) are where there may be an effort to estimate the location (e.g., to determine an uncertain location) or context of the user to assist with other aspects of the interface.
In the
-
- The kitchen sink and food preparation area (in the upper left region of the living space);
- The refrigerator door (to the right of the sink and food preparation area);
- The dining area (in the lower left region of the living space);
- The open area of the living space (to the right of the sink and food preparation area and dining area);
- The TV couch (at the right of the open area);
- The TV itself;
- Tables; and
- The door area or entry way (in the upper right region of the living space).
It is apparent that there are often a similar number of lights with similar positioning to suit action areas. Some or all of the lights may be individually controllable networked agents.
In some examples, the goal is not to estimate the user's exact geometric location but to form a robust estimate of a discrete zone (e.g., in the presence of heavy noise and residual echo). As used herein, the “geometric location” of an object or a user in an environment refers to a location based on a coordinate system, whether the coordinate system is with reference to GPS coordinates, with reference to the environment as a whole (e.g., according to a Cartesian or polar coordinate system having its origin somewhere within the environment) or with reference to a particular device within the environment (e.g., according to a Cartesian or polar coordinate system having the device as its origin), such as a smart audio device. According to some examples, the estimate of a user's location in an environment may be determined without reference to geometric locations of the plurality of microphones.
In some examples, the user's zone may be estimated via a data-driven approach that involves a plurality of high-level acoustic features derived, at least partially, from at least one of the wakeword detectors. These acoustic features (which may include wakeword confidence and/or received level) may, in some implementations, consume very little bandwidth and may be transmitted asynchronously to a device implementing a classifier with very little network load. Some examples are disclosed in U.S. Provisional Patent Application No. 62/950,004, filed on Dec. 18, 2019 and entitled “Acoustic Zoning with Distributed Microphones,” for example
Some such methods may involve receiving output signals from each microphone of a plurality of microphones in the environment. Each of the plurality of microphones may reside in a microphone location of the environment. In some examples, the output signals may correspond to a current utterance of a user.
Some such methods may involve determining multiple current acoustic features from the output signals of each microphone and applying a classifier to the multiple current acoustic features. Applying the classifier may involve applying a model trained on previously-determined acoustic features derived from a plurality of previous utterances made by the user in a plurality of user zones in the environment. Some such methods may involve determining, based at least in part on output from the classifier, an estimate of the user zone in which the user is currently located. The user zones may, for example, include a sink area, a food preparation area, a refrigerator area, a dining area, a couch area, a television area and/or a doorway area.
In some examples, a first microphone of the plurality of microphones may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock. In some examples, at least one of the microphones may be included in, or configured for communication with, a smart audio device. According to some examples, the plurality of user zones may involve a plurality of predetermined user zones.
According to some examples, the estimate may be determined without reference to geometric locations of the plurality of microphones. In some examples, the multiple current acoustic features may be determined asynchronously.
In some instances, the current utterance and/or the previous utterances may include wakeword utterances. In some examples, a user zone may be estimated as a class with maximum posterior probability.
According to some implementations, the model may be trained using training data that is labelled with user zones. In some instances, the classifier may involve applying a model trained using unlabelled training data that is not labelled with user zones. In some examples, applying the classifier may involve applying a Gaussian Mixture Model trained on one or more of normalized wakeword confidence, normalized mean received level, or maximum received level.
In some examples, training of the model may continue during a process of applying the classifier. For example, the training may be based on explicit feedback from the user. Alternatively, or additionally, the training may be based on implicit feedback, such as implicit feedback regarding the success (or lack thereof) of beamforming or microphone selection based on an estimated user zone. In some examples, the implicit feedback may include a determination that a user has terminated the response of a voice assistant abnormally. According to some implementations, the implicit feedback may include a command recognizer returning a low-confidence result. In some instances, the implicit feedback may include a second-pass retrospective wakeword detector returning low confidence that the wakeword was spoken.
Returning to
The “corresponding device” may or may not be the device providing an attentiveness signal, depending on the particular implementation. For example, a virtual assistant may include a speaker system and/or a light system and may be configured to generate an attentiveness signal that indicates a relevance metric of the virtual assistant via the speaker system and/or a light system.
In some alternative examples, an attentiveness signal generated by a first device may indicate a relevance metric of a second device. In such examples, the second device is the “corresponding device” referenced in block 415. Referring to
In some examples, the relevance metric may be based, at least in part, on an estimated distance from a location. In some examples, the location may be an estimated location of the person who made the utterance that is referenced in block 405. According to some such examples, the relevance metric may be based, at least in part, on an estimated distance from the person to the device corresponding to the attentiveness signal.
In some implementations, the estimated distance may be an estimated distance from one location (e.g., a light's location, a smart device's location, etc.) to an acoustic centroid of a plurality of microphones within the zone. For example, the estimated distance may be an estimated Euclidean distance from the acoustic centroid of the microphones within the zone. In other instances, the estimated distance may be an estimated Mahalanobis distance from the acoustic centroid of the microphones within the zone. In further instances, the relevance metric may be the posterior probability that the given light would be classified as being associated in the given zone if it were a microphone.
In some implementations, a control system may be configured to estimate posterior probabilities p(Ck|W(j)) of a feature set W(j) corresponding to an utterance, for example by using a classifier. The classifier may, in some such implementations, be a Bayesian classifier. Probabilities p(Ck|W(j)) may indicate a probability (for the jth utterance and the kth zone, for each of the zones Ck, and each of the utterances) that the user is in each of the zones Ck. These probabilities are an example of the output of such a classifier.
In some examples, the amount of attentiveness expression may be related (e.g., monotonically related) to p(Ck|W(j)). For example, in some instances if a lighting device of interest may not include any microphones, so the classifier may determine or estimate a proxy based on a relative position of the lighting device and nearby microphones.
According to some examples, a process of building and/or updating a zone location model may include the following:
-
- 1. Collect a set of zone classification posteriors p(Ck|W(j)) corresponding to a recent set of utterances j=1 . . . J (e.g., the set of the most recent 200 wakewords uttered in the household) along with the estimated position xj of the talker during each utterance in the set (e.g., in 3D Cartesian space);
- 2. Compute the “acoustic centroid” μk for each zone k (e.g., in 3D Cartesian space) as the weighted mean
and
-
- 3. Optionally, compute an “acoustic size and shape” of each zone, for example assuming a multivariate Gaussian distribution over Cartesian space. In some such examples, the process may involve computing a weighted covariance matrix, e.g., as follows:
Then, given a new position y, a control system may be configured to do one or more of the following with the zone location model:
-
- 1. Compute the Euclidean distance dk=√{square root over ((y−μk)T(y−μk))} and use dk (e.g., in meters) as the relevance metric. Some such examples may involve passing dk through a monotonic function ƒ(dk) which maps dk into the range [0,1].
- 2. Compute the Mahalanobis distance mk=√{square root over ((y−μk)TΣk−1(y−μk))} and use mk (in units of standard deviations from the centroid) as the relevance metric. Some such examples may involve passing mk through a monotonic function g(mk) which maps mk into the range [0,1].
- 3. Evaluate the probability density of the multivariate Gaussian zone k model for location y:
Some such examples may involve normalizing the probability density for each zone y into posterior probabilities
Some such implementations may involve directly using the posteriors pk as zone relevance metrics in the range [0, 1].
According to some examples, the relevance metric may be based, at least in part, on an estimated visibility of the corresponding device. In some such examples, the relevance metric may be based, at least in part, on the elevation of the corresponding device, e.g., the height of the corresponding device from a floor of the environment. According to some such examples, if the estimated distance from the person to two devices is the same, or substantially the same (e.g., within a threshold percent, such as 10%, 8%, 5%, etc.) and one device has a higher elevation than the other device, the higher device will be assigned a higher relevance metric. In some such examples, a weighting factor of the relevance metric may be based on the estimated visibility of the corresponding device. For example, the weighting factor may correspond to the relative distances from the floor of the aforementioned devices. In other examples, the estimated visibility of the corresponding device and the corresponding weighting factor may be determined according to the relative positions of the person and one or more features of the environment, such as interior walls, furniture, etc. For example, the weighting factor may correspond to a probability that the corresponding device will be visible from the person's estimated location, e.g., based on a known environment layout, wall positions, furniture positions, counter positions, etc.
According to some implementations, the relevance metric may be based, at least in part, on estimations of wakeword confidence. In some such examples, a relevance metric may correspond to an estimation of wakeword confidence. According to some such examples, the wakeword confidence units may be a percentage, a number in the range of [0,1], etc. In some instances, wakeword detectors may use a logarithmic implementation. In some such logarithmic implementations, a wakeword confidence of zero means the likelihood that the wakeword was spoken is the same as the likelihood that the wakeword was not spoken (e.g., according to a particular training set). In some such implementations, increasingly positive numbers may indicate an increasing confidence that the wakeword was spoken. For example, a wakeword confidence score of +30 may correspond with a very high likelihood that the wakeword was spoken. In some such examples, negative numbers may indicate that it is unlikely that the wakeword was spoken. For example, a wakeword confidence score of −100 may correspond with a high likelihood that the wakeword was not spoken.
In other examples, a relevance metric for a particular device may be based on an estimation of wakeword confidence for that device as well as the estimated distance from the person to the device. For example, the estimation of wakeword confidence may be used as a weighting factor that is multiplied by the estimated distance to determine the relevance metric.
The attentiveness signals may, for example, include light signals. In some such examples, the attentiveness signals may vary spatially within the zone according to color, color saturation, light intensity, etc. In some such examples, the attentiveness signals may vary spatially within the zone according to a rate at which lights are blinking. For example, lights that are blinking more quickly may indicate a relatively higher relevance metric of the corresponding device than lights that are blinking more slowly.
Alternatively, or additionally, the attentiveness signals may, for example include sound waves. In some such examples, the attentiveness signals may vary spatially within the zone according to frequency, volume, etc. In some such examples, the attentiveness signals may vary spatially within the zone according to a rate at which a series of sounds are being produced, e.g., the number of beeps or chirps in a time interval. For example, sounds that are being produced at a higher rate may indicate a relatively higher relevance metric of the corresponding device than sounds that are being produced at a lower rate.
Referring again to
According to some examples, the attentiveness signals may include a modulation of at least one previous signal generated by a device within the zone prior to a time of the utterance. For example, if a light fixture or a light source system had previously been emitting light signals the modulation may be a color modulation, a color saturation modulation and/or a light intensity modulation. If the previous signal had been a sound signal, the modulation may have included a level or volume modulation, a frequency modulation, etc. In some examples, the modulation may be a change of a fan speed, a change of a flame size, a change of a motor speed and/or a change of an air flow rate.
According to some implementations the modulation may be a “swell.” The swell may be, or may include, a predetermined sequence of signal modulations. Some detailed examples are described below. Some such implementations may involve the use of variable output devices (which may, in some instances, be continuously variable output devices) in the system environment (e.g., the lights, speakers, fans, fireplace, etc., of a living space) that may be used for another purpose but are able to be modulated around their current operating point. Some examples may provide variable attentiveness indication(s) (e.g., varying attentiveness signals which have a swell), for example to indicate a varying expression (e.g., a varying amount) of attention across a set of devices. Some implementations may be configured to control variable attentiveness signal(s) (e.g., a swell) based on a function of estimated intensity of user signaling and/or confidence of user location(s).
For example, in response to a wakeword (having determined intensity and having a location of origin which is determined with uncertainty), two different lights of, or associated with, the devices may be activated to produce time-varying attentiveness signals. Because in this example the attentiveness signals are based in part on an estimated distance between a device and the location of origin of the wakeword, which varies according to the location of each device, the attentiveness signals are also spatially-varying.
In the example shown in
Variable Output Devices
Without loss of generality, Table 1 (below) indicates examples of devices (e.g., smart audio devices, each of which includes or is associated with (e.g., configured for communication with) a controllable light-emitting, sound-emitting, heat-emitting, moving, or vibrating element) which are useful as variable, and in some instances continuously variable, output devices. In these examples, the output of each variable output device is a time-varying attentiveness signal. Table 1 indicates some ranges of modulation of sound, light, heat, air movement, or vibration (each serving as an attentiveness signal) emitted from or produced by each of the devices. Although a single number is used to indicate some of the ranges, the single number indicates a maximum change during a “swell” and therefore indicates a range from a baseline condition to the indicated maximum or minimum value. These ranges are merely made by example and are not limiting. However, each range provides an example of a minimum detectable change in the indication and a maximum (commanded) attention indication.
For example, having determined an “attentiveness signal” (e.g., in the range [0,1]) for each modality there may be an “attentiveness-to-swell” mapping from that attentiveness signal. In some examples, the attentiveness-to-swell mapping may be a monotonic mapping.
The attentiveness-to-swell mapping may, in some instances, be set heuristically or experimentally (for example on a demographically-representative group of test subjects) so that the mapping seems “natural,” at least to a group of individuals who have provided feedback during the testing procedures. For example, for color change modality an attentiveness of 0.1 may corresponds to +20 nm of hue, whereas an attentiveness of 1 may corresponds to +100 nm of hue. Color-changeable lights will generally not change the frequency of the transducer, but may instead have separate R, G, B LEDs which may be controlled with varying intensities, so the foregoing are merely a rough examples. Table 1 provides some examples of some natural mappings of attentiveness to produced physical phenomena, which will generally differ from modality to modality.
In this example shown in
In the example shown in
According to the example shown in
In this instance, the swell envelope 601 also includes a hold 617, during which the attentiveness signal level remains the same. In some implementations, the attentiveness signal level may remain substantially the same during the hold 617, e.g., may remain within a determined percentage of the attentiveness signal level at the beginning of the hold 617 (e.g., within 1%, within 2%, within 3%, within 4%, within 5%, etc.). In the example shown in
Estimated Intensity
In some example embodiments, the normalized intensity of an attentiveness signal may vary from 0 (for a threshold detection of wakeword), through to 1 (for a wakeword having an estimated vocal effort causing voice levels 15-20 dB above normal).
Function for Modulating Swell of a Device
An example of a function for modulating the swell of an attentiveness signal having an initial intensity “Output” is:
Output=Output+Swell*Confidence*Intensity,
where the parameters Swell, Confidence, and Intensity may vary with time.
The control of large numbers of devices of an Internet of things (IoT), such as lights, is complicated of itself before introducing a step of swelling for expression of attention. Some embodiments have been designed with this in mind, e.g., in the sense that a swell is typically a short-term additive delta to whatever setting is occurring due to the broader scene or space context control.
In some implementations, the scene control may involve occupancy, and may be shaped additionally by and during voice commands that relate to the control of the system being coopted for expressing attention. For example, audio attentiveness signals may be kept within a relatively lower amplitude range if more than one person is within a zone.
Some embodiments provide a way to implement such scene control from implementation of a swell. In some implementations, the swell of attentiveness signals for multiple devices may be controlled according to a separate protocol (in other words, separate from other protocols for controlling functionality of the devices), enabling the devices to participate in the human attention cycle as well as be controlled for the ambience of a living space.
Aspects of some embodiments may include the following:
-
- Continuous output actuators;
- Assignment of smart audio devices into activation groups, in some instances with devices assigned to more than one group;
- Swell with one or more designed temporal envelope(s);
- Range of swell controlled by a simple function of activation intensity and zone (or location) confidence.
Some examples of how a virtual assistant (or other smart audio device) may be controlled to exhibit an ambient presence create testable criteria that are not well represented in prior systems may include the following:
-
- confidence scores (such as wakeword confidence scores) computed based on the estimation of a user's intent or invocation of the virtual assistant with specific contextual information (such as the location and/or zone in which the wakeword was spoken) may be published (e.g., shared between smart devices in an environment), and at least in some examples are not directly used to control the devices;
- suitably equipped devices with continuous electrical control may be controlled to use this information to “swell” their existing state to respond naturally and with reciprocity;
- the self-delegation of devices (e.g., automated discovery of and/or dynamic updating of zones by the devices) to perform “swells” may create emergent responses that do not require manual tables of positions and “zones,” and the added robustness afforded by the low user set-up requirements; and
- the continuous estimation, publishing, and growing confidence through an accumulation of statistical samples (e.g., via explicit or implicit user feedback), enables the system to create a semblance of presence that may, in some examples, move across space naturally, and in some examples may modulate in accordance with increased efforts by the user to address the assistant.
FIG. 7 shows an example embodiment of a system for implementing automatic optical orchestration.
Elements of
-
- 700: Example home illustrating automatic optical orchestration, here a two room apartment;
- 701: Living room;
- 702: Bedroom;
- 703: Wall between living room and bedroom. According to this example, light cannot pass between the two rooms;
- 704: Living room window. Daylight illuminates the living room via this window during daytime hours;
- 705A-C: Plurality of smart ceiling (e.g., LED) lights illuminating the living room;
- 705D-F: Each ceiling light is orchestrated and communicates by Wi-Fi (or another protocol) with the smart home hub 740;
- 706: Living room table;
- 707: Living room smart speaker device incorporating light sensor;
- 707A: Device 707 is orchestrated by and communicates by Wi-Fi (or another protocol) with the smart home hub 740;
- 708A-C: Controlled light propagation from the lights 705A-C to the device 707;
- 709: Uncontrolled light propagation from the window 704 to the device 707;
- 710: Smart ceiling LED light illuminating the bedroom;
- 710A: The bedroom light is orchestrated by and communicates by Wi-Fi (or another protocol) with the smart home hub 740;
- 711: Potted plant;
- 712: IoT (internet of things) automatic watering device incorporating light sensor;
- 712A: IoT watering device is orchestrated by and communicates by Wi-Fi (or other protocol) with the smart home hub 740;
- 713: Bedroom table;
- 714: Bedroom smart speaker device incorporating light sensor;
- 714A: Bedroom smart speaker is orchestrated by and communicates with the smart home hub 740 by Wi-Fi or another protocol;
- 715: Controlled light propagation from the bedroom light 710 to the IoT watering device 712; and
- 716: Controlled light propagation from the bedroom light 710 to the bedroom smart speaker 714.
According to this example, the smart home hub 740 is an instance of the apparatus 300 that is described above with reference to
-
- 800: Graph displaying the continuous value of light intensity settings (810, 805A, 805B, and 805C) for an example set of smart lighting devices pictured in
FIG. 7 (710, 705A, 705B, and 705C, respectively). Graph 800 also displays on the same time axis the continuous light sensor readings (812, 814, and 807) for the example light sensors pictured inFIGS. 7 (712, 714 and 707, respectively); - 810: The continuously controlled light intensity output for smart lighting device 710. The value at time 6:00 pm corresponds to the light completely off;
- 805A: The continuously controlled light intensity output for smart lighting device 705A. The value at time 6:00 pm corresponds to the light completely off;
- 805B: The continuously controlled light intensity output for smart lighting device 705B. The value at time 6:00 pm corresponds to the light completely off;
- 805C: The continuously controlled light intensity output for smart lighting device 705C. The value at time 6:00 pm corresponds to the light completely off;
- 812: The continuous light sensor reading for the example light sensor 712. The reading at time 6:00 pm is low;
- 814: The continuous light sensor reading for the example light sensor 714. The reading at time 6:00 pm is low;
- 807: The continuous light sensor reading for the example light sensor 707. The reading at time 6:00 pm is high;
- 830: The continuous light sensor reading is initially high due to daylight (709) entering through the window (704). As dusk falls, the ambient light intensity falls until 7:30 pm;
- 820: An event that occurs at 7:30 pm when two smart lighting devices (705A, 705B) are switched on by a user in response to the low light conditions in the room (706). The light intensity of the smart lighting devices 705A and 705B are increased as shown in the traces 820A and 820B. At the same time, the continuous light sensor reading at 820C increases with a discernibly similar response;
- 821: The event of 820 ends when the smart lighting devices 705A and 706B are switched off. The traces 820A and 820B correspondingly return to completely off, and the light sensor readings 807 return low;
- 820A: The increase and decrease of light output of smart lighting device 705A when it is switched on and then off;
- 820B: The increase and decrease of light output of smart lighting device 705B when it is switched on and then off;
- 820C: The increase and decrease of light output of the light sensor readings of sensor 707 in response to the lights 705A and 705B being switched on and off;
- 822: An event that occurs at 8:00 pm when the smart lighting device 710 is switched on and then off (823). The light intensity of the device modulates with the response 822A. The light sensor readings 812 and 822 then modulate with discernibly similar responses 822B and 822C;
- 824: An event that occurs at 8:30 pm when a new smart lighting device 705C is connected to the system. The light output is modulated either through an automatic sequence or by a user manually controlling the output of the light in an on/off pattern shown by 824A;
- 824A: The modulated output pattern of light 705C; 824B: In response to the modulation of smart light 705C, the continuous light sensor 707 reads a discernibly similar response 824B;
- 825: The event of 824 finishes;
- 826: In response to a user request, the lights in room 701 are enabled to a dim setting around 50% intensity. These lights are 705A, 705B, and 705C, with their 50% output intensity shown in the traces 826A, 826B, and 826C, respectively. Correspondingly, the continuous light sensor reading of sensor 707 modulates with a discernibly similar response; and
- 827: The event of 826 ends.
- 800: Graph displaying the continuous value of light intensity settings (810, 805A, 805B, and 805C) for an example set of smart lighting devices pictured in
The management and enrolment of networked devices in the home and workplace presents a growing challenge as the number of such devices is surging rapidly. Lighting, furniture, appliances, mobile phones, and wearables are all becoming increasingly connected, and present manual methods of installing and configuring such devices are not sustainable. Supplying network authentication details and pairing devices with user accounts and other services is just one example of the kind of enrolment devices need when initially installed. Another common step of enrolment and installation is the assignment of a particular “zone” or “group” to a set of devices, organizing them into a logical category often associated with specific physical spaces such as rooms. Lighting and appliances which are usually statically installed fall into this category most often. The labour and additional installation steps associated with assigning these “zones” or “groups” to devices presents a usability challenge for users and lowers their attractiveness as commercial products.
The present disclosure recognizes that these logical groupings and zones are sensical in the context of home automation, but may be too rigid to provide the level of expression and fluidity desirable for human/machine interaction as users navigate the space. In some examples, the ability to modulate and swell the continuously variable output parameters of a collection of devices to best express attention may require that the system possesses some knowledge about the distribution or relevance of these devices that is more finely granulated or pertinent than the typical rigidly and manually assigned “zones.” Herein we describe an inventive approach to automatically map such a distribution by aggregating both the readings produced by a plurality of sensors, and the opportunistic sampling of the continuous output configurations of a plurality of smart devices. Herein we motivate the discussion with an example using light, so using one or more light-sensitive components with digitizable output readings attached to one or more smart devices, and the self-reported light intensity and hue output parameters for a plurality of smart lighting devices. However, it will be understood that other modalities such as sound, temperature (with a temperature measurement component, and smart connected heating and cooling appliances) are also possible embodiments of this method and approach.
With reference to
In our example, all smart lighting devices (710, 705A, 705B, and 705C) are initially emitting no light at 6:00 pm, seen in the traces 810, 805A-C, respectively. Devices 710, 705A and 705B are all presently installed and already mapped, while 705C is a new device that is not yet mapped by the system. The light sensor readings (812, 814, and 807) of three smart devices (712, 714, and 707 respectively) are also depicted. It should be understood that the vertical and horizontal axes (in
In our example, room 702 is a bedroom and room 701 is a living room. Room 702 contains one smart light emitting device 710 and two smart devices with light sensing capability, 712 (an IoT watering device) and 714 (a smart speaker). Room 701 contains two initially installed and mapped smart lights 705A and 705B, and one new unmapped smart light 705C. Room 702 also contains one smart speaker device 707, which possesses light sensing capability. A window 704 is also present in room 702, producing an uncontrolled amount of ambient light.
In our example, all smart devices are equipped to communicate over a home or local network, either via WiFi or via some other communication protocol, and that information collected or stored at one device may be transmitted to an orchestrating hub device 740. At time 6:00 pm, no light is produced by any of the smart lighting devices 710, 705A-C, however there is light emitted through the window 704 in room 701. Hence the light sensor readings for room 702 are low, and the readings for room 701 are high.
A series of events corresponding to changes in the lighting conditions will occur, and it will be demonstrated that corresponding changes in the light sensor readings will be sufficient to establish a basic mapping between the smart sensing devices and the smart light emitting devices. Trace 820 depicts the sensor readings of device 707 reducing as the sun sets, and the amount of light produced (709) through the window 704 is reduced. At 7:30 pm event 820 occurs, as a user switches on the lights in the living room 701. Hence the light outputs 805A and 805B increase, as shown by the profiles 820A and 820B. Correspondingly, the light sensor readings 807 increase with the profile 820C. Notably, the light sensor readings 812 and 814 corresponding to the device 712 and 714 in the adjacent room are not changed by this event. The event ends at the horizontal time marked by 821, when the lights are switched off again.
In a similar fashion to event 820, event 822 begins at time 8:00 pm when the bedroom light is switched on. The continuously variable output parameter (810) of the bedroom light (710) is increased with the profile 822A during this event. The light sensor readings (812 and 814) of smart devices 712 and 714 also modulate in a corresponding fashion with the profiles 822B and 822C respectively. Notably, the light sensor reading 807 is unaffected as it is in the adjacent room. At 823 the event ends as the bedroom light 710 is switched off.
At 8:30 pm the unmapped living room light 805C is toggled on and off in a periodic fashion for some short duration of time. This toggling could have been automatically initiated by the lighting device itself, or at the request of the smart hub 740, or manually by a user using a physical switch or by alternatively suppling power to and removing power from the device. Regardless of how this modulation in output (identifiable with profile 824A) was achieved, the reported output intensity (805C) of device 705C is communicated via the network for aggregation with the light sensor readings 812, 814 and 807. As in event 820, the only sensor in the living room (attached to device 707) reflects the output modulation 824A with a discernibly similar pattern 824B in the sensor reading. This event ends sometime shortly after it begins, as indicated by numeral 825.
With the data aggregated by the system up until this point, it is possible to deduce that the unmapped smart light 705C is strongly related to the lights 705A and 705B. This is because the degree to which 705A and 705B affect the light sensor readings (807) through the transmission of light 708A and 708B is highly similar to degree to which the light emitted (708C) by 705C affects the same sensor. The degree of similarity (determined by a convolutional process about to be discussed in greater detail) determines to what degree the lights are co-located and contextually related. This soft decision and approximate relational mapping provides an example of how finer grained “zoning” and spatial awareness is afforded to the smart assistant system.
With the smart light 705C now effectively mapped, an example of a user request to switch on all the “living room” lights to 50% intensity is depicted in event 826. All three living room lights 705A-C are enabled at 50% output, depicted in the output traces 805A-C and follow profiles 826A-C. Correspondingly, the light sensor reading 807 also modulates with a profile 826D. The degree to which a device is “mapped” will increase in confidence over time, with the accumulation of correlated modulations observed in the output of the device and the readings of the sensor in question. So even though the new device 705C has been at least understood to co-exist with 705A and 705B, further analysis of events such as 826 occurring after the initial setup period should be understood as data to build an increasingly detailed and confident spatial map of the space that may be used to facilitate expressive personal assistant interactions as previous discussed in this disclosure.
It will be understood that light sensors may incorporate specific filters to more selectively sense light produced by consumer and commercial LED lighting devices, removing spectra of light produced by uncontrollable light sources such as the sun.
It will be understood that the events of 824 in the example are optional from the point of view of the system. However, in this example the rate at which a device is mapped into the system is directly proportional to the how often it modulates output parameters. With this in mind, it will be expected that devices can be more quickly integrated into the system's mapping with highly discernible modulation events such as 824 that encode a high degree of information from an informational theory standpoint.
Some embodiments may be configured to implement continuous (or at least continued and/or periodic) re-mapping and refinement. The events described through the example of
We next discuss a subtler and complementary form of modulation that is explicitly not driven by user intervention, referred to herein as “pervasive refinement.” A system may continuously adjust the output parameters of individual smart devices in a slow-moving fashion that is minimally detectable to users, but discernible to smart sensors, in order to build a mapping with increasingly higher fidelity. Instead of relying on the user to operate the system in a way that produces unambiguous information to correlate—the system can take control and perform its own modulation of individual smart output devices, again in a fashion that is only minimally detectable to users, and still discernible to sensors.
Many examples of this approach are possible (with light modality focus). Examples are shown in the following table:
With the premise and operation of the embodiments described above, we next describe in further detail the development of the mapping (over time) between continuous output devices and smart devices with sensors. We define the “mapping” H as a normalised similarity metric between a sensor equipped smart device and all the continuous output devices in the system. For a sensor-equipped smart device D{i} and smart output device L{j}, we can define a continuous similarity metric G as:
0<=G(D{i}, L{j})<=1,
where H is the set of all G for all D{i} and L{j} in the system: H={G(D{i}, L{j})} for all i, j.
With this established, it can be seen that selecting a discrete zone in the vicinity of D{i} could be achieved with a binary threshold d between 0 and 1:
Z=all j, such that G(D{i}, L{j})>d.
Having established a continuous similarity metric G allows the concept of zones to become fluid, and we need not restrict ourselves to discrete zones for the purpose of expressing attention. Therefore, different values of d could be selected based on the degree of attentiveness or expression desired by the virtual assistant during an interaction.
Referring again to
G(D{i}, L{j}) may be computed from the discretely sampled time series I[t] and S[t], the output device parameters communicated over a network, and the sensor readings respectively. I and S may be sampled at regular intervals close together enough such that they may be meaningfully compared. Many similarity metrics often assume zero-mean signals. However, constant ambient offsets are often present in environmental sensors (e.g. ambient lighting conditions).
Therefore, it is also possible to derive signals I[t]′ and S[t]′ from I[t] and S[t], and the computation of G from these derived signals. For instance, the smoothed sample-to-sample delta may be expressed as follows:
I[t]′=(1−a)*I[t−1]′+a*(I[t]−I[t−1]); for 0<a<1
Establishing similarity between these two time series for a recent time period T can be achieved through many methods that will be familiar to those skilled in the art of signal processing and statistics, for example, by:
-
- 1. Pearson correlation coefficient (PCC or “r”) between I[t] and S[t] setting G=(1+PCC)/2, e.g., as described at http://mathworlf.wolfram.com/CorrelationCoefficient.html, which is hereby incorporated by, reference;
- 2. The method of 1, but with the time-delta derived versions of I and S;
- 3. The method of 1, but with mean removed versions of I and S; and/or
- 4. Dynamic time warping on both I and S, e.g., as described at https://en.wikipedia.org/wiki/Dynamic_time_warping (which is hereby incorporated by reference), using the produced distance metric as G.
Some implementations may involve automatically updating an automated process of determining whether a device is in a device group, whether a device is in a zone and/or whether a person is in a zone. Some such implementations may involve updating the automated process according to implicit feedback based on one or more of a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated the response of a voice assistant abnormally, a command recognizer returning a low-confidence result or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
The goal of predicting the user zone in which the user is located may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wakeword. In such scenarios, implicit techniques for obtaining feedback on the quality of zone prediction may include:
-
- Penalizing predictions that result in misrecognition of the command following the wakeword. A proxy that may indicate misrecognition may include the user cutting short the voice assistant's response to a command, for example, by utterance a counter-command like, for example, “Amanda, stop!”;
- Penalizing predictions that result in low confidence that a speech recognizer has successfully recognized a command Many automatic speech recognition systems have the capability to return a confidence level with their result that can be used for this purpose;
- Penalizing predictions that result in failure of a second-pass wakeword detector to retrospectively detect the wakeword with high confidence; and/or
- Reinforcing predictions that result in highly confident recognition of the wakeword and/or correct recognition of the user's command.
Following is an example of a failure of a second-pass wakeword detector to retrospectively detect the wakeword with high confidence. Suppose that after obtaining output signals corresponding to a current utterance from microphones in an environment and after determining acoustic features based on the output signals (e.g., via a plurality of first pass wakeword detectors configured for communication with the microphones), the acoustic features are provided to a classifier. In other words, the acoustic features are presumed to correspond to a detected wakeword utterance. Suppose further that the classifier determines that the person who made the current utterance is most likely to be in zone 3, which corresponds to a reading chair in this example. There may, for example, be a particular microphone or learned combination of microphones that is known to be best for listening to the person's voice when the person is in zone 3, e.g., to send to a cloud-based virtual assistant service for voice command recognition.
Suppose further that after determining which microphone(s) will be used for speech recognition, but before the person's speech is actually sent to the virtual assistant service, a second-pass wakeword detector operates on microphone signals corresponding to speech detected by the chosen microphone(s) for zone 3 that you are about to submit for command recognition. If that second pass wakeword detector disagrees with your plurality of first pass wakeword detectors that the wakeword was actually uttered it is probably because the classifier incorrectly predicted the zone. Therefore, the classifier should be penalized.
Techniques for the a posteriori updating of the zone mapping model after one or more wakewords have been spoken may include:
-
- Maximum a posteriori (MAP) adaptation of a Gaussian Mixture Model (GMM) or nearest neighbor model; and/or
- Reinforcement learning, for example of a neural network, for example by associating an appropriate “one-hot” (in the case of correct prediction) or “one-cold” (in the case of incorrect prediction) ground truth label with the SoftMax output and applying online back propagation to determine new network weights.
Some examples of a MAP adaptation in this context may involve adjusting the means in the GMM each time a wakeword is spoken. In this manner, the means may become more like the acoustic features that are observed when subsequent wakewords are spoken. Alternatively, or additionally, such examples may involve adjusting the variance/covariance or mixture weight information in the GMM each time a wakeword is spoken.
For example, a MAP adaptation scheme may be as follows:
μi,new=μii,old*α+x*(1−α)
In the foregoing equation, μi,old represents the mean of the ith Gaussian in the mixture, a represents a parameter which controls how aggressively MAP adaptation should occur (a may be in the range [0.9,0.999]) and x represents the feature vector of the new wakeword utterance. The index “i” would correspond to the mixture element that returns the highest a priori probability of containing the speaker's location at wakeword time.
Alternatively, each of the mixture elements may be adjusted according to their a priori probability of containing the wakeword, e.g., as follows:
Mi,new=μi,old*βi*x(1−βi)
In the foregoing equation, βi=α *(1−P(i)), wherein P(i) represents the a priori probability that the observation x is due to mixture element i.
In one reinforcement learning example, there may be three user zones. Suppose that for a particular wakeword, the model predicts the probabilities as being [0.2, 0.1, 0.7] for the three user zones. If a second source of information (for example, a second-pass wakeword detector) confirms that the third zone was correct, then the ground truth label could be [0, 0, 1] (“one hot”). The a posteriori updating of the zone mapping model may involve back-propagating the error through a neural network, effectively meaning that the neural network will more strongly predict zone 3 if shown the same input again. Conversely, if the second source of information shows that zone 3 was an incorrect prediction, the ground truth label could be [0.5, 0.5, 0.0] in one example. Back-propagating the error through the neural network would make the model less likely to predict zone 3 if shown the same input in the future.
Alternatively, or additionally, some implementations may involve automatically updating the automated process according to explicit feedback from a person. Explicit techniques for obtaining feedback may include:
-
- Asking the user whether the prediction was correct using a voice user interface (UI). For example, a sound indicative of the following may be provided to the user: “I think you are on the couch, please say ‘right’ or ‘wrong’”).
- Informing the user that incorrect predictions may be corrected at any time using the voice UI. (e.g., sound indicative of the following may be provided to the user: “I am now able to predict where you are when you speak to me. If I predict wrongly, just say something like ‘Amanda, I'm not on the couch. I'm in the reading chair’”).
- Informing the user that correct predictions may be rewarded at any time using the voice UI. (e.g., sound indicative of the following may be provided to the user: “I am now able to predict where you are when you speak to me. If I predict correctly you can help to further improve my predictions by saying something like ‘Amanda, that's right. I am on the couch.’”).
- Including physical buttons or other UI elements that a user can operate in order to give feedback (e.g., a thumbs up and/or thumbs down button on a physical device or in a smartphone app).
While specific embodiments and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of this disclosure.
Claims
1. A method of controlling a system of devices in an environment, the method comprising:
- receiving output signals from each microphone of a plurality of microphones in the environment, each of the plurality of microphones residing in a microphone location of the environment, the output signals corresponding to an utterance of a person;
- determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location;
- generating a plurality of spatially-varying attentiveness signals within the zone, each attentiveness signal of the plurality of attentiveness signals being generated by a device located within the zone, each attentiveness signal indicating that a corresponding device is in an operating mode in which the corresponding device is awaiting a command, each attentiveness signal indicating a relevance metric of the corresponding device.
2. The method of claim 1, wherein an attentiveness signal generated by a first device indicates a relevance metric of a second device, the second device being a corresponding device.
3. The method of claim 1, wherein the relevance metric is based, at least in part, on an estimated distance from a location.
4. The method of claim 3, wherein the location is an estimated location of the person.
5. The method of claim 3, wherein the estimated distance is an estimated distance from the location to an acoustic centroid of a plurality of microphones within the zone.
6. The method of claim 1, wherein the relevance metric is based, at least in part, on an estimated visibility of the corresponding device.
7. The method of claim 1, wherein the utterance comprises a wakeword.
8. The method of claim 4, wherein the attentiveness signals vary, at least in part, according to estimations of wakeword confidence.
9. The method of claim 1, wherein at least one of the attentiveness signals comprises a modulation of at least one previous signal generated by a device within the zone prior to a time of the utterance.
10. The method of claim 9, wherein the at least one previous signal comprises a light signal and wherein the modulation comprises at least one of a color modulation, a color saturation modulation or a light intensity modulation.
11-17. (canceled)
18. The method of claim 1, wherein at least one of the microphones is included in or configured for communication with a smart audio device.
19. The method of claim 1, further comprising an automated process of determining whether a device is in a device group.
20. The method of claim 19, wherein the automated process is based, at least in part, on sensor data corresponding to at least one of light or sound emitted by the device, or, wherein the automated process is based, at least in part, on communications between at least one of a source and an orchestrating hub device or a receiver and the orchestrating hub device, or, wherein the automated process is based, at least in part, on a light source or a sound source being switched on and off for a duration of time.
21-25. (canceled)
26. The method of claim 19, further comprising automatically updating the automated process according to implicit feedback based on one or more of a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated the response of a voice assistant abnormally, a command recognizer returning a low-confidence result or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
27. The method of claim 1, further comprising selecting at least one speaker of a device located within the zone and controlling the at least one speaker to provide sound to the person.
28. The method of claim 1, further comprising selecting at least one microphone of a device located within the zone and providing signals output by the at least one microphone to a smart audio device.
29. The method of claim 1, wherein a first microphone of the plurality of microphones samples audio data according to a first sample clock and a second microphone of the plurality of microphones samples audio data according to a second sample clock.
30. An apparatus configured to perform the method of claim 1.
31. A system configured to perform the method of claim 1.
32. One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform the method of claim 1.
Type: Application
Filed: Jul 30, 2020
Publication Date: Aug 25, 2022
Applicant: Dolby Laboratories Licensing Corporation (San Francisco, CA)
Inventors: Christopher Graham HINES (Sydney , New South Wales), Rowan James KATEKAR (Redfern, New South Wales), Glenn N. DICKINS (Como, New South Wales), Richard J. CARTWRIGHT (Sydney, New South Wales), Jeremiha Emile DOUGLAS (Mill Valley, CA), Mark R.P. THOMAS (Walnut Creek, CA)
Application Number: 17/626,617