CALL INITIATION BY VOICE COMMAND
A terminal for use in a communications network for calling a called party includes a capture buffer for storing a media stream, including a voice command, received from the caller. A session controller in the terminal controls call handling and launches the call from the caller to the called party in accordance with the voice command. Responsive to the session manager, a stream controller will transmit the media stream from the capture buffer to the called party following set up of the call. In this way the called party may identify the caller voice and screen the call. Silence intervals in the media stream are detected and suppressed.
Latest Patents:
This invention relates to managing both audio-only and audio-video calls.
BACKGROUND ARTPresently, a caller seeking to establish an audio-only or audio-video call with one or more called parties does so through a series of steps, beginning with initiating the call. After initiating the call, call set-up occurs to establish a connection between the caller and the called party. Assuming the called party chooses to accept the call once set-up, the caller will then announce himself or herself to the called party. The advent of Caller Identification (“Caller ID) allows the called party to engage in “Call Screening,” whereby a called party examines the caller ID (e.g., the telephone number of the called party) to decide whether to answer the call. If the called party has a call answering service, provided by either a stand-alone answering machine or a network service, the called party can forgo answering the call, thereby allowing the call answering machine or answering service to take a message. With many stand-alone answering machines, the called party can listen to the call as the answering machine answers the call. Before the answering machine records a message from the caller, the called party can interrupt the answering machine and accept the call. However, if the called party accepts the call once the answering machine has begun to record the caller's message, the answering machine will now record the conversation between the caller and called party.
Traditionally, a caller initiates a call by entering a sequence of Dual Tone Multi-Frequency (DTMF) signals representing the telephone number of the called party. The caller will enter the called party's telephone number through a key pad on the caller's communication device (e.g., a telephone) for transmission to a network to which the caller subscribes. Rather than enter a telephone number, the caller could enter another type of identifier, for example, the called party's name, IP address or URL, for example to enable the network to set-up (i.e., route) the call properly.
Many new communications devices, for example mobile telephones, now include voice recognition technology thereby allowing a caller to speak a command (e.g., “call John Smith”) to initiate a call to that party. In response to the voice command, the communications device will first ascertain the telephone number or other associated identifier of the called party (e.g., IP address or URL) and then translate that identification of the called party into signaling information needed to launch the call into the communications network. Presently, the initial voice command made by the caller to launch the call typically never reaches the called party. Instead, the caller's communication device typically will discard the voice command during the process of translating the voice command into the signaling information necessary to initiate the call. At best, the called party will only receive the telephone number identifying the caller. In some instances, outbound calls from various individuals at a common location will have a single number associated with a trunk line carrying the call from that common location into the communications network. Under such circumstances, the called party will only receive the telephone number of the trunk line that carried the call and will not know the identity of the actual caller.
Thus, a need exists for a voice-activated call initiation technique that overcomes the aforementioned disadvantages.
BRIEF SUMMARY OF THE INVENTIONBriefly, in accordance with an illustrated embodiment of the present principles, a method for establishing a call from a caller to a called party commences by storing a media stream, including a voice command, received from the caller. Thereafter, the call is launched from the caller to the called party in accordance with the voice command. The media stream is transmitted to the called party following set up of the call.
In the exemplary embodiment of
Understanding the flow of communications signals between the two stations 120 and 140 will aid in understanding of the operation of the communications system 100 of
At the station 140, the camera 154 will capture the image of the subscriber 142 and generate a video signal that varies accordingly, whereas the microphone 153 at that station will capture that subscriber's voice and generate an audio signal that varies accordingly. A capture buffer 155 in the terminal 141 will buffer the video and audio signals from the camera 154 and the microphone 153, respectively, until accessed as audio/video media data 161 by a stream controller 162. The stream controller 162 in the terminal 141 will transmit the audio/video media data 161 as a media stream 163 to the station 120 via the communications network 103 for receipt at a receive buffer 183 in the terminal 121 at the station 120. Thereafter, receive buffer 183 will provide video portion of the media stream to a monitor 184 and the audio portion of the stream to an audio reproduction device 187. In this way, the monitor 184 will display a presentation 185 that includes an image 186 of the subscriber 142, whereas the audio reproduction device 187 will reproduce that subscriber's voice.
At the stations 120 and 140, session managers 127 and 157 in the terminals 121 and 141, respectively, provide the management functions necessary for handing calls. In some embodiments, each session manager could take the form of a module implementing the well-known Session Initiation Protocol (SIP), an Internet standard generally used in conjunction with Voice over Internet Protocol (VoIP) and suitable for this purpose, but also having the capability of managing calls with video. In other embodiments, different protocols suitable for the purpose may be substituted. The session managers 127 and 157 have the responsibility for setting up, initiating, managing, terminating, and tearing down the connections necessary for a call. In this embodiment, the session manager 127 and 157 typically communicate with the presence block 110 via connections 128 and 104 respectively, to register themselves or to find each other over the communication channel 103.
Control of the session managers 127 and 157 can occur in several different ways. For example, the session managers 127 and 157 could receive control commands from subscribers 122 and 142, respectively, through a graphical subscriber interface (not shown). Further, the session managers 127 and 157 of
The session managers 127 and 157 can provide control signals to the stream controllers 132 and 162, respectively, in a well-known manner. In this way, once a station has initiated a call using SIP and the other station has accepted the call, also using SIP, then the session managers can establish the separate connections between the stations via the stream controllers to transport corresponding media streams 133 and 163 between stations. In some embodiments, well-known protocols exist that are suitable for these connections to carry the media streams 133 and 163, including the Real-time Transport Protocol (RTP) and the corresponding RTP Control Protocol (RTCP), both Internet standards. Other embodiments may use different protocols.
In some cases, the media streams 133 and 163 can include spoken or gesture-based commands sent from one station to the other station. In other instances, transmitting such commands will prove undesirable. In accordance with the present principles, the command recognition components (e.g., the speech recognition modules 126 and 156 as shown, as well the gesture recognition module (not shown)) will provide a control signal to their respective stream controllers 132 and 162. These control signals indicate which portions of the audio/visual media data 131 and 161 should undergo streaming and which parts that should not.
Providing specific control signals from the command recognition components (e.g., the speech recognition modules 126 and 156) to the stream controllers 132 and 162 can give rise to a substantial delay (e.g., on the order of several seconds) between the capture of the audio/visual signals in the buffers 125 and 155 and actual access of the audio/visual media data 131 and 161 by stream controllers 132 and 162, for transmission as the media streams 133 and 163, all respectively. A much shorter delay (e.g., on the order of less than 100 mS) will greatly enhance communication between the subscribers 122 and 142. In accordance with the present principles, the stream controllers 132 and 162 (or other component within each corresponding terminal), keep track of the current delay imposed by the corresponding capture buffers 125 and 155, respectively. The terminals 121 and 141 reduce this current delay (to reach a predetermined minimum delay, which may be substantially zero) by reading the corresponding audio/visual media data 131 and 161, respectively, at a faster-than-real-time rate with stream controls 132 and 162 to provide, for an interval of time, time-compressed media streams 133 and 163, respectively. Reading the corresponding audio/visual media data 131 and 161, respectively, at a faster-than-real-time rate provides a somewhat faster-than-real-time representation of data from capture buffers 125 and 155, respectively.
In some embodiments, the stream controllers 132 and 162 can also serve to reduce the current delay when it exceeds a predetermined minimum delay by making use of information collected from silence detectors 135 and 165. Each of the silence detectors 135 and 165 previews data in a corresponding one of the capture buffers 125 and 155, respectively, by reading ahead of the audio/video media data 131 and 161, respectively, and identifying intervals within the buffered data where the corresponding one of the subscribers 122 and 142 appears not to speak. Playing a portion of the media stream at faster-than-real-time appears much less noticeable to the remote subscriber while the local subscriber remains silent, especially if the corresponding stream controller offers no pitch compensation while streaming audio at faster-than-real-time. In embodiments where the stream controller (or receive buffer) does offer pitch compensation when audio streams out at faster-than-real-time, a remote subscriber will perceive the local subscriber as speaking somewhat quickly, but without suffering from “chipmunk effect” (that is, an artificially high-pitched voice).
During step 202, audio will begin to accumulate in the capture buffer of the terminal (depicted generically in
If during step 205, the terminal does not detect an initiation command, the process 200 reverts to step 204 during which the terminal continues to monitor for an initiation command. When the terminal detects an initiation command during step 205, then during step 206, the session manager at the terminal is given a notification to initiate a connection (i.e., place a call, or accept an incoming call) and the process 200 waits for notification that the session manager has completed the call connection. The detected initiation command may contain parameters, for example, who or what other station to call. Under such circumstances, the terminal will supply such parameters to the presence block 110 in order to resolve to the address of a remote terminal. Alternatively, the terminal itself could resolve the address of a remote terminal using local data, for example, a locally maintained address book (not shown). The initiation command could contain other parameters, for example, the beginning point in the capture buffer 203 for a subsequent media stream (e.g., stream 133). The grammar for individual commands, discussed below in conjunction with
Upon establishment of a connection, the stream controller (e.g., 132) will receive a command to begin sending a media stream (e.g., media stream 133) during step 207, here at normal speed, beginning in the capture buffer at a point indicated by initiation command detected during step 205. While the stream controller (e.g., the stream controller 132 in the terminal 121) transmits the media stream (e.g., the media stream 133) at normal speed, a check occurs during step 208 to detect a silent interval. If the current position in audio/video media data (e.g., the data 131) does not correspond to a silent interval (e.g., as detected and noted by the silence detector 135), then during step 209, the stream controller will continue to provide the media stream at normal speed. However, upon detection of a substantially silent interval (e.g., an interval during which the subscriber 122 does not speak), then during step 210, the stream controller will play out the media stream at faster-than-real-time.
A speed of 1⅓ faster-than-real-time generally achieves a sufficient speed-up during step 210 and at other times, though a greater or lesser speedup could occur for aesthetic reasons. If the delays accumulated in capture buffer 203 of
In the course of answering a call, the capture buffer 203 will typically accumulate less delay than for placing a call, since the command for answering a call typically has a more simple structure. At the same time, the protocol exchange and delay in opening a media stream (e.g., stream 163) after accepting a call remains shorter, whereby the accumulated delay becomes ratiometrically larger by comparison to setting up the media stream than the delay accumulated when placing the call. (On the other hand, when a called party takes a long time to accept a call, then the accumulated delay at the call-initiator's buffer will grow much larger.)
In some embodiments, during step 210, all or a portion of the silent interval detected during step 208 may get skipped, though this may result in a discontinuity in the resulting media stream (i.e., a ‘pop’ in the audio or a ‘skip’ in the video). The use of well-known audio and video techniques (e.g., fade-out, fade-in, crossfade, etc.) can at least partially address such discontinuities. During step 211, while play out proceeds at faster-than-real-time during the silent interval (or in alternative embodiments, skipping occurs), the terminal makes a determination whether the stream controller has “caught up” to the capture buffer, that is, whether the accumulated delay between the play out point for audio/video media data and the current capture point for live audio and video from microphone and camera has decreased to equal or become less than a predetermined value.
Determination of the predetermined value during step 211 will depend on the type of call (e.g., with or without video) and may depend on the hardware implementation. Often, transfer of the audio and video signals to the capture buffer at a given terminal occurs in blocks (e.g., 10 mS blocks for audio or frames of video every 1/30 second), as frequently seen with Universal Serial Bus (USB) computer peripherals and many other interfaces. Likewise, the stream controller at the terminal will access the audio/video media data in the same or different sized blocks. Because such accesses typically occur in quantized units (e.g., by the block or by the frame), and may occur asynchronously, the term ‘substantially’ is used in this context. For example, the “predetermined value” used for comparison against the accumulated delay could comprise a maximum buffer vale of not more than 2 frames of video, or in an audio-only embodiment, not more than 50 mS of buffered audio. This predetermined value should not exceed 250 Ms.
As long as the determination made during step 211 finds that the stream controller at the given terminal has not caught up and has not sufficiently reduced the accumulated delay, the process 200 of
The discussion of the call initiation process 200 of
The transactions illustrated in
Starting at the top of
At some point prior to call initiation, at least the called party's station 310 will register with the presence service 330, for example by using a ‘REGISTER’ message 340 in accordance with the SIP protocol discussed previously, so that callers can find the station 310 using the information registered with the presence service. In some cases, the station 310 can register with the presence service 330 under at least one particular name for that station, while in other cases; the station 310 will have a registration associated with at least one particular subscriber (e.g., Scott 311). The presence service 330 will accept the registration for station 310 and will make that registration information available to assist in call routing and initiation. In connection with system 100 of
Referring to
So far, the capture buffer 125 of the terminal 121 of
Upon recognizing the utterance 341 “Kirk to Engineering” as a command to initiate a call to a particular station or person, the station 320 sends an SIP “INVITE” message 345 for “Engineering” to the presence service 330 to identify the station associated with the label “Engineering.” Referring to In
Upon receiving SIP “INVITE” message 347, the station 310 now becomes aware of the call initiated by the subscriber Kirk 321. In some embodiments, the station 310 can provide a notification tone 348 the subscriber Scott 311. The called party station 310 responds to the “INVITE” message 347 with an SIP “SUCCESS” response code which (along with other well-known SIP transaction steps) allows caller station 320 to initiate a media stream at event 350 and begin transferring the buffered media.
Beginning at event 350, the capture buffer will begin transferring the captured version of utterance 341 as stream portion 351 during the interval 352. The utterance undergoes play out in real-time for interval 352, though delayed by a total amount of time from the start of the interval 342 to the start of the media stream at 350. This amount represents the cumulative delay in the capture buffer for the utterance 341. This delay arises from a combination of the utterance duration 342, the address recognition latency 344, and the cumulative latency from the sending of initial invitation message 345 to the receipt of the success response 349, plus some non-zero processing time (not explicitly identified).
Upon receipt of the early portion of the utterance in the early part of stream portion 351 at the called party station 310, some non-zero buffer latency 353 will occur as the receive buffer 143 at the terminal 141 of the called party station captures and decodes the utterance.
After decoding, which need not wait for the entire utterance to be received, the stream undergoes playback 354 to the subscriber Scott 311 via the audio reproduction device 147 of
If the capture buffer 155 in terminal 141 (corresponding to the called party station 310 of
However, in the illustrated embodiment of
Any time following the SIP “Success” response 349, the called party station 310 may begin play out of the called party media stream 163, shown here as beginning at event 356. However, since the subscriber Scott 311 has not accepted the call, the called party media stream initially remains muted. Initiating the called party stream before subscriber Scott 311 has personally accepted the call serves to minimize latency due to setting up the stream when and if the subscriber Scott 311 does eventually accept the call. The video corresponding to the called party media stream, muted while awaiting call acceptance, may appear black, or could include a video graphic (not shown) indicative of the called party station 310, or subscriber 311, depending on the configuration. The called party stream 163 in its muted condition may still undergo play out to the subscriber 321 as a video output 357 on the monitor 184 of
As called party station reproduces the utterance 341 as the output 354, the subscriber Scott 311 will hears and/or see the communication from the subscriber Kirk 321. After a short reaction time 358, the subscriber Scott 311 replies with an utterance 359. Since capture buffer 155 has already become operational, the speech recognition module 156 of
Immediately after the utterance 359, the subscriber Scott 311 may become substantially silent. However, as discussed above, such silence still represents an input signal (which may include video) captured in the buffer 155 of the terminal 141 of
As the receive buffer 143 of
At this point, when the subscriber Kirk 321 says, “Kirk to Engineering” 341, the subscriber Scott 311 will hear this utterance in the form of the output 354 about five seconds later. The subscriber Scott 311 will typically respond with the utterance “Scott here, Captain,” which the subscriber Kirk 321 will hear as the output 365 about ten seconds after his own original utterance 341. While these latencies appear high and perceptibly much larger than an in-person experience, each of stations 310 and 320 actively works to reduce their respective contributions to the overall latency as the call proceeds. The transaction continues in
Referring to
Upon hearing the subscriber Scott's acknowledgement 365 (
After the normal-speed interval 371, the silence detector 135 of
In response to hearing subscriber Kirk's order 373, after a reaction time 374, the subscriber Scott 311 replies with the utterance 382 of acknowledgement “Aye, Sir”. The faster-than-real-time play out has now consumed the accumulated delay in the capture buffer 125 of
In the above described exemplary embodiment, the faster-than-real-time speeds of 1⅓× and 2× represent example values and serve as teaching example. Higher or lower speed-up values remain possible, including skips as previously discussed. Additionally, transitions between playback speeds are shown in
As discussed above, the silence detectors 135 and 165 of
During step 402, at least audio begins to accumulate in the capture buffer 403, which generically represents the corresponding one of the capture buffers 125 and 155 of
If during step 405, the terminal does not detect an initiation command, the process 400 reverts back to step 404 to resume monitoring for such a command. However, if during step 405, the terminal detects an initiation command, then, during step 406, the session manager at the terminal (e.g., the session manager 127 at the terminal 121 of
In the case of placing a call, the detected initiation command may contain parameters, for example, who or what other station to call. The terminal could supply such parameters to the presence block 110 to resolve the address of a remote terminal (e.g., terminal 141). Alternatively, the local terminal (e.g., the terminal 121 of
During step 407, upon the connection (e.g., for media stream 133) being established, the stream controller (e.g., the stream controller 132) receives a trigger to begin sending a media stream, here at faster than normal speed (e.g., 1⅓× or 2× normal speed), beginning in the capture buffer at a point indicated by the initiation command found during step 405. As previously discussed, even though the media stream is playing faster than normal, the audio signal may be processed so as to leave the voice pitch substantially unchanged.
During step 411 of
As long stream controller 132 of
Process 500 begins upon commencement of step 501 with the stream already established, for example by using call initiation process 400 of
Regardless of the current play out speed, during step 511, the contents of capture buffer 125 of
As an example of a supplementary command, a local subscriber could direct his or her local terminal to start streaming video. In this regard, policy or system configuration considerations might dictate that a call is accepted in an audio-only mode. After call acceptance, the local subscriber might decide to provide a video stream. Under such circumstances, the local subscriber might utter the command “Computer, video on.” for receipt at that subscriber's terminal (e.g., terminal 121 of
In some embodiments, the terminal may have already energized the camera 124 so the capture buffer 125 has already begun accumulating images from the camera in synchronism with the audio accumulated from the microphone 123, all of
In some embodiments, the terminal will redact the act of the subscriber 122 giving such a command (“Computer, video on”) from the media stream. In such a case, the remote subscriber 142, would remain unaware that the caller issued the command (other than because the mode of the call has changed to include the transmission of video). The redaction of the command occurs in the following manner. The terminal will choose a sufficiently long predetermined accumulated delay used as the target in step 508 for the signal word (e.g., “Computer”) to undergo capture in buffer 125 and recognition by the speech module 126 (all of
While the stream remains paused at the point immediately preceding the signal word (or other recognized command), the capture buffer 125 accumulates the images from the camera 124 and audio microphone 123. If subsequent to the pause, the speech recognition module 126 does not recognize any command, then the stream becomes unpaused and access by stream controller 132 of
Process 500 as described above uses the speed control paradigm as illustrated in process 400. In other words, if there excess delay has accumulated in the capture buffer, then the stream manager plays the media stream out at faster than normal speed to reduce the excess delay. Alternatively, a variation of the above described stream management process could use the “faster than normal, but only when silent” paradigm described with respect to the process 200 of
In some cases, a terminal can excise (redact) all or part of a recognized command before streamlining the media data to a called party streamed. Generally, the grammar of the pattern for recognition will indicate the portion subject to redaction. For example, assume that the terminal wishes to redact the command, “Computer, video on” before reaching media stream 133. Such a command could have the following expression (in which curly-bracketed phrases are stream and call management instructions, unenclosed phrases are literal command phrases as might be spoken by a subscriber, and angle-bracketed phrases are tokens which are to be replaced, perhaps iteratively, until resolved to literal values):
{REDACT_S}<signal> VIDEO ON {REDACT_E} {VIDEO_ON}
wherein the token <signal> constitutes the locally defined signal word (e.g., “Computer”, though the subscriber could customize the signal word), and unenclosed phrase “VIDEO_ON” constitutes the specific command utterance.
In an alternative expression for the command, the command grammar uses the token <VIDEO_ON> instead of the literal (unenclosed) version of the command. The grammar would then include a collection of phrases corresponding to that command token. This allows the terminal to match this specific token with any utterance of “VIDEO ON”, “TURN VIDEO ON”, “ACTIVATE VIDEO”, or “START VIDEO.” The actual utterances acceptable in place of the command token can further depend on the spoken language preference of the subscriber. Thus, for a subscriber speaking German, the terminal would seek to recognize literal utterances such as “VIDEO AN” or “AKTIVIEREN VIDEO” for the <VIDEO ON> token. Grammar elements such as tokens that are satisfied by any one of one or more literal values, are well know.
To indicate that all or a portion of the spoken command requires redaction from the outbound stream, the command will include two redaction instructions {REDACT_S} and {REDACT_E}. These redaction instructions indicate that the portion of the utterance corresponding to those tokens and literals that lie between the start and end redaction tokens, requires redaction. The two redaction instructions always appear as a pair with a command form, and always in the start→end order, though some embodiments might choose to admit an unmatched instruction in a command form with the interpretation that if {REDACT_S} does not appear, the terminal will assume the presence of this redact operator at the beginning of the command or signal (if present). When {REDACT_E} does not appear, the terminal assumes the presence of such a redact operator at the end of the command (no such examples shown).
Under certain circumstances, a terminal could stream the uttered command to the remote station before the recognizing and parsing the utterance to place a {REDACT_S} in the stream. This could occur if the subscriber entering the command speaks slowly or the command phrase exceeds a prescribed length or that the accumulated delay and/or the station being commanded has a buffer latency too small. When this situation occurs, the {REDACT_S} can be placed at the current streaming position to execute immediately, unless this placement occurs after the {REDACT_E} instruction, in which case the instruction to redact cannot undergo execution.
Lastly, the {VIDEO_ON} instruction marks the point in the matched utterance at which the action triggered by the recognized command should take place. Thus, due to the redaction tokens, redaction of the entirety of the command utterance “Computer, video on” from the audio stream occurs, with the audio stream resuming following the placement of the {REDACT_E} instruction placed at the end of that portion of the utterance matching “ON”. Coincident with the resumption of the audio stream, synchronized video may undergo streaming too, because of the placement of the {VIDEO_ON} instruction.
Row 621 in
Row 622 shows a command providing the same function, but because this command contains no signal word as part of the command form, the predetermined accumulated delay in the capture buffer must be sufficient to recognize the utterance “VIDEO ON” and still enable redaction of that utterance from the media stream. Here, the command has same the grammar as in Row 621, but without the <signal> token. If both commands 621 and 622 remain simultaneously available, then adequate buffering must exist for the longer of the two so that when the signal word of command 621 triggers recognition before recognition of command 622 starts, no ambiguity or race condition occurs. If the subscriber only uttered the words “VIDEO ON” with no signal word, then only command 622 will trigger. Commands 623 and 624 are analogous to commands 621 and 622, respectively, but deactivate the video.
Note that the instruction {VIDEO_ON} in commands 621 and 622 and the instruction {VIDEO_OFF} in the commands 623 and 624 could logically appear anywhere in the grammar associated with these commands, since everything else in the command gets redacted anyway, which would leave the start and end position of the command recognized as coincident in the resulting media stream 133 after redaction. This is not always the case, as will be discussed below with respect to certain commands (e.g., command 601).
Some commands 625-628 contain other stream control instructions, such as {MUTE} and {UNMUTE}. The redaction instruction not only prevents the stream from being heard, but also attempts to remove the redacted portion from the timeline. If sufficient delay has accumulated (particularly as might occur toward the beginning of a call), the stream recipient may not miss the redacted portion. The {MUTE} and {UNMUTE} instructions behave differently. They control audibility, but do not alter the timeline of the media stream. As an example, consider row 627, containing the command form grammar {MUTE} MUTE. The {MUTE} instruction in this command marks the point in the stream where suppression of the audio should start. The bare word “MUTE” constitutes the literal utterance that triggers the command. Since the {MUTE} instruction appears before the literal utterance, muting of the audio occurs before streaming the utterance. Were the command form to read MUTE {MUTE}, then the command would mute the audio of the stream following the streaming of the utterance, so the recipient would hear the audio cut out after hearing the word “mute”, which some implementors may prefer. Note that, at least in English, the command “MUTE” constitutes a shorter utterance than “COMPUTER”, and so no accumulated delay requirement exists for proper recognition of this command, even without a signal word in the grammar Note that the {MUTE} and {UNMUTE} instructions do not require pairing within a command (as do {REDACT_S} and {REDACT_E}), though they could be, and that they need not be paired in separate commands: A subscriber might command the system to mute, and a few moments or minutes later, either forgetting himself or herself or just for extra assurance, could command the system to mute again.
In
Upon receipt of an inbound call and subsequent recognition of a call acceptance command (e.g., commands 607-610, 615), the {BUFFER} instruction triggers the start of accumulated delay. The {ACCEPT} instruction represents the point at which to start a connection and defines the command as a call acceptance type command. In both cases, the speech recognition module 126 of
In the exemplary commands appearing in rows 601-610, the elements such as <self_ref>, <station_ref>, and <addressee_ref> are tokens that also each represent a parameter in the grammar. For example, in row 601, the token <self_ref> represents an utterance by subscriber 122 initiating a call in his or her name, in this case, the literal utterance “Kirk”. Some systems might interpret this element as requiring a subscriber to identify him or her to the system in order for recognition of the command. In alternative embodiments, the grammar constraints might allow any brief utterance that appears between the signal word and the literal “TO”.
In the same example, <station_ref> token represents that an utterance must match a station known to the system, such as contained in a local address file (not shown) or found in the presence database 112 of
With regard to the commands 608-610, each command form contains grammar that recognizes a single occurrence of several different greetings. In example 608, these greetings include the literals “HERE”, “AYE”, “HELLO” separated by the vertical bar character, whereas for the command 609, such greeting words (and others) are represented by the single <familiar_greeting> token. Such a construct allows for easier construction and maintenance of command grammars. For example, upon adding the word “HEY” as a literal corresponding to the <familiar_greeting> token provides that the word will now apply to all instances of the collective token, as in command 610. Otherwise, a need would exist to add the word to all the individual instances of the command form (e.g., as another alternative literal in command 608), making tracking necessary to ensure consistency, which could prove awkward. The literal construct further offers the advantage of possibly covering different languages by collecting the various greeting words under the <familiar_greeting> element and allowing modification thereof by an explicit or default language selection (not shown).
In other examples, e.g., rows 630 and 631 in table 620, the system 100 of FIG. could generate a call-waiting signal (not shown) to let a subscriber know that another call is waiting. The call waiting signal could be ignored by the subscriber. Alternatively, the subscriber could accept the new incoming call as a second call, using the “switch call” command type. After the subscriber accepted the second call, the subscriber could later terminate second call and resume the first call using the “resume call” command type.
In the former case, when a new incoming call awaits acceptance, and the terminal now recognizes the “switch call” command from the local subscriber (e.g., command 630), a new outbound stream can begin at the point indicated by the {BUFFER} instruction. In this example, the {BUFFER} instruction appears after the “switch command” token (matched by the literal utterance “Switch Calls”). Therefore, the command utterance “Switch Calls” made by the local subscriber does not become part of the media stream sent to and heard by the remote subscriber who initiated the second call. The remote subscriber who initiated the first call will also not hear this command utterance because of the {MUTE} instruction. The start of the second call and placement of the first call on hold both occur in response to the {ACCEPT} instruction.
While the first call remains on hold, the mute may remain in effect, or the system could choose to provide another effect (e.g., “music on hold” or a visual notification) while the hold persists. Upon acceptance, the second call does not undergo muting because the {MUTE} instruction only applies to the call that was active at the time of encountering that instruction.
In the latter case, upon termination of the second call to return to the first call, the {MUTE} instruction prevents the second caller from hearing the resume call command and the {RESUME} instruction marks the point of termination of the second stream and release of the first stream from hold. Assigning the stream to take up at the {BUFFER} instruction can eliminate any accumulated delay remaining for this first stream. The {UNMUTE} instruction applies to the now currently active first stream, which had undergone muting in response to the “switch call” command 630. For other variations of the “resume call” command, the {MUTE} instruction might be absent, in which case the second caller would hear the utterance of the “Resume Call” command. Further, if the {BUFFER} instruction appeared at the start of the command, the first caller could hear the same utterance, though the accumulated delay for that stream would be set to at least the entire command utterance.
For cases in which the subscriber wants to actively reject an inbound call, the subscriber can do so using one of the exemplary call denial commands shown in rows 611-612, whereas row 613 depicts a passive denial command. The {DECLINE} instruction indicates that the command will block a connection to the inbound call, thereby refusing the call. As is common in many grammars, the notation in rows 611 and 612 separates multiple alternative literals, any one of which will match (i.e., any one of the utterances “Cancel”, “Block”, “Deny” would match the grammar). Whether any certain words have a further connotation e.g., whether the word “Block” would implicitly result in a terminal ignoring future calls from the same caller remains a design choice available to different implementations. The command grammar could support additional instruction like {BLACKLIST} (not shown in
In some cases, for example to simplify the task of speech recognition for the call initiation or call acceptance commands, the structure of the command can include a signal word, for example “Computer”, as in “Computer: Kirk to Engineering” where the signal word would not comprise part of the stream, but “Kirk to Engineering” would. In this case, the stream would begin just after the signal word, but still within the interval of the command utterance. Row 601 depicts such an example. In some instances, an utterance can contain an explicit or implied dialing command, immediately followed by a portion of the conversation, as in “Mr. Scott, meet me on the bridge.” Here, the capture buffer in the terminal would buffer the entirety of the utterance, even though only the first portion corresponds to a command to initiate a connection. Row 606 shows such an example.
In a video call, the called party may first accept the call with a verbal response (e.g., “Here, Sir”, as depicted in row 608), but the system configuration may only allow connection of the audio stream at first. To activate the video portion of the media stream, the subscriber would utter a subsequent command, “Video on” (as depicted in row 622). The terminal could squelch that utterance in the return stream (depending upon configuration or preferences), including removing the duration of the command utterance from the timeline when possible, after which the terminal will activate the video portion of the media stream.
In other embodiments, instead of the terminal streaming the audio/video media data at faster-than-real-time, the terminal could skip portions of the stream (not shown). In such an implementation, skipping only silent sections of the audio/video media data becomes preferable. In this regard, the terminal could crossfade between the last fraction of a second before the skipped portion and the first fraction of a second following the skipped portion.
Thus, audio input signal 710 matches the command form 711 (and row 601 of
The audio input 750 signal contains the utterance “Kirk to Engineering”, without any signal word, which does not fulfill the grammar for the command form 711 (which requires the signal word), but does fulfill the grammar for the command form 751 (and row 602 of table 600). The audio input signal 750 begins with an extended silence 757, which gets broken by the first portion 753 containing the 0.4 s long utterance “Kirk” which constitutes an acceptable match for the <self_ref> token of form 751. The second portion 754 contains the spoken word “to” which matches the literal “TO” of form 751 and third portion 755 contains the spoken word “Engineering” which corresponds to the <station_ref> token of 751 as above (assuming “Engineering” constitutes a currently recognized station name). In this example, the {BUFFER} instruction appears first, just ahead of the <self_ref> token. As such, for some embodiments, the buffer position could be determined to be the start of first portion 753, which corresponded to the <self_ref> element, but such an assignment can frequently cause a click or pop at the start of the buffer, since there could exist some aesthetically desirable pre-utterance that precedes the portion 753 identified by speech recognition module 126 of
Thus, audio input signal 750 matches the command form 751 (and row 602 of
The foregoing describes a technique for managing both audio-only and audio-video calls.
Claims
1. A method for establishing a call from a caller to a called party, comprising the steps of:
- storing a media stream, including a voice command, received from the caller;
- launching the call from the caller to the called party in accordance with the voice command; and
- transmitting the media stream to the called party following set up of the call.
2. The method according to claim 1 wherein the media stream includes video in addition to the voice command.
3. The method according to claim 1 wherein the transmitting step includes the steps of:
- (a) monitoring the media stream for a silence interval;
- (b) transmitting the silence interval of the media command at faster than real time; and
- (c) transmitting intervals of the media stream other than the silence interval at real time.
4. The method according to claim 1 wherein the transmitting step includes the steps of:
- (a) monitoring the media stream for a silence interval;
- (b) omitting the silence interval of the media command from transmission; and
- (c) transmitting intervals of the media stream other than the silence interval at real time.
5. The method according to claim 3 wherein the steps of (a) and (b) are repeated for each subsequent silence interval to reduce accumulated delay in the media stream below a predetermined value.
6. The method according to claim 4 wherein the steps of (a) and (b) are repeated for each subsequent silence interval to reduce accumulated delay in the media stream below a predetermined value.
7. The method according to claim 1 wherein the step of transmitting the media stream includes transmitting the media stream at faster than real time until accumulated delay in the stream is reduced below a predetermined value.
8. The method according to claim 1 wherein the step of launching the call includes sending a call initiation command using SIP.
9. A method for initiating a call between a caller and a called party, comprising the steps of:
- storing a media stream, including a voice command, received from a first participant selected from the caller and the called party;
- initiating the call in accordance with the voice command; and,
- transmitting the media stream to a second participant selected another of the caller and the called party, following set up of the call.
10. The method according to claim 9 wherein the media stream includes video in addition to the voice command.
11. The method according to claim 9 wherein the transmitting step includes the steps of:
- (a) monitoring the media stream for a silence interval;
- (b) transmitting the silence interval of the media command at faster than real time; and
- (c) transmitting intervals of the media stream other than the silence interval at real time.
12. The method according to claim 9 wherein the transmitting step includes the steps of:
- (a) monitoring the media stream for a silence interval;
- (b) omitting the silence interval of the media command from transmission; and
- (c) transmitting intervals of the media stream other than the silence interval at real time.
13. The method according to claim 11 wherein the steps of (a) and (b) are repeated for each subsequent silence interval to reduce accumulated delay in the media stream below a predetermined value.
14. The method according to claim 12 wherein the steps of (a) and (b) are repeated for each subsequent silence interval to reduce accumulated delay in the media stream below a predetermined value.
15. The method according to claim 9 wherein the step of transmitting the media stream includes transmitting the media stream at faster than real time until accumulated delay in the stream is reduced below a predetermined value.
16. The method according to claim 9 wherein the step of launching the call includes sending a call initiation command using SIP.
17. A terminal for use in a communications network for calling a called party, comprising:
- a capture buffer for storing a media stream, including a voice command, received from the caller;
- a session controller for launching the call from the caller to the called party in accordance with the voice command; and
- a stream controller for transmitting the media stream from the capture buffer to the called party following set up of the call.
18. The terminal according to claim 17 wherein the media stream stored in the capture buffer includes video in addition to the voice command.
19. The terminal according to claim 17 further comprising a silence detector for monitoring the media stream for a silence interval and for causing the stream controller to transmit the silence interval of the media command at faster than real time and to transmit intervals of the media stream other than the silence interval at real time.
20. The terminal according to claim 17 further comprising a silence detector for monitoring the media stream for a silence interval and for causing the stream controller to omit the silence interval of the media command and to transmit intervals of the media stream other than the silence interval at real time.
21. The terminal according to claim 19 wherein the stream controller transmits the each subsequent silence at faster than real interval at faster than real time until accumulated delay in the stream is reduced below a predetermined value
22. The terminal according to claim 19 wherein the stream controller skips the each subsequent silence time until accumulated delay in the stream is reduced below a predetermined value
23. The terminal according to claim 17 wherein the stream controller transmits the media stream at faster than real time until accumulated delay in the stream is reduced below a predetermined value.
24. The terminal according to according to claim 17 wherein the session manager launches the call by sending a call initiation command using SIP.
Type: Application
Filed: May 1, 2013
Publication Date: Feb 11, 2016
Patent Grant number: 10051115
Applicant:
Inventor: William Gibbens REDMANN (Glendale, CA)
Application Number: 14/770,481