SPEECH PROCESSING APPARATUS AND SPEECH PROCESSING METHOD
[Problem] To obtain a meaning intended for conveyance by a user from speech of the user while reducing trouble for the user. [Solution] A speech processing apparatus includes an analysis unit configured to analyze a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
Latest Sony Corporation Patents:
- Information processing device, information processing method, program, and information processing system
- Beaconing in small wavelength wireless networks
- Information processing system and information processing method
- Information processing device, information processing method, and program class
- Scent retaining structure, method of manufacturing the scent retaining structure, and scent providing device
The present disclosure relates to a speech processing apparatus and a speech processing method.
BACKGROUNDSpeech processing apparatus with a speech agent function has recently become popular. The speech agent function is a function to analyze the meaning of speech uttered by a user and execute processing in accordance with the meaning obtained by the analysis. For example, when a user utters a speech “Send an email let's meet in Shibuya tomorrow to A”, the speech processing apparatus with the speech agent function analyzes the meaning of the speech, and sends an email having a body “Let's meet in Shibuya tomorrow” to A by using a pre-registered email address of A. Examples of other types of processing executed by the speech agent function include answering a question from a user, for example, as disclosed in Patent Literature 1.
CITATION LIST Patent LiteraturePatent Literature 1: JP 2016-192121 A
SUMMARY Technical ProblemThe speech uttered by a user may include a correct speech expressing a meaning intended for conveyance by the user, and an error speech not expressing the meaning intended for conveyance by the user. The error speech is, for example, a filler such as “well” and “umm”, and a soliloquy such as “what was it?”. When a user utters speech including the error speech, the user may utter the speech again from the start to provide the speech including only the correct speech to the speech agent function. However, uttering the speech again from the start is troublesome for the user.
Thus, the present disclosure proposes a novel and improved speech processing apparatus and method enabling acquisition of a meaning intended for conveyance by a user from speech of the user while reducing the trouble for the user.
Solution to ProblemAccording to the present disclosure, a speech processing apparatus is provided that includes an analysis unit configured to analyze a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
Moreover, according to the present disclosure, a speech processing method is provided that includes analyzing, by a processor, a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
Advantageous Effects of InventionAs described above, the present disclosure enables the acquisition of the meaning intended for conveyance by the user from the speech of the user while reducing the trouble for the user. Note that the effects described above are not necessarily limitative. With or in the place of the above effects, there may be achieved any one of the effects described in this specification or other effects that may be grasped from this specification.
Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the appended drawings. In this specification and the appended drawings, structural elements having substantially the same function and structure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
Additionally, in this specification and the appended drawings, a plurality of structural elements having substantially the same function and structure are sometimes distinguished from each other using different alphabets after the same reference numerals. However, when a plurality of structural elements having substantially the same function and structure do not particularly have to be distinguished from each other, the structural elements are denoted only with the same reference numerals.
Moreover, the present disclosure will be described in the order of the following items.
-
- 1. Overview of Speech processing apparatus
- 2. Configuration of Speech processing apparatus
- 3. Specific examples of Meaning correction
- 3-1. First example
- 3-2. Second example
- 3-3. Third example
- 3-4. Fourth example
- 4. Operation of Speech processing apparatus
- 5. Modification
- 6. Hardware configuration
- 7. Conclusion
Overview of Speech Processing Apparatus
First, an overview of a speech processing apparatus according to an embodiment of the present disclosure will be described with reference to
For example, when the user of the speech processing apparatus 20 utters a speech “Send an email let's meet in Shibuya tomorrow to A” as illustrated in
Note that the speech processing apparatus 20, which is illustrated as a stationary apparatus in
Here, the speech uttered by the user may include a correct speech expressing a meaning intended for conveyance by the user, and an error speech not expressing the meaning intended for conveyance by the user. The error speech is, for example, a filler such as “well” and “umm”, and a soliloquy such as “what was it?” A negative word such as “not” and a speech talking to another person also sometimes fall under the error speech. When the user utters speech including such an error speech, e.g., when the user utters a speech “Send an email let's meet in, umm . . . where is that? Shibuya tomorrow to A”, uttering the speech again from the start is troublesome for the user.
The inventors of this application have developed the embodiment of the present disclosure by focusing on the above circumstances. In accordance with the embodiment of the present disclosure, the meaning intended for conveyance by the user can be obtained from the speech of the user while reducing the trouble for the user. In the following, a configuration and an operation of the speech processing apparatus 20 according to the embodiment of the present disclosure will be sequentially described in detail.
Configuration of Speech Processing Apparatus
(Image Processing Unit)
The image processing unit 220 includes an imaging unit 221, a face image extraction unit 222, an eye feature value extraction unit 223, a visual line identification unit 224, a face feature value extraction unit 225, and a facial expression identification unit 226 as illustrated in
The imaging unit 221 captures an image of a subject to acquire the image of the subject. The imaging unit 221 outputs the acquired image of the subject to the face image extraction unit 222.
The face image extraction unit 222 determines whether a person area exists in the image input from the imaging unit 221. When the person area exists in the imaging unit 221, the face image extraction unit 222 extracts a face image in the person area to identify a user. The face image extracted by the face image extraction unit 222 is output to the eye feature value extraction unit 223 and the face feature value extraction unit 225.
The eye feature value extraction unit 223 analyzes the face image input from the face image extraction unit 222 to extract a feature value for identifying a visual line of the user.
The visual line identification unit 224, which is an example of a behavior analysis unit configured to analyze user behaviors, identifies a direction of the visual line based on the feature value extracted by the eye feature value extraction unit 223. The visual line identification unit 224 identifies a face direction in addition to the visual line direction. The visual line direction, a change in the visual line, and the face direction obtained by the visual line identification unit 224 are output to the analysis unit 260 as an example of analysis results of the user behaviors.
The face feature value extraction unit 225 extracts a feature value for identifying a facial expression of the user based on the face image input from the face image extraction unit 222.
The facial expression identification unit 226, which is an example of the behavior analysis unit configured to analyze the user behaviors, identifies the facial expression of the user based on the feature value extracted by the face feature value extraction unit 225. For example, the facial expression identification unit 226 may identify an emotion corresponding to the facial expression by recognizing whether the user changes his/her facial expression during utterance, and which emotion the change in the facial expression is based on, e.g., whether the user is angry, laughing, or embarrassed. A correspondence relation between the facial expression and the emotion may be explicitly given by a designer as a rule using a state of eyes or a mouth, or may be obtained by a method of preparing data in which the facial expression and the emotion are associated with each other and performing statistical learning using the data. Additionally, the facial expression identification unit 226 may identify the facial expression of the user by utilizing time series information based on a moving image, or by preparing a reference image (e.g., an image with a blank expression), and comparing the face image output from the face image extraction unit 222 with the reference image. The facial expression of the user and a change in the facial expression of the user identified by the facial expression identification unit 226 are output to the analysis unit 260 as an example of the analysis results of the user behaviors. Note that the speech processing apparatus 20 can also obtain whether the user is talking to another person or is uttering speech to the speech processing apparatus 20 by using the image obtained by the imaging unit 221 as the analysis results of the user behaviors.
(Speech Processing Unit)
The speech processing unit 240 includes a sound collection unit 241, a speech section detection unit 242, a speech recognition unit 243, a word detection unit 244, an utterance direction estimation unit 245, a speech feature detection unit 246, and an emotion identification unit 247 as illustrated in
The sound collection unit 241 has a function as a speech input unit configured to acquire an electrical sound signal from air vibration containing environmental sound and speech. The sound collection unit 241 outputs the acquired sound signal to the speech section detection unit 242.
The speech section detection unit 242 analyzes the sound signal input from the sound collection unit 241, and detects a speech section equivalent to a speech signal in the sound signal by using an intensity (amplitude) of the sound signal and a feature value indicating a speech likelihood. The speech section detection unit 242 outputs the sound signal corresponding to the speech section, i.e., the speech signal to the speech recognition unit 243, the utterance direction estimation unit 245, and the speech feature detection unit 246. The speech section detection unit 242 may obtain a plurality of speech sections by dividing one utterance section by a break of the speech.
The speech recognition unit 243 recognizes the speech signal input from the speech section detection unit 242 to obtain a character string representing the speech uttered by the user. The character string obtained by the speech recognition unit 243 is output to the word detection unit 244 and the analysis unit 260.
The word detection unit 244 stores therein a list of words possibly falling under the error speech not expressing the meaning intended for conveyance by the user, and detects the stored word from the character string input from the speech recognition unit 243. The word detection unit 244 stores therein, for example, words falling under the filler such as “well” and “umm”, words falling under the soliloquy such as “what was it?” and words corresponding to the negative word such as “not” as the words possibly falling under the error speech. The word detection unit 244 outputs the detected word and an attribute (e.g., the filler or the negative word) of this word to the analysis unit 260.
The utterance direction estimation unit 245, which is an example of the behavior analysis unit configured to analyze the user behaviors, analyzes the speech signal input from the speech section detection unit 242 to estimate a user direction as viewed from the speech processing apparatus 20. When the sound collection unit 241 includes a plurality of sound collection elements, the utterance direction estimation unit 245 can estimate the user direction, which is a speech source direction, and movement of the user as viewed from the speech processing apparatus 20 based on a phase difference between speech signals obtained by the respective sound collection elements. The user direction and the user movement are output to the analysis unit 260 as an example of the analysis results of the user behaviors.
The speech feature detection unit 246 detects a speech feature such as a voice volume, a voice pitch and a pitch fluctuation from the speech signal input from the speech section detection unit 242. Note that the speech feature detection unit 246 can also calculate an utterance speed based on the character string obtained by the speech recognition unit 243 and the length of the speech section detected by the speech section detection unit 242.
The emotion identification unit 247, which is an example of the behavior analysis unit configured to analyze the user behaviors, identifies an emotion of the user based on the speech feature detected by the speech feature detection unit 246. For example, the emotion identification unit 247 acquires, based on the speech feature detected by the speech feature detection unit 246, information expressed in the voice depending on the emotion, e.g., an articulation degree such as whether the user speaks clearly or unclearly, a relative utterance speed in comparison with a normal utterance speed, and whether the user is angry or embarrassed. A correspondence relation between the speech and the emotion may be explicitly given by a designer as a rule using a voice state, or may be obtained by a method of preparing data in which the voice and the emotion are associated with each other and performing statistical learning using the data. Additionally, the facial expression identification unit 226 may identify the emotion of the user by preparing a reference voice of the user, and comparing the speech output from the speech section detection unit 242 with the reference voice. The user emotion and a change in the emotion identified by the emotion identification unit 247 are output to the analysis unit 260 as an example of the analysis results of the user behaviors.
(Analysis Unit)
The analysis unit 260 includes a meaning analysis unit 262, a storage unit 264, and a correction unit 266 as illustrated in
The meaning analysis unit 262 analyzes the meaning of the character string input from the speech recognition unit 243. For example, when a character string “Send an email I won't need dinner tomorrow to Mom” is input, the meaning analysis unit 262 has a portion to perform morphological analysis on the character string and determine that the task is “to send an email” based on keywords such as “send” and “email”, and a portion to acquire the destination and the body as necessary arguments for achieving the task. In this example, “Mom” is acquired as the destination, and “I won't need dinner tomorrow” as the body. The meaning analysis unit 262 outputs these analysis results to the correction unit 266.
Note that a meaning analysis method may be any of a method of achieving the meaning analysis by machine learning using an utterance corpus created in advance, a method of achieving the meaning analysis by a rule, or a combination thereof. Additionally, to perform the morphological analysis as a part of the meaning analysis processing, the meaning analysis unit 262 has a mechanism of giving an attribute to each word, and an internal dictionary. The meaning analysis unit 262 can provide what kind of word the word included in the uttered speech is, that is, the attribute such as a person name, a place name and a common noun in accordance with the attribute giving mechanism and the dictionary.
The storage unit 264 stores therein a history of information regarding the user. The storage unit 264 may store therein information indicating, for example, what kind of order the user has given to the speech processing apparatus 20 by speech, and what kind of condition the image processing unit 220 and the speech processing unit 240 have identified regarding the user.
The correction unit 266 corrects the analysis results of the character string obtained by the meaning analysis unit 262. The correction unit 266 specifies a portion corresponding to the error speech included in the character string based on, for example, the change in the visual line of the user input from the visual line identification unit 224, the change in the facial expression of the user input from the facial expression identification unit 226, the word detection results input from the word detection unit 244, and the history of the information regarding the user stored in the storage unit 264, and corrects the portion corresponding to the error speech by deleting or replacing the portion. The correction unit 266 may specify the portion corresponding to the error speech in accordance with a rule in which a relation between each input and the error speech is described, or based on statistical learning of each input. The specification and correction processing of the portion corresponding to the error speech by the correction unit 266 will be more specifically described in “3. Specific examples of Meaning correction”.
(Processing Execution Unit)
The processing execution unit 280 executes processing in accordance with the meaning corrected by the correction unit 266. The processing execution unit 280 may be, for example, a communication unit that sends an email, a schedule management unit that inputs an appointment to a schedule, an answer processing unit that answers a question from the user, an appliance control unit that controls operations of household electrical appliances, or a display control unit that changes display contents in accordance with the meaning corrected by the correction unit 266.
SPECIFIC EXAMPLES OF MEANING CORRECTIONThe configuration of the speech processing apparatus 20 according to the embodiment of the present disclosure has been described above. Subsequently, some specific examples of the meaning correction performed by the facial expression identification unit 226 of the speech processing apparatus 20 will be sequentially described.
First ExampleMoreover, in the example of
The correction unit 266 specifies whether each speech portion uttered by the user corresponds to the correct speech or the error speech based on the analysis results of the user behaviors such as the visual line direction, the facial expression and the utterance direction, and the detection of the filler. In the example illustrated in
As a result, the correction unit 266 deletes the meaning of the portion corresponding to the speech section A2 from the meaning of the uttered speech acquired by the meaning analysis unit 262. That is, the correction unit 266 corrects the meaning of the email body from “let's meet in, umm . . . where is that? Shibuya tomorrow” to “let's meet in Shibuya tomorrow”. With such a configuration, the processing execution unit 280 sends an email having a body “Let's meet in Shibuya tomorrow” intended for conveyance by the user to A.
Second ExampleMoreover, in the example of
The correction unit 266 specifies whether each speech portion uttered by the user corresponds to the correct speech or the error speech based on the analysis results of the user behaviors such as the visual line direction, the facial expression and the utterance direction, and the detection of the negative word. In the example illustrated in
As a result, the correction unit 266 deletes the meaning of the speech portion corresponding to “not in Shibuya” from the meaning of the uttered speech acquired by the meaning analysis unit 262. That is, the correction unit 266 corrects the content of the schedule from “meeting in Shinjuku, not in Shibuya” to “meeting in Shinjuku”. With such a configuration, the processing execution unit 280 registers “meeting in Shinjuku” as a schedule for tomorrow.
Third ExampleMoreover, in the example of
The correction unit 266 specifies whether each speech portion uttered by the user corresponds to the correct speech or the error speech based on the analysis results of the user behaviors such as the visual line direction, the facial expression and the utterance direction, and the detection of the negative word. In the example illustrated in
Moreover, in the example of
The correction unit 266 specifies whether each speech portion uttered by the user corresponds to the correct speech or the error speech based on the analysis results of the user behaviors such as the visual line direction, the facial expression and the utterance direction, and the detection of the filler. In the example illustrated in
Additionally, in the example illustrated in
As a result, the correction unit 266 deletes the meanings of the portions corresponding to the speech sections D2 and D3 from the meaning of the uttered speech acquired by the meaning analysis unit 262. That is, the correction unit 266 corrects the meaning of the email body from “let's meet in, umm . . . where is that? Shibuya. Shibuya tomorrow” to “let's meet in Shibuya tomorrow”. With such a configuration, the processing execution unit 280 sends an email having a body “Let's meet in Shibuya tomorrow” intended for conveyance by the user to C.
The example in which the speech uttered by a user other than the user who has uttered the speech to be processed by the speech processing apparatus 20 is also input to the meaning analysis unit 262 has been described above. Alternatively, the speech acquired to be uttered by another user based on the utterance direction estimated by the utterance direction estimation unit 245 may be deleted before input to the meaning analysis unit 262.
Operation of Speech Processing Apparatus
The configuration of the speech processing apparatus 20 and the specific examples of the processing according to the embodiment of the present disclosure have been described above. Subsequently, the operation of the speech processing apparatus 20 according to the embodiment of the present disclosure will be described with reference to
The speech recognition unit 243 recognizes the speech signal input from the speech section detection unit 242 to obtain the character string representing the speech uttered by the user (S320). The meaning analysis unit 262 then analyzes the meaning of the character string input from the speech recognition unit 243 (S330).
In parallel with the above steps at 5310 to 5330, the speech processing apparatus 20 analyzes the user behaviors (S340). For example, the visual line identification unit 224 of the speech processing apparatus 20 identifies the visual line direction of the user, and the facial expression identification unit 226 identifies the facial expression of the user.
After that, the correction unit 266 corrects the analysis results of the character string obtained by the meaning analysis unit 262 based on the history information stored in the storage unit 264 and the analysis results of the user behaviors (S350). The processing execution unit 280 executes the processing in accordance with the meaning corrected by the correction unit 266 (S360).
Modification
The embodiment of the present disclosure has been described above. Hereinafter, some modifications of the embodiment of the present disclosure will be described. Note that the respective modifications described below may be applied to the embodiment of the present disclosure individually or by combination. Additionally, the respective modifications may be applied instead of the configuration described in the embodiment of the present disclosure or added to the configuration described in the embodiment of the present disclosure.
For example, the function of the correction unit 266 may be enabled/disabled depending on an application to be used, that is, the task in accordance with the meaning analyzed by the meaning analysis unit 262. To be more specific, the error speech may be easily generated in some applications, and difficult to be generated in other applications. In this case, the function of the correction unit 266 is disabled in the application in which the error speech is difficult to be generated and is enabled in the application in which the error speech is easily generated. This allows prevention of correction not intended by the user.
Additionally, the above embodiment has described the example in which the correction unit 266 performs the meaning correction after the meaning analysis performed by the meaning analysis unit 262. The processing order and the processing contents are not limited to the above example. For example, the correction unit 266 may delete the error speech portion first, and the meaning analysis unit 262 may then analyze the meaning of the character string from which the error speech portion has been deleted. This configuration can shorten the length of the character string as a target of the meaning analysis performed by the meaning analysis unit 262, and reduce the processing load on the meaning analysis unit 262.
Moreover, the above embodiment has described the example in which the speech processing apparatus 20 has the plurality of functions illustrated in
Hardware Configuration
The embodiment of the present disclosure has been described above. The information processing such as the image processing, the speech processing and the meaning analysis described above is achieved by cooperation between software and hardware of the speech processing apparatus 20 described below.
The CPU 201 functions as an arithmetic processor and a controller and controls the entire operation of the speech processing apparatus 20 in accordance with various computer programs. The CPU 201 may be also a microprocessor. The ROM 202 stores computer programs, operation parameters or the like to be used by the CPU 201. The RAM 203 temporarily stores computer programs to be used in execution of the CPU 201, parameters that appropriately change in the execution, or the like. These units are connected mutually via a host bus including, for example, a CPU bus. The CPU 201, the ROM 202, and the RAM 203 can cooperate with software to achieve the functions of, for example, the eye feature value extraction unit 223, the visual line identification unit 224, the face feature value extraction unit 225, the facial expression identification unit 226, the speech section detection unit 242, the speech recognition unit 243, the word detection unit 244, the utterance direction estimation unit 245, the speech feature detection unit 246, the emotion identification unit 247, the analysis unit 260, and the processing execution unit 280 described with reference to
The input device 208 includes an input unit that allows the user to input information, such as a mouse, a keyboard, a touch panel, a button, a microphone, a switch and a lever, and an input control circuit that generates an input signal based on the input from the user and outputs the input signal to the CPU 201. The user of the speech processing apparatus 20 can input various data or instruct processing operations to the speech processing apparatus 20 by operating the input device 208.
The output device 210 includes a display device such as a liquid crystal display (LCD) device, an organic light emitting diode (OLED) device, and a lamp. The output device 210 further includes a speech output device such as a speaker and a headphone. The display device displays, for example, a captured image or a generated image. Meanwhile, the speech output device converts speech data or the like to a speech and outputs the speech.
The storage device 211 is a data storage device configured as an example of the storage unit of the speech processing apparatus 20 according to the present embodiment. The storage device 211 may include a storage medium, a recording device that records data on the storage medium, a read-out device that reads out the data from the storage medium, and a deleting device that deletes the data recorded on the storage medium. The storage device 211 stores therein computer programs to be executed by the CPU 201 and various data.
The drive 212 is a storage medium reader-writer, and is incorporated in or externally connected to the speech processing apparatus 20. The drive 212 reads out information recorded on a removable storage medium 24 such as a magnetic disk, an optical disk, a magneto-optical disk or a semiconductor memory loaded thereinto, and outputs the information to the RAM 203. The drive 212 can also write information onto the removable storage medium 24.
The imaging device 213 includes an imaging optical system such as a photographic lens and a zoom lens for collecting light, and a signal conversion element such as a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS). The imaging optical system collects light emitted from a subject to form a subject image on the signal conversion unit, and the signal conversion element converts the formed subject image to an electrical image signal.
The communication device 215 is, for example, a communication interface including a communication device to be connected to the network 12. The communication device 215 may be also a wireless local area network (LAN) compatible communication device, a long term evolution (LTE) compatible communication device, or a wired communication device that performs wired communication.
Conclusion
In accordance with the embodiment of the present disclosure described above, various effects can be obtained.
For example, the speech processing apparatus 20 according to the embodiment of the present disclosure specifies the portion corresponding to the correct speech and the portion corresponding to the error speech by using not only the detection of a particular word but also the user behaviors when the particular word is detected. Consequently, a more appropriate specification result can be obtained. The speech processing apparatus 20 according to the embodiment of the present disclosure can also specify the speech uttered by a different user from the user who has uttered the speech to the speech processing apparatus 20 as the error speech by further using the utterance direction.
The speech processing apparatus 20 according to the embodiment of the present disclosure deletes or corrects the meaning of the portion specified as the error speech. Thus, even when the speech of the user includes the error speech, the speech processing apparatus 20 can obtain the meaning intended for conveyance by the user from the speech of the user without requiring the user to utter the speech again. As a result, the trouble for the user can be reduced.
The preferred embodiment(s) of the present disclosure has/have been described in detail with reference to the accompanying drawings, whilst the technical scope of the present disclosure is not limited to the above examples. A person skilled in the art may find various alterations and modifications within the technical scope of the appended claims, and it should be understood that they will naturally come under the technical scope of the present disclosure.
For example, the respective steps in the processing carried out by the speech processing apparatus 20 in this specification do not necessarily have to be time-sequentially performed in accordance with the order described as the flowchart. For example, the respective steps in the processing carried out by the speech processing apparatus 20 may be performed in an order different from the order described as the flowchart, or may be performed in parallel.
Additionally, a computer program that allows the hardware such as the CPU, the ROM and the RAM incorporated in the speech processing apparatus 20 to demonstrate a function equivalent to that of each configuration of the speech processing apparatus 20 described above can also be created. A storage medium storing the computer program is also provided.
Moreover, the effects described in this specification are merely illustrative or exemplary, and not restrictive. That is, with or in the place of the above effects, the technology according to the present disclosure can achieve other effects that are obvious to a person skilled in the art from the description of this specification.
Additionally, the present technology may also be configured as below.
(1)
A speech processing apparatus comprising an analysis unit configured to analyze a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
(2)
The speech processing apparatus according to (1), wherein the analysis unit includes
a meaning analysis unit configured to analyze the meaning of the speech uttered by the user based on the recognition result of the speech, and
a correction unit configured to correct the meaning obtained by the meaning analysis unit based on the analysis result of the behavior of the user.
(3)
The speech processing apparatus according to (2), wherein the correction unit determines whether to delete the meaning of the speech corresponding to one speech section in an utterance period of the user based on the analysis result of the behavior of the user in the speech section.
(4)
The speech processing apparatus according to any one of (1) to (3), wherein the analysis unit uses an analysis result of a change in a visual line of the user as the analysis result of the behavior of the user.
(5)
The speech processing apparatus according to any one of (1) to (4), wherein the analysis unit uses an analysis result of a change in a facial expression of the user as the analysis result of the behavior of the user.
(6)
The speech processing apparatus according to any one of (1) to (5), wherein the analysis unit uses an analysis result of a change in an utterance direction as the analysis result of the behavior of the user.
(7)
The speech processing apparatus according to any one of (1) to (6), wherein the analysis unit further analyzes the meaning of the speech based on a relation between the user and another user indicated by the speech.
(8)
The speech processing apparatus according to (3), wherein the correction unit further determines whether to delete the meaning of the speech corresponding to the speech section based on whether a particular word is included in the speech section.
(9)
The speech processing apparatus according to (8), wherein the particular word includes a filler or a negative word.
(10)
The speech processing apparatus according to any one of (1) to (9), further comprising:
a speech input unit to which the speech uttered by the user is input;
a speech recognition unit configured to recognize the speech input to the speech input unit;
a behavior analysis unit configured to analyze the behavior of the user while the user is uttering the speech; and
a processing execution unit configured to execute processing in accordance with the meaning obtained by the analysis unit.
(11)
A speech processing method comprising analyzing, by a processor, a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
REFERENCE SIGNS LIST20 SPEECH PROCESSING APPARATUS
30 MOBILE TERMINAL
220 IMAGE PROCESSING UNIT
221 IMAGING UNIT
222 FACE IMAGE EXTRACTION UNIT
223 EYE FEATURE VALUE EXTRACTION UNIT
224 VISUAL LINE IDENTIFICATION UNIT
225 FACE FEATURE VALUE EXTRACTION UNIT
226 FACIAL EXPRESSION IDENTIFICATION UNIT
240 SPEECH PROCESSING UNIT
241 SOUND COLLECTION UNIT
242 SPEECH SECTION DETECTION UNIT
243 SPEECH RECOGNITION UNIT
244 WORD DETECTION UNIT
245 UTTERANCE DIRECTION ESTIMATION UNIT
246 SPEECH FEATURE DETECTION UNIT
247 EMOTION IDENTIFICATION UNIT
260 ANALYSIS UNIT
262 MEANING ANALYSIS UNIT
264 STORAGE UNIT
266 CORRECTION UNIT
280 PROCESSING EXECUTION UNIT
Claims
1. A speech processing apparatus comprising an analysis unit configured to analyze a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
2. The speech processing apparatus according to claim 1, wherein the analysis unit includes
- a meaning analysis unit configured to analyze the meaning of the speech uttered by the user based on the recognition result of the speech, and
- a correction unit configured to correct the meaning obtained by the meaning analysis unit based on the analysis result of the behavior of the user.
3. The speech processing apparatus according to claim 2, wherein the correction unit determines whether to delete the meaning of the speech corresponding to one speech section in an utterance period of the user based on the analysis result of the behavior of the user in the speech section.
4. The speech processing apparatus according to claim 1, wherein the analysis unit uses an analysis result of a change in a visual line of the user as the analysis result of the behavior of the user.
5. The speech processing apparatus according to claim 1, wherein the analysis unit uses an analysis result of a change in a facial expression of the user as the analysis result of the behavior of the user.
6. The speech processing apparatus according to claim 1, wherein the analysis unit uses an analysis result of a change in an utterance direction as the analysis result of the behavior of the user.
7. The speech processing apparatus according to claim 1, wherein the analysis unit further analyzes the meaning of the speech based on a relation between the user and another user indicated by the speech.
8. The speech processing apparatus according to claim 3, wherein the correction unit further determines whether to delete the meaning of the speech corresponding to the speech section based on whether a particular word is included in the speech section.
9. The speech processing apparatus according to claim 8, wherein the particular word includes a filler or a negative word.
10. The speech processing apparatus according to claim 1, further comprising:
- a speech input unit to which the speech uttered by the user is input;
- a speech recognition unit configured to recognize the speech input to the speech input unit;
- a behavior analysis unit configured to analyze the behavior of the user while the user is uttering the speech; and
- a processing execution unit configured to execute processing in accordance with the meaning obtained by the analysis unit.
11. A speech processing method comprising analyzing, by a processor, a meaning of speech uttered by a user based on a recognition result of the speech and an analysis result of a behavior of the user while the user is uttering the speech.
Type: Application
Filed: Jan 25, 2019
Publication Date: Jun 3, 2021
Applicant: Sony Corporation (Tokyo)
Inventor: Chika Myoga (Tokyo)
Application Number: 17/046,747