Information processing apparatus and method therefor
An information processing apparatus using a speech signal, comprising a playback unit configured to play back the speech signal, a speech recognition unit configured to subject the speech signal to speech recognition, a text generator to generate a linguistic text having linguistic elements and time information for synchronizing with playback of the speech signal, by using a speech recognition result of the speech recognition unit, and a presentation unit configured to present selectively the linguistic elements together with the time information in synchronism with the speech signal played back by the playback unit.
This application is based upon and claims the benefit of priority from prior Japanese Patent Application No. 2003-207622, filed Aug. 15, 2003, the entire contents of which are incorporated herein by reference.
BACKGROUND OF THE INVENTION1. Field of the Invention
The present invention relates to an information processing apparatus, particularly to an information processing apparatus to output linguistic information based on a speech recognition result and a information processing method therefor.
2. Description of the Related Art
Recently, research is done flourishingly on meta data generation using linguistic information obtained by a speech recognition result of a speech signal. It is useful for data management or search to apply generated meta data to a speech signal.
For example, Japanese Patent Laid-Open
No. 8-249343 provides a technique to realize a search of desired audio data by extracting a specific expression and keyword from a linguistic text obtained by a speech recognition result of audio data, and indexing it to build an audio data base.
There is a technique that the linguistic text obtained by a speech recognition result is used as meta data used for a data management or a search. However, there is not a technique of displaying in dynamic the linguistic text of the speech recognition result so that a user can easily understand contents of a speech and that of a video corresponding to the speech and perform a playback control.
The object of the present invention is to provide an information processing apparatus capable of generating a linguistic text by speech recognition and displaying the linguistic text in dynamic, and a method therefor.
BRIEF SUMMARY OF THE INVENTIONAn aspect of the present invention is to provide an information processing apparatus using a speech signal, comprising: a playback unit configured to play back the speech signal; a speech recognition unit configured to subject the speech signal to speech recognition; a text generator to generate a linguistic text having linguistic elements and time information for synchronizing with playback of the speech signal, by using a speech recognition result of the speech recognition unit; and a presentation unit configured to present selectively the linguistic elements together with the time information in synchronism with the speech signal played back by the playback unit.
Another aspect of the present invention is to provide an information processing apparatus using a video-audio signal, comprising: a speech playback unit configured to play back a speech signal from the video-audio signal; a speech recognition unit configured to subject the speech signal to speech recognition; a text generator to generate a linguistic text having linguistic elements and time information for synchronizing with playback of the speech signal, by using a speech recognition result of the speech recognition unit; a presentation unit configured to present selectively the linguistic elements together with the time information in synchronism with the speech signal played back by the speech playback unit.
Another aspect of the present invention is to provide an information processing method comprising: subjecting a speech signal to speech recognition to obtain a speech recognition result; generating a linguistic text including linguistic elements and time information for synchronizing with playback of the speech signal according to the speech recognition result; playing back the speech signal; and displaying selectively the linguistic elements together with the time information in synchronism with the playback speech signal.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
There will now be described an embodiment of the present invention in conjunction with the accompanying drawings.
(First Embodiment)
The AV information delay unit (memory) 12 temporarily stores AV information output from the data separator 10. This AV information is delayed till the AV information is speech-recognized with the speech recognition unit 13. Linguistic information is generated based on the speech recognition result. The AV information is output from the AV information delay unit 12 when the generated linguistic information is output from the linguistic information output unit 14. The speech recognition unit 13 acquires information including part-of-speech information of all recognizable words as linguistic information from the speech signal.
The delayed AV information output from the AV information delay unit 12 and the linguistic information output from the linguistic information output unit 14 are supplied to the synchronous processor 15. The synchronous processor 15 plays back the delayed AV information. In addition, the synchronous processor 15 converts the linguistic text included in the linguistic information to a video signal, and outputs it to the display controller 16 in synchronism with playback of the AV information. The speech signal of the AV information played back by the synchronous processor 15 is input to a speaker 22 via an audio circuit 21, and the video playback signal is supplied to the display controller 16.
The display controller 16 synthesizes the video signal of the linguistic text with the image signal of the AV information and supplies it to the display 17 to display it. The linguistic information output from the linguistic information output unit 14 can be stored in a recorder 18 such as HDD or a recording medium such as a DVD 19.
At first, in step S1, the linguistic information output unit 14 acquires a speech recognition result from the speech recognizer 13. A presentation method of the linguistic information is set along with speech recognition or beforehand (step S2). The acquisition of information for setting the presentation method is described hereinafter.
In step S3, the linguistic text included in the speech recognition result acquired by the speech recognizer 13 is analyzed. This analysis can use a well known morphological analysis technique. Various kinds of natural language processing such as extraction of a keyword and an important sentence from the analysis result of the linguistic text are performed. For example, summary information may be generated based on the morphological analysis result of the linguistic text included in the speech recognition result, and used as linguistic information of an object to be presented. It should be noted that time information for synchronizing with playback of the speech signal is necessary for the linguistic information based on such summary information.
In step S4, presentation linguistic information is selected. Concretely, information on words and phrases or information on sentences is selected according to setting information such as basis of selection, quantity of presentation. In step S5, an output (presentation) unit of the presentation linguistic information selected in step S4 is determined. In step S6, the presentation timing is set every output unit based on the speech start time information. In step S7, the time length of presentation continuation is determined for each output unit.
In step S8, linguistic information representing a presentation notation, a presentation start time, and a length of presentation continuous time is output.
At first, in step S10, it is decided whether to present the keyword (important word or phrase). When the keyword is presented, the process advances to step S11. Otherwise, the process advances to step S12. When the keyword is presented, linguistic information is chosen and presented in units of a sentence.
In step S11 for setting generation of presentation word or phrase, and base of selection, a user sets part-of-speech specification, the important word or phrase presentation, the priority presentation word or phrase, presentation quantity. In step S12 for setting the presentation sentence generation and base of selection, the user sets representation of a sentence including designated words or phrases, a summary ratio and so on. When setting is done by either of step S11 or step S12, the process advances to step S13. In step S13, it is decided whether the linguistic information should be presented in dynamic. When the user instructs a dynamic presentation, a velocity and direction of the dynamic presentation are set in step S14. Concretely, the scrolling direction and speed at that the represent notation is scrolled are set.
In step S15, a unit of presentation and a start timing are designated. The unit of presentation is “sentence”, “clause”, or “words and phrases”, a sentence head speech start time, a clause speech start time, a word-and-phrase speech start time are set to a start timing. In step S16, a presentation continuous time is designated in units of a presentation. In here, on the presentation continuous time, “until the speech start of the next word or phrase”, “the number of seconds”, or “until the end of a sentence” can be designated. In step S17, a presentation mode is set. The presentation mode includes, for example, position of a unit of presentation, character stile (font), size, and so on. The presentation mode is preferably set for all words and phrases or every designated word or phrase.
A TV viewer can visually understand the speech contents 51 in synchronism with the image 53 according to the dynamic display (presentation) of such keyword caption. The playback output speech contents 51 helps understanding of the contents such as confirmation of miss heard contents or prompt understanding of broad contents. The speech recognizer 13, the linguistic information output part 14, the synchronous processor 15, the display controller 16 and so on may be executed by computer software.
(Second Embodiment)
The home server 60 includes further a search processor 600 providing a search screen for searching for AV information stored in the AV information storage unit 61 to a user terminal 68 and a network electrical household appliances and electrical equipment (AV television) 69 through a network 67 from a communication I/F (interface) unit 66.
The representative image (reduced still image) of the part contents obtained by dividing the contents 81b (here, “news B”) or the reduced video of part contents is displayed on the region 82b. The linguistic information representing the speech contents of the part contents to assume 11:30 to be a start time is displayed in scroll on the region 83b. The linguistic information representing the speech contents of the part contents to assume 11:35 to be a start time is displayed in scroll on the region 83b.
The keywords of the speech contents of the part contents are displayed every part contents in a list on the search screen 80 provided by the search processor 600 as above. If the speech contents attains at its end in each scrolling display, it comes back to its beginning again, and repeats its display. In the case of displaying the regions 82a, 84a, 82b and 84b by movie display, the movie display and the scrolling display may be synchronized in the contents. In this case, the first embodiment may be taken into account. When a linguistic text is subjected to speech recognition, time information for synchronization may be derived from (the speech signal of) the contents to be recognized.
When a user specifies a keyword 86b by, for example, a mouse M in the search screen 80 as shown in
According to the second embodiment, a TV viewer can understand visually the speech content of the contents by the dynamic scrolling display of the keyword generated based on the speech recognition result. In addition, desired contents can be adequately selected from the contents listed based on visual understanding of the speech content, resulting in realizing efficient search of the AV information. According to the current invention as discussed above, it is possible to provide an information processing apparatus to generate a linguistic text by speech recognition, and display the linguistic text in a dynamic, and a method therefor.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Claims
1. An information processing apparatus using a speech signal, comprising:
- a playback unit configured to play back the speech signal;
- a speech recognition unit configured to subject the speech signal to speech recognition;
- a text generator to generate a linguistic text having linguistic elements and time information for synchronizing with playback of the speech signal, by using a speech recognition result of the speech recognition unit; and
- a presentation unit configured to present selectively the linguistic elements together with the time information in synchronism with the speech signal played back by the playback unit.
2. An information processing apparatus using a video-audio signal, comprising:
- a speech playback unit configured to play back a speech signal from the video-audio signal;
- a speech recognition unit configured to subject the speech signal to speech recognition;
- a text generator to generate a linguistic text having linguistic elements and time information for synchronizing with playback of the speech signal, by using a speech recognition result of the speech recognition unit; and
- a presentation unit configured to present selectively the linguistic elements together with the time information in synchronism with the speech signal played back by the speech playback unit.
3. The apparatus according to claim 2, which further includes a receiver unit configured to receive the video-audio signal including the speech signal, and a delay unit configured to store temporarily the video-audio signal received by the receiver unit and delayed output of the video-audio signal till the text generator generates the linguistic text.
4. The apparatus according to claim 2, which includes a video player to play back a video signal of the video-audio signal in synchronism with the speech signal, and wherein the presentation unit includes a display device configured to display the linguistic text together with the video signal played back by the video player.
5. The apparatus according to claim 4, which further includes a receiver unit configured to receive the video-audio signal including the speech signal, and a delay unit configured to store temporarily the video-audio signal received by the receiver unit and delayed output of the video-audio signal till the text generator generates the linguistic text.
6. The apparatus according to claim 2 and adopted to a recording medium, which further includes a synthesis unit configured to synthesize an image signal representing the linguistic text with the playback video signal, and an output unit configured to output a synthesis result of the synthesis unit to the recording medium.
7. The apparatus according to claim 6, which further includes a receiver unit configured to receive the video-audio signal including the speech signal, and a delay unit configured to store temporarily the video-audio signal received by the receiver unit and delayed output of the video-audio signal till the text generator generates the linguistic text.
8. The apparatus according to claim 2, wherein the linguistic elements includes words.
9. An information processing apparatus comprising:
- a memory to store a plurality of speech signals,
- a text generator to generate a plurality of linguistic texts by subjecting the speech signal to speech recognition;
- a keyword extractor to extract a plurality of keywords from the linguistic texts; and
- a display device configured to display the keywords in dynamic.
10. The apparatus according to claim 9, wherein the display is configured to display a plurality of keywords in dynamic for each of the linguistic texts.
11. The apparatus according to claim 9, which includes a selector to select from the speech signals of the memory a speech signal corresponding to a keyword of the keywords which is specified by a user, and a speech reproducer to reproduce the speech signal selected by the selector.
12. The apparatus according to claim 11, wherein the display is configured to display a plurality of keywords in dynamic for each of the linguistic texts.
13. The apparatus according to claim 11 and adopted to a user terminal, which includes a transmitter to transmit the speech signal or the video-audio signal to the user terminal via a network.
14. The apparatus according to claim 9, wherein the memory stores video-audio signals including the speech signal, and which includes a selector to select from the video-audio signals of the memory a video-audio signal corresponding to a keyword of the keywords which is specified by a user, and a video-audio reproducer to reproduce the video-audio signal selected by the selector.
15. The apparatus according to claim 14, wherein the display is configured to display a plurality of keywords in dynamic for each of the linguistic texts.
16. The apparatus according to claim 14 and adopted to a user terminal, which includes a transmitter to transmit the speech signal or the video-audio signal to the user terminal via a network.
17. The apparatus according to claim 9, wherein the keywords each represent part of speech contents of the speech signal.
18. An information processing method comprising:
- subjecting a speech signal to speech recognition to obtain a speech recognition result;
- generating a linguistic text including linguistic elements and time information for synchronizing with playback of the speech signal according to the speech recognition result;
- playing back the speech signal; and
- displaying selectively the linguistic elements together with the time information in synchronism with the playback speech signal.
19. An information processing method comprising:
- storing a plurality of speech signals, subjecting the speech signals to speech recognition to generate a plurality of linguistic texts;
- extracting a plurality of keywords from the linguistic texts; and
- displaying the keywords in dynamic.
20. An information processing program stored in a computer readable medium, comprising:
- means for instructing a computer to subject a speech signal to speech recognition to obtain a speech recognition result;
- means for instructing the computer to generate a linguistic text including time information for synchronizing with playback of the speech signal according to the speech recognition result;
- means for instructing the computer to reproduce the speech signal; and
- means for instructing the computer to display the linguistic text in synchronism with the reproduced speech signal.
21. An information processing program stored in a computer readable medium, comprising:
- means for instructing a computer to store a plurality of speech signals in a memory,
- means for instructing the computer to subject the speech signals to speech recognition to generate a plurality of linguistic texts;
- means for instructing the computer to extract a plurality of keywords from the linguistic texts; and
- means for instructing the computer to display the keywords in dynamic.
Type: Application
Filed: Aug 13, 2004
Publication Date: Apr 14, 2005
Inventors: Kazuhiko Abe (Yokohama-shi), Akinori Kawamura (Kunitachi-shi), Yasuyuki Masai (Yokohama-shi), Makoto Yajima (Tachikawa-shi), Kohei Momosaki (Kawasaki-shi), Munehiko Sasajima (Yokohama-shi), Koichi Yamamoto (Kawasaki-shi)
Application Number: 10/917,344