System and method for processing speech to identify keywords or other information
A system and method are provided for performing speech processing. A system includes an audio detection system configured to receive a signal including speech and a memory having stored therein a database of keyword models forming an ensemble of filters associated with each keyword in the database. A processor is configured to receive the signal including speech from the audio detection system, decompose the signal including speech into a sparse set of phonetic impulses, and access the database of keywords and convolve the sparse set of phonetic impulses with the ensemble of filters. The processor is further configured to identify keywords within the signal including speech based a result of the convolution and control operation the electronic system based on the keywords identified.
Latest The Johns Hopkins University Patents:
- PSMA targeted radiohalogenated urea-polyaminocarboxylates for cancer radiotherapy
- Long-acting GLP-1r agonist as a therapy of neurological and neurodegenerative conditions
- Method for treating pulmonary hypertension with antibodies to human Resistin
- METHOD OF ADDITIVE MANUFACTURING AND METHOD OF MAKING POROUS PARTICLES
- Neuroprotective compounds for amyotrophic lateral sclerosis
This application is a continuation of U.S. patent application Ser. No. 13/926,659 filed Jun. 25, 2013.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCHThis invention was made with government support under H98230-07-C-0365 awarded by the Department of Defense. The government has certain rights in the invention.
BACKGROUND OF THE INVENTIONThis present invention relates generally to receiving and processing speech signals, and more specifically, techniques to identify keywords in speech, for example, for use in command and control of an electronic device.
The presence of electronic devices in personal and professional settings has increased to the extent that manual operation may no longer be sufficient or suitable to take advantage of their full capabilities. Configurations of electronic devices that include voice-activated, command and control abilities offer practical and convenient methods for operation, in addition to direct physical input. However, voice operation requires rapid and accurate processing of speech signals, using reliable models that address user variability, distortions, and detection errors.
Traditionally, isolated word recognition systems were constructed using models for entire words. Although practical for limited vocabulary size, demand for larger vocabulary necessitated the use of sub-word units to enable the sharing of training examples across contexts and permit the modeling of out-of-vocabulary words. Current, state-of-the-art spoken term detection systems typically employ large vocabulary databases and are based on automatic speech recognition (ASR) methods based on lattice searching. Although these systems have demonstrated good recognition performance, generating comprehensive sub-word lattices and spotting keywords in a large lattice is a slow process with a cost of high computational overhead.
Thus far, systems with limited processing and storage capacities, such as portable electronic devices, have had to rely upon network connectivity, remote data storage, and powerful processing servers to perform computationally-intensive tasks and access larger vocabulary databases.
Consequently, there exists a need for a system and method in the art that can achieve fast and reliable recognition of audio input without the explicit need for remote servers, communications with a network, and any other remote data storage and processing units. Also, there exists a need for a system and method in the art that can be implemented on an electronic device with limited processing resources. Furthermore, there exists a need for a system and method in the art that can construct and implement efficient keyword search models in the absence of substantial amounts of training data. In addition to the above, there exists a need for a system and method in the art that can handle out-of-vocabulary keywords.
SUMMARY OF THE INVENTIONThe present invention overcomes the above and other drawbacks by providing a system and method for speech signal processing that is computationally rapid, efficient, and does not require external resources for processing and execution support.
In accordance with one aspect of the invention, an electronic system configured to perform speech processing is disclosed that includes an audio detection system configured to receive a signal including speech and a memory having stored therein a database of keyword models forming an ensemble of filters associated with each keyword in the database. The system also includes a processor configured to receive the signal including speech from the audio detection system, decompose the signal including speech into a sparse set of phonetic impulses, and access the database of keywords and convolve the sparse set of phonetic impulses with the ensemble of filters. The processor is also configured to identify keywords within the speech signal using the convolved impulses and filters and control operation the electronic system based on the keywords identified.
In accordance with another aspect of the invention, a method for perform speech processing is disclosed that includes the steps of a) receiving a signal including speech, b) accessing a database of keyword models forming an ensemble of filters associated with each keyword in the database, and c) decomposing, with an electronic system, the signal including speech into a sparse set of phonetic impulses. The method also includes the steps of d) accessing, with an electronic system, the database of keywords and convolve the sparse set of phonetic impulses with the ensemble of filters and e) identifying, with an electronic system, keywords within the signal including speech based on step d). The method also includes f) controlling operation the electronic system based on the keywords identified in step e).
In accordance with yet another aspect of the invention, an electronic system configured to perform speech processing is disclosed that includes an audio detection system configured to receive a signal and generate a representation of speech as a sparse set of phonetic events in time. The system also includes a memory having stored therein a database of keyword models forming an ensemble of filters based upon Poisson rate parameters describing an evolution of phonetic events associated with each keyword in the database of keywords. The system further includes a processor configured to i) receive the representation of speech from the audio detection system, ii) decompose the representation of speech into a sparse set spare set of phonetic impulses, and iii) access the database of keywords and convolve the sparse set of phonetic impulses with the ensemble of filters. The processor is also configured to iv) identify keywords within the speech signal based on iii) and v) control operation the electronic system based on the keywords identified in iv).
The foregoing and other advantages of the invention will appear from the following description. In the description, reference is made to the accompanying drawings which form a part hereof, and in which there is shown by way of illustration a preferred embodiment of the invention. Such embodiment does not necessarily represent the full scope of the invention, however, and reference is made therefore to the claims and herein for interpreting the scope of the invention.
Referring to
Specifically, referring to
Referring now to
Referring to
For example, speech signal processing may be sampled at a rate of 100 samples or frames per second, that is a frame rate of 100 Hz. Computations on the signal are also computed at this rate. In this example, it would be reasonable to assume that there might be approximately 175 frames and 40 phones, which would result in 7000 floating-point values to describe an utterance of 40 phones over 175 frames. As will be further described, the present invention substantially improves the speed and efficiency of the process of analyzing speech.
Referring again to
Referring again to
Information regarding reference keyword data may be reflected by models for phonetic events, which may be described using an inhomogeneous Poisson counting process and calculated using a maximum likelihood estimation (MLE) method. Referring to
Referring again to
where the numerator corresponds to the keyword-specific model parameters and the denominator corresponds to the background model parameters. The detection function 806 is a log likelihood ratio evaluated at a time 814, t, in an utterance, and when large values 808 are detected above a threshold 810, it is likely that the keyword has occurred.
The calculation of the detection function 806 at time t 814 may be approximated by performing summation of the products between the counts of phonetic events, n, and a matrix of score elements, φ, belonging to each keyword phone and time division.
Referring to the process 900 of
As shown in
Similarly to the frame-by-frame keyword matching process of
Referring again to
The system disclosed herein is configured to detect and retrieve speech signals from a file medium or an audio detection system, upon demand and continuously. The system includes a processor capable of reducing speech signals to a phonetic posteriorgram and then decomposing the posteriorgram into a sparse set of phonetic impulses. The system is also equipped with a database storage and retrieval capability, wherein an expandable keyword database is located. Based upon the examples in the database, the system computes a keyword model that is formulated and stored as a set of keyword filters, representative of a combined phonetic signature from all available examples. The system allows for modeling keywords when training data is abundant as well as scarce. Once a keyword that is associated with a pre-determined operational instruction is identified in a speech utterance, the system allows for the execution of the instruction.
The method for keyword spotting in speech described here employs a convolution algorithm applied between a series phonetic impulses of speech and a set keyword model filters representing examples in the database. The method computes a score differential at the time location of every impulse in the sparse speech signal. The value of the score differential is indicative of a likelihood for a keyword matching in speech, and a high value is reported that is indicative of a high probability match.
The event-based searching approach provides enhanced performance of keyword spotting compared to traditional methods, due to the scarcity of events in time that must be processed. Processing speeds can approach 500,000 times faster than real time, allowing for hundreds of thousands of keywords to be identified in parallel. The lack of necessity for high computational overhead makes this system and method ideal for electronic devices with limited resources and connectivity capabilities, including consumer electronics such as cameras and other portable devices.
The present invention has been described in terms of one or more preferred embodiments, and it should be appreciated that many equivalents, alternatives, variations, and modifications, aside from those expressly stated, are possible and within the scope of the invention.
Claims
1. An electronic system configured to perform speech processing comprising:
- a memory having stored therein a database of keyword models, wherein the database of keyword models includes an ensemble of filters describing an evolution of phonetic events associated with each keyword in the database;
- a processor configured to: i) receive a signal including speech; ii) decompose the signal including speech into a sparse set of phonetic impulses; iii) access the database of keyword models; iv) identify keywords within the signal including speech using the sparse set of phonetic impulses and the database of keywords; and v) generate an output indicating the keywords identified in iv).
2. The system of claim 1 wherein the ensemble of filters is based upon Poisson rate parameters describing the evolution of phonetic events associated with each keyword in the database.
3. The system of claim 1 wherein the processor is further configured to generate a representation of speech as a sparse set of temporal phonetic events in response to the signal including speech in ii).
4. The system of claim 1 wherein the processor is further configured to compute a score for the phonetic impulses at each of a plurality of frames in time.
5. The system of claim 4 wherein the score relates a probability that a word occurred at particular point in time associated with a given frame in the plurality of frames in time.
6. The system of claim 1 wherein the database of keyword models includes an ensemble of filters associated with each keyword in the database and the processor is configured to convolve the sparse set of phonetic impulses with the ensemble of filters and calculate a score vector based on the convolving of the sparse set of phonetic impulses with ensemble of filters.
7. The system of claim 6 wherein the processor is further configured to identify keywords within the signal including speech based on the score vectors.
8. The system of claim 6 wherein the processor is further configured to track changes in the score vector to identify the keywords.
9. A method for performing speech processing comprising steps of:
- a) receiving a signal including speech;
- b) accessing a database of keyword models, wherein the database of keyword models includes an ensemble of filters describing an evolution of phonetic events associated with each keyword in the database;
- c) decomposing, with an electronic system, the signal including speech into a sparse set of impulses;
- d) accessing, with an electronic system, the database of keywords and process the sparse set of impulses with information from the database of keyword models;
- e) identifying, with an electronic system, keywords within the signal including speech based on step d); and
- f) controlling operation the electronic system based on the keywords identified in step e).
10. The method of claim 9 further comprising creating the ensemble of filters based upon Poisson rate parameters describing the evolution of phonetic events associated with each keyword in the database to form the database of keyword models.
11. The method of claim 9 wherein step c) includes generating a representation of speech as a sparse set of temporal phonetic events.
12. The method of claim 9 wherein step d) includes computing a score for the impulses at each of a plurality of frames in time.
13. The method of claim 12 wherein the score relates a probability that a word occurred at particular point in time associated with a given frame in the plurality of frames in time.
14. The method of claim 9 further comprising convolving the sparse set of impulses with an ensemble of filters available in the database of keyword models and calculating a score vector based on the convolving of the sparse set of impulses with ensemble of filters.
15. The method of claim 14 wherein step e) includes identifying keywords within the signal including speech based on the score vectors.
16. The method of claim 14 further comprising tracking changes in the score vector to identify the keywords.
17. An electronic system configured to perform speech processing comprising:
- an audio detection system configured to receive a signal and generate a representation of speech as a sparse set of events in time;
- a memory having stored therein a database of keyword models describing an evolution of events associated with each keyword in the database of keywords;
- a processor configured to: i) receive the representation of speech from the audio detection system; ii) decompose the representation of speech into a sparse set of phonetic impulses; iii) access the database of keywords and convolve the sparse set of phonetic impulses information in the database of keyword models; iv) identify keywords within the speech signal based on iii); and v) generate an output signal based on the keywords identified in iv).
18. The system of claim 17 wherein the processor is further configured to compute a score for the sparse set of phonetic impulses at each of a plurality of frames in time that relates a probability that a word occurred at particular point in time associated with a given frame in the plurality of frames in time.
19. The system of claim 17 wherein the processor is configured calculate a score vector based on the convolving of the sparse set of phonetic impulses with ensemble of filters.
20. The system of claim 19 wherein the processor is further configured to track changes in the score vector to identify the keywords.
20070038447 | February 15, 2007 | Kaneko |
20110144995 | June 16, 2011 | Bangalore et al. |
20110218805 | September 8, 2011 | Washio |
2189976 | May 2010 | EP |
- Jansen et al (2009) Point Process Models for Spotting Keywords in Continuous Speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8):1457-1470.
- Kintzley et al (2011) Event Selection from Phone Posteriorgrams Using Matched Filters. Interspeech, pp. 1905-1908.
- Kintzley et al (2012) Inverting the Point Process Model for Fast Phonetic Keyword Search. Interspeech, pp. 2438-2441.
- Kintzley et al (2012) MAP Estimation of Whole-Word Acoustic Models with Dictionary Priors. Interspeech, pp. 787-790.
Type: Grant
Filed: Aug 31, 2015
Date of Patent: Oct 24, 2017
Patent Publication Number: 20150371635
Assignee: The Johns Hopkins University (Baltimore, MD)
Inventors: Keith Kintzley (Severna Park, MD), Aren Jansen (Ellicott City, MD), Hynek Hermansky (Baltimore, MD), Kenneth Church (Croton-on-Hudson, NY)
Primary Examiner: Daniel Abebe
Application Number: 14/840,089
International Classification: G10L 15/00 (20130101); G10L 15/22 (20060101); G10L 15/14 (20060101); G10L 15/08 (20060101);