Method and apparatus for localization of an acoustic source

- PictureTel Corporation

An acoustic signal processing method and system using a pair of spatially separated microphones to obtain the direction or location of speech or other acoustic signals from a common sound source is disclosed. The invention includes a method and apparatus for processing the acoustic signals by determining whether signals acquired during a particular time frame represent the onset or beginning of a sequence of acoustic signals from the sound source, identifying acoustic received signals representative of the sequence of signals, and determining the direction of the source based upon the acoustic received signals. The invention has applications to videoconferencing where it may be desirable to automatically adjust a video camera, such as by aiming the camera in the direction of a person who has begun to speak.

Skip to:  ·  Claims  ·  References Cited  · Patent History  ·  Patent History

Claims

1. A method of processing a sequence of acoustic signals arriving from an acoustic source comprising the steps of:

acquiring respective streams of acoustic data at a plurality of locations during a plurality of time frames;
determining whether the acoustic data acquired at any one of the locations during a particular time frame represents the beginning of the sequence of acoustic signals;
identifying acoustic received signals at at least two of said locations representative of said sequence of signals if said acoustic data at any one of said locations represents the beginning of the sequence; and
determining a direction of said source based upon the identified acoustic received signals.

2. The method of claim 1 wherein the step of determining whether the signals acquired during a particular time frame represent the beginning of the sequence of acoustic signals comprises the step of examining the magnitude of a plurality of frequency components of signals acquired during the particular time frame.

3. The method of claim 2 wherein the step of examining comprises the steps of:

determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and
determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

4. The method of claim 3 wherein the step of identifying comprises the steps of:

identifying signals representative of cross-correlations between signals acquired at the plurality of locations during the particular time frame; and
subtracting a corresponding background noise from each of the signals representative of the cross-correlations.

5. The method of claim 3 wherein the step of identifying comprises determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a first predetermined number of times greater than the background noise energy for that frequency.

6. The method of claim 3 wherein the step of identifying comprises determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a second predetermined number of times greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames.

7. The method of claim 4 wherein the step of determining the direction of the source comprises the step of extracting from the acoustic acquired signals a time delay indicative of the difference in arrival times of the sequence of acoustic signals at the plurality of microphone locations.

8. The method of claim 7 further comprising the steps of extracting from the acoustic received signals a plurality of potential time delays each of which falls within one of a plurality of ranges of values, and selecting an actual time delay based upon the number of potential time delays falling within each range and a relative weight assigned to each range.

9. The method of claim 8 wherein ranges of potential time delays having relatively small values are assigned higher relative weights than ranges of potential time delays having larger values.

10. A method of processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising the steps of:

acquiring a stream of acoustic data during a plurality of time frames;
determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;
identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence;
determining the direction of said source based upon the acoustic received signals; and
aiming a video camera automatically in response to the step of determining the direction of said source.

11. The method of claim 10 wherein the step of determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals comprises the steps of:

determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and
determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

12. The method of claim 11 wherein the step of aiming comprises the step of panning the camera in the direction of said source.

13. The method of claim 11 wherein the step of aiming comprises the step of tilting the camera in the direction of said source.

14. The method of claim 11 further comprising the step of determining the location of the source based upon the acoustic received signals.

15. The method of claim 14 further comprising the step of automatically zooming a lens of the camera so as to frame the source in response to the step of determining the location of the source.

16. An apparatus for processing a sequence of acoustic signals arriving from an acoustic source comprising:

a plurality of transducers for acquiring a stream of acoustic data during a plurality of time frames;
means for determining whether the acoustic data acquired at any one of the transducers during a particular time frame represents the beginning of the sequence of acoustic signals;
means for identifying acoustic received signals at at least two of said locations representative of said sequence of signals if said acoustic data at any one of said locations represents the beginning of the sequence; and
means for determining a direction of said source based upon the identified acoustic received signals.

17. The apparatus of claim 16 wherein the means for determining whether the acquired acoustic data acquired represent the beginning of the sequence of acoustic signals comprises:

a background noise energy estimator;
a first means for determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount;
a second means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

18. The apparatus of claim 17 wherein the means for identifying comprises means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a first predetermined number of times greater than the background noise energy for that frequency.

19. The apparatus of claim 17 wherein the means for identifying comprises means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a second predetermined number of times greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames.

20. The apparatus of claim 17 wherein the means for identifying comprises:

means for identifying signals representative of cross-correlations between signals acquired at the plurality of transducers during the particular time frame; and
a differencer for subtracting a corresponding background noise from each of the signals representative of the cross-correlations.

21. The apparatus of claim 20 wherein the means for determining the direction of the source comprises means for extracting from the acoustic acquired signals a time delay indicative of the difference in arrival times of the sequence of acoustic signals at the plurality of microphone locations.

22. The apparatus of claim 21 further comprising means for extracting from the acoustic received signals a plurality of potential time delays each of which falls within one of a plurality of ranges of values, and means for selecting an actual time delay based upon the number of potential time delays falling within each range and a relative weight assigned to each range.

23. The apparatus of claim 22 wherein ranges of potential time delays having relatively small values are assigned higher relative weights than ranges of potential time delays having larger values.

24. An apparatus for processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising:

a video camera;
a plurality of transducers for acquiring a stream of acoustic data during a plurality of time frames;
means for determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;
means for identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence; and
means for determining the direction of said source based upon the acoustic received signals;
wherein the video camera is automatically aimed in the direction of the source in response to signals received from the means for determining the direction of said source.

25. The apparatus of claim 24 wherein the means for determining whether the acquired acoustic data acquired represent the beginning of the sequence of acoustic signals comprises:

a background noise energy estimator;
a first means for determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and
a second means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

26. The apparatus of claim 25 wherein the camera is automatically panned in the direction of the source in response to the signals received from the means for determining the direction of said source.

27. The apparatus of claim 25 wherein the camera is automatically tilted in the direction of the source in response to the signals received from the means for determining the direction of said source.

28. An apparatus for processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising:

a video camera;
at least a first, second and third transducer for acquiring a stream of acoustic data during a plurality of time frames;
means for determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;
means for identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence; and
means for determining the direction of said source in two-dimensions based upon the acoustic received signals;
wherein the video camera is automatically aimed in the direction of the source in response to signals received from the means for determining the direction of said source.

29. The apparatus of claim 28 wherein the means for determining whether the acquired acoustic data acquired represent the beginning of the sequence of acoustic signals comprises:

a background noise energy estimator;
a first means for determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and
a second means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

30. The apparatus of claim 28 wherein

the first transducer is displaced from the second transducer along a first axis and the third transducer is displaced from the second transducer along a second axis perpendicular to the first axis;
acoustic data acquired by the first and second transducers is used to determine the direction of the source with respect to the first axis, and acoustic data acquired by the second and third transducers is used to determine the direction of the source with respect to the second axis; and
the video camera is automatically panned and tilted in the direction of the source in response to the signals received from the means for determining the direction of said source.

31. The apparatus of claim 28 further comprising a fourth transducer for acquiring a stream of acoustic data during the plurality of time frames, wherein

the first, second and third transducers are located along a first axis, with the second transducer located between the first and third transducers, and the fourth transducer is displaced from the second transducer along a second axis perpendicular to the first axis;
acoustic data acquired by the first and third transducers is used to determine the direction of the source with respect to the first axis, and acoustic data acquired by the second and fourth transducers is used to determine the direction of the source with respect to the second axis; and
the video camera is automatically panned and tilted in the direction of the source in response to the signals received from the means for determining the direction of said source.

32. The apparatus of claim 31 further comprising:

means for determining the position of the source, wherein acoustic data acquired by three of the four transducers is used to determine the position of the source; and
wherein the camera automatically is zoomed so as to frame the source in response to signals received from the means for determining the position of the source.

33. A method of processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising the steps of:

acquiring a stream of acoustic data during a plurality of time frames at first, second and third transducers;
determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;
identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence;
determining the direction of said source in two-dimensions based upon the acoustic received signals; and
aiming a video camera automatically in the direction of the source in response to the step of determining the direction of said source.

34. The method of claim 33 further comprising the step of:

acquiring a stream of acoustic data during the plurality of time frames at a fourth transducer;
determining the position of the source in a third dimension, wherein acoustic data acquired by three of the four transducers is used to determine the position of the source; and
zooming the video camera automatically so as to frame the source in response to the step of determining the position of the source.

35. A method of operating a video camera for use in videoconferencing comprising the steps of:

displaying images corresponding to video data acquired by the video camera;
acquiring a stream of acoustic data, including a sequence of acoustic signals from a source, during a plurality of time frames;
determining the direction of said source based upon the acquired stream of acoustic data;
aiming the video camera automatically in response to the step of determining by tilting or panning the video camera; and
freezing the image displayed on a display during the period when the video camera is tilting or panning.

36. The method of claim 35 wherein the step of freezing comprises the step of freezing the image appearing on the display at a video frame occurring just prior to panning or tilting the video camera.

37. A method of operating a system comprising at least first and second video cameras for use in videoconferencing comprising, the method comprising:

displaying on a display images corresponding to video data acquired by the first video camera;
acquiring a stream of acoustic data, including a sequence of acoustic signals from a source, during a plurality of time frames;
determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;
identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence; and
determining the direction of said source based upon the acoustic received signals;
aiming the first video camera automatically in response to determining the direction by tilting or panning the video camera; and
displaying on the display images corresponding to video data acquired by the second video camera during the period when the first video camera is tilting or panning.
Referenced Cited
U.S. Patent Documents
4581758 April 8, 1986 Coker et al.
4741038 April 26, 1988 Elko et al.
4965819 October 23, 1990 Kannes
4980761 December 25, 1990 Natori
5058419 October 22, 1991 Nordstrom et al.
5206721 April 27, 1993 Ashida et al.
5335011 August 2, 1994 Addeo et al.
5465302 November 7, 1995 Lazzari et al.
5550924 August 27, 1996 Helf et al.
Foreign Patent Documents
0 594 098 A1 April 1994 EPX
4-109784 April 1992 JPX
Other references
  • M. Omologo and P. Svaizer, "Acoustic Event Localization Using A Crosspower-Spectrum Phase Based Technique," Proceedings of the 1994 International Conference on Acoustics, Speech, and Signal Processing, Apr. 1994, Adelaide, South Australia, pp. II-273 to II-276.
Patent History
Patent number: 5778082
Type: Grant
Filed: Jun 14, 1996
Date of Patent: Jul 7, 1998
Assignee: PictureTel Corporation (Andover, MA)
Inventors: Peter L. Chu (Lexington, MA), Hong Wang (Westford, MA)
Primary Examiner: Forester W. Isen
Law Firm: Fish & Richardson P.C.
Application Number: 8/663,670
Classifications
Current U.S. Class: Directive Circuits For Microphones (381/92); 348/15
International Classification: H04N 715;