Method and apparatus for localization of an acoustic source

Info

Patent number: 5778082
Type: Grant
Filed: Jun 14, 1996
Date of Patent: Jul 7, 1998
Assignee: PictureTel Corporation (Andover, MA)
Inventors: Peter L. Chu (Lexington, MA), Hong Wang (Westford, MA)
Primary Examiner: Forester W. Isen
Law Firm: Fish & Richardson P.C.
Application Number: 8/663,670

Abstract

An acoustic signal processing method and system using a pair of spatially separated microphones to obtain the direction or location of speech or other acoustic signals from a common sound source is disclosed. The invention includes a method and apparatus for processing the acoustic signals by determining whether signals acquired during a particular time frame represent the onset or beginning of a sequence of acoustic signals from the sound source, identifying acoustic received signals representative of the sequence of signals, and determining the direction of the source based upon the acoustic received signals. The invention has applications to videoconferencing where it may be desirable to automatically adjust a video camera, such as by aiming the camera in the direction of a person who has begun to speak.

Claims

1. A method of processing a sequence of acoustic signals arriving from an acoustic source comprising the steps of:

acquiring respective streams of acoustic data at a plurality of locations during a plurality of time frames;

determining whether the acoustic data acquired at any one of the locations during a particular time frame represents the beginning of the sequence of acoustic signals;

identifying acoustic received signals at at least two of said locations representative of said sequence of signals if said acoustic data at any one of said locations represents the beginning of the sequence; and

determining a direction of said source based upon the identified acoustic received signals.

2. The method of claim 1 wherein the step of determining whether the signals acquired during a particular time frame represent the beginning of the sequence of acoustic signals comprises the step of examining the magnitude of a plurality of frequency components of signals acquired during the particular time frame.

3. The method of claim 2 wherein the step of examining comprises the steps of:

determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and

determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

4. The method of claim 3 wherein the step of identifying comprises the steps of:

identifying signals representative of cross-correlations between signals acquired at the plurality of locations during the particular time frame; and

subtracting a corresponding background noise from each of the signals representative of the cross-correlations.

5. The method of claim 3 wherein the step of identifying comprises determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a first predetermined number of times greater than the background noise energy for that frequency.

6. The method of claim 3 wherein the step of identifying comprises determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a second predetermined number of times greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames.

7. The method of claim 4 wherein the step of determining the direction of the source comprises the step of extracting from the acoustic acquired signals a time delay indicative of the difference in arrival times of the sequence of acoustic signals at the plurality of microphone locations.

8. The method of claim 7 further comprising the steps of extracting from the acoustic received signals a plurality of potential time delays each of which falls within one of a plurality of ranges of values, and selecting an actual time delay based upon the number of potential time delays falling within each range and a relative weight assigned to each range.

9. The method of claim 8 wherein ranges of potential time delays having relatively small values are assigned higher relative weights than ranges of potential time delays having larger values.

10. A method of processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising the steps of:

acquiring a stream of acoustic data during a plurality of time frames;

determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;

identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence;

determining the direction of said source based upon the acoustic received signals; and

aiming a video camera automatically in response to the step of determining the direction of said source.

11. The method of claim 10 wherein the step of determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals comprises the steps of:

determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and

determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

12. The method of claim 11 wherein the step of aiming comprises the step of panning the camera in the direction of said source.

13. The method of claim 11 wherein the step of aiming comprises the step of tilting the camera in the direction of said source.

14. The method of claim 11 further comprising the step of determining the location of the source based upon the acoustic received signals.

15. The method of claim 14 further comprising the step of automatically zooming a lens of the camera so as to frame the source in response to the step of determining the location of the source.

16. An apparatus for processing a sequence of acoustic signals arriving from an acoustic source comprising:

a plurality of transducers for acquiring a stream of acoustic data during a plurality of time frames;

means for determining whether the acoustic data acquired at any one of the transducers during a particular time frame represents the beginning of the sequence of acoustic signals;

means for identifying acoustic received signals at at least two of said locations representative of said sequence of signals if said acoustic data at any one of said locations represents the beginning of the sequence; and

means for determining a direction of said source based upon the identified acoustic received signals.

17. The apparatus of claim 16 wherein the means for determining whether the acquired acoustic data acquired represent the beginning of the sequence of acoustic signals comprises:

a background noise energy estimator;

a first means for determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount;

a second means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

18. The apparatus of claim 17 wherein the means for identifying comprises means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a first predetermined number of times greater than the background noise energy for that frequency.

19. The apparatus of claim 17 wherein the means for identifying comprises means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component of the signals acquired during the particular time frame is at least a second predetermined number of times greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames.

20. The apparatus of claim 17 wherein the means for identifying comprises:

means for identifying signals representative of cross-correlations between signals acquired at the plurality of transducers during the particular time frame; and

a differencer for subtracting a corresponding background noise from each of the signals representative of the cross-correlations.

21. The apparatus of claim 20 wherein the means for determining the direction of the source comprises means for extracting from the acoustic acquired signals a time delay indicative of the difference in arrival times of the sequence of acoustic signals at the plurality of microphone locations.

22. The apparatus of claim 21 further comprising means for extracting from the acoustic received signals a plurality of potential time delays each of which falls within one of a plurality of ranges of values, and means for selecting an actual time delay based upon the number of potential time delays falling within each range and a relative weight assigned to each range.

23. The apparatus of claim 22 wherein ranges of potential time delays having relatively small values are assigned higher relative weights than ranges of potential time delays having larger values.

24. An apparatus for processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising:

a video camera;

a plurality of transducers for acquiring a stream of acoustic data during a plurality of time frames;

means for determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;

means for identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence; and

means for determining the direction of said source based upon the acoustic received signals;

wherein the video camera is automatically aimed in the direction of the source in response to signals received from the means for determining the direction of said source.

25. The apparatus of claim 24 wherein the means for determining whether the acquired acoustic data acquired represent the beginning of the sequence of acoustic signals comprises:

a background noise energy estimator;

a first means for determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and

a second means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

26. The apparatus of claim 25 wherein the camera is automatically panned in the direction of the source in response to the signals received from the means for determining the direction of said source.

27. The apparatus of claim 25 wherein the camera is automatically tilted in the direction of the source in response to the signals received from the means for determining the direction of said source.

28. An apparatus for processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising:

a video camera;

at least a first, second and third transducer for acquiring a stream of acoustic data during a plurality of time frames;

means for determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;

means for identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence; and

means for determining the direction of said source in two-dimensions based upon the acoustic received signals;

wherein the video camera is automatically aimed in the direction of the source in response to signals received from the means for determining the direction of said source.

29. The apparatus of claim 28 wherein the means for determining whether the acquired acoustic data acquired represent the beginning of the sequence of acoustic signals comprises:

a background noise energy estimator;

a first means for determining, for a plurality of frequencies, whether the magnitude of each such frequency component of signals acquired during the particular time frame is greater than a background noise energy for that frequency by at least a first predetermined amount; and

a second means for determining, for the plurality of frequencies, whether the magnitude of each such frequency component is greater than the magnitude of corresponding frequency components of signals acquired during a pre-specified number of preceding time frames by at least a second predetermined amount.

30. The apparatus of claim 28 wherein

the first transducer is displaced from the second transducer along a first axis and the third transducer is displaced from the second transducer along a second axis perpendicular to the first axis;

acoustic data acquired by the first and second transducers is used to determine the direction of the source with respect to the first axis, and acoustic data acquired by the second and third transducers is used to determine the direction of the source with respect to the second axis; and

the video camera is automatically panned and tilted in the direction of the source in response to the signals received from the means for determining the direction of said source.

31. The apparatus of claim 28 further comprising a fourth transducer for acquiring a stream of acoustic data during the plurality of time frames, wherein

the first, second and third transducers are located along a first axis, with the second transducer located between the first and third transducers, and the fourth transducer is displaced from the second transducer along a second axis perpendicular to the first axis;

acoustic data acquired by the first and third transducers is used to determine the direction of the source with respect to the first axis, and acoustic data acquired by the second and fourth transducers is used to determine the direction of the source with respect to the second axis; and

the video camera is automatically panned and tilted in the direction of the source in response to the signals received from the means for determining the direction of said source.

32. The apparatus of claim 31 further comprising:

means for determining the position of the source, wherein acoustic data acquired by three of the four transducers is used to determine the position of the source; and

wherein the camera automatically is zoomed so as to frame the source in response to signals received from the means for determining the position of the source.

33. A method of processing a sequence of acoustic signals arriving from an acoustic source for use in videoconferencing comprising the steps of:

acquiring a stream of acoustic data during a plurality of time frames at first, second and third transducers;

determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;

identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence;

determining the direction of said source in two-dimensions based upon the acoustic received signals; and

aiming a video camera automatically in the direction of the source in response to the step of determining the direction of said source.

34. The method of claim 33 further comprising the step of:

acquiring a stream of acoustic data during the plurality of time frames at a fourth transducer;

determining the position of the source in a third dimension, wherein acoustic data acquired by three of the four transducers is used to determine the position of the source; and

zooming the video camera automatically so as to frame the source in response to the step of determining the position of the source.

35. A method of operating a video camera for use in videoconferencing comprising the steps of:

displaying images corresponding to video data acquired by the video camera;

acquiring a stream of acoustic data, including a sequence of acoustic signals from a source, during a plurality of time frames;

determining the direction of said source based upon the acquired stream of acoustic data;

aiming the video camera automatically in response to the step of determining by tilting or panning the video camera; and

freezing the image displayed on a display during the period when the video camera is tilting or panning.

36. The method of claim 35 wherein the step of freezing comprises the step of freezing the image appearing on the display at a video frame occurring just prior to panning or tilting the video camera.

37. A method of operating a system comprising at least first and second video cameras for use in videoconferencing comprising, the method comprising:

displaying on a display images corresponding to video data acquired by the first video camera;

acquiring a stream of acoustic data, including a sequence of acoustic signals from a source, during a plurality of time frames;

determining whether the acoustic data acquired during a particular time frame represent the beginning of the sequence of acoustic signals;

identifying acoustic received signals representative of said sequence of signals when said data represents the beginning of the sequence; and

determining the direction of said source based upon the acoustic received signals;

aiming the first video camera automatically in response to determining the direction by tilting or panning the video camera; and

displaying on the display images corresponding to video data acquired by the second video camera during the period when the first video camera is tilting or panning.