AUDIO FINGERPRINTING FOR ADVERTISEMENT DETECTION
A device may receive an audio sample, and may separate the audio sample into multiple sub-band signals in multiple frequency bands. The device may modify an upper boundary and a lower boundary of at least one of the frequency bands to form modified frequency bands. The device may modify the sub-band signals to form banded signals associated with the modified frequency bands. The device may smooth the banded signals to form smoothed signal values. The device may identify peak values included in the smoothed signal values, and may generate an audio fingerprint for the audio sample based on the smoothed signal values and the peak values.
Latest Verizon Patent and Licensing Inc. Patents:
- SYSTEMS AND METHODS FOR DYNAMIC SLICE SELECTION IN A WIRELESS NETWORK
- SYSTEMS AND METHODS FOR UTILIZING HASH-DERIVED INDEXING SUBSTITUTION MODELS FOR DATA DEIDENTIFICATION
- PRIVATE NETWORK MANAGEMENT VIA DYNAMICALLY DETERMINED RADIO PROFILES
- Systems and methods for simultaneous recordation of multiple records to a distributed ledger
- Self-managed networks and services with artificial intelligence and machine learning
An audio fingerprint may refer to a condensed digital summary, generated from an audio sample, that can be used to identify the audio sample or locate similar items in an audio fingerprint database. For example, audio fingerprinting may be used to identify songs, melodies, tunes, etc.
The following detailed description of example implementations refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements.
When providing an audio and/or video stream, a service provider may want to identify a segment of the stream based on contents included in the stream. For example, when providing a streaming television service, the service provider may want to identify a particular advertisement so that a substitute advertisement can be provided in place of the particular advertisement, or so that a viewer can be prevented from skipping an advertisement that an advertiser has paid to include in the stream. Such identification can be performed using an audio fingerprint. However, many audio fingerprinting techniques are slow and resource intensive (e.g., requiring a large amount of processing power, storage capacity, etc.), and may not be suitable for identifying an advertisement or another segment of a stream while the stream is being provided (e.g., for display to a viewer). Implementations described herein provide a fast, less resource intensive way to identify audio streams using a compact audio fingerprint.
As further shown in
The audio fingerprints for the audio stream and/or the audio samples may be generated using an audio fingerprinting technique described in more detail elsewhere herein. The audio fingerprinting technique may be used to quickly generate an audio fingerprint, so that an audio stream can be quickly identified before being provided to the user device. Furthermore, the audio fingerprinting technique may reduce a quantity of data points used to generate an audio fingerprint, thereby reducing an amount of storage space required to store the audio fingerprints. In this way, the fingerprint matching device may quickly and efficiently identify audio streams and/or segments of audio streams (e.g., audio samples).
Content serving device 210 may include one or more devices capable of receiving, generating, storing, processing, and/or providing a content stream, such as an audio stream, a video stream, an audio/video stream, etc. For example, content serving device 210 may include a storage device, a server (e.g., a content server, a host server, a web server, an HTTP server, etc.), or a similar device. Content serving device 210 may receive requests for one or more content streams (e.g., from fingerprint matching device 220 and/or user device 240), and may provide the requested content stream(s).
Fingerprint matching device 220 may include one or more devices capable of generating audio fingerprints based on an audio stream and/or an audio sample. For example, fingerprint matching device 220 may include a server (e.g., an application server, a content server, etc.), a traffic transfer device, or the like. In some implementations, fingerprint matching device 220 may receive an audio stream from content serving device 210, may generate an audio fingerprint for the audio stream, and may search for a matching audio fingerprint (e.g., using fingerprint storage device(s) 230) so that the audio stream may be identified. Fingerprint matching device 220 may identify characteristics associated with the matching audio fingerprint, and may control content provided to user device 240 based on the characteristics.
Fingerprint storage device 230 may include one or more devices capable of storing audio fingerprints and/or information associated with audio fingerprints (e.g., an audio identifier, information that identifies one or more characteristics associated with an audio fingerprint, etc.). For example, fingerprint storage device 230 may include a server (e.g., a storage server), a database, a storage device, or the like. Fingerprint matching device 220 may access one or more fingerprint storage devices 230 to identify matching audio fingerprints.
User device 240 may include one or more devices capable of receiving content and providing the received content (e.g., via a display, a speaker, etc.). For example, user device 240 may include a set-top box, a desktop computer, a laptop computer, a tablet, a smart phone, a television, a radio, a gaming system, or the like. In some implementations, user device 240 may receive content and/or instructions for providing the content from fingerprint matching device 220, and may provide the content (e.g., based on the instructions).
Network 250 may include one or more wired and/or wireless networks. For example, network 250 may include a cellular network (e.g., an LTE network, a 3G network, a code division multiple access (CDMA) network, etc.), a public land mobile network (PLMN), a wireless local area network (e.g., a Wi-Fi network), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, and/or a combination of these or other types of networks.
The number and arrangement of devices and networks shown in
Bus 310 may include a component that permits communication among the components of device 300. Processor 320 may include a processor (e.g., a central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), etc.), a microprocessor, and/or any processing component (e.g., a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), etc.) that interprets and/or executes instructions. Memory 330 may include a random access memory (RAM), a read only memory (ROM), and/or another type of dynamic or static storage device (e.g., a flash memory, a magnetic memory, an optical memory, etc.) that stores information and/or instructions for use by processor 320.
Storage component 340 may store information and/or software related to the operation and use of device 300. For example, storage component 340 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optic disk, a solid state disk, etc.), a compact disc (CD), a digital versatile disc (DVD), a floppy disk, a cartridge, a magnetic tape, and/or another type of computer-readable medium, along with a corresponding drive.
Input component 350 may include a component that permits device 300 to receive information, such as via user input (e.g., a touch screen display, a keyboard, a keypad, a mouse, a button, a switch, a microphone, etc.). Additionally, or alternatively, input component 350 may include a sensor for sensing information (e.g., a global positioning system (GPS) component, an accelerometer, a gyroscope, an actuator, etc.). Output component 360 may include a component that provides output information from device 300 (e.g., a display, a speaker, one or more light-emitting diodes (LEDs), etc.).
Communication interface 370 may include a transceiver-like component (e.g., a transceiver, a separate receiver and transmitter, etc.) that enables device 300 to communicate with other devices, such as via a wired connection, a wireless connection, or a combination of wired and wireless connections. Communication interface 370 may permit device 300 to receive information from another device and/or provide information to another device. For example, communication interface 370 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, or the like.
Device 300 may perform one or more processes described herein. Device 300 may perform these processes in response to processor 320 executing software instructions stored by a computer-readable medium, such as memory 330 and/or storage component 340. A computer-readable medium is defined herein as a non-transitory memory device. A memory device includes memory space within a single physical storage device or memory space spread across multiple physical storage devices.
Software instructions may be read into memory 330 and/or storage component 340 from another computer-readable medium or from another device via communication interface 370. When executed, software instructions stored in memory 330 and/or storage component 340 may cause processor 320 to perform one or more processes described herein. Additionally, or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to perform one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.
The number and arrangement of components shown in
As shown in
The audio sample may be received in the time domain, and fingerprint matching device 220 may convert the time domain audio sample to a frequency domain audio sample (e.g., using a Fast Fourier Transform). The frequency domain audio sample may be represented as S[n], where n represents a time index and/or a sample number (e.g., n=0, 1, 2, . . . ∞).
Fingerprint matching device 220 may separate the audio sample into multiple sub-band signals in different frequency bands (e.g., having different frequencies, or falling within different frequency ranges). For example, fingerprint matching device 220 may use a filter bank (e.g., one or more band-pass filters) to separate an input audio signal into multiple audio signals that each carry a particular frequency sub-band of the input audio signal. A particular sub-band signal may be represented as Sbin[f,n], where f represents a sub-band index. Fingerprint matching device 220 may separate the frequency domain audio sample into F sub-bands, such that f=1, 2, . . . , F. The value of F may be configurable, in some implementations. Additionally, or alternatively, the sub-bands may span a linear frequency scale.
As further shown in
In some implementations, fingerprint matching device 220 may calculate Sband[k,n] from Sbin[f,n] as follows:
In the above expression, high[k] may represent an upper boundary of band k, low[k] may represent a lower boundary of band k, and the value of high[k]−low[k] may increase as the value of k increases. In other words, bands that include higher frequency values may have a larger bandwidth.
As further shown in
Slpf[k,n]=α×Sband[k,n]+(1−α)×Slpf[k,n−1]
In the above expression, a may represent a configurable smoothing factor.
In some implementations, fingerprint matching device 220 may stabilize peak values, included in the banded signals, to form smoothed signals (e.g., by reducing noise and/or oscillations near the peak values). A smoothed signal may be represented as Ssm[k,n]. Fingerprint matching device 220 may generate a smoothed signal based on a filtered signal and/or a configurable decay factor β (e.g., β<1). For example, fingerprint matching device 220 may compare a filtered signal, for a particular band index and a current time step, to a product of the decay factor and a previous smoothed signal, associated with the particular band index and a previous time step. If the value of the filtered signal is greater than or equal to the product, then fingerprint matching device 220 may set a value of a current smoothed signal, for the particular band index and the current time step, equal to the value of the filtered signal. Otherwise, if the value of the filtered signal is less than the product, then fingerprint matching device 220 may set a value of a current smoothed signal, for the particular band index and the current time step, equal to the product. In this way, fingerprint matching device 220 may reduce noise near the peak values.
In other words, fingerprint matching device 220 may determine Ssm[k,n] from Slpf[k,n] as follows:
if (Slpf[k,n]≧Ssm[k,n−1]×β):
-
- then Ssm[k,n]=Slpf[k,n]
- else Ssm[k,n]=Ssm[k,n−1]×β
As further shown in
In other words, Peak[k,n] may be set equal to the maximum value of Ssm[k,n] within a frequency band window of size 2×H centered around k (e.g., from k−H to k+H) and within a time window of size 2×W centered around n (e.g., from n−W to n+W). The values of H and W may be configurable, in some implementations. In the above expression, M may represent a quantity of bands (e.g., k=1, 2, . . . , M), and N may represent a quantity of time vectors (e.g., n=1, 2, . . . , N). The expressions max(1, k−H) and min(M, k+H) may be used to ensure that the frequency band window does not fall outside of the range of k (e.g., from 1 to M). Similarly, the expressions max(1, n−W) and min(N, n+W) may be used to ensure that the time window does not fall outside of the range of n (e.g., from 1 to N).
As further shown in
In some implementations, fingerprint matching device 220 may prune peak values by identifying a maximum value (e.g., a local maximum) within a frequency band window centered around band k, and within a time window centered around time n. A pruned peak value associated with a particular band k and time n may be represented as Peakprune[k,n], and may be calculated as follows:
In other words, Peakprune[k,n] may be set equal to the maximum value of Peak[k,n] within a frequency band window of size 2×Hprune centered around k (e.g., from k−Hprune to k+Hprune) and within a time window of size 2×Wprune centered around n (e.g., from n−Wprune to n+Wprune). The values of Hprune and Wprune may be configurable, and may be set to different values than H and W, respectively. In some implementations, Hprune may be set to a value greater than H, and Wprune may be set to a value less than W.
Fingerprint matching device 220 may use the pruned peak values to generate an audio fingerprint for the audio sample, as described in more detail elsewhere herein.
Although
As shown in
Fingerprint matching device 220 may determine whether the time index n satisfies a time index threshold, such as a maximum time index value N. For example, fingerprint matching device 220 may compare n to N, may determine that the time index threshold is satisfied when n is less than or equal to N, and may determine that the time index threshold is not satisfied when n is greater N.
As further shown in
Fingerprint matching device 220 may initialize a band index k by setting the band index k equal to an initial value (e.g., 1). Fingerprint matching device 220 may use the band index to assist in performing an audio fingerprint algorithm to generate an audio fingerprint.
As further shown in
If Peakprune[k,n] is equal to Ssm[k,n] (block 525—YES), then process 500 may include setting a code vector bit, corresponding to the band index, equal to a first value (block 530). For example, if Peakprune[k,n] is equal to Ssm[k,n], then a signal corresponding to time index n and band index k corresponds to a peak value. In this case, fingerprint matching device 220 may indicate this peak value by setting a corresponding bit of the code vector equal to a first value. For example, fingerprint matching device 220 may set bit k−1 of code vector code[n] equal to one.
If Peakprune[k,n] is not equal to Ssm[k,n] (block 525—NO), then process 500 may include setting a code vector bit, corresponding to the band index, equal to a second value (block 535). For example, if Peakprune[k,n] is not equal to Ssm[k,n], then a signal corresponding to time index n and band index k does not correspond to a peak value. In this case, fingerprint matching device 220 may indicate this non-peak value by setting a corresponding bit of the code vector equal to a second value. For example, fingerprint matching device 220 may set bit k−1 of code vector code[n] equal to zero.
As further shown in
As further shown in
As further shown in
If the current code vector does not include a peak value (block 550—NO), then process 500 may include incrementing the time index (block 555) and returning to block 510. For example, fingerprint matching device 220 may increment the value of n (e.g., n=n+1). Fingerprint matching device 220 may return to block 510 to continue to analyze whether other time index values n (e.g., each n from n=1 through N) include peak values.
If the current code vector includes a peak value (block 550—YES), then process 500 may include generating a hash value from the current code vector, and generating a fingerprint value, for a current fingerprint index, that identifies the current time index and the hash value associated with the current time index (block 560). For example, the code vector may include M bits, and fingerprint matching device 220 may apply a hashing algorithm to generate a hash value hash[n], corresponding to the current time index, from code vector code[n]. For example, the hashing algorithm may include SHA1, SHA2, MD5, etc. The hash value may include fewer bits than the code vector, thereby reducing a size of an audio fingerprint that includes the hash value rather than the code vector.
Fingerprint matching device 220 may generate a fingerprint value FP[j], where FP[j] includes a pair of corresponding values {n, hash[n]}. When there is no peak value associated with time index n (e.g., when code[n]=null set=0), fingerprint matching device 220 may not store a fingerprint value for time index n.
As further shown in
If the time index does not satisfy the time index threshold (block 510—NO), then process 500 may include storing an audio fingerprint that includes an audio identifier and one or more fingerprint values (block 570). For example, when fingerprint matching device 220 has finished analyzing peak values for all time index values n=1 through N, then fingerprint matching device 220 may generate an audio fingerprint. In some implementations, the audio fingerprint may include an audio identifier (e.g., an advertisement identifier, a song identifier, etc.). Additionally, or alternatively, the audio fingerprint may include each generated fingerprint value FP[j] for j=1 to J. The value of J may correspond to the quantity of time index values n associated with a peak value.
Although
As shown in
As indicated above,
As shown in
As further shown in
As further shown in
If none of the stored audio fingerprints in the set share a correlation (e.g., based on a configurable matching threshold) with the generated audio fingerprint, then fingerprint matching device 220 may determine that there is no match. Fingerprint matching device 220 may perform an action based on determining that there is no match, such as by providing the audio stream to another device (e.g., user device 240) without instructions that may otherwise be provided if there were a match.
In some implementations, fingerprint matching device 220 may generate a histogram to determine a correlation between a stored audio fingerprint and the generated audio fingerprint. For each matching hash value included in the generated audio fingerprint and the stored audio fingerprint, fingerprint matching device 220 may calculate:
Δn=nmatching−ngenerated
In the above expression, nmatching may represent a time index value n paired with a matching hash value included in the stored audio fingerprint, and ngenerated may represent a time index value n paired with a matching hash value included in the generated audio fingerprint. For a particular stored audio fingerprint, fingerprint matching device 220 may plot the Δn values (e.g., for each pair of matching hash values) over the time indices in a histogram, and may determine whether a matching threshold is satisfied. In other words, fingerprint matching device 220 may determine whether a quantity of Δn values, for a particular time index, satisfies a matching threshold.
In some implementations, fingerprint matching device 220 may calculate a ratio of the quantity of Δn values, for a particular time index, to a total quantity of matching hash values between the particular stored audio fingerprint and the generated audio fingerprint, and may determine whether the ratio satisfies the matching threshold (e.g., which may be a configurable value set to, for example, 0.5, 0.6, etc.).
If a particular stored audio fingerprint satisfies the matching threshold, then fingerprint matching device 220 may identify the particular stored audio fingerprint as a matching audio fingerprint. If multiple stored audio fingerprints satisfy the matching threshold, then fingerprint matching device 220 may identify the stored audio fingerprint with a highest match ratio (e.g., a highest ratio of Δn values, for a particular time index, to matching hash values) as the matching audio fingerprint.
As further shown in
While described herein with respect to advertisements, fingerprint matching device 220 may perform processes described herein for other types of audio content. For example, the audio stream may include a song clip, and fingerprint matching device 220 may determine a song identifier based on the song clip, and may provide the song identifier to user device 240 (e.g., for display). In this way, fingerprint matching device 220 may use the audio fingerprint generation technique described herein to identify any type of audio.
Although
As shown in
As shown by reference number 840, fingerprint matching device 220 may identify matching hash values included in an audio fingerprint for an audio sample identified as “Ad 3.” The matching hash values are shown as “Rk0L” and “qq88.” As shown, then time index values in the generated audio fingerprint, shown as 31 and 37, do not match the corresponding time index values in the stored audio fingerprint, shown as 1 and 7. This is because fingerprint matching device 220 analyzes the audio stream as the audio stream is received, and does not know where a new audio sample (e.g., segment), included in the audio stream, begins and ends. However, the difference between the time index values (e.g., 4n) is the same for these matching hash values (e.g., 37−31=7−1=6). Assume that fingerprint matching device 220 calculates a ratio of the quantity of matching time offset values to the total quantity of matching hash values, and determines that the ratio satisfies a matching threshold, as described in more detail in connection with
As shown in
As further shown, fingerprint matching device 220 may generate a histogram 880 of time offset differences for matching hash locations with respect to a particular time index (e.g., in the stored audio fingerprint). As shown by reference number 890, a high quantity of time offset differences (e.g., that satisfies a matching threshold) at a particular time index value may indicate that the stored audio fingerprint is a matching audio fingerprint.
As indicated above,
The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.
As used herein, the term component is intended to be broadly construed as hardware, firmware, or a combination of hardware and software.
Some implementations are described herein in connection with thresholds. As used herein, satisfying a threshold may refer to a value being greater than the threshold, more than the threshold, higher than the threshold, greater than or equal to the threshold, less than the threshold, fewer than the threshold, lower than the threshold, less than or equal to the threshold, equal to the threshold, etc.
To the extent the aforementioned embodiments collect, store or employ personal information provided by individuals, it should be understood that such information shall be used in accordance with all applicable laws concerning protection of personal information. Additionally, the collection, storage and use of such information may be subject to consent of the individual to such activity, for example, through well known “opt-in” or “opt-out” processes as may be appropriate for the situation and type of information. Storage and use of personal information may be in an appropriately secure manner reflective of the type of information, for example, through various encryption and anonymization techniques for particularly sensitive information.
It will be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code—it being understood that software and hardware can be designed to implement the systems and/or methods based on the description herein.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items, and may be used interchangeably with “one or more.” Furthermore, as used herein, the term “set” is intended to include one or more items, and may be used interchangeably with “one or more.” Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.
Claims
1. A device, comprising:
- one or more processors to: receive an audio sample; separate the audio sample into a plurality of sub-band signals in a plurality of frequency bands; modify an upper boundary and a lower boundary of at least one of the plurality of frequency bands to form a plurality of modified frequency bands; modify the plurality of sub-band signals to form a plurality of banded signals associated with the plurality of modified frequency bands; smooth the plurality of banded signals to form a plurality of smoothed signal values; identify a plurality of peak values included in the plurality of smoothed signal values; and generate an audio fingerprint for the audio sample based on the plurality of smoothed signal values and the plurality of peak values.
2. The device of claim 1, where the one or more processors, when modifying the upper boundary and the lower boundary, are further to:
- convert a linear frequency scale of a frequency band, of the plurality of frequency bands, to a logarithmic frequency scale to form a modified frequency band, of the plurality of modified frequency bands.
3. The device of claim 1, where the one or more processors are further to:
- compare a peak value, of the plurality of peak values, that corresponds to a band index and a time index, to a smoothed signal value, of the plurality of smoothed signal values, that corresponds to the band index and the time index, the band index identifying a modified frequency band of the plurality of modified frequency bands, the time index identifying a time associated with the audio sample; and
- where the one or more processors, when generating the audio fingerprint, are further to: generate the audio fingerprint based on comparing the peak value to the smoothed signal value.
4. The device of claim 3, where the one or more processors, when generating the audio fingerprint, are further to:
- generate a code vector, corresponding to the time index, to be included in the audio fingerprint;
- insert a first value or a second value into the code vector, at a position corresponding to the band index, based on comparing the peak value and the smoothed signal value, the first value being inserted when the peak value and the smoothed signal value are a same value, the second value being inserted when the peak value and the smoothed signal value are a different value;
- generate a hash value, corresponding to the time index, based on the code vector; and
- include the hash value in the audio fingerprint.
5. The device of claim 1, where the one or more processors are further to:
- select a subset of the plurality of peak values to form a plurality of pruned peak values;
- compare the plurality of pruned peak values, corresponding to different pairs of band indexes and time indexes, to respective smoothed signal values, of the plurality of smoothed signal values, corresponding to the different pairs of band indexes and time indexes, the band indexes identifying modified frequency bands of the plurality of modified frequency bands, the time indexes identifying times associated with the audio sample; and
- where the one or more processors, when generating the audio fingerprint, are further to: generate the audio fingerprint based on comparing the plurality of pruned peak values to the smoothed signal values.
6. The device of claim 1, where the one or more processors are further to:
- for each of the plurality of modified frequency bands, compare a peak value, of the plurality of peak values, that corresponds to a band index and a time index, to a smoothed signal value, of the plurality of smoothed signal values, that corresponds to the band index and the time index, the band index corresponding to a modified frequency band of the plurality of modified frequency bands, the time index corresponding to a time associated with the audio sample; and
- generate a code vector based on comparing the peak value to the smoothed signal value for each of the plurality of modified frequency bands, the code vector having a length that corresponds to a quantity of modified frequency bands included in the plurality of modified frequency bands.
7. The device of claim 1, where the one or more processors are further to:
- cause a search of a data structure to be performed using the generated audio fingerprint;
- identify a matching audio fingerprint, stored in the data structure, based on the search of the data structure; and
- provide an audio identifier that identifies the matching audio fingerprint.
8. A computer-readable medium storing instructions, the instructions comprising:
- one or more instructions that, when executed by one or more processors, cause the one or more processors to: receive an audio sample; separate the audio sample into a plurality of sub-band signals in a plurality of frequency bands; modify an upper boundary and a lower boundary of at least one of the plurality of frequency bands to form a plurality of modified frequency bands; modify the plurality of sub-band signals to form a plurality of banded signals associated with the plurality of modified frequency bands; smooth the plurality of banded signals to form a plurality of smoothed signal values; identify a plurality of peak values included in the plurality of smoothed signal values; and generate an audio fingerprint for the audio sample based on the plurality of smoothed signal values and the plurality of peak values.
9. The computer-readable medium of claim 8, where the one or more instructions, that cause the one or more processors to modify the upper boundary and the lower boundary, further cause the one or more processors to:
- convert a linear frequency scale of a frequency band, of the plurality of frequency bands, to a logarithmic frequency scale to form a modified frequency band, of the plurality of modified frequency bands.
10. The computer-readable medium of claim 8, where the one or more instructions, when executed by the one or more processors, further cause the one or more processors to:
- compare a peak value, of the plurality of peak values, that corresponds to a band index and a time index, to a smoothed signal value, of the plurality of smoothed signal values, that corresponds to the band index and the time index, the band index identifying a modified frequency band of the plurality of modified frequency bands, the time index identifying a time associated with the audio sample; and
- where the one or more instructions, that cause the one or more processors to generate the audio fingerprint, further cause the one or more processors to: generate the audio fingerprint based on comparing the peak value to the smoothed signal value.
11. The computer-readable medium of claim 10, where the one or more instructions, that cause the one or more processors to generate the audio fingerprint, further cause the one or more processors to:
- generate a code vector, corresponding to the time index, to be included in the audio fingerprint;
- insert a first value or a second value into the code vector, at a position corresponding to the band index, based on comparing the peak value and the smoothed signal value, the first value being inserted when the peak value and the smoothed signal value are a same value, the second value being inserted when the peak value and the smoothed signal value are a different value;
- generate a hash value, corresponding to the time index, based on the code vector; and
- include the hash value in the audio fingerprint.
12. The computer-readable medium of claim 8, where the one or more instructions, when executed by the one or more processors, further cause the one or more processors to:
- select a subset of the plurality of peak values to form a plurality of pruned peak values;
- compare the plurality of pruned peak values, corresponding to different pairs of band indexes and time indexes, to respective smoothed signal values, of the plurality of smoothed signal values, corresponding to the different pairs of band indexes and time indexes, the band indexes identifying modified frequency bands of the plurality of modified frequency bands, the time indexes identifying times associated with the audio sample; and
- where the one or more instructions, that cause the one or more processors to generate the audio fingerprint, further cause the one or more processors to: generate the audio fingerprint based on comparing the plurality of pruned peak values to the smoothed signal values.
13. The computer-readable medium of claim 8, where the one or more instructions, when executed by the one or more processors, further cause the one or more processors to:
- for each of the plurality of modified frequency bands, compare a peak value, of the plurality of peak values, that corresponds to a band index and a time index, to a smoothed signal value, of the plurality of smoothed signal values, that corresponds to the band index and the time index, the band index corresponding to a modified frequency band of the plurality of modified frequency bands, the time index corresponding to a time associated with the audio sample; and
- generate a code vector based on comparing the peak value to the smoothed signal value for each of the plurality of modified frequency bands, the code vector having a length that corresponds to a quantity of modified frequency bands included in the plurality of modified frequency bands.
14. The computer-readable medium of claim 8, where the one or more instructions, when executed by the one or more processors, further cause the one or more processors to:
- cause a search of a data structure to be performed using the generated audio fingerprint;
- identify a matching audio fingerprint, stored in the data structure, based on the search of the data structure; and
- provide an audio identifier that identifies the matching audio fingerprint.
15. A method, comprising:
- receiving, by a device, an audio sample;
- separating, by the device, the audio sample into a plurality of sub-band signals in a plurality of frequency bands;
- modifying, by the device, an upper boundary and a lower boundary of at least one of the plurality of frequency bands to form a plurality of modified frequency bands;
- modifying, by the device, the plurality of sub-band signals to form a plurality of banded signals associated with the plurality of modified frequency bands;
- smoothing, by the device, the plurality of banded signals to form a plurality of smoothed signal values;
- identifying, by the device, a plurality of peak values included in the plurality of smoothed signal values;
- generating, by the device, an audio fingerprint for the audio sample based on the plurality of smoothed signal values and the plurality of peak values;
- causing, by the device, a search of a data structure to be performed using the generated audio fingerprint;
- identifying, by the device, a matching audio fingerprint, stored in the data structure, based on the search of the data structure; and
- providing, by the device, an audio identifier associated with the matching audio fingerprint.
16. The method of claim 15, where modifying the upper boundary and the lower boundary further comprises:
- converting a linear frequency scale of a frequency band, of the plurality of frequency bands, to a logarithmic frequency scale to form a modified frequency band, of the plurality of modified frequency bands.
17. The method of claim 15, further comprising:
- comparing a peak value, of the plurality of peak values, that corresponds to a band index and a time index, to a smoothed signal value, of the plurality of smoothed signal values, that corresponds to the band index and the time index, the band index identifying a modified frequency band of the plurality of modified frequency bands, the time index identifying a time associated with the audio sample; and
- where generating the audio fingerprint further comprises: generating the audio fingerprint based on comparing the peak value to the smoothed signal value.
18. The method of claim 17, where generating the audio fingerprint further comprises:
- generating a code vector, corresponding to the time index, to be included in the audio fingerprint;
- inserting a first value or a second value into the code vector, at a position corresponding to the band index, based on comparing the peak value and the smoothed signal value, the first value being inserted when the peak value and the smoothed signal value are a same value, the second value being inserted when the peak value and the smoothed signal value are a different value;
- generating a hash value, corresponding to the time index, based on the code vector; and
- including the hash value in the audio fingerprint.
19. The method of claim 15, further comprising:
- selecting a subset of the plurality of peak values to form a plurality of pruned peak values;
- comparing the plurality of pruned peak values, corresponding to different pairs of band indexes and time indexes, to respective smoothed signal values, of the plurality of smoothed signal values, corresponding to the different pairs of band indexes and time indexes, the band indexes identifying modified frequency bands of the plurality of modified frequency bands, the time indexes identifying times associated with the audio sample; and
- where generating the audio fingerprint further comprises: generating the audio fingerprint based on comparing the plurality of pruned peak values to the smoothed signal values.
20. The method of claim 15, further comprising:
- for each of the plurality of modified frequency bands, comparing a peak value, of the plurality of peak values, that corresponds to a band index and a time index, to a smoothed signal value, of the plurality of smoothed signal values, that corresponds to the band index and the time index, the band index corresponding to a modified frequency band of the plurality of modified frequency bands, the time index corresponding to a time associated with the audio sample; and
- generating a code vector based on comparing the peak value to the smoothed signal value for each of the plurality of modified frequency bands, the code vector having a length that corresponds to a quantity of modified frequency bands included in the plurality of modified frequency bands.
Type: Application
Filed: Mar 27, 2014
Publication Date: Oct 1, 2015
Applicant: Verizon Patent and Licensing Inc. (Basking Ridge, NJ)
Inventors: Erwin GOESNAR (Daly City, CA), Ravi Kalluri (San Jose, CA)
Application Number: 14/227,659