听歌识曲原理探究以及样例代码

最新推荐文章于 2023-12-01 10:00:39 发布

安安爸Chris

最新推荐文章于 2023-12-01 10:00:39 发布

阅读量3.3k

点赞数 11

分类专栏：智能语音文章标签： shazam

本文链接：https://blog.youkuaiyun.com/mimiduck/article/details/118632364

版权

智能语音专栏收录该内容

15 篇文章

订阅专栏

上世纪末Shazam娱乐公司的声纹识别算法利用时频图的星座图特征，通过选取峰值和创建指纹系统来实现音乐匹配。文章详细解释了如何通过计算fingerprint并匹配时间信息，解决音乐搜索中的‘找斜线’问题。算法具有抗噪性和高效的数据结构设计。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

技术故事

听歌识曲是一个很成熟的技术。现在的主流音乐播放器，几乎都有这个功能。

听歌识曲的app界面
但是第一个吃螃蟹的是上个世纪末的一个叫**“Shazam Entertainment Limited”**的公司，后来该公司在2018年被Apple以4亿美金收购了。

上图是ShazomApp，用于Identitying the songs playing around you.

算法实现

算法的实现基于论文 An Industrial-Strength Audio Search Algorithm

Constellation 星座图

算法中巧妙的采用了**时频图(spectrum)**中最大值特征：

时频图是时域信息经过傅里叶变换为频率信息后的信息，它的x轴为时间，y轴为频率振幅或者能量

经试验发现，最大值特性具有一定的抗噪性和鲁棒性。时频图中仅仅保留局部的最大值，处理后就是一个稀松矩阵，它看上去就像一个星座图(Constellation)

星座图丢失了振幅信息，只保留了最大值，所以它的信息量是减少的。但这也为算法的计算量大为减少。这是一个精妙的减法思想。

星座图

fingerprint 声纹

作者想了一个巧妙的方法，通过星座图(Constellation)中的点构建了一个fingerprint系统

选定图中的任意一个点作为锚点(anchor)，在它附件选一个区域(Target Zone)，然后锚点和每一个区域里的点做如下组合,如下图1C所示，

在这里插入图片描述

星座图的每一个点，实际上是时间和该点的频率的坐标；那么两个点的组合公式如下：如图1D
$f i n g e r p r i n t = h a s h c o d e (f 1 : f 2 : t 2 - t 1)$

每一个finggerprint会带上t1绝对时间，用于索引匹配到时可以得到此时的绝对时间信息。

因为一个区域内的点数量级是一定，假设它的数量为M (论文里是Fan out factor)

一首歌中的fingerprint个数假设为N, 则产生的fingerprint大约为 N X M

检索与匹配

待识别的音乐片段也可以通过上述方法将fingerprint计算出来，所以计算出来的fingerprint与数据库中每首歌的fingerprint做对比。

假设待识别的音乐片段的fingerprint如下：
每一个fingerprint是带有时间信息的

hashcode1:ta
hashcode2:tb
hashcode3:tc
hashcode4:td

应该匹配的音乐的fingerprint如下：

hashcode1:t1
hashcode2:t2
hashcode3:t3
hashcode4:t5
....
hashcode100:t100

那么匹配之后会得到，得到两个时间组成的坐标点

(ta,t1)
(tb,t2)
(tc,t3)
(td,t4)

如果把这些点放到坐标系中，就可以看到所有匹配的点的分布；

其中坐标系中的x轴为数据库中匹配音乐的时间，
y轴为待匹配音乐片段的时间

因为匹配到的点的相对时间都是固定的，所以如果匹配到了，所有的点在图上应该成为一条“斜线”，如下图3A。
坐标系
所以由此，这个匹配的问题转换成在这样的坐标轴中找寻斜线的问题。论文中介绍了集中找斜线的算法，但是都没有采用；而是设计了另一个非常方便的方法。

正如刚才所说，匹配到的点一定包含相同的时间差，所以分别计算
abs(ta-t1),abs(tb-t2),abs(tc-t3),abs(td-t4),他们的值一定是相同的，如果把这些匹配到的fingerprint的时间差都统计出来，如图3B，可以得到一个最大值。那么所有歌曲匹配过程中，最大值就找到哪一首歌被匹配到，如图3B

为搜索精心设计的数据结构

所有的fingerprint都是hashcode，它的搜索可以通过hash搜索来大大提升搜索性能。另外，论文中设计的数据结构将时间信息，歌曲信息等都加入到了fingerprint，这样在搜索处理过程中，搜过一次搜索直接找到对应歌曲的信息。

抗噪性

本算法具有一定的抗噪性，如果是纯净无噪语音，则peak和zone的区间可以减少一点；如果是存在噪音，则需要增加peak的值以及zone的大小；

样例代码

下面是根据上面的逻辑大致使用python写的源码，供大家参考

import os
import time
import numpy as np
import soundfile as sf
import librosa
import scipy.signal
import hashlib
import random as rd

N_FFT = 8000   # 1000ms  
ZONE_WIDTH = 5 # 5s


def calculate_fingerprint(y):
    fingerprint = {}

    # 100hz - 3100hz
    amplitude_spectrum = np.abs(librosa.stft(y, n_fft=N_FFT, hop_length=N_FFT // 2)[50:1551])

    size = amplitude_spectrum.shape[1]
    peak_spectrum = []
    for i in range(size):
        amplitude_frame = amplitude_spectrum[:, i]
        amplitude_frame = amplitude_frame - np.max(amplitude_frame) / 2
        # picks the peaks that not too short nor too dense
        peaks, _ = scipy.signal.find_peaks(amplitude_frame, height=10, distance=5)
        #print(peaks)
        peak_spectrum.append(peaks)

    # iterator the anchor point by point
    # assuming that N point and M points in the ZONE (M is a approximate and average number)
    # so there are totally N*M fingerprints
    for i in range(size - ZONE_WIDTH):
        anchor_peaks = peak_spectrum[i]
        if len(anchor_peaks) == 0:
            continue
        for j in range(len(anchor_peaks)):
            f1 = anchor_peaks[j]
            for n in range(ZONE_WIDTH):
                zone_peaks = peak_spectrum[i + n + 1]
                if len(zone_peaks) == 0:
                    continue
                for k in range(len(zone_peaks)):
                    f2 = zone_peaks[k]
                    hash = hashlib.md5('{}:{}:{}'.format(f1, f2, n + 1).encode("utf-8")).hexdigest()
                    fingerprint[hash] = i + n + 1

    return fingerprint


def compare_fingerprint(src, dst, log):
    '''
    compare the two fingerprint
    :param src:  the source fingerprint
    :param dst:  the destination fingerprint
    :return:     tuple of result
                 (the max time index, the max times of matching)
    '''
    delta = {}
    max_key=0
    max_value=0
    for src_kv in src.items():
        dst_val = dst.get(src_kv[0])
        if dst_val is not None:
            d = np.abs(src_kv[1] - dst_val)
            if d not in delta.keys():
                delta[d] = 1
            else:
                delta[d] = delta[d] + 1
            if max_value < delta[d]:
                max_key = d
                max_value = delta[d]

    if len(delta.items()) == 0:
        return -1,-1

    log.write("============start==============\n")
    for kv in delta.items():
        log.write('{}:{}\n'.format(kv[0], kv[1]))
    log.write("the max is {}:{}\n".format(max_key, max_value))
    print("the max is {}:{}".format(max_key, max_value))

    return max_key, max_value


D = 48000  # 6s
N = 100

def rand_test_file(wav, dict, log, negative=False):
    print("============start testing for {}================".format(wav))
    log.write("============start testing for {}================\n".format(wav))
    y, sr = sf.read(wav)
    suc_count=0

    for i in range(N):
        ni = rd.randint(1000, len(y) - D - 1)
        start = ni
        end = ni + D
        log.write("File: {}, Test {}, testing duration: {}s, start: {}s, end: {}s\n".format(os.path.split(wav)[1], i,
                                                                                      len(y) / 8000, start / 8000,
                                                                                      end / 8000))

        fingerprint = calculate_fingerprint(y[start:end])

        max = 0
        max_t = 0
        match = ''
        for kv in dict.items():
            t , m = compare_fingerprint(kv[1], fingerprint, log)
            if max < m:
                max = m
                max_t = t
                match = kv[0]

        if os.path.split(wav)[1] == match:
            log.write('Success! {},{}\n'.format(max_t, max))
            suc_count= suc_count+1
        else:
            log.write('Failure\n')

    return suc_count, N


dict = {}
suc=0
ttl=0
time1=time.time()
for root, dirs, files in os.walk('./audio/Source', topdown=False):
    for name in files:
        n, x = os.path.splitext(name)
        if x == ".wav":
            y, _ = sf.read(os.path.join(root, name))
            dict[name] = calculate_fingerprint(y)
        else:
            print('ignore this file: ', name)
time2=time.time()
print('cache feature cost:',time2-time1)

log=open('demo3_run.log', 'w')

for root, dirs, files in os.walk('./audio/Test', topdown=False):
    for name in files:
        n, x = os.path.splitext(name)
        if x == ".wav":
            s, t = rand_test_file(os.path.join(root, name), dict, log, negative=True)
            suc=s+suc
            ttl=t+ttl
        else:
            print('ignore this file: ', name)


time3=time.time()
print("there are total {} testing, the passing rate: {}%".format(ttl, suc*100/ttl))
log.write("there are total {} testing, the passing rate: {}%\n".format(ttl, suc*100/ttl))
print('totally cost {}s, each case cost {}s '.format(time3-time2, (time3-time2)/ttl))

log.close()