假设已经安装好了HTK并设置好了环境变量。使用的数据集是timit数据集。
数据准备(Data Preparation)
建立任务语法(Task Grammar)
从timit中抽取数据
抽取发音字典
使用如下的python程序从timit中抽取发音词典并修改成为HTK所使用的格式。
def timit_dict_to_htk_dict(timit_dict_file,htk_dict_file):
timit_dicts = [line.strip().split(' ') for line in open(timit_dict_file,'r') if not line.startswith(';')]
source_dict = {
}
for line in timit_dicts:
word,pronun = line[0].upper().split('~')[0].strip(),line[1]
for c in ['.']:
word = word.strip(c)
if word.startswith("'"):
word = word[1:]
if word not in source_dict:
source_dict[word] = []
source_dict[word].append(pronun)
new_word_pronun = []
for word in source_dict:
for pronun in source_dict[word]:
new_word_pronun.append([word,pronun])
new_word_pronun = sorted(new_word_pronun,key= lambda x:x[0])
htk_dicts = [[line[0].upper(),line[1].replace('/','')] for line in new_word_pronun]
htk_dicts = sorted(htk_dicts,key = lambda x:x[0])
outf = open(htk_dict_file,'w')
outf.write('\n'.join([' '.join(line) for line in htk_dicts]) + '\n')
outf.close()
timit_path = '/home/wd/D/Research/dataset/timit'
source_dict_file = '%s/timit/doc/timitdic.txt' % timit_path
htk_dict_file = 'data/htkdic'
timit_dict_to_htk_dict(source_dict_file, htk_dict_file)
程序会创建文件data/htkdic
,其内容为:
-KNACKS n ae1 k s
-UPMANSHIP ah1 p m ax n sh ih p
-UPS ah p s
-ZAGGED z ae1 g d
A ax
ABBREVIATE ax b r iy1 v iy ey2 t
ABDOMEN ae1 b d ax m ax n
ABIDES ax b ay1 d z
ABILITY ax b ih1 l ix t iy
ABLE ey1 b el
...
在程序目录中创建data
文件夹,使用如下的python脚本来提取timit的一个子集。
import os
Delete_Letter = ['"',':',',','!','?','.','~',';']
def wav_trans(timit_path,data_set,htk_dict_file,prompts_file):
htk_dict_vocabulary = [line.strip().split(' ')[0] for line in open(htk_dict_file,'r')]
wav_trans_list = []
path_1 = '%s/timit/%s' % (timit_path,data_set)
for item in os.listdir(path_1):
path_2 = '%s/%s' % (path_1,item)
for sub_item in os.listdir(path_2):
path_3 = '%s/%s' % (path_2,sub_item)
for file in os.listdir(path_3):
file_path = '%s/%s' % (path_3,file)
if file_path.endswith('.txt'):
trans = open(file_path,'r').readline().strip().split(' ')[2:]
trans = [i[1:].upper() if i.startswith("'") else i.upper() for i in trans]
keep = True
trans = ' '.join(trans)
for c in Delete_Letter:
trans = trans.replace(c,'')
trans = trans.split(' ')
for item in trans:
if item not in htk_dict_vocabulary:
keep = False
if keep:
trans = ' '.join(trans)
file_path = file_path[:-3] + 'wav'
wav_trans_list.append([file_path,trans])
outf = open(prompts_file,'w')
outf.write('\n'.join(['\t'.join(line) for line in wav_trans_list][:]))
outf.close()
data_set = 'train'
prompts_file = 'data/%sprompts' % data_set
wav_trans(timit_path, data_set,htk_dict_file,prompts_file)
程序运行结束后,data/
中有一个文件trainprompts
,其中内容为:
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/sa1.wav SHE HAD YOUR DARK SUIT IN GREASY WASH WATER ALL YEAR
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/sa2.wav DON'T ASK ME TO CARRY AN OILY RAG LIKE THAT
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/si1027.wav EVEN THEN IF SHE TOOK ONE STEP FORWARD HE COULD CATCH HER
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/si1657.wav OR BORROW SOME MONEY FROM SOMEONE AND GO HOME BY BUS
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/si648.wav A SAILBOAT MAY HAVE A BONE IN HER TEETH ONE MINUTE AND LIE BECALMED THE NEXT
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/sx127.wav THE EMPEROR HAD A MEAN TEMPER
/home/wd/D/Research/dataset/timit/timit/train/dr1/fcjf0/sx217.wav HOW PERMANENT ARE THEIR RECORDS
/home/wd/D/Research/dataset/timit/ti