语音特征参数分析平台(二) 语音信号的采集-wave文件格式
第一章 语音信号的采集
我采用的cool edit pro进行语音信号的采集,然后保存成wav文件。过程没什么好说的,软件大家可以上google上去搜。cool edit pro非常专业,使得语音的采集和处理过程和文本处理过程类似。
有一点要提一下。人声的频率范围为0-4kHz,所以在cep中设定采样频率可取8kHz,11.025kHz和22.050kHz。我取采样频率为11.025kHz,因为11.025的语音信号比8kHz的语音信号在特征参数提取时具有更明显的特征。但是也不能盲目选取大的采样频率,那样只会导致系统的复杂度(空间和时间)更大而以。
一、WAVE文件处理
对于wav文件,要提取其中的裸音频数据和Format信息,才能进行下一步的分析,通过设计CWaveFile类,实现了对任意格式和采样频率的wav文件的裸音频数据和格式的提取。
1、WAVE文件的格式:下面转自微软的wave格式定义,作为预备知识
WAVE File Format
WAVE File Format is a file format for storing digital audio (waveform) data. It supports a variety of bit resolutions, sample rates, and channels of audio. This format is very popular upon IBM PC (clone) platforms, and is widely used in professional programs that process digital audio waveforms. It takes into account some pecularities of the Intel CPU such as little endian byte order.
This format uses Microsoft's version of the Electronic Arts Interchange File Format method for storing data in "chunks".
Data Types
A C-like language will be used to describe the data structures in the file. A few extra data types that are not part of standard C, but which will be used in this document, are:
pstring | Pascal-style string, a one-byte count followed by that many text bytes. The total number of bytes in this data type should be even. A pad byte can be added to the end of the text to accomplish this. This pad byte is not reflected in the count. |
ID | A chunk ID (ie, 4 ASCII bytes). |
Also note that when you see an array with no size specification (e.g., char ckData[];), this indicates a variable-sized array in our C-like language. This differs from standard C arrays.
Constants
Decimal values are referred to as a string of digits, for example 123, 0, 100 are all decimal numbers. Hexadecimal values are preceded by a 0x - e.g., 0x0A, 0x1, 0x64.
Data Organization
All data is stored in 8-bit bytes, arranged in Intel 80x86 (ie, little endian) format. The bytes of multiple-byte values are stored with the low-order (ie, least significant) bytes first. Data bits are as follows (ie, shown with bit numbers on top):
7 6 5 4 3 2 1 0 +-----------------------+ char: | lsb msb | +-----------------------+ 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 +-----------------------+-----------------------+ short: | lsb byte 0 | byte 1 msb | +-----------------------+-----------------------+ 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 +-----------------------+-----------------------+-----------------------+-----------------------+ long: | lsb byte 0 | byte 1 | byte 2 | byte 3 msb | +-----------------------+-----------------------+-----------------------+-----------------------+
File Structure
A WAVE file is a collection of a number of different types of chunks. There is a required Format ("fmt ") chunk which contains important parameters describing the waveform, such as its sample rate. The Data chunk, which contains the actual waveform data, is also required. All other chunks are optional. Among the other optional chunks are ones which define cue points, list instrument parameters, store application-specific information, etc. All of these chunks are described in detail in the following sections of this document.
All applications that use WAVE must be able to read the 2 required chunks and can choose to selectively ignore the optional chunks. A program that copies a WAVE should copy all of the chunks in the WAVE, even those it chooses not to interpret.
There are no restrictions upon the order of the chunks within a WAVE file, with the exception that the Format chunk must precede the Data chunk. Some inflexibly written programs expect the Format chunk as the first chunk (after the RIFF header) although they shouldn't because the specification doesn't require this.
Here is a graphical overview of an example, minimal WAVE file. It consists of a single WAVE containing the 2 required chunks, a Format and a Data Chunk.
__________________________ | RIFF WAVE Chunk | | groupID = 'RIFF' | | riffType = 'WAVE' | | __________________ | | | Format Chunk | | | | ckID = 'fmt ' | | | |__________________| | | __________________ | | | Sound Data Chunk | | | | ckID = 'data' | | | |__________________| | |__________________________|
A Bastardized Standard
The WAVE format is sort of a bastardized standard that was concocted by too many "cooks" who didn't properly coordinate the addition of "ingredients" to the "soup". Unlike with the AIFF standard which was mostly designed by a small, coordinated group, the WAVE format has had all manner of much-too-independent, uncoordinated aberrations inflicted upon it. The net result is that there are far too many chunks that may be found in a WAVE file -- many of them duplicating the same information found in other chunks (but in an unnecessarily different way) simply because there have been too many programmers who took too many liberties with unilaterally adding their own additions to the WAVE format without properly coming to a concensus of what everyone else needed (and therefore it encouraged an "every man for himself" attitude toward adding things to this "standard"). One example is the Instrument chunk versus the Sampler chunk. Another example is the Note versus Label chunks in an Associated Data List. I don't even want to get into the totally irresponsible proliferation of compressed formats. (ie, It seems like everyone and his pet Dachshound has come up with some compressed version of storing wave data -- like we need 100 different ways to do that). Furthermore, there are lots of inconsistencies, for example how 8-bit data is unsigned, but 16-bit data is signed.
I've attempted to document only those aspects that you're very likely to encounter in a WAVE file. I suggest that you concentrate upon these and refuse to support the work of programmers who feel the need to deviate from a standard with inconsistent, proprietary, self-serving, unnecessary extensions. Please do your part to rein in half-ass programming.
Sample Points and Sample Frames
A large part of interpreting WAVE files revolves around the two concepts of sample points and sample frames.
A sample point is a value representing a sample of a sound at a given moment in time. For waveforms with greater than 8-bit resolution, each sample point is stored as a linear, 2's-complement value which may be from 9 to 32 bits wide (as determined by the wBitsPerSample field in the Format Chunk, assuming PCM format -- an uncompressed format). For example, each sample point of a 16-bit waveform would be a 16-bit word (ie, two 8-bit bytes) where 32767 (0x7FFF) is the highest value and -32768 (0x8000) is the lowest value. For 8-bit (or less) waveforms, each sample point is a linear, unsigned byte where 255 is the highest value and 0 is the lowest value. Obviously, this signed/unsigned sample point discrepancy between 8-bit and larger resolution waveforms was one of those "oops" scenarios where some Microsoft employee decided to change the sign sometime after 8-bit wave files were common but 16-bit wave files hadn't yet appeared.
Because most CPU's read and write operations deal with 8-bit bytes, it was decided that a sample point should be rounded up to a size which is a multiple of 8 when stored in a WAVE. This makes the WAVE easier to read into memory. If your ADC produces a sample point from 1 to 8 bits wide, a sample point should be stored in a WAVE as an 8-bit byte (ie, unsigned char). If your ADC produces a sample point from 9 to 16 bits wide, a sample point should be stored in a WAVE as a 16-bit word (ie, signed short). If your ADC produces a sample point from 17 to 24 bits wide, a sample point should be stored in a WAVE as three bytes. If your ADC produces a sample point from 25 to 32 bits wide, a sample point should be stored in a WAVE as a 32-bit doubleword (ie, signed long). Etc.
Furthermore, the data bits should be left-justified, with any remaining (ie, pad) bits zeroed. For example, consider the case of a 12-bit sample point. It has 12 bits, so the sample point must be saved as a 16-bit word. Those 12 bits should be left-justified so that they become bits 4 to 15 inclusive, and bits 0 to 3 should be set to zero. Shown below is how a 12-bit sample point with a value of binary 101000010111 is formatted left-justified as a 16-bit word.
___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ | | | | | | | | | | | | | | | | | | 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 | |___|___|___|___|___|___|___|___|___|___|___|___|___|___|___|___| <---------------------------------------------> <-------------> 12 bit sample point is left justified rightmost 4 bits are zero padded
But note that, because the WAVE format uses Intel little endian byte order, the LSB is stored first in the wave file as so:
___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ | | | | | | | | | | | | | | | | | | | 0 1 1 1 0 0 0 0 | | 1 0 1 0 0 0 0 1 | |___|___|___|___|___|___|___|___| |___|___|___|___|___|___|___|___| <-------------> <-------------> <-----------------------------> bits 0 to 3 4 pad bits bits 4 to 11
For multichannel sounds (for example, a stereo waveform), single sample points from each channel are interleaved. For example, assume a stereo (ie, 2 channel) waveform. Instead of storing all of the sample points for the left channel first, and then storing all of the sample points for the right channel next, you "mix" the two channels' sample points together. You would store the first sample point of the left channel. Next, you would store the first sample point of the right channel. Next, you would store the second sample point of the left channel. Next, you would store the second sample point of the right channel, and so on, alternating between storing the next sample point of each channel. This is what is meant by interleaved data; you store the next sample point of each of the channels in turn, so that the sample points that are meant to be "played" (ie, sent to a DAC) simultaneously are stored contiguously.
The sample points that are meant to be "played" (ie, sent to a DAC) simultaneously are collectively called a sample frame. In the example of our stereo waveform, every two sample points makes up another sample frame. This is illustrated below for that stereo example.
sample sample sample frame 0 frame 1 frame N _____ _____ _____ _____ _____ _____ | ch1 | ch2 | ch1 | ch2 | . . . | ch1 | ch2 | |_____|_____|_____|_____| |_____|_____| _____ | | = one sample point |_____|
For a monophonic waveform, a sample frame is merely a single sample point (ie, there's nothing to interleave). For multichannel waveforms, you should follow the conventions shown below for which order to store channels within the sample frame. (ie, Below, a single sample frame is displayed for each example of a multichannel waveform).
channels 1 2 _________ _________ | left | right | stereo | | | |_________|_________|