Speech Recognition vs. Voice Recognition | 语音识别工作原理 | 模型训练 | 应用

原创已于 2025-01-05 09:02:08 修改 · 2.2k 阅读

31 ·

CC 4.0 BY-SA版权

文章标签：

#语音识别

于 2025-01-02 01:02:14 首次发布

AI 专栏收录该内容

35 篇文章

订阅专栏

注：机翻，未校。

Speech Recognition 与 Voice Recognition

剑桥词典

speech recognition，语音识别
voice recognition，声音识别

Speech vs. Voice - What’s the Difference? | This vs. That
https://thisvsthat.io/speech-vs-voice
Voice vs Speech - What’s the difference? | WikiDiff
https://wikidiff.com/voice/speech

Speech Recognition vs. Voice Recognition: In Depth Comparison 语音识别与声音识别：深度比较

Calendar12 July 2023

Have you ever stopped to think about how your voice magically turns into written words or how your smartphone recognizes your unique vocal identity? It’s mind-boggling, right?
你有没有停下来想过你的声音是如何神奇地变成书面文字的，或者你的智能手机是如何识别你独特的声音身份的？这真是令人难以置信，对吧？

Imagine this: you’re sitting in a room, jotting down notes for an important presentation. Instead of tediously typing every word, wouldn’t it be incredible if you could simply speak your thoughts and watch as they appear on the screen before your eyes? That’s the power of speech recognition! It’s like having your own personal stenographer, effortlessly transcribing your spoken words into written text.
想象一下：你坐在一个房间里，为一个重要的演示文稿记笔记。与其乏味地输入每个单词，不如简单地说出您的想法并看着它们出现在您眼前的屏幕上，这不是很不可思议吗？这就是语音识别的力量！这就像拥有自己的私人速记员，毫不费力地将您的口语转录成书面文本。

But hold on, that’s not all! Have you ever seen those spy movies where a secret agent’s voice unlocks a high-tech vault? Well, that’s voice recognition in action! It’s like having a superpower that allows you to open doors, access your digital devices, and even perform secure transactions, all with the sound of your voice.
但是等一下，这还不是全部！你有没有看过那些秘密特工的声音打开高科技金库的间谍电影？嗯，这就是声音识别的实际应用！这就像拥有一种超能力，可以让您打开门、访问您的数字设备，甚至执行安全交易，所有这些都可以通过您的声音来实现。

Now, you might be wondering, what’s the difference between speech recognition and voice recognition? Aren’t they the same thing? Ah, my curious friend, Not quite! While these terms are often used interchangeably, they actually refer to distinct technologies with their own unique abilities.
现在，您可能想知道，语音识别和声音识别有什么区别？它们不是一回事吗？啊，我好奇的朋友，不完全是！虽然这些术语经常互换使用，但它们实际上指的是具有自己独特能力的不同技术。

In this captivating article, we’ll unravel the secrets behind speech recognition and voice recognition, exploring their real-life applications, benefits, and most importantly, the intriguing differences between them.
在这篇引人入胜的文章中，我们将揭开语音识别和声音识别背后的秘密，探索它们在现实生活中的应用、好处，最重要的是，它们之间的有趣差异。

Understanding Speech Recognition 了解语音识别

Automatic Speech Recognition

Speech recognition, also known as automatic speech recognition (ASR), is a technological marvel that enables computers to convert spoken language into written text. It involves the process of analyzing audio input, extracting the spoken words, and transforming them into written form. Speech recognition systems utilize sophisticated algorithms and language models to achieve accurate transcription.
语音识别，也称为自动语音识别（ASR），是一项技术奇迹，它使计算机能够将口语转换为书面文本。它涉及分析音频输入、提取口语单词并将其转换为书面形式的过程。语音识别系统利用复杂的算法和语言模型来实现准确的转录。

How does Speech Recognition Work? 语音识别的工作原理是什么？

The workings of speech recognition are quite fascinating. Let’s take a closer look at the underlying process:
语音识别的工作原理非常有趣。让我们仔细看看底层过程：

Audio Input: The speech recognition system receives audio input, typically through a microphone or other audio devices.
音频输入：语音识别系统通常通过麦克风或其他音频设备接收音频输入。

Pre-processing: The audio input undergoes pre-processing to eliminate background noise, enhance clarity, and normalize the audio signal.
预处理：音频输入经过预处理，以消除背景噪音、提高清晰度并使音频信号标准化。

Acoustic Modeling: The system employs acoustic modeling techniques to analyze and interpret the audio input. This involves breaking down the speech into smaller units known as phonemes and mapping them to corresponding linguistic representations.
声学建模：该系统采用声学建模技术来分析和解释音频输入。这涉及将语音分解为更小的单元，称为音素，并将它们映射到相应的语言表示。

Language Modeling:Language models play a crucial role in speech recognition by utilizing statistical patterns and grammar rules to predict and correct potential errors in transcription. They enhance the accuracy and contextuality of the converted text.
语言建模：语言模型通过利用统计模式和语法规则来预测和纠正转录中的潜在错误，在语音识别中发挥着至关重要的作用。它们提高了转换文本的准确性和上下文性。

Decoding: Using a process called decoding, the system matches the audio input against its extensive database of acoustic and language models to determine the most likely transcription.
解码：使用称为解码的过程，系统将音频输入与其广泛的声学和语言模型数据库进行匹配，以确定最可能的转录。

Text Output: Finally, the speech recognition system generates the written output, providing an accurate representation of the spoken words.
文本输出：最后，语音识别系统生成书面输出，准确表示口语。

How is The Speech Recognition Model Trained? 语音识别模型是如何训练的？

There are different types of speech datasets used to train a speech recognition model, which typically consist of paired audio and text samples. This means that for each audio segment, there is a corresponding transcription of the spoken words. The dataset needs to be diverse and representative of real-world speech patterns, encompassing different speakers, accents, languages, and recording conditions. Here’s an overview of the training process for a speech recognition model using such a dataset:
有不同类型的语音数据集用于训练语音识别模型，这些数据集通常由成对的音频和文本样本组成。这意味着对于每个音频片段，都有相应的口语转录。数据集需要多样化并代表现实世界的语音模式，包括不同的说话人、口音、语言和录制条件。以下是使用此类数据集的语音识别模型的训练过程概述：

Data Collection: Large amounts of audio data are collected from various sources, such as recorded speeches, interviews, lectures, custom collections, or publicly available datasets. The dataset should cover a wide range of topics and speakers to ensure generalization.
数据收集：从各种来源收集大量音频数据，例如录制的演讲、访谈、讲座、自定义集合或公开可用的数据集。数据集应涵盖广泛的主题和演讲者，以确保泛化。

Data Preprocessing: The collected audio data undergoes preprocessing steps to enhance its quality and normalize the audio signals. This may involve removing background noise, equalizing volume levels, and applying filters to improve clarity.
数据预处理：对采集的音频数据进行预处理，以提高音频质量并对音频信号进行归一化处理。这可能涉及消除背景噪音、均衡音量级别和应用滤镜以提高清晰度。

Transcription: Trained transcribers listen to the audio samples and manually transcribe the spoken words into written text. The transcriptions are carefully aligned with the corresponding audio segments to create the paired audio-text dataset.

Speech recognition models typically require large amounts of training data. For instance, OpenAI’s Whisper ASR system was trained on 680,000 hours of multilingual and multitask supervised data, making it one of the largest speech datasets ever created.
语音识别模型通常需要大量的训练数据。例如，OpenAI 的 Whisper ASR 系统在 680,000 小时的多语言和多任务监督数据上进行了训练，使其成为有史以来最大的语音数据集之一。

Dataset Split:The dataset is typically divided into three subsets: training, validation, and testing. The training subset, which is the largest, is used to train the model. The validation subset is used during training to monitor the model’s performance and adjust hyperparameters. The testing subset is used to evaluate the final model’s performance.
数据集拆分：数据集通常分为三个子集：训练、验证和测试。最大的训练子集用于训练模型。验证子集在训练期间用于监控模型的性能并调整超参数。测试子集用于评估最终模型的性能。

Feature Extraction: From the audio samples, various acoustic features are extracted. These features capture important characteristics of the audio, such as frequency content, duration, and intensity. Common features include Mel-frequency cepstral coefficients (MFCCs), spectrograms, and pitch information.
特征提取：从音频样本中提取各种声学特征。这些功能捕获音频的重要特征，例如频率内容、持续时间和强度。常见特征包括 Mel 频率倒谱系数（MFCC）、频谱图和音高信息。

Language Modeling: Language models are trained on large textual datasets to learn statistical patterns, grammar rules, and linguistic contexts. These models provide additional contextual information during the training of the speech recognition model, improving its accuracy and contextuality.
语言建模：语言模型在大型文本数据集上进行训练，以学习统计模式、语法规则和语言上下文。这些模型在语音识别模型的训练过程中提供额外的上下文信息，从而提高其准确性和上下文性。

Training the Model:The speech recognition model is trained using the paired audio-text dataset and the extracted acoustic features. The model learns to associate the acoustic patterns with the corresponding textual representations. This involves using algorithms such as deep neural networks, recurrent neural networks (RNNs), or transformer-based models, which are trained using gradient-based optimization techniques.
训练模型：语音识别模型使用配对的音频文本数据集和提取的声学特征进行训练。该模型学习将声学模式与相应的文本表示相关联。这涉及使用深度神经网络、递归神经网络（RNN）或基于 transformer 的模型等算法，这些算法使用基于梯度的优化技术进行训练。

Iterative Training: The model is trained iteratively, where batches of data are fed to the model, and the model’s parameters are adjusted based on the prediction errors. The training process aims to minimize the difference between the predicted transcriptions and the ground truth transcriptions in the dataset.
迭代训练：对模型进行迭代训练，其中将批量数据馈送到模型，并根据预测误差调整模型的参数。训练过程旨在最小化数据集中预测的转录和 Ground Truth 转录之间的差异。

Hyperparameter Tuning: During training, hyperparameters (parameters that control the learning process) are adjusted to optimize the model’s performance. This includes parameters related to network architecture, learning rate, regularization techniques, and optimization algorithms.
超参数优化：在训练期间，会调整超参数（控制学习过程的参数）以优化模型的性能。这包括与网络架构、学习率、正则化技术和优化算法相关的参数。

Validation and Testing:Throughout the training process, the model’s performance is evaluated on the validation subset to monitor its progress and prevent overfitting. Once training is complete, the final model is evaluated on the testing subset to assess its accuracy, word error rate, and other relevant metrics.
验证和测试：在整个训练过程中，在验证子集上评估模型的性能，以监控其进度并防止过度拟合。训练完成后，将在测试子集上评估最终模型，以评估其准确性、单词错误率和其他相关指标。

Fine-tuning and Optimization: After the initial training, the model can undergo further fine-tuning and optimization to improve its performance. This may involve incorporating additional training data, adjusting model architecture, or using advanced optimization techniques.
微调和优化：在初始训练之后，模型可以进行进一步的微调和优化，以提高其性能。这可能涉及合并额外的训练数据、调整模型架构或使用高级优化技术。

By training on a diverse and extensive dataset of paired audio and text samples, speech recognition models can learn to accurately transcribe spoken words, enabling applications such as transcription services, virtual assistants, and more. The training process involves leveraging the power of machine learning algorithms and optimizing model parameters to achieve high accuracy and robustness in recognizing and transcribing speech.
通过在配对音频和文本样本的多样化和广泛的数据集上进行训练，语音识别模型可以学习准确转录口语，从而支持转录服务、虚拟助手等应用程序。训练过程包括利用机器学习算法的强大功能和优化模型参数，以实现识别和转录语音的高精度和稳健性。

Applications of Speech Recognition 语音识别的应用

Speech recognition technology has revolutionized numerous industries, transforming the way we interact with devices and systems. Here are some prominent applications:
语音识别技术已经彻底改变了许多行业，改变了我们与设备和系统的交互方式。以下是一些突出的应用：

Transcription Services
Speech recognition has streamlined the transcription process, making it faster and more efficient. It has become an invaluable tool for medical, legal, and business professionals, saving hours of manual effort.
转录服务语音识别简化了转录过程，使其更快、更高效。它已成为医疗、法律和商业专业人士的宝贵工具，可节省数小时的手动工作。

Voice Assistants
Virtual assistants like Apple’s Siri, Amazon’s Alexa, and Google Assistant employ speech recognition to understand and respond to user commands. They can perform tasks, answer queries, and control various devices using voice commands.
语音助手 Apple 的 Siri、Amazon 的 Alexa 和 Google Assistant 等虚拟助手使用语音识别来理解和响应用户命令。他们可以使用语音命令执行任务、回答查询和控制各种设备。

Accessibility
Speech recognition has significantly improved accessibility for individuals with disabilities. It allows people with motor impairments or visual impairments to interact with computers, smartphones, and other devices using their voices.
可及性语音识别显著改善了残障人士的辅助功能。它允许有运动障碍或视力障碍的人使用他们的声音与计算机、智能手机和其他设备进行交互。

Call Centers
Many call centers leverage speech recognition technology to enhance customer service. It enables automated call routing, voice authentication, and real-time speech-to-text conversion for call transcripts.
呼叫中心许多呼叫中心利用语音识别技术来增强客户服务。它支持自动呼叫路由、语音身份验证和通话记录的实时语音到文本转换。

Dictation Software
Speech recognition has made dictation effortless and accurate. Professionals in various fields, such as writers, journalists, and students, benefit from dictation software that converts spoken words into written text.
听写软件语音识别使听写变得轻松而准确。各个领域的专业人士，例如作家、记者和学生，都受益于将口语转换为书面文本的听写软件。

Benefits of Speech Recognition 语音识别的优势

Speech recognition offers several advantages that make it a powerful technology:
语音识别具有多项优势，使其成为一项强大的技术：

Increased Productivity
Speech recognition enables faster and more efficient data entry, transcription, and command execution, enhancing productivity in various domains.
提高生产力语音识别支持更快、更高效的数据输入、转录和命令执行，从而提高各个领域的生产力。

Accessibility and Inclusivity
By allowing individuals with disabilities to interact with devices using their voices, speech recognition promotes inclusivity and equal access to technology.
可访问性和包容性语音识别允许残障人士使用他们的声音与设备交互，从而促进了包容性和平等使用技术。

Hands-Free Operation
With speech recognition, users can perform tasks without the need for manual input, making it ideal for situations where hands-free operation is necessary or convenient.
免提操作通过语音识别，用户无需手动输入即可执行任务，非常适合需要或方便免提操作的情况。

Multilingual Support
Advanced speech recognition systems can recognize and transcribe multiple languages, facilitating communication in diverse linguistic contexts.
多语言支持先进的语音识别系统可以识别和转录多种语言，从而促进不同语言环境中的交流。

Understanding Voice Recognition 了解声音识别

Voice recognition, also known as speaker recognition or voice authentication, is a technology that focuses on identifying and verifying the unique characteristics of an individual’s voice. It aims to determine the identity of the speaker, rather than convert speech into text.
声音识别，也称为说话人识别或语音身份验证，是一种专注于识别和验证个人声音独特特征的技术。它旨在确定说话者的身份，而不是将语音转换为文本。

How does Voice Recognition Work? 语音识别如何工作？

Voice recognition systems employ sophisticated algorithms and machine learning techniques to analyze various vocal features, such as pitch, tone, rhythm, and pronunciation. Let’s explore the process:
声音识别系统采用复杂的算法和机器学习技术来分析各种声音特征，例如音高、语气、节奏和发音。让我们来探索一下这个过程：

Enrollment:In the enrollment phase, the system records a sample of the user’s voice, capturing their unique vocal characteristics.
注册：在注册阶段，系统会录制用户的声音样本，捕捉他们独特的声音特征。

Feature Extraction:The system extracts specific features from the recorded voice sample, analyzing factors like pitch, speech rate, and spectral patterns.
特征提取：系统从录制的语音样本中提取特定特征，分析音高、语速和频谱模式等因素。

Voiceprint Creation:Using the extracted features, the system creates a unique voiceprint, which serves as a reference for future authentication.
声纹创建：使用提取的特征，系统创建一个唯一的声纹，作为未来身份验证的参考。

Authentication: When a user attempts to authenticate, their voice is compared to the stored voiceprint. The system assesses the similarity and determines whether the speaker’s identity matches the enrolled voiceprint.
身份验证：当用户尝试进行身份验证时，他们的语音将与存储的声纹进行比较。系统会评估相似性并确定说话人的身份是否与已注册的声纹匹配。

Decision: Based on the comparison results, the voice recognition system makes a decision, either granting or denying access.
决策：根据比较结果，声音识别系统做出决策，授予或拒绝访问权限。

How is The Voice Recognition Model Trained? 语音识别模型是如何训练的？

Training a voice recognition model requires a dataset that encompasses audio samples from different individuals, capturing the unique vocal characteristics that differentiate one person from another. The dataset used for training a voice recognition model typically consists of the following components:
训练声音识别模型需要一个数据集，其中包含来自不同人的音频样本，以捕获区分一个人与另一个人的独特声音特征。用于训练声音识别模型的数据集通常由以下组件组成：

Enrolled Voice Samples:The dataset includes voice samples from individuals who voluntarily enroll in the system. These samples serve as the reference or template for each individual’s voiceprint. The enrollment process involves recording a set of voice samples from each person.
已注册的语音样本：该数据集包括来自自愿注册系统的个人的语音样本。这些样本用作每个人声纹的参考或模板。注册过程包括录制每个人的一组语音样本。

Test Voice Samples:Along with enrolled voice samples, the dataset also includes separate voice samples for testing and evaluation purposes. These samples are used to assess the model’s accuracy and performance in recognizing and verifying the identity of speakers.
测试语音样本：除了注册的语音样本外，数据集还包括用于测试和评估目的的单独语音样本。这些样本用于评估模型在识别和验证说话人身份方面的准确性和性能。

The training process for a voice recognition model involves the following steps:
声音识别模型的训练过程包括以下步骤：

Feature Extraction:From the enrolled voice samples, specific features are extracted to capture the unique vocal characteristics of each individual. These features may include pitch, speech rate, formant frequencies, spectral patterns, and other relevant acoustic properties.
特征提取：从注册的语音样本中提取特定特征以捕获每个人独特的声音特征。这些特征可能包括音高、语速、共振峰频率、频谱模式和其他相关的声学特性。

Voiceprint Creation: Using the extracted features, a voiceprint or voice template is created for each individual. The voiceprint represents a unique representation of an individual’s voice characteristics.
声纹创建：使用提取的特征，为每个人创建声纹或语音模板。声纹表示个人语音特征的独特表示。

Training the Model: The model is trained using the enrolled voiceprints as the training data. The model learns to analyze and identify the distinctive features that differentiate one voiceprint from another. Various machine learning techniques, such as neural networks or Gaussian mixture models, are commonly employed to train the model.
训练模型：使用已注册的声纹作为训练数据对模型进行训练。该模型学习分析和识别区分一种声纹与另一种声纹的独特特征。通常采用各种机器学习技术，例如神经网络或高斯混合模型，来训练模型。

Evaluation and Optimization: After the initial training, the model is evaluated using the test voice samples to assess its accuracy and performance. If the model does not meet the desired performance criteria, it undergoes iterative refinement and optimization. This process may involve adjusting model parameters, incorporating additional training data, or implementing advanced algorithms for better feature extraction and matching.
评估和优化：在初始训练之后，使用测试语音样本对模型进行评估，以评估其准确性和性能。如果模型不满足所需的性能标准，则会对其进行迭代优化和优化。此过程可能涉及调整模型参数、合并额外的训练数据或实施高级算法以实现更好的特征提取和匹配。

Decision Threshold Setting: In voice recognition, a decision threshold is set to determine whether a given voice sample matches an enrolled voiceprint. This threshold controls the trade-off between false acceptances (when an impostor is incorrectly accepted) and false rejections (when a genuine user is incorrectly rejected). The threshold is typically adjusted to balance security and usability based on the specific application requirements.
Decision Threshold Setting（决策阈值设置）：在声音识别中，设置决策阈值以确定给定的语音样本是否与已注册的声纹匹配。此阈值控制错误接受（当冒名顶替者被错误接受时）和错误拒绝（当真正的用户被错误拒绝时）之间的权衡。通常会根据特定的应用程序要求调整阈值，以平衡安全性和可用性。

By training the voice recognition model on a dataset that encompasses diverse voice samples and using sophisticated algorithms, the model learns to accurately identify and verify individuals based on their unique vocal characteristics. The iterative refinement and optimization process ensures that the model achieves higher accuracy and robustness in real-world scenarios.
通过在包含各种语音样本的数据集上训练声音识别模型并使用复杂的算法，该模型学会了根据个人独特的声音特征准确识别和验证个人。迭代优化和优化过程可确保模型在实际场景中实现更高的准确性和稳健性。

Applications of Voice Recognition 声音识别的应用

Voice recognition technology finds numerous applications in our daily lives. Here are some notable examples:
声音识别技术在我们的日常生活中有许多应用。以下是一些值得注意的例子：

Security Systems 安全系统
Voice recognition enhances security by providing an additional layer of authentication. It is employed in biometric systems, access control, and voice-based password systems.
声音识别通过提供额外的身份验证层来增强安全性。它用于生物识别系统、访问控制和基于语音的密码系统。

Personalized Services 个性化服务
Voice recognition enables personalized services in various domains. For instance, smart homes can recognize residents’ voices and customize settings accordingly.
声音识别可在各个领域实现个性化服务。例如，智能家居可以识别居民的声音并相应地自定义设置。

Automotive Industry 汽车工业
Voice recognition is increasingly integrated into cars, allowing drivers to control various functions hands-free. It enhances safety and convenience on the road.
声音识别越来越多地集成到汽车中，使驾驶员能够免提控制各种功能。它提高了道路上的安全性和便利性。

Voice Banking 语音银行
Some financial institutions utilize voice recognition for secure and convenient banking transactions. Customers can access their accounts and make transactions using their voices.
一些金融机构利用声音识别进行安全便捷的银行交易。客户可以使用他们的语音访问他们的账户并进行交易。

Forensic Investigations 法医调查
Voice recognition assists forensic investigators in analyzing recorded voices, identifying suspects, and providing evidence in criminal cases.
声音识别可帮助法医调查人员分析录音、识别嫌疑人并在刑事案件中提供证据。

Benefits of Voice Recognition 语音识别的好处

Voice recognition offers several advantages that make it a valuable technology:
声音识别具有多项优势，使其成为一项有价值的技术：

Strong Authentication 强身份验证
Voice recognition provides a robust authentication mechanism since each person has a unique voiceprint, making it difficult to forge or replicate.
声音识别提供了一种强大的身份验证机制，因为每个人都有唯一的声纹，因此很难伪造或复制。

Convenience and Speed 方便快捷
With voice recognition, users can authenticate themselves or perform tasks quickly and conveniently using their voices, eliminating the need for manual input.
通过声音识别，用户可以使用语音快速方便地验证自己的身份或执行任务，无需手动输入。

Natural Interaction
Voice-based interfaces facilitate natural and intuitive interaction with devices, creating a more user-friendly experience.
自然交互基于语音的界面有助于与设备进行自然直观的交互，从而创造更加用户友好的体验。

Versatility 多面性
Voice recognition can be integrated into various devices and systems, offering flexibility and adaptability across different applications.
声音识别可以集成到各种设备和系统中，为不同应用提供灵活性和适应性。

Speech Recognition vs. Voice Recognition: The Key Differences 语音识别与声音识别：主要区别

While speech recognition and voice recognition are closely related, there are significant differences between the two technologies. Let’s explore the key distinctions:
虽然语音识别和声音识别密切相关，但这两种技术之间存在显著差异。让我们探讨一下主要区别：

Purpose 目的
Speech recognition focuses on converting spoken language into written text, enabling transcription and text-based analysis. In contrast, voice recognition aims to identify and authenticate individuals based on their unique vocal characteristics.
声音识别侧重于将口语转换为书面文本，从而实现转录和基于文本的分析。相比之下，语音识别旨在根据个人独特的声音特征来识别和验证个人。

Output 输出
Speech recognition generates written text as its output, facilitating transcription, data entry, and text-based analysis. Voice recognition, on the other hand, produces an authentication decision or performs actions based on the recognized voice.
语音识别生成书面文本作为其输出，从而促进转录、数据输入和基于文本的分析。另一方面，声音识别会根据识别的语音生成身份验证决策或执行操作。

Application应用
Speech recognition finds applications in transcription services, virtual assistants, accessibility tools, and dictation software. Voice recognition is utilized for security systems, personalized services, automotive applications, and voice banking.
语音识别可用于转录服务、虚拟助手、辅助功能工具和听写软件。声音识别用于安全系统、个性化服务、汽车应用和语音银行。

Technology 科技
Speech recognition heavily relies on natural language processing, acoustic modeling, and language modeling techniques. Voice recognition relies on signal processing, feature extraction, and speaker verification algorithms to identify and authenticate individuals based on their unique vocal characteristics.
语音识别在很大程度上依赖于自然语言处理、声学建模和语言建模技术。声音识别依靠信号处理、特征提取和说话人验证算法，根据个人独特的声音特征来识别和验证个人。

Accuracy Requirements 精度要求
Speech recognition systems strive for high accuracy in transcribing spoken language. However, they can tolerate some errors as long as the overall meaning is preserved. In contrast, voice recognition systems require high accuracy in identifying the speaker’s identity to ensure robust authentication.
语音识别系统力求在转录口语时实现高准确性。但是，只要保留整体含义，他们可以容忍一些错误。相比之下，声音识别系统需要高精度地识别说话人的身份，以确保稳健的身份验证。

The Difference Between Speech and Voice Recognition 语音和声音识别的区别

Voice technology has permeated every aspect of our lives. Therefore, it is essential to understand the differences between speech recognition and voice recognition and how they work.
语音技术已经渗透到我们生活的方方面面。因此，了解语音识别和声音识别之间的区别以及它们的工作原理至关重要。

Laura Tate

VP Marketing

Tech

Voice technology has permeated every aspect of our lives. We use speech recognition and voice technology to get information, navigate, translate our voice into text, and give voice assistants and even our cars actionable commands.
语音技术已经渗透到我们生活的方方面面。我们使用语音识别和语音技术来获取信息、导航、将我们的语音翻译成文本，并为语音助手甚至我们的汽车提供可操作的命令。

Businesses are implementing voice and speech recognition technologies into their office, marketing, and consumer-end offerings.

企业正在将语音和语音识别技术实施到其办公、营销和消费终端产品中。

With this growth, voice and speech technology advocates, marketers, and end-users have blended the terminology to describe these technologies to mean the same thing. However, the two technologies use separate processes and output different responses.
随着这种增长，语音和语音技术倡导者、营销人员和最终用户已经混合了这些术语来描述这些技术，以表示相同的含义。但是，这两种技术使用不同的过程并输出不同的响应。

The simplest explanation of the differences between speech and voice recognition:
语音识别和声音识别之间差异的最简单解释：

Speech recognition translate anyone’s voice
语音识别翻译任何人的声音
Voice recognition understands a specific user’s voice.
声音识别可以理解特定用户的声音。

It is essential to understand these technologies as businesses increasingly look for ways to improve operations, communication, and growth using voice and speech recognition devices.
随着企业越来越多地寻找使用语音和语音识别设备改善运营、沟通和增长的方法，了解这些技术至关重要。

In the following, we explain the differences a little more in-depth and their uses.
在下文中，我们将更深入地解释这些差异及其用途。

What is Speech Recognition? 什么是语音识别？

The simple definition of speech recognition is a technology that enables a computer to recognize, understand, and translate human speech into text.
语音识别的简单定义是一种使计算机能够识别、理解人类语音并将其转换为文本的技术。

Speech recognition technology uses natural language processing or NLP and machine learning to translate human speech.
语音识别技术使用自然语言处理或 NLP 和机器学习来翻译人类语音。

Engineers used the term automatic speech recognition, or ASR, in the early 1990s to stress that speech recognition is machine processed. But today, ASR and speech recognition are synonymous.
工程师在 1990 年代初期使用术语自动语音识别（ASR）来强调语音识别是机器处理的。但今天，ASR 和语音识别是同义词。

How Speech Recognition Works 语音识别的工作原理

speech recognition diagram

It has taken years of deep research, machine learning, and implementing artificial intelligence to develop speech recognition technologies used in today’s voice user interfaces (VUIs).
经过多年的深入研究、机器学习和人工智能的实施，我们开发了用于当今语音用户界面（VUI）的语音识别技术。

Speech recognition relies upon “feature analysis,” which is “speaker independent” voice recognition. This method processes voice input using phonetic unit recognition and finds similarities between expected inputs and the actual digitized voice input. Simply put, it matches a user’s speech to generic voice patterns.
语音识别依赖于“特征分析”，即“独立于说话人”的声音识别。该方法使用语音单元识别处理语音输入，并发现预期输入与实际数字化语音输入之间的相似性。简而言之，它将用户的语音与通用语音模式相匹配。

Highly accurate speaker-independent speech recognition is challenging to achieve as accents, inflections, and different languages thwart the process. Speech recognition accuracy rates are 90% to 95%.
高度准确的独立于说话人的语音识别很难实现，因为口音、语调变化和不同的语言会阻碍这一过程。语音识别准确率为 90% 到 95%。

Here’s a basic breakdown of how speech recognition works:
以下是语音识别工作原理的基本细分：

A microphone translates the vibrations of a person’s voice into an electrical signal.
麦克风将人声音的振动转换为电信号。
A computer or similar system converts that signal into a digital signal.
计算机或类似系统将该信号转换为数字信号。
A preprocessing unit enhances the speech signal while mitigating noise.
预处理单元在减轻噪声的同时增强语音信号。
The speech recognition software analyzes the signal using acoustic modeling to register phonemes, distinct units of speech sound that represent and distinguish one word from another.
语音识别软件使用声学建模来分析信号，以记录音素，音素是语音的不同单位，用于表示和区分一个单词和另一个单词。
The phonemes are constructed into understandable words and sentences using language modeling.
使用语言建模将音素构建为易于理解的单词和句子。

Examples of Speech Recognition in Use 语音识别的使用示例

Note Taking/Writing: An example of speech recognition technology in use is speech-to-text platforms such as Speechmatics or Google’s speech-to-text engine.
记笔记/写作：使用的语音识别技术的一个例子是语音转文本平台，例如 Speechmatics 或 Google 的语音转文本引擎。

In addition, many voice assistants offer speech-to-text translation. This article, for example, was written using Siri to translate voice to text in Apple’s Notes app.
此外，许多语音助手都提供语音到文本的翻译。例如，这篇文章是使用 Siri 编写的，用于将 Apple Notes 应用程序中的语音转换为文本。

Voice Control: We also use speech recognition to give voice commands to a VUI device, such as telling a car infotainment system to play music or get directions.
语音控制：我们还使用语音识别向 VUI 设备发出语音命令，例如告诉汽车信息娱乐系统播放音乐或获取路线。

Helping the Disabled: Speech recognition also helps the deaf, hard of hearing, and those with learning and other disabilities use computers and similar hardware and engage with media using auto-captioning, Dictaphones, and text relays.
帮助残障人士：语音识别还可以帮助聋人、听力障碍者以及有学习障碍和其他障碍的人使用计算机和类似硬件，并使用自动字幕、录音机和文本中继与媒体互动。

What is Voice Recognition? 什么是声音识别？

Voice recognition and speech recognition are similar in that a front-end audio device (microphone) translates a person’s voice into an electrical signal and then digitizes it.
声音识别和语音识别的相似之处在于，前端音频设备（麦克风）将人的声音转换为电信号，然后将其数字化。

While speech recognition will recognize almost any speech (depending on language, accents, etc.), voice recognition applies to a machine’s ability to identify a specific users’ voice.
虽然语音识别几乎可以识别任何语音（取决于语言、口音等），但声音识别适用于机器识别特定用户声音的能力。

How Voice Recognition Works 声音识别的工作原理

voice recognition diagram ‍

Voice recognition depends on a recorded template of a user’s voice, called “template matching.” A program must be “trained” to recognize a user’s voice.
声音识别依赖于用户语音的录制模板，称为 “模板匹配”。程序必须经过 “训练” 才能识别用户的声音。

First, the program will show a printed word or phrase that the user speaks and repeats several times into the system’s microphone to train the voice recognition software.
首先，该程序将向系统的麦克风显示用户说出并重复多次的印刷单词或短语，以训练声音识别软件。
Next, the program computes a statistical average of multiple samples of the same word or phrase.
接下来，该程序计算同一单词或短语的多个样本的统计平均值。
Finally, the program stores the average sample as a template in its data structure.
最后，程序将平均样本作为模板存储在其数据结构中。

Voice recognition accuracy rates are higher than speech recognition — 98%. Also, devices that are speaker-dependent can provide personalized responses to a user.
声音识别准确率高于语音识别 — 98%。此外，依赖于说话人的设备可以向用户提供个性化的响应。

Examples of Voice Recognition in Use 声音识别的使用示例

Voice Assistants: The most commonly known use of voice recognition is with the help of voice assistants.
语音助手：声音识别最常见的用途是在语音助手的帮助下。

For example, Google’s voice assistant will provide individualized responses, such as giving calendar updates or reminders, only to the user who trained the assistant to recognize their voice.
例如，Google 的语音助手将仅向训练助手识别其声音的用户提供个性化的响应，例如提供日历更新或提醒。

Additionally, voice recognition is used to ask VAs to make reservations or look up the weather, among many other actions.
此外，声音识别还用于要求 VA 进行预订或查询天气，以及许多其他操作。

Hands-free Calling: Making hands-free calls to specific people in a contact list is another example of voice recognition.
免提通话：向联系人列表中的特定人员拨打免提电话是声音识别的另一个示例。

Voice Biometrics: User verification is another example of voice recognition in use. For example, the financial and banking industries are increasingly implementing voice biometrics for security purposes. Similar to facial recognition, a person can use their voice to log in to their accounts.
声音生物识别：用户验证是使用声音识别的另一个示例。例如，金融和银行业越来越多地出于安全目的实施声音生物识别技术。与面部识别类似，一个人可以使用他们的声音登录他们的帐户。

Voice Picking: Warehouses have integrated voice recognition to complete tasks and keep workers’ hands-free.
语音拣选：仓库集成了声音识别功能，可以完成任务并解放工人的双手。

The warehousing company RFgen uses a specific voice technology called voice picking, which allows the company to update its stock, complete order picking, and perform cycle counting using voice commands.
仓储公司 RFgen 使用一种称为语音拣选的特定语音技术，该技术允许公司更新其库存、完成订单拣选并使用语音命令执行周期盘点。

Voice picking relies on speaker-dependent voice recognition.
声音选择依赖于与说话人相关的语音识别。

In Summary 总结

While speech and voice recognition work differently, the two deeply intertwine to provide many cross-functional capabilities to improve our daily lives and present possibilities for the future.
虽然语音和声音识别的工作方式不同，但两者深度交织在一起，提供了许多跨功能功能，以改善我们的日常生活并为未来提供可能性。

However, more work is needed to refine speech and voice recognition accuracy to achieve even greater returns from investments in the voice technology sectors.
然而，还需要做更多的工作来提高语音和声音识别的准确性，以便从语音技术领域的投资中获得更大的回报。

语音识别与 Speech Recognition

AI 天才研究院于 2024-01-25 01:54:08 发布

1. 背景介绍

语音识别 (Speech Recognition) 是一种将声音转换为文本的技术，涉及语音处理、自然语言处理、人工智能等领域。随着技术发展，已广泛应用于智能家居、智能汽车、语音助手等领域。本文将从多方面深入探讨，助读者全面了解。

1.1 语音识别技术的发展历程

1950 年代：语音识别技术诞生，基于手工编写规则，用于识别单词和短语。
1960 年代：开始使用自动化方法研究，基于统计学方法识别单词和短语。
1970 年代：采用人工神经网络方法研究，基于人工神经网络的前馈网络和反馈网络识别单词和短语。
1980 年代：运用卷积神经网络 (CNN) 方法研究，基于卷积神经网络的深度学习方法识别单词和短语。
1990 年代：使用循环神经网络 (RNN) 方法研究，基于循环神经网络的长短期记忆 (LSTM) 网络和 gates recurrent unit (GRU) 网络识别单词和短语。
2000 年代：以深度学习方法研究，基于深度学习方法的卷积神经网络和循环神经网络识别单词和短语。
2010 年代：采用端到端方法研究，基于端到端方法的深度学习方法识别单词和短语。

2. 核心概念与联系

2.1 核心概念

语音信号：人类发出的声音，由声波产生，主要特征含频率、振幅、时间等。
语音特征：用于描述语音信号的数学模型，常见有时域特征、频域特征和时频域特征等。
语音模型：描述语音信号和语音特征的数学模型，常见有 Hidden Markov Model (HMM)、Support Vector Machine (SVM)、神经网络等。
语音识别：将语音信号转换为文本信号的过程，可分为语音特征提取和语音模型训练两个阶段。

2.2 语音识别与 SpeechRecognition 的联系

SpeechRecognition 是实现语音识别功能的技术，其发展与语音识别技术紧密相关，共同推动语音技术发展。

3. 核心算法原理和具体操作步骤以及数学模型公式详细讲解

3.1 语音特征提取

是将语音信号转换为数值序列的过程，常见方法如下：

时域特征：描述语音信号在时域的特征，常见有平均振幅、振幅波动、振幅平均值等。
频域特征：描述语音信号在频域的特征，常见有傅里叶变换、快速傅里叶变换、波形分析等。
时频域特征：描述语音信号在时域和频域的特征，常见有傅里叶频谱、波形分析等。

3.2 语音模型训练

是将语音特征转换为文本信号的过程，常见方法如下：

Hidden Markov Model (HMM)：用于描述随机过程的概率模型，可描述语音信号生成过程，用于语音识别任务。
Support Vector Machine (SVM)：解决二分类问题的机器学习方法，可描述语音信号分类，用于语音识别任务。
神经网络：描述复杂非线性关系的数学模型，可描述语音信号生成过程，用于语音识别任务。

3.3 数学模型公式详细讲解

3.3.1 时域特征

平均振幅： $\bar {A} = \frac {1}{N} \sum_{n=1}^{N} A (n)$
振幅波动： $\sigma_{A} = \sqrt {\frac {1}{N-1} \sum_{n=1}^{N} (A (n) - \bar {A})^2}$
振幅平均值： $\mu_{A} = \frac {1}{N} \sum_{n=1}^{N} A (n)$

3.3.2 频域特征

傅里叶变换： $\sum_{n=0}^{N-1} x (n) e^{-j2\pi kn/N}$
快速傅里叶变换： $\frac {1}{N} \sum_{n=0}^{N-1} x (n) e^{-j2\pi kn/N}$

3.3.3 时频域特征

傅里叶频谱： $P(k,t)=|X (k,t)|^2$

4. 具体最佳实践：代码实例和详细解释说明

4.1 使用 Python 实现语音特征提取

import numpy as np
import librosa

def extract_features(file_path):
    y, sr = librosa.load(file_path)
    mfcc = librosa.feature.mfcc(y=y, sr=sr)
    return mfcc

4.2 使用 Python 实现语音模型训练

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def train_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

5. 实际应用场景

语音识别技术应用场景广泛，包括：

智能家居：用于控制家电、调节温度、播放音乐等。
智能汽车：用于语音控制、导航、娱乐等。
语音助手：如 Siri、Google Assistant、Alexa 等。
医疗保健：用于医疗诊断、药物咨询、病历录入等。

6. 工具和资源推荐

Python：易于学习和使用的编程语言，提供诸多语音识别任务相关库，如 librosa、speechrecognition 等。
Keras：深度学习框架，提供很多用于语音识别任务的模型，如 CNN、RNN、LSTM、GRU 等。
TensorFlow：开源机器学习框架，提供许多用于语音识别任务的库，如 tflearn、tensorflow-io 等。
SpeechRecognition：Python 库，提供诸多用于语音识别任务的 API，如 Google Speech Recognition、Microsoft Bing Voice Recognition 等。

7. 总结：未来发展趋势与挑战

7.1 未来发展趋势

深度学习：推动语音识别技术不断提高，如使用卷积神经网络、循环神经网络、长短期记忆网络等。
端到端方法：使语音识别技术更简洁、高效，如采用端到端深度学习方法进行语音识别。
多模态技术：让语音识别技术更智能、灵活，如结合图像、文本等多种信息进行语音识别。

7.2 挑战

噪声抑制：语音信号中的噪声影响识别准确性，需开发更高效的噪声抑制技术。
语言多样性：要处理不同语言、方言、口音等，需开发更通用的语音识别技术。
实时性能：需实时识别语音信号，要开发更高效的语音识别技术。

8. 附录：常见问题与解答

Q：语音识别技术和语音合成技术有什么区别？
A：语音识别技术是将语音信号转换为文本信号的过程，而语音合成技术是将文本信号转换为语音信号的过程，二者应用场景和技术原理不同。
Q：语音识别技术和自然语言处理技术有什么区别？
A：语音识别技术是将语音信号转换为文本信号的过程，自然语言处理技术是处理和分析自然语言文本信号的过程，二者应用场景和技术原理不同。
Q：语音识别技术和语音合成技术有什么相似之处？
A：语音识别技术和语音合成技术都涉及语音信号的处理和分析，技术原理和方法有一定相似之处，例如都可使用深度学习方法处理和分析。
Q：如何选择合适的语音识别技术？
A：需考虑应用场景、技术原理、性能、实时性、语言多样性等因素，据此选择合适技术满足应用需求。

语音识别 (Speech Recognition) 综述

energy_百分百于 2021-06-18 16:31:42 发布

1. 语音识别的基本单位

1.1 Phoneme（音位，音素）

“a unit of sound”，是声音的最基本单位，每个词语 token 的声音由多个 phoneme 组成。

1.2 Grapheme（字位）

“smallest unot of a writing system”，每个单词书写最基本的单位。简单来说：

英文的 grapheme 可以认为是词缀，由［26 个英文字母 + 空格 + 标点符号］组成。
中文的 grapheme 是汉字。

1.3 Word（词）

英文可以用单词作为语音识别的最基本单位，但包括中文在内的很多语言无法使用 word 作为最基本的单位（word 数量太过于庞大，word 之间难于分隔等）。

1.4 Morpheme（词素）

“the smallest meaningful unit”，类似英文单词中词缀。

1.5 bytes

用 byte 的序列来表示计算机中的每个字符（比如使用 utf-8 对字符编码），用 byte 作为语音识别的基本单位可以让识别系统将不同的语言统一处理，和语言本身无关（英文上叫 The system can be language independent）。

2. 获取语音特征（Acoustic Feature）

获取语音特征的方法从难到易依次是：

waveform -> spectrogram -> filter bank output -> MFCC

在这里插入图片描述

3. 语音识别的网络结构

语音识别的结构一般可以分为两种：

一种是直接输出 word embedding（feature base）。
一种将语音识别模型和和其他模型相组合的 end2end 结构，如：speech recognition + 翻译模型、speech recognition + 分类模型、speech recognition + Slot filling 模型，这里主要分析这一种类型。

4. 语音识别模型

主流的语音模型总体上可以分为 seq2seq 结构和 HMM 结构，而 seq2seq 结构有 LAS、CTC、RNN-T、Neural Transducer、MoChA 等。

4.1 LAS（Listen, Attend, and Spell）

LAS 网络是一个 seq2seq 的结构（encoder-decoder），其中：

Listen（encoder）：部分可以使用多种网络结构，主要作用是进行注意力机制和过滤噪声等工作，encoder 可以是 CNN、LSTM、BILSTM、CNN+RNN、Self-Attention 或多层上述的组合结构等。
Attend：就是一般的 Attention 结构，由 encoder 的输出和 decoder（RNN）上一时刻的输入变换后经过点乘或相加得到，如下图所示。
Spell（decoder ）：一般是 RNN（LSTM）结构，这部分可以认为是模型中的 Language Model，因此 LAS 可以不在模型之后添加其他的 Language Model，但是后再在接一个 Language Model 会得到更好的效果。

注意：Attend 中的 Attention 和 encoder 中的 Self-Attention 没有关系：

encoder 中的 Self-Attention：用来对输入数据去噪同时提取有效数据。
Attend 中的 Attention：用来得到当前时刻 encoder 和 decoder 之间的语义向量（content vector）。

在这里插入图片描述

LAS 过程：

encoder 端将输入数据转化为高维隐层嵌入。
Attention 过程：将 decoder 上一时刻的输出和 encoder 的每个输出分别做 match 得到每个 encoder 输出的权重参数 $a_i$ ，然后对 $a_i$ 进行 softmax，最后将 $a_i$ 作为权重对 $h_i$ 进行加权求和得到语义变量 $c_i$ 。
将 $c_i$ 作为 decoder（RNN）当前时刻的输入传入 decoder，并将 decoder 结果作为 LAS 当前时刻的输出返回。

在这里插入图片描述

在 LAS 中，常用以下技术来优化模型性能：

1. down sampling（下采样）

因为语音识别的数据量很大，因此在 LAS 的 encoder 内往往需要对数据进行下采样的操作，从而降低数据维度，在 RNN 中，一般使用如下两种方式进行下采样：

合并第一个 RNN 的输出（两个和并为 1 个）然后传入第二个 RNN。
在第一个 RNN 的输出中选择部分输出传入第二个 RNN。

在这里插入图片描述

对于 TCNN，可以使用上左图的方式进行下采样操作：一般的 TCNN 网络会读取整个范围内所有序列的数据，但是为了减少数据量，我们可以只输入序列的开始和结束的 embedding。

对于 Self-Attention，为了减少数据量，我们可以只对一定范围内的序列数据进行 attention，如上右图所示，对于输入 $x_3$ ，只对其周围的 $x_2$ 和 $x_4$ 进行 Attention。

2. Beam search

下边使用一个例子来说明 Beam search 的过程。
假如 token 的个数为 2，分别为 A 和 B，同时序列长度为 3，我们可以使用下图来展示语音识别的整个过程：

对于第一个 token，识别为 A 的概率是 0.6，B 的概率是 0.4，我们将三个 token 识别出来的所有可能展示出来就如下图所示；如果每次我们都选择概率最大的 token，我们会得到红色路径代表的结果，但是如果我们第一次没有选择概率最大的 A，而是选择了 B，那么我们会得到绿线代表的结果，我们发现绿线的结果反而更好；因此我们可以同时选择多条路线同时预测，最后选择效果最好的结果返回，其实这就是 beam search 的思想，其中 beam size 就代表同时进行的路线数量。

在这里插入图片描述

3. LAS 中的 Attention

LAS 中的 Attention 可以有两种形式：

一种是上文提到的，将 decoder 的当前时刻隐含层数据 $z^t$ 在 encoder 的输出 $h^i$ 上做 Attention，并将此生成的语义变量 $c^t$ 作为下一时刻 decoder（RNN）的输入。
一种是将 decoder 的当前时刻隐含层 $z^t$ 和在 encoder 的输出上做 Attention，并将此生成的语义变量和当前时刻的隐含层 $z^t$ 作为当前时刻 decoder（RNN）的输入放入 RNN 中。

这两种注意力的区别在，注意力得到的结果是下一个时间使用还是当前时间使用。第一篇拿 Seq2Seq 做语音识别的论文，用的是二者的合体版本。

在这里插入图片描述

4. Location-aware attention

Location-aware attention 在计算每个 $h^i$ 的权重时，不仅考虑 $z^i$ 和 $h^i$ ，同时将上一时刻得到的部分权重；之所以是部分权重是因为只考虑上一时刻 $h^i$ 邻域内的权重，具体实现方式可以参考下图。

在这里插入图片描述

5. LAS 训练过程

我们用 one-hot 编码来表示每个 token，同时计算模型输出和正确 token one-hot 编码的交叉熵，使模型输出的结果逐步接近正确 token 的 one-hot 编码，如下图所示：

在这里插入图片描述

teacher forcing

但是，有一点需要注意，在 decoder 端，我们并不会将上一时刻的输出作为当前时刻的输入，而是将上一时刻正确的 token 作为当前时刻输入，如下图所示，当我们要预测 cat 中的 a 时，我们并关心上一时刻（第一个 token）得到什么结果，而是直接将上一时刻的正确结果 c 作为当前时刻 decoder 的输入这个训练方式叫做 teacher forcing。

在这里插入图片描述

6. LAS 的局限

由于 LAS 是 seq2seq 结构，而 seq2seq 结构需要将整个输入序列编码成一个语义向量，要得到整个输入序列之后才能开始输出第一个 token，因此无法实现在线学习，或者说是在线语音识别。

4.2 CTC（Connectionist Temporal Classification）

和 LAS 相比，CTC 能够实现实时识别的功能，CTC 模型的基本结构如下图所示：

在这里插入图片描述

首先，模型先通过一个 encoder 结构将输入的 token 转化为一个高维隐层嵌入，然后对于每一个 token 的输出使用一个分类器（全连接网络）进行分类，最终的到每个 token 对应的预测结果；虽然 CTC 网络没有 Attention 机制，但 encoder 往往使用 LSTM 网络，从而每个 token 也能够得到上下文的信息；CTC 会遇到如下两个问题：因为 CTC 模型的输入是音位，因此多个相邻的 token 可能出现重复或者某个 token 的输出为空的情况：

当某个 token 没有合适的输出时，我们输出 ∅ ，并在最后将输出结果中的∅ 符号删除。
当多个相邻 token 对应的输出重复时，我们会在最后将多个重复的输出结果合并。

同样因为 CTC 模型的输入是音位，因此我们无法准确的到每个序列对应的标签，以下边的例子为例，同样对于好棒这个语音的音位序列，他的标签可以是下边标签的任意一个，问题是我们要用哪一个做为这个语音序列的标签呢？CTC 其实是用到了下边的所有标签，原理这里暂且不做讲解。

在这里插入图片描述

4.3 RNN-T（RNN Transducer）

在认识 RNN-T 之前，首先要认识一下 RNA（Recurrent Neural Aligner）网络；前边我们了解了 CTC 网络，RNA 网络就是将 CTC 中 encoder 后的多个分类器换成了一个 RNN 网络，使网络能够参考序列上下文信息。

在这里插入图片描述

RNN-T 网络在 RNA 网络的基础上使每个输入 token 可以连续输出多个结果，当每个 token 输出符号 ∅ 时，RNN 网络再开始接受下一个 token，具体过程如下图所示：

在这里插入图片描述

其实，在 RNN-T 中，RNN 网络的的输出并不是简单的将上一时刻的输出作为当然时刻的一个输入，而是将上一时刻的输出放入一个额外的 RNN 中，然后将额外 RNN 的输出作为当前时刻的一个输入；这个额外的 RNN 可以认为是一个语言模型，可以单独在语料库上进行训练，因为在一般的语料库上并不包含 ∅ 符号，因此这个额外的 RNN 网络在训练时会忽略符号 ∅ 。

在这里插入图片描述

4.4 Neural Transducer

和 RNA、CTC、RNN-T 不同，Neural Transducer 每次接受多个输入，并对这些输入做 Attention，然后得到多个输出的语音识别模型；

和 LAS 对整个输入序列做 Attenton 不同，Neural Transducer 只对窗口内的多个输入做 attention，Neural Transducer 模型结构如下图所示：

在这里插入图片描述

Neural Transducer 中 Attention 的实现方式在网上没有找到明确的说明，这里以后做补充。

4.5 （MoCha）Monotonic Chunkwise Attention

MoCha 是一个窗口可变的语音识别模型，和 Neural Transducer 最大的区别是 **MoCha 每次得到的窗口大小可以动态变化 **，每次的窗口大小是模型学习的一个参数；同时因为 MoCha 的窗口动态可变，因此 MoCha 的 decoder 端每次只输出一个 token，MoCha 模型结构如下图所示：

在这里插入图片描述

4.6 几种 seq2seq 语音识别模型的区别

模型	Attention	输入	输出	编码器	解码器	是否支持实时识别
LAS	对整个序列 Attention	每次输入整个序列	每次输出一个 Token	CNN、RNN、Self-Attention 等	LSTM	否
CTC	无	单个 Token	每个分类器输出一个 Token	单向 LSTM	每个 token 对应一个独立的分类器	是
RNA	无	单个 Token	单个 Token	单向 LSTM	LSTM	是
RNN-T	无	单个 token	多个 Token	单向 LSTM	两个 LSTM	是
Neural Transducer	对同一窗口内的 token 做 Attention	多个 Token（窗口大小固定）	多个 token	单向 LSTM	两个 LSTM	是
MoCha	对同一窗口内的 token 做 Attention	多个 Token（窗口大小动态）	单向 LSTM	一个窗口输出一个 Token	两个 LSTM	是

一篇优快云语音识别相关文章中的名词约定：

语音识别 — Speech Recognition
语声识别 — Voice Recognition

语音识别系列1：语音识别Speech recognition综述_speechrecognition-优快云博客
https://blog.youkuaiyun.com/gongdiwudu/article/details/122929191