17、电影评论应用测试与对话式AI客服聊天机器人构建

stem5

于 2025-09-26 16:24:11 发布

阅读量31

点赞数

CC 4.0 BY-SA版权

分类专栏： 9个项目玩转AI实战文章标签：电影评论应用移动应用测试对话式AI客服

本文链接：https://blog.youkuaiyun.com/stem5/article/details/152286722

9个项目玩转AI实战专栏收录该内容

21 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

电影评论应用测试与对话式AI客服聊天机器人构建

1. 移动应用测试

为了验证移动应用的性能，我们选取了两部电影《阿凡达》（Avatar）和《星际穿越》（Interstellar）的评论进行测试。

《阿凡达》的评论来自于https://www.rogerebert.com/reviews/avatar - 2009 ，评论者认为观看《阿凡达》的感受如同1977年观看《星球大战》，詹姆斯·卡梅隆再次用非凡的电影让质疑者闭嘴。此电影不仅是一场精彩的娱乐盛宴，更是一次技术突破，有着环保和反战的主题，还创造了新语言、新明星。评论者给出了4/5的评分，而移动应用给出的评分为约4.8/5。

《星际穿越》的评论来自于https://www.rottentomatoes.com/m/interstellar_2014/ ，该电影展现了编剧兼导演克里斯托弗·诺兰一贯的精彩制作。在烂番茄上的平均评分为7/10，换算为5分制是3.5/5，移动应用预测的评分为3.37。

通过这两个例子可以看出，移动电影评论评分应用在为电影评论提供合理评分方面表现出色。

2. 对话式AI客服聊天机器人概述

对话式聊天机器人近期备受关注，因为它能显著提升客户体验。现代企业已在多个流程中运用聊天机器人的能力，使填写表格或在互联网上发送信息等繁琐任务变得更加高效。

对话式聊天机器人系统的参与者包括用户和机器人。使用对话式聊天机器人有诸多优势：
- 个性化服务 ：为所有客户创建个性化体验可能很繁琐，但不这样做会使企业受损。聊天机器人是为每个客户提供个性化体验的便捷选择。
- 全天候支持 ：雇佣客服代表提供24/7服务成本高昂，使用聊天机器人在非工作时间提供客服支持可减少额外人力需求。
- 回复一致性 ：聊天机器人的回复更一致，不同客服代表对相同问题的回复可能不同，这避免了客户因不满意回复而多次致电。
- 耐心服务 ：客服代表在服务客户时可能会失去耐心，而聊天机器人不会。
- 高效查询记录 ：聊天机器人在查询记录方面比人工客服更高效。

聊天机器人并非新事物，其起源可追溯到20世纪50年代。二战后，艾伦·图灵开发了图灵测试以区分人类和机器。1966年，约瑟夫·魏泽鲍姆开发了模仿心理治疗师语言的软件Eliza，该工具仍可在http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm 找到。

聊天机器人能执行多种任务，例如：
- 回答产品相关问题
- 为客户提供推荐
- 完成句子补全活动
- 进行对话交流
- 与客户协商价格并参与竞价

企业在决定是否需要聊天机器人时，可以参考以下流程图：

graph TD;
    A{是否需要聊天机器人?} --> B{客户咨询量大?};
    B -->|是| C{问题类型较固定?};
    B -->|否| D{不需要};
    C -->|是| E{需要};
    C -->|否| F{考虑其他方案};

3. 聊天机器人架构

聊天机器人的核心是自然语言处理框架。它通过解析用户输入的数据，然后根据用户需求进行解读并给出合适的回复。在处理用户请求时，聊天机器人可能需要借助知识库和历史交易数据存储。

聊天机器人主要分为以下两类：
| 类型 | 特点 | 优点 | 缺点 |
| ---- | ---- | ---- | ---- |
| 基于检索的模型 | 依赖查找表或知识库从预定义答案中选择回复 | 实现相对简单，多数生产中的聊天机器人采用此类型 | 无法处理未见过的问题或无预定义回复的请求 |
| 生成式模型 | 实时生成回复，多为概率模型或基于机器学习 | 能理解用户输入中的实体，生成类似人类的回复 | 训练难度大，需要大量数据，易出现语法错误 |

4. 基于LSTM的序列到序列模型

序列到序列模型架构适合捕捉客户输入的上下文并生成合适的回复。其工作流程如下：
- 编码阶段 ：编码器LSTM将输入的单词序列编码为隐藏状态向量和单元状态向量，这些向量捕捉了整个输入句子的上下文。
- 解码阶段 ：解码器LSTM以编码信息作为初始隐藏和单元状态，在每一步根据当前单词预测下一个单词。预测第一个单词时，会使用虚拟起始关键字，预测到虚拟结束关键字时停止输出。

在训练过程中，我们预先知道每个目标单词的前一个单词；但在推理阶段，需要将上一步预测的单词作为输入。

5. 构建序列到序列模型

我们构建的聊天机器人所使用的序列到序列模型对基本架构进行了微调。在预测目标单词时，输入包括前一个目标单词和编码器在每个输入步骤的隐藏状态。

以下是相关代码实现：

def process_data(self,path):
    data = pd.read_csv(path)
    if self.mode == 'train':
        data = pd.read_csv(path)
        data['in_response_to_tweet_id'].fillna(-12345,inplace=True)
        tweets_in = data[data['in_response_to_tweet_id'] == -12345]
        tweets_in_out = tweets_in.merge(data,left_on=['tweet_id'],right_on=['in_response_to_tweet_id'])
        return tweets_in_out[:self.num_train_records]
    elif self.mode == 'inference':
        return data


def tokenize_text(self,in_text,out_text):
    count_vectorizer = CountVectorizer(tokenizer=casual_tokenize, max_features=self.max_vocab_size - 3)
    count_vectorizer.fit(in_text + out_text)
    self.analyzer = count_vectorizer.build_analyzer()
    self.vocabulary = {key_: value_ + 3 for key_,value_ in count_vectorizer.vocabulary_.items()}
    self.vocabulary['UNK'] = self.UNK
    self.vocabulary['PAD'] = self.PAD
    self.vocabulary['START'] = self.START
    self.reverse_vocabulary = {value_: key_ for key_, value_ in self.vocabulary.items()}
    joblib.dump(self.vocabulary,self.outpath + 'vocabulary.pkl')
    joblib.dump(self.reverse_vocabulary,self.outpath + 'reverse_vocabulary.pkl')
    joblib.dump(count_vectorizer,self.outpath + 'count_vectorizer.pkl')
    #pickle.dump(self.count_vectorizer,open(self.outpath + 'count_vectorizer.pkl',"wb"))

def words_to_indices(self,sent):
    word_indices = [self.vocabulary.get(token,self.UNK) for token in self.analyzer(sent)] + [self.PAD]*self.max_seq_len
    word_indices = word_indices[:self.max_seq_len]
    return word_indices

def indices_to_words(self,indices):
    return ' '.join(self.reverse_vocabulary[id] for id in indices if id != self.PAD).strip()


def replace_anonymized_names(self,data):
    def replace_name(match):
        cname = match.group(2).lower()
        if not cname.isnumeric():
            return match.group(1) + match.group(2)
        return '@__cname__'
    re_pattern = re.compile('(\W@|^@)([a-zA-Z0-9_]+)')
    if self.mode == 'train':
        in_text = data['text_x'].apply(lambda txt:re_pattern.sub(replace_name,txt))
        out_text = data['text_y'].apply(lambda txt:re_pattern.sub(replace_name,txt))
        return list(in_text.values),list(out_text.values)
    else:
        return map(lambda x:re_pattern.sub(replace_name,x),data)

6. 模型定义

我们使用LSTM版本的RNN来构建序列到序列模型，因为LSTM能更有效地处理长文本序列中的长期依赖关系。模型包含两个LSTM：
- 编码器LSTM ：将输入的推文编码为上下文向量，即编码器LSTM的最后隐藏状态。输入推文以单词索引序列的形式输入，这些索引会映射到词嵌入矩阵中的词向量。
- 解码器LSTM ：尝试将编码器生成的上下文向量解码为有意义的回复。在每个时间步，结合上下文向量和前一个单词来生成当前单词。

以下是模型定义的代码：

def define_model(self):
    # Embedding Layer
    embedding = Embedding(
        output_dim=self.embedding_dim,
        input_dim=self.max_vocab_size,
        input_length=self.max_seq_len,
        name='embedding',
    )
    # Encoder input
    encoder_input = Input(
        shape=(self.max_seq_len,),
        dtype='int32',
        name='encoder_input',
    )
    embedded_input = embedding(encoder_input)
    encoder_rnn = LSTM(
        self.hidden_state_dim,
        name='encoder',
        dropout=self.dropout
    )
    # Context is repeated to the max sequence length so that the same context 
    # can be feed at each step of decoder
    context = RepeatVector(self.max_seq_len)(encoder_rnn(embedded_input))
    # Decoder    
    last_word_input = Input(
        shape=(self.max_seq_len,),
        dtype='int32',
        name='last_word_input',
    )
    embedded_last_word = embedding(last_word_input)
    # Combines the context produced by the encoder and the last word uttered as 
    # inputs to the decoder.
    decoder_input = concatenate([embedded_last_word, context],axis=2)
    # return_sequences causes LSTM to produce one output per timestep instead of 
    # one at the end of the intput, which is important for sequence producing models.
    decoder_rnn = LSTM(
        self.hidden_state_dim,
        name='decoder',
        return_sequences=True,
        dropout=self.dropout
    )
    decoder_output = decoder_rnn(decoder_input)
    # TimeDistributed allows the dense layer to be applied to each decoder output    
    # per timestep
    next_word_dense = TimeDistributed(
        Dense(int(self.max_vocab_size/20),activation='relu'),
        name='next_word_dense',
    )(decoder_output)
    next_word = TimeDistributed(
        Dense(self.max_vocab_size,activation='softmax'),
        name='next_word_softmax'
    )(next_word_dense)
    return Model(inputs=[encoder_input,last_word_input], outputs=[next_word])

7. 模型训练的损失函数

模型使用分类交叉熵损失来预测解码器LSTM每个时间步的目标单词。由于词汇量可能很大，使用稀疏分类交叉熵损失更高效，因为我们只需将目标单词的索引作为目标标签，而无需将其转换为独热编码向量。

综上所述，通过对移动应用的测试，我们看到了其在电影评论评分方面的良好表现；而对话式AI客服聊天机器人的构建，从架构设计到模型实现，为企业提升客户服务水平提供了有力的技术支持。

8. 基于Twitter数据的聊天机器人构建实践

接下来，我们将利用20个大品牌在Twitter上对客户推文的回复数据来构建聊天机器人。数据集 twcs.zip 可以从https://www.kaggle.com/thoughtvector/customer-support-on-twitter 获取。每条推文由 tweet_id 标识，推文内容存储在 text 字段中。客户发布的推文可以通过 in_response_to_tweet_id 字段识别，该字段对于客户推文应为空值；对于客服推文，该字段应指向其回复的客户推文的 tweet_id 。

以下是处理数据的详细步骤：
1. 数据读取与预处理 ：

def process_data(self,path):
    data = pd.read_csv(path)
    if self.mode == 'train':
        data = pd.read_csv(path)
        data['in_response_to_tweet_id'].fillna(-12345,inplace=True)
        tweets_in = data[data['in_response_to_tweet_id'] == -12345]
        tweets_in_out = tweets_in.merge(data,left_on=['tweet_id'],right_on=['in_response_to_tweet_id'])
        return tweets_in_out[:self.num_train_records]
    elif self.mode == 'inference':
        return data

在训练模式下，我们将缺失的 in_response_to_tweet_id 填充为 -12345，筛选出客户推文，然后将其与客服推文进行合并，最终返回指定数量的训练记录。在推理模式下，直接返回数据。

文本分词与词汇表构建 ：

def tokenize_text(self,in_text,out_text):
    count_vectorizer = CountVectorizer(tokenizer=casual_tokenize, max_features=self.max_vocab_size - 3)
    count_vectorizer.fit(in_text + out_text)
    self.analyzer = count_vectorizer.build_analyzer()
    self.vocabulary = {key_: value_ + 3 for key_,value_ in count_vectorizer.vocabulary_.items()}
    self.vocabulary['UNK'] = self.UNK
    self.vocabulary['PAD'] = self.PAD
    self.vocabulary['START'] = self.START
    self.reverse_vocabulary = {value_: key_ for key_, value_ in self.vocabulary.items()}
    joblib.dump(self.vocabulary,self.outpath + 'vocabulary.pkl')
    joblib.dump(self.reverse_vocabulary,self.outpath + 'reverse_vocabulary.pkl')
    joblib.dump(count_vectorizer,self.outpath + 'count_vectorizer.pkl')
    #pickle.dump(self.count_vectorizer,open(self.outpath + 'count_vectorizer.pkl',"wb"))

使用 CountVectorizer 对输入和输出文本进行分词和词汇表构建，为每个单词分配一个索引，并添加特殊标记如 UNK （未知单词）、 PAD （填充标记）和 START （句子起始标记）。最后将词汇表和反向词汇表保存到文件中。

单词与索引的转换 ：

def words_to_indices(self,sent):
    word_indices = [self.vocabulary.get(token,self.UNK) for token in self.analyzer(sent)] + [self.PAD]*self.max_seq_len
    word_indices = word_indices[:self.max_seq_len]
    return word_indices

def indices_to_words(self,indices):
    return ' '.join(self.reverse_vocabulary[id] for id in indices if id != self.PAD).strip()

words_to_indices 函数将句子转换为单词索引序列， indices_to_words 函数将索引序列转换为句子。

匿名名称替换 ：

def replace_anonymized_names(self,data):
    def replace_name(match):
        cname = match.group(2).lower()
        if not cname.isnumeric():
            return match.group(1) + match.group(2)
        return '@__cname__'
    re_pattern = re.compile('(\W@|^@)([a-zA-Z0-9_]+)')
    if self.mode == 'train':
        in_text = data['text_x'].apply(lambda txt:re_pattern.sub(replace_name,txt))
        out_text = data['text_y'].apply(lambda txt:re_pattern.sub(replace_name,txt))
        return list(in_text.values),list(out_text.values)
    else:
        return map(lambda x:re_pattern.sub(replace_name,x),data)

使用正则表达式替换推文中的匿名名称，确保数据的一致性和隐私性。

9. 模型训练与评估

在完成数据处理和模型定义后，我们可以开始训练模型。训练过程中，我们将使用之前定义的损失函数和优化器。以下是一个简化的训练流程示例：

# 假设已经定义了模型、数据和损失函数
model = define_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# 加载训练数据
train_data = process_data('path/to/train_data.csv')
in_text, out_text = replace_anonymized_names(train_data)
input_indices = [words_to_indices(sent) for sent in in_text]
output_indices = [words_to_indices(sent) for sent in out_text]

# 训练模型
model.fit([input_indices, output_indices], output_indices, epochs=10, batch_size=32)

在训练过程中，我们将输入和输出文本转换为单词索引序列，并使用 fit 方法进行模型训练。训练完成后，我们可以使用测试数据对模型进行评估，计算准确率、损失值等指标，以评估模型的性能。

10. 聊天机器人的应用与拓展

训练好的聊天机器人可以应用于多种场景，如客户服务、智能助手等。在实际应用中，我们可以将聊天机器人集成到网站、移动应用或社交媒体平台中，为用户提供实时的交互服务。

为了进一步提升聊天机器人的性能和功能，我们可以考虑以下拓展方向：
- 多语言支持 ：通过引入多语言数据集和多语言模型，使聊天机器人能够处理不同语言的用户请求。
- 知识图谱融合 ：将知识库与知识图谱相结合，让聊天机器人能够获取更丰富的知识，提供更准确的回答。
- 情感分析 ：在回复用户时考虑用户的情感状态，提供更贴心的服务。

以下是一个简单的聊天机器人交互示例：

# 加载训练好的模型
model = load_model('path/to/trained_model.h5')

# 获取用户输入
user_input = input("请输入您的问题：")

# 处理用户输入
input_indices = words_to_indices(user_input)

# 生成回复
output_indices = model.predict([[input_indices], [input_indices]])
output_sentence = indices_to_words(output_indices[0])

print("聊天机器人回复：", output_sentence)